SQL Server 2012 (formerly code-named SQL Server Denali) introduces several important T-SQL programmability features; this article focuses on one of those features—window functions. SQL Server 2005 was the first milestone in supporting window functions; it introduced window ranking functions (ROW_NUMBER, RANK, DENSE_RANK, and NTILE), as well as limited support for window aggregate functions—only with a window partition clause. SQL Server 2012 enhances support for window aggregate functions by introducing window order and frame clauses, support for offset functions (LAG, LEAD, FIRST_VALUE, and LAST_VALUE), and support for window distribution functions (PERCENT_RANK, CUME_DIST, PERCENTILE_DISC, and PERCENTILE_CONT).
Window functions are the best thing to happen since sliced bread; therefore, I'm going to spend more than one article on the topic. In fact, I just finished writing an entire book on the topic—it's that big! This month I'll introduce the concept of windowing, describe the elements involved in window specifications, and cover window aggregate functions. In later articles I'll describe window offset functions, window distribution functions, and optimization of window functions.
To be able to run the examples from this series, you need to use SQL Server Denali CTP3 or later, as well as a sample database called TSQL2012. You can download SQL Server 2012 from Microsoft's website. You can click the link to download the sample database TSQL2012.
A window function is a function that's applied to a set of rows defined by a window descriptor and returns a single value for each row from the underlying query. The purpose of the window descriptor is to define the set of rows that the function should apply to. You provide the window specification using a clause called OVER that you're probably familiar with from SQL Server 2005's window ranking and aggregate functions. Here's an example from SQL Server 2012, relying on new capabilities:
USE TSQL2012; SELECT empid, ordermonth, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS runqty FROM Sales.EmpOrders;
Figure 1 shows an abbreviated form of the query output.
empid ordermonth qty runqty ----- ----------------------- ---- ------- 1 2006-07-01 00:00:00.000 121 121 1 2006-08-01 00:00:00.000 247 368 1 2006-09-01 00:00:00.000 255 623 1 2006-10-01 00:00:00.000 143 766 1 2006-11-01 00:00:00.000 318 1084 ... 2 2006-07-01 00:00:00.000 50 50 2 2006-08-01 00:00:00.000 94 144 2 2006-09-01 00:00:00.000 137 281 2 2006-10-01 00:00:00.000 248 529 2 2006-11-01 00:00:00.000 237 766 ...
Don't worry about not understanding the full window function's specification yet; we'll get to that later. I just wanted you to have the code in front of you as I discuss the concepts. For now, suffice it to say that the query returns detail elements from rows in the EmpOrders view, such as the employee ID, the order month, and the current month's quantity; the query also uses a window function to compute a running total quantity from the beginning of the employee's activity and until the current month. Again, more details are coming shortly.
As you know, there are other forms of set functions that T-SQL supports besides window functions—but window functions have several important advantages. Aggregate functions are a good example of a kind of function that needs to be applied to a set of rows—not the only kind, but a good starting point to explain the concepts. The relevance of the window concept is in how you define the set of rows for the function to apply to, and where you can use the function in the language.
Try to think of the more traditional T-SQL language constructs that let you define a set of rows for an aggregate function to work with. The most obvious one that comes to mind is grouped queries. Grouped queries define groups, or sets, of rows to which aggregate functions can be applied. Each group is represented by one result row. A big limitation of grouped queries is that after you group the data, you have to apply all computations in the context of the groups. You gain insights to new information in the form of aggregates, but you lose the detail. This means that you can't directly involve in the same expression a detail element from a source row as well as an aggregate for the group.
Think of even simple things such as computing a percent of an order value out of the customer total. Using grouped queries, you basically must have one query that groups the data and computes aggregates and another that joins the result of the grouped query and the detail rows. Now think of how things get even more complicated if you need to compute the percent of the current order value out of the customer total, as well as out of the grand total.
Another way to define a set of rows for an aggregate function to work with is using subqueries. Each subquery can be applied to a different set of rows, and you can mix into the same expressions a detail element and the result of an aggregate applied to a subquery. However, a subquery starts with a fresh view of the data.
What if you need to apply the aggregate function to a subset of rows from the underlying query, and the underlying query has a lot going on (table operators such as joins, additional filters, and so on)? You need to repeat in the subquery a lot of the logic that appears in the underlying query. This can lead to verbose and complex code. Also, each subquery is optimized in SQL Server with separate access to the data; the optimizer doesn't have the logic today to internally collapse multiple subquery calculations that need to be applied to the same set of rows to perform only one visit to the data.
Window functions are designed in such an interesting way that they circumvent a lot of the limitations that apply to grouped queries and subqueries. In addition, their design lends itself to good optimization and fairly simple indexing guidelines.
Window Aggregate Function Specification
To understand what's involved in the specification of a window function, examine again the query I provided earlier:
SELECT empid, ordermonth, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS runqty FROM Sales.EmpOrders;
The first thing that's important to understand is that the OVER clause defines the set of rows for the function to work with. You can think of it as SUM(qty) OVER(<the following set of rows>). The initial set of rows that the function has to work with is the underlying query's result set. So the form SUM(qty) OVER() basically means the sum of all quantities from the underlying query's result set. Anything you add to the window specification will basically let you further restrict the window. Because the underlying query's result set is achieved only after you get to the SELECT phase of the logical query processing (after FROM, WHERE, GROUP BY, and HAVING), you're allowed to use window functions only in the SELECT and ORDER BY clauses of a query. If you need to refer to the result of a window function in other query clauses, you have to do so indirectly by using a table expression, such as a CTE.
Before I discuss how you can further restrict the window, consider the advantages of defining the set of rows for the function using a windowing concept versus grouping or subqueries. Unlike with grouped queries, the window function defines its own set of rows to work with. There are no limitations imposed on the rest of the expressions in the query. You can mix detail elements and results of window functions in the same expression. You can even have different window functions work with different windows of rows. Unlike subqueries, window functions don't start from a fresh view of the data but rather with the underlying query's result set—hence there's no need to repeat the logic from the underlying query. I also mentioned that window functions lend themselves to good optimization, but I'll cover the optimization part in a future article.
As to the specification of window aggregate functions, the elements you add to the window specification basically let you restrict the set of rows the function will apply to (remember, the starting point is all rows from the underlying query's result set). The three parts that you can indicate for window aggregate functions in SQL Server 2012 are <window partition clause>, <window order clause>, and <window frame clause>.
Window Partition Clause
The window partition clause was already available to window aggregate functions in SQL Server 2005. It restricts the window to only those rows that have the same values in the partitioning columns as in the current row. If SUM(qty) OVER() means "the sum of all quantities," SUM(qty) OVER(PARTITION BY empid) means "the sum of all quantities from the rows that have the same employee ID as in the current row." In simpler terms: "the sum of all quantities for the current employee." And as I mentioned, you can mix detail elements and window functions. For example, the expression 100.0 * qty / SUM(qty) OVER() gives you the percent of the current quantity out of the grand total. The expression 100.0 * qty / SUM(qty) OVER(PARTITION BY empid) gives you the percent of the current quantity out of the employee total.
Window Ordering and Framing
The window order clause is there to give meaning to framing options. Framing means that you define ordering within the window partition, and then based on that ordering you place two boundary points that frame the subset of rows that you want to restrict. For example, our query that computes running totals partitions the window by empid (PARTITION BY empid), orders the window by ordermonth (ORDER BY ordermonth), and frames it with the option ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means that based on order month ordering, the applicable frame of rows is the subset of rows within the current employee's partition with no low boundary point and until the current row. In our case, this means all months from the beginning of the employee's activity and until the current month. The ROWS option isn't the only supported frame unit; there's another option called RANGE that I'll discuss later.
As for defining the window frame extent (the frame bounds), you can refer to the start or end of the window partition with the options UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING, respectively. You can also indicate a bound as an offset from the current row: <n> PRECEDING means <n> rows before the current row, and <n> FOLLOWING means <n> rows after the current row. You can also refer to the current row simply as CURRENT ROW.
You already saw the form ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This form will likely be the one you use most often for practical reasons. Conveniently, there's an abbreviated form that means the same thing: ROWS UNBOUNDED PRECEDING. Note, though, that if you indicate a window order clause but not a window frame clause, the default isn't exactly the same—but I'll get to it in the next section when describing the RANGE option.
Here's a query showing bounds expressed as offset from the current row:
SELECT empid, ordermonth, MAX(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS prvqty, qty AS curqty, MAX(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS nxtqty, AVG(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS avg3months FROM Sales.EmpOrders;
The first function operates on a frame with just one row—the previous row. The second function operates on a frame with also just one row—the next row. The third function operates on a frame with three rows—one before the current row until one after the current row. Figure 2 shows the output of this query.
empid ordermonth prvqty curqty nxtqty avg3months ------ ----------------------- ------ ------ ------ ---------- 1 2006-07-01 00:00:00.000 NULL 121 247 184 1 2006-08-01 00:00:00.000 121 247 255 207 1 2006-09-01 00:00:00.000 247 255 143 215 1 2006-10-01 00:00:00.000 255 143 318 238 1 2006-11-01 00:00:00.000 143 318 536 332 ... 1 2008-01-01 00:00:00.000 583 397 566 515 1 2008-02-01 00:00:00.000 397 566 467 476 1 2008-03-01 00:00:00.000 566 467 586 539 1 2008-04-01 00:00:00.000 467 586 299 450 1 2008-05-01 00:00:00.000 586 299 NULL 442 ...
Note that when only one row is involved in the frame, you should use other new functions, called LAG and LEAD—which I'll discuss next month. The offset window frame extent options are mainly interesting when there can be more than one row in the frame.
ROWS vs. RANGE
The ROWS option relies on a very basic concept of an offset in terms of number of rows with respect to the current row. But considering our sample data, you realize that there's no assurance that an employee will handle orders every month. In other words, there could be gaps between months of activity for an employee. Suppose that you want to compute the average quantity for the past 3 months of activity. Using the form ROWS BETWEEN 2 PRECEDING AND CURRENT ROW will give you what you want only if there's an assurance for no gaps between months of activity. But in our case there can be gaps, so the frame will include three rows that could represent a period of more than 3 months. That's what the RANGE option is there to resolve by defining offsets from the current row's value, as opposed to offset in terms of number of rows.
Note, though, that the current implementation of the RANGE option in SQL Server 2012 is very basic. To achieve filtering options such as "the past 3 months" you need window frame extent options that weren't implemented in SQL Server 2012, as well as support for a temporal interval data type that isn't available in SQL Server. But just to give you an idea, according to standard SQL, you'd achieve the task as follows (don't run this code):
SELECT empid, ordermonth, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY ordermonth RANGE BETWEEN INTERVAL '2' MONTH PRECEDING AND CURRENT ROW) AS sum3month FROM Sales.EmpOrders;
The RANGE frame extent options that SQL Server 2012 does support are only with UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, and CURRENT ROW as delimiters. There's a subtle, yet very important difference between ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW and RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when the ordering isn't unique (when there's a possibility for ties). The former truly uses the current row as the upper bound; namely, it doesn't include ties, whereas the latter does.
What's even more important to note is that if you indicate a window order clause but not a window frame clause, the default frame will be RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW and not ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. The reason it's important to emphasize this fact is that even when the ordering is unique and the two are logically equivalent, SQL Server's optimizer might not realize this, and then the form with RANGE (the default) can end up being more expensive to compute. As I mentioned, I'll discuss optimization of window functions in a future article, but for now you can already make it a best practice: Stick to the ROWS form whenever possible and try to avoid indicating ordering without framing.
More to Come
This article is the first in a series discussing window functions support in SQL Server 2012. I described the windowing concept, the advantages of window functions compared with more traditional T-SQL constructs, and the elements in the specification of window aggregate functions. Next month I'll cover other window functions that SQL Server 2012 supports.