| Executive Summary: |
Until Microsoft SQL Server supports the OVER clause, your best solutions for running aggregates are a set-based solution using subqueries or joins, for very small partition sizes, or a CLR data reader–based solution, for large partitions.
A running aggregate is an aggregate that keeps accumulating values over a sequence—typically temporal, possibly within partitions. A running sum of quantity over time (days) for each employee is an example of a running aggregate. For each employee and date, you’re looking for the total quantity from the beginning of the employee’s activity until the current date.
In previous articles I covered three solutions to running aggregates: a set-based solution using subqueries or joins, a T-SQL cursor-based solution, and a CLR-based solution using a data reader. (See the Learning Path for these and other articles related to running aggregates.) I explained how the solutions scale when you change various aspects, such as the number of aggregations, the number of partitions, and the partition size.
In this article I present two new solutions—one based on nested iterations that you can implement in SQL Server 2008 and 2005, and another that SQL Server doesn’t yet support (as of SQL Server 2008), but hopefully will in the future.
For consistency with my previous articles, I’ll use the same problem and sample data. Web Listing 1, Web Listing 2, and Web Listing 3 contain the code to create the Sales table from my previous examples and populate it with sample data. Run the code in Web Listing 1 to create the Sales table. Run the code in Web Listing 2 to create a table function called GetNums that generates a sequence of integers of a requested size, which is later used to populate the Sales table. Run the code in Web Listing 3 to populate the Sales table with sample data. Set the values of the variables @num_partitions and @rows_per_partition to determine the number of partitions (employees) and partition size, respectively, based on your needs.
My sample problem is to calculate the running total quantity for each employee over dates. That is, for each employee and date, calculate the running total quantity from the beginning of the employee activity until the current date.
Solution Based on Nested Iterations
In my previous articles, the solutions I presented were either purely set-based (using subqueries or joins) or purely cursor-based (using a T-SQL cursor or a .NET data reader). The first solution I present in this article combines iterative and set-based logic. The idea is to iterate through the entries within a partition in sequence order, using a set-based query in each iteration to process the nth entry across all partitions. Listing 1 contains an implementation of this approach using a recursive query.
The code in Listing 1 first generates row numbers for the rows from the Sales table, partitioned by empid, ordered by dt, materializes the sales rows along with the row numbers (rownum column) in a temporary table, and clusters the table by rownum and empid. The code then defines a common table expression (CTE) that calculates the running sum. The CTE’s anchor member processes the first entry for all employees (rownum = 1), and the recursive member processes in each iteration the next entry (previous rownum plus 1) for all employees, adding the current entry’s quantity to the running sum of quantity that was accumulated so far.
Listing 2 has an implementation of the nested iterations using loops instead of recursive queries. You might want to use this solution if you’re working with SQL Server versions prior to 2005.
Materializing the sales rows with the row numbers in a temporary table and indexing that table obviously involves a cost (especially the indexing part, which involves sorting). However, this solution’s greatest cost is in calculating the running aggregation. Therefore, I focus on the second part of the solution in my performance discussions. Figure 1 shows the execution plan for the recursive query used in Listing 1.
The first Clustered Index Seek operator in the plan fetches all rows with rownum = 1 from the index. The plan then spools those rows (i.e., stores them in a temporary table). Then, the plan uses a loop, and in each iteration the code processes all rows with the next row number, calculating the running aggregate and adding the rows from the current round to the spool, until no more rows are found. The plan uses a seek operation per each row from the previous round to match it with the next row (rownum greater than previous by one) for the same employee.
Solution Based on the OVER Clause
The second solution I present isn’t yet supported in SQL Server (as of SQL Server 2008); hopefully the next major SQL Server release will support it. This solution is based on the standard OVER clause, which SQL Server only partially implements. Unfortunately, SQL Server doesn’t yet implement the parts of the OVER clause that are necessary to calculate running aggregations (i.e., the ORDER BY, ROWS, and other subclauses of the OVER clause for aggregate functions). The ORDER BY subclause lets you define logical ordering in the window, and the ROWS subclause lets you define the range of rows you want to aggregate, marking the low boundary and high boundary points with respect to the current row. For example, to express the running sum of quantity for each employee and date, you’d run the following code:
SELECT empid, dt, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sumqty FROM dbo.Sales;
Remember, you can’t actually run this code in SQL Server. As you can see, the code is very natural and intuitive. Because the calculation is supposed to be independent for each employee, you specify PARTITION BY empid. Because ordering is supposed to be based on dates, you specify ORDER BY dt. Finally, because you want to aggregate all rows with no low boundary point until the current row, you specify ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
If you know how SQL Server optimizes those aspects of the OVER clause that are currently implemented, you’ll have a good sense of how running aggregates will perform once SQL Server implements them. For example, consider the following running count aggregate that isn’t yet supported in SQL Server:
SELECT empid, dt, qty, COUNT(*) OVER(PARTITION BY empid ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sumqty FROM dbo.Sales;
This aggregate is logically equivalent to the following row number calculation:
SELECT empid, dt, qty, ROW_NUMBER() OVER(PARTITION BY empid ORDER BY dt) AS rownum FROM dbo.Sales;
Figure 2 shows the plan for the row number calculation.
As you can see, the plan is highly efficient. The clustered index on empid and dt is scanned in order. The Segment operator is in charge of passing one partition (segment) at a time to the next operator. Finally, the Sequence Project operator is in charge of assigning the row numbers. It resets the value to 1 for the first row in the segment, and it increments the previous value by 1 for all other rows. When SQL Server enhances the OVER clause in the future to support running aggregates, the plan will likely be very similar in the special case of ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Only the Sequence Project operator will apply other calculations, such as adding the current measure’s value instead of just adding 1. Therefore, it should be safe to use the existing ROW_NUMBER calculation to evaluate the expected future performance of running aggregates based on the OVER clause.
Performance of Solutions
To evaluate the solutions’ performance, I changed three aspects: number of aggregations, number of partitions, and partition size. For both solutions (i.e., the solution based on nested iterations and the solution based on the OVER clause), most of the cost of the plan is associated with access to the data, and not the calculations. Adding more aggregations in both cases adds to the calculations but not to the amount of data access. Therefore, adding more aggregations adds little cost, and the complexity is close to constant.
As for adding partitions but keeping the partition size constant, in both solutions the impact should be linear, because both involve a constant cost per partition. However, the solution based on nested iterations should be substantially less efficient than the one based on the OVER clause. The former involves scanning the data, calculating row numbers, materializing the result in a temporary table, and indexing it, plus performing a seek operation in the clustered index per each row. With the latter, the data is scanned only once in clustered index order, and the calculation using the OVER clause is based on the existing index ordering.
The graph in Figure 3 shows the effect of changing the number of partitions. It includes the two solutions I discuss in this article, as well as the three solutions I discussed in previous articles.
It’s interesting to note that all five solutions have fairly linear complexity with respect to changes in the number of partitions. With the tested partition size (10 rows per partition), the solution based on the OVER clause leaves all the others behind. With small partitions, the solution based on nested iterations is more efficient than the T-SQL cursor-based solution, but it’s slower than the others.
As for the effect of partition size, again, because both solutions (i.e., nested iterations and the OVER clause) involve a fairly constant cost per row, their scaling with respect to partition size should be linear. I should mention again that the solution based on nested iterations materializes the rows in a temporary table and creates an index on that table, which therefore involves sorting. The complexity of sorting isn’t linear, but because this part of the solution is a small portion of the entire solution cost (with the sizes I used in my performance tests), I ignored this part. The second part of the solution that processes the recursive query (or the looping logic) does involve a constant cost per row.
The graph in Figure 4 shows how all five solutions, including the new ones, scale with respect to changes in partition size. You can see that all the solutions—other than the set-based solution using subqueries or joins, which has quadratic complexity—have fairly linear complexity. With the number of partitions used in this test (1,000), and partition size varying between 10 and 1,000, analyzing the result is interesting. As you can see, the solution based on the OVER clause is by far the big winner in all cases. The solution based on nested iterations is faster than the T-SQL cursor but slower than the CLR data reader. It is slower than the set-based solution using subqueries or joins up to a partition size of about 350 rows, but beyond this point becomes faster.
Until the OVER Clause is Supported...
In the past few months I covered five solutions to running aggregations. Four of these solutions can currently be implemented in SQL Server, and one solution (i.e., the solution based on the OVER clause) will hopefully be available in a future version of SQL Server. The performance expectations for the OVER clause solution show that it will leave all the other solutions far behind. Hopefully Microsoft will implement this solution in the next major release of SQL Server. Until then, the set-based solution using subqueries or joins is a reasonable choice for very small partition sizes; its quadratic complexity makes it too slow with large partitions. For large partitions, the fastest current solution is the one based on the CLR data reader.
Web Listing 1: DDL Statement to Create Sales Table
SET NOCOUNT ON; USE tempdb; IF OBJECT_ID('dbo.Sales', 'U') IS NOT NULL DROP TABLE dbo.Sales; CREATE TABLE dbo.Sales ( empid INT NOT NULL, -- partitioning column dt DATETIME NOT NULL, -- ordering column qty INT NOT NULL DEFAULT (1), -- measure 1 val MONEY NOT NULL DEFAULT (1.00), -- measure 2 CONSTRAINT PK_Sales PRIMARY KEY(empid, dt) ); GO
Web Listing 2: DDL Statement to Create GetNums Function
IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL DROP FUNCTION dbo.GetNums; GO CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE AS RETURN WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1), L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5) SELECT n FROM Nums WHERE n <= @n; GO
Web Listing 3: Code to Populate Sales Table with Sample Data
DECLARE @num_partitions AS INT, @rows_per_partition AS INT, @start_dt AS DATETIME; SET @num_partitions = 10000; SET @rows_per_partition = 10; SET @start_dt = '20090101'; TRUNCATE TABLE dbo.Sales; INSERT INTO dbo.Sales WITH (TABLOCK) (empid, dt) SELECT NP.n AS empid, DATEADD(day, RPP.n - 1, @start_dt) AS dt FROM dbo.GetNums(@num_partitions) AS NP CROSS JOIN dbo.GetNums(@rows_per_partition) AS RPP;
Listing 1: Nested Iterations, Using Recursive Queries
SELECT empid, dt, qty, ROW_NUMBER() OVER(PARTITION BY empid ORDER BY dt) AS rownum INTO #Sales FROM dbo.Sales; CREATE UNIQUE CLUSTERED INDEX idx_rownum_empid ON #Sales(rownum, empid); WITH C AS ( SELECT 1 AS rownum, empid, dt, qty, qty AS sumqty FROM #Sales WHERE rownum = 1 UNION ALL SELECT PRV.rownum + 1, PRV.empid, PRV.dt, CUR.qty, PRV.sumqty + CUR.qty FROM C AS PRV JOIN #Sales AS CUR ON CUR.rownum = PRV.rownum + 1 AND CUR.empid = PRV.empid ) SELECT empid, dt, qty, sumqty FROM C OPTION (MAXRECURSION 0); DROP TABLE #Sales;
Listing 2: Nested Iterations, Using Loops
SELECT ROW_NUMBER() OVER(PARTITION BY empid ORDER BY dt) AS rownum, empid, dt, qty, CAST(qty AS BIGINT) AS sumqty INTO #Sales FROM dbo.Sales; CREATE UNIQUE CLUSTERED INDEX idx_rownum_empid ON #Sales(rownum, empid); DECLARE @rownum AS INT; SET @rownum = 1; BEGIN TRAN; WHILE 1 = 1 BEGIN SET @rownum = @rownum + 1; UPDATE CUR SET sumqty = PRV.sumqty + CUR.qty FROM #Sales AS CUR JOIN #Sales AS PRV ON CUR.rownum = @rownum AND PRV.rownum = @rownum - 1 AND CUR.empid = PRV.empid; IF @@rowcount = 0 BREAK; END COMMIT TRAN; SELECT empid, dt, qty, sumqty FROM #Sales; DROP TABLE #Sales;