I’m writing this blog entry in response to the following comment about the article “Set-Based vs. Cursor-Based Solutions for Running Aggregates.” There was just not enough room to respond directly with a comment of my own.
>> wgshef: This set based method is using an inefficient triangular-join. <<
The set-based solution I presented in this article is a very commonly used technique to handle running totals. The point I was trying to make in the article is that its execution plan has close to quadratic complexity in respect to the partition size. More specifically, for a partition size r, the number of rows processed per partition is r + (r + r^2)/2 (which is unfortunately how SQL Server’s optimizer handles the part of the join that you refer to in your comment as a “triangular join”). For small partitions that’s fine because the square effect is small, but with large partitions that’s very inefficient. Again, that’s exactly the point I was trying to deliver in the article so in this respect we’re both on the same page.
>> wgshef: There are other set-based methods that will blow this (and, obviously, the cursor) away. See http://www.sqlservercentral.com/Forums/FindPost728664.aspx for a set-based example of a one million row running update that runs in 35 seconds, and a corresponding cursor-based solution that runs in 144 seconds. <<
I assume that you refer to the following technique from the thread you pointed to:
SET @PrevBalance = AccountBalance = CASE
WHEN AccountNumber = @PrevAccount
THEN @PrevBalance + Amount
@PrevAccount = AccountNumber
FROM #pcWork WITH(INDEX(0),TABLOCKX) --Warm fuzzy options
OPTION (MAXDOP 1) --Just to be absolutely safe
There may be other set-based methods that will blow the one I showed away, and I’d be more than happy to see those, but the UPDATE technique you refer to is neither set-based (I will explain why shortly) nor officially guaranteed to work the way you expect it to, and hence I wouldn’t recommend using it.
BTW, as an aside, the hint index = 0 doesn’t instruct the optimizer to do something in clustered index order, rather, just do a table scan (or clustered index scan with no order guaranteed). To instruct the optimizer to perform an activity in clustered index order you would normally use index = 1.
In SQL Server 2008 and earlier versions, I haven’t yet found a pure set-based technique for running totals that SQL Server’s optimizer handles very efficiently with large partition sizes. In another article in this series, that will be published in the coming months, I describe standard language elements for calculating running totals that are based on the OVER clause, and that unfortunately weren’t yet implemented in SQL Server. Those lend themselves to excellent optimization, and hopefully we will see them in a future version of SQL Server.
So, back to the UPDATE technique you referred to; it violates several fundamental concepts in SQL that originate from mathematical set theory:
1. A table is a set of rows that have no particular order. A set-based query should not rely on the physical order in which the data is stored, accessed, or processed. The UPDATE solution relies on physical access order based on the clustered index. Even if it was guaranteed that such a statement would always be processed in clustered index order (and I haven’t seen such an official guarantee), this fact alone would make it a non set-based solution.
2. A set should be treated as a whole. The UPDATE solution relies on logic that handles one-row-at-a-time. It is a cursor in disguise, even though it doesn’t use T-SQL’s CURSOR object.
3. SQL (the language) implements a concept called all-at-once, meaning that all expressions in the same logical phase are as if evaluated at the same point in time. That’s why, for example, you can write a statement that swaps column values like this: UPDATE T1 SET col1=col2, col2=col1. The proposed UPDATE solution to the running totals problem makes another nonguaranteed assumption that the assignments in the SET clause will be processed from left to right.
Some think that anything that is not set-based (e.g., a cursor) is “dirty” and therefore should not be used. I’m not one of those. I see the great value in conformance with the relational model, and in true set-based solutions, but I also realize that there are practical limitations, like the way some set-based solutions get optimized. So yeah, my preference would be to use a set-based solution if an efficient one exists, and if not, consider other tools along with their disadvantages. But when considering other tools, do this responsibly. Namely, use tools with well defined, documented, guaranteed behavior.
This UPDATE statement reminds me of other cases where people relied on behavior that they assumed to work the way they expected from observation alone, and later were proved wrong. There are many examples…
The first example goes back to SQL Server 6.5. Back then the optimizer had only one algorithm available to handle grouping--based on sorting. So when you wrote a query such as:
SELECT col1, SUM(col2) AS total
GROUP BY col1;
You always got the data in group order. Some programmers assumed from observation that grouping by something implies that the data will always be sorted by the grouped columns, hence they stopped specifying an ORDER BY clause when they wanted the data to be sorted this way. Neither the standard nor SQL Server’s documentation supported this assumption. Then starting with SQL Server 7.0, the optimizer can also handle grouping and aggregation with hashing, in which case, presentation ordering may not be group ordering anymore.
Another classic example is that some people try to create a “sorted view”—which contradicts the premise of a set—because a view is a table, a table is a relation, a relation is a set, and a set has no order. So when you try to create a view based on a query with an ORDER BY, SQL Server rejects the attempt. Some people discovered that when they use TOP 100 PERCENT, SQL Server allows an ORDER BY in a view definition, but they fail to realize that in a definition of a view, an ORDER BY is only guaranteed to serve the logical filtering purpose for TOP. In SQL Server 2000, when you queried a view with a TOP 100 PERCENT and an ORDER BY, and didn’t specify an ORDER BY clause in the outer query against the view, SQL Server returned the data in order. From observation, people just assumed that they can trust this behavior to always work. There was no standard or documentation to support this assumption. And starting with SQL Server 2005, the optimizer became smarter, realizing that such a view does not need to guarantee presentation ordering, and therefore the optimizer simply ignores both the TOP and the ORDER BY clauses in the view’s query.
Another example is generating row numbers using a SELECT INTO with an ORDER BY clause, like so:
SELECT IDENTITY(INT, 1, 1) AS rownum, col1, col2, col3
ORDER BY col1;
From observation, people assumed that this technique will always assign identity values that “honor” the ORDER BY clause. But this technique is not standard, and at least in the past the behavior was not defined anywhere officially. Fortunately, due to the common use of this technique in production, Microsoft published a knowledgebase article that describes this technique; and incidentally, what it says is that it is not guaranteed, and that if you need a guarantee, you should first CREATE the table with an IDENTITY property, then insert into this table the result of a SELECT with an ORDER BY clause.
All these were just a few examples why you should always make sure you rely on techniques that are based on documented, well defined, guaranteed behavior. I’m afraid that the running UPDATE technique doesn’t qualify and therefore I hope that people will avoid using it in production.
As a consultant/developer/DBA you can always stand behind your choices when relying on documented and guaranteed behavior. But when relying on undocumented techniques, how would you then justify to a customer that X amount of dollars disappeared from their account balance, if such a thing happens?
What I do hope is that Microsoft will realize how common is the need to calculate running totals and other similar calculations, and provide a more complete implementation of the OVER clause/window functions.
Imagine how great it would be if one day we’d be able to express a running total like this:
SELECT AccountNumber, TransactionDate, CheckBookID, Amount,
SUM(Amount) OVER(PARTITION BY AccountNumber
ORDER BY TransactionDate, CheckBookID
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS Balance
ORDER BY AccountNumber, TransactionDate, CheckBookID;
A fast track of optimization of this code will likely do a single ordered scan of an index created on (AccountNumber, TransactionDate, CheckBookID) INCLUDE(Amount).
Until that day, yes, I will consider using non set-based techniques but ones that are documented, well defined and guaranteed.