This month, I want to introduce a problem that I like to think of as identifying sections. The generic form of the problem involves a table (call it T1) that has two columns of interest: one representing order among rows (call it id) and the other holding some value (call it val). The task at hand is to identify sections of consecutive rows that share the same value. The terms section and consecutive rows are problematic when you're dealing with sets, but as I mentioned, the column ID represents logical order among rows, and once order is defined, these terms become meaningful. For each identified section, you need to return the minimum id, maximum id, val, count of rows in the section, and possibly other aggregates.
Run the code in Listing 1 to create the table T1 and populate it with sample data. Table 1 shows the desired result. Before I demonstrate techniques to solve the problem in its generic form, I want to mention a couple of practical scenarios in which you might face this problem. One example involves bank transactions: You want to isolate consecutive periods of time in which an account had the same balance sign—negative, nonnegative, where the value element is SIGN(balance). Another example involves statistics regarding sports: You want to identify periods in which consecutive games had the same results—losses, ties, wins where the value element is SIGN(positive_ points-negative_points). Of course, there are many other examples. I'll demonstrate two techniques to solve the problem—one based on a subquery, and the other based on row numbers.
The Subquery-Based Solution
This problem's primary challenge is to come up with an expression that returns a value that's unique to the section to which the row belongs. I'll refer to this expression as grp. Once you manage to write such an expression, the rest is pretty easy: Group the rows by grp, and return the statistics requested for the group (e.g., minimum id, maximum id, count of rows).
The code in Listing 2 demonstrates how you can use a subquery to calculate the section identifier (grp). Per each outer row in T1, the subquery returns the minimum id out of the rows with a greater id and a different val than the one in the outer row. Table 2 shows the output of this query. Notice that each consecutive section of rows (based on id ordering) with the same value has a unique grp value. The last group (the one with the highest ids) has a NULL grp value, but that's not a problem. In the next step, you'll group the data by grp to return statistics about the section, and NULLs behave just like known values for grouping purposes—namely, all NULLs will produce one group.
The code in Listing 3 shows the complete solution. The query from Listing 2 that produces the section identifier (grp) is used to create the derived table D; the outer query groups the rows by grp and returns for each group the requested information (minimum id, maximum id, count of rows). Table 1 shows the output of this query.
This solution isn't limited to SQL Server 2005. It works in earlier versions of SQL Server, as well.
Solution Based on Row numbers
The second solution uses the ROW_NUMBER function in a nontrivial way to calculate a section identifier. The section identifier is a combination of val and the difference between ROW_NUMBER() OVER (ORDER BY id) and ROW_NUMBER() OVER(ORDER BY val, id).
This calculation would probably be best understood by running the code in Listing 4 and examining its output, which Table 3 shows. You're looking at the data, along with the individual row number calculations, and the difference between the two (call it diff ). Table 3's output is sorted for demonstration purposes by rn_val_id (val, id ordering). Keep in mind that the logical ordering that matters in identifying sections is by id alone. Notice that all rows in the same section get a unique diff value among the rows with the same val. In other words, the combination of val and diff uniquely identifies a section.
To understand the logic behind this calculation, examine the three sections with a val of a, and rn_val_id values 1-4, 5-6, 7-7. These are row numbers based on val, id ordering. Among the rows with the same val (i.e., a), the row numbers are consecutive across the sections. However, if you examine the rn_id values of the rows in those sections (row numbers based on id ordering), you'll naturally have gaps between the ranges of row numbers: 1-4, 7-8, 12-12. Now, within a section, both rn_val_id and rn_id keep incrementing in the same manner by one for each row. However, although there are no gaps between the sections in the rn_val_id values, there are gaps between the sections in the rn_ id values. In other words, the difference between the two (rn_id - rn_val_id, call it diff ) is constant within a section, and increases in the next section with the same val. Thus, the combination of val, diff uniquely identifies a section.
If you find this logic hard to grasp, examine Table 3's output and try to determine how you can use val, rn_id, and rn_val_id to calculate a section identifier. You might find it useful to examine the same data sorted by id (or rn_id, which is the same), as you see in Table 4.
Finally, you're left with aggregating the data. Listing 5 shows the final solution. The code from Listing 4 (except for the presentation-related ORDER BY) defines the common table expression C. The outer query groups the rows by val and diff (the section identifier in this solution), and returns the desired aggregates (minimum id, maximum id, val, count of rows). Table 1 shows the output of this query.
Missing Ranking Calculations
This type of calculation (section identifier) is pretty generic and applies to many problems. In essence, you're looking at a type of ranking calculation in which ordering is defined by one set of columns (id, in our case) and ranking is defined by another (val, in our case). SQL Server 2005 does support the RANK and DENSE_RANK functions, but unfortunately neither allows a separation between the set of columns that define ordering and the set of columns that define ranking. Rather, sorting and ranking are one and the same. It would have been nice to have RANK and DENSE_RANK functions that permitted the separation. Here's an example of pseudo code demonstrating how the syntax might have looked had it been supported:
SELECT id, val,
RANK(val) OVER(ORDER BY id) AS grp
With such support, solving this article's problem would be trivial—not to mention the fact that such a function would lend itself to good optimization. Having an index on (sort_cols, rank_cols), this calculation could have been achieved with a single ordered pass over the leaf level of the index.
Which Solution Is Better?
I tested the performance of both solutions against more realistic table sizes than the ones I used in this article. On one hand, the subquery-based solution requires substantially more I/O than the solution based on row numbers. On the other hand, it doesn't involve sorting, whereas the solution based on row numbers does. With a table size of a couple hundred thousand rows, both solutions run with similar performance. With large table sizes (e.g., a million rows and beyond), the nonlinear algorithmic complexity of sorting involved in the solution based on row numbers outweighs the larger amount of I/O involved with the subquery-based solution. In other words, the subquery-based solution becomes faster. However, up to a million rows in the table, the differences aren't dramatic, so you should use the solution that you feel more comfortable with. Hopefully, in the future, we'll see RANK and DENSE_ RANK functions with separate definitions of ordering columns and ranking columns, allowing such calculations to be both simpler and much faster.