Flex your logic muscles by finding ranges of activity
"Crime is common. Logic is rare. Therefore it is upon the logic rather than upon the crime that you should dwell."
Sir Arthur Conan Doyle, 1859-1930
The Adventures of Sherlock Holmes, "The Copper Beeches"
T-SQL querying problems often require you to apply complex logical thinking. Even though some people may be born with stronger logical-deduction abilities than others, logic is an area in which you can significantly improve by constant practice. So, the more you practice tough T-SQL puzzles, the more skilled you become. Since T-SQL querying problems require highly logical thought processes, you can improve your T-SQL skills by practicing purely logical puzzles, too. Logic puzzles are fun, intriguing, and challenging; they appeal to people of all ages. Besides improving your own T-SQL querying abilities, you can challenge your children, grandchildren, nieces, or nephews with logic puzzles. You never know; you might be raising a T-SQL master for the next generation. This article's T-SQL puzzle requires you to apply complex logic to a problem involving groups of date ranges. Then, the "Logical Puzzle" sidebar, page 27, provides a pure logical puzzle for you to practice.
Grouping Ranges Puzzle
Imagine you're the DBA for a chain of small stores selling firewood and potatoes to visitors at various national parks around the United States. In bad weather, when few people visit parks, a storeowner might decide not to open the store. At the end of every day that a store is open, the storeowner enters the day's total revenue into the FireTatoes Sales application, which uses SQL Server 2005 as its database. The application records the sales data in a local Sales table and SQL Server replicates it for analysis purposes to a central SQL Server database at the chain's headquarters at Angel's Landing in Zion National Park, Utah. Run the code in Listing 1 to create the central database's Stores and Sales tables and populate them with sample data.
At the beginning of every month, your manager needs a report that shows the activity periods of each store during the previous month. You need to write the T-SQL query that generates the desired results, grouping or "collapsing" consecutive sale days. For example, Figure 1 shows the desired results for January 2005. To make sure you understand the requirements, compare Listing 1's base data for store 1 (Zion) to the result in Figure 1. There are two groups of consecutive sale days for store 1 in January 2005: 1, 2, 3 and 23, 24, 25, 26. So, the result groups are 20050101 through 20050103 and 20050123 through 20050126. Try to solve the puzzle yourself before looking at my solutions; I walk through a SQL Server 2000−compatible solution first, then demonstrate how to solve the problem in SQL Server 2005.
SQL Server 2000−Compatible Solution
I'll first provide a SQL Server 2000−compatible solution that you can apply if you need to devise a solution to a similar problem involving SQL Server 2000 databases. The toughest part of this problem is figuring out what to group by. You need to calculate some value (call it grp) that you can use in the GROUP BY clause. Here's where you need to apply logic. First, define the logical problem: Calculate a value for each sales row that will be the same for all of a store's consecutive sale days. This means that you need to calculate a different value for each group of rows. Second, define the solution—that is, logically express the grp value. Once you define the solution logically, translating it to T-SQL should be straightforward. So, here's my suggestion for defining the grp value: For the same store as the current row's store, return the earliest sales date that's greater than or equal to the current row's sales date, after which the store was closed.
I think about it this way: you want to return a particular date for all consecutive rows in one group, and it's actually the latest date within the group. You return another date for all consecutive rows of a different group. If the answer isn't clear yet, bear with me until you look at the result of the T-SQL query that calculates the grp value, then go over the logic again.
Next, the translation to T-SQL: You want to return all rows from the table (call it the outer table) and use a subquery to fetch the desired date from another instance of the table (call it the inner table). "Earliest date" translates to MIN(dt); "for the same store" translates to a correlation by store ID; "greater than or equal to" translates to inner table's dt >= outer table's dt. And "after which the store was closed" translates to a NOT EXISTS predicate that looks for a sales date for the same store that's greater than the current sales date by 1 day. In other words, the subquery returns the latest sales date for the group.
Listing 2 shows the T-SQL query that calculates the desired grp value (the grouping factor). Figure 2 shows the result of Listing 2's query. Notice that you indeed get the same grp value for all rows of the same group and a different grp value for all rows of a different group. If you're confused by the logic I've described so far, look at the result in Figure 2 and try to figure out what you see there, then read my explanation again.
The rest of the solution is simple, as Listing 3 shows. Query the derived table D, which Listing 2 created. Group the rows by the store ID and grp, and return the store ID and lowest and highest dt values, which represent the range boundaries. This query gives you Figure 1's results.
SQL Server 2005 Solution
T-SQL enhancements in SQL Server 2005 will let you come up with a much shorter, faster solution. Specifically, I'm talking about the ROW_NUMBER() function that Microsoft introduced in SQL Server 2005. For details about ROW_NUMBER(), see my April 2004 T-SQL 2005 Web column, "Calculating Row Numbers in SQL Server 2005," InstantDoc ID 42302.
Imagine row numbers assigned to each of a store's sale rows in date order. The most recent sale row gets 1, the next gets 2, and so on. The ability to calculate such row numbers lets you define the grp value as follows: current sale date minus the number of days that corresponds to the date's row number. Yes, that's all there is to it.
Now let's figure out why this simple calculation gives you what you're after. By definition, a "consecutive date" means a date that's greater by one than the previous date. The same applies to row numbers because they're consecutive by definition. So, if you subtract the row numbers from a group of consecutive dates, you're bound to get the same result date for each.
The real beauty of this solution is in what happens with the next group of consecutive dates. There's a gap of dates between one group and another, but there's no gap in the row numbers. This disparity ensures that the second group of dates will have a different result date value than the first. In fact, the difference between the grp value the calculation generated for one group and the grp value it generated for the next group will be the number of days in the gap minus one.
If you find this logic complicated, try following an example. Take the dates for store 1—20050101, 20050102, 20050103, 20050123, 20050124, 20050125, and 20050126—and their respective row numbers—1, 2, 3, 4, 5, 6, and 7. From each date, subtract the number of days that corresponds to each row number, as follows:
20050101 − 1 = 20041231 20050102 − 2 = 20041231 20050103 − 3 = 20041231 20050123 − 4 = 20050119 20050124 − 5 = 20050119 20050125 − 6 = 20050119 20050126 − 7 = 20050119
There's a gap of 20 days between the first group and the second, so the grp value generated for the second group is 19 greater than that for the first. In short, you got what you were after: a grp value that's the same for all rows of one group and different for all rows of another group. Here's the T-SQL query that calculates the grp values:
SELECT storeid, dt, dt - ROW_NUMBER() OVER(PARTITION BY storeid ORDER BY dt) AS grp FROM dbo.Sales WHERE dt BETWEEN '20050101' AND '20050131';
Figure 3 shows the results.
Again, the rest is simple. As Listing 4 shows, you use a derived table or a common table expression (CTE) and group the rows by the store ID and grp value. Since this solution applies to SQL Server 2005 only, I used a CTE to demonstrate the solution.
Work It Out for Yourself
Solving T-SQL querying problems sometimes requires you to apply complex logic. This puzzle showed how to find groups of existing (activity) ranges; next time, I'll examine how to group non-existing (inactivity) ranges. Meanwhile, try to write a solution that returns ranges of dates when each store was closed during January 2005, producing the result that Figure 4 shows.
I again emphasize that you can improve your T-SQL problem-solving skills by practicing pure logical puzzles such as the one in the "Logical Puzzle" sidebar. I'll provide the solution to this logic puzzle next month. If you have any new, interesting T-SQL or pure logic puzzles, I'd be happy to hear from you.