With logic, even querying nonexistent data is no problem
T-SQL querying problems often require you to apply complex logic in your solutions. In my January 2005 column, "Grouping Ranges" (InstantDoc ID 44570), I presented a scenario in which you had to use T-SQL to return ranges of consecutive existing sales dates for each store during January 2005; the solutions required you to apply logical deduction. To solve this month's problem, you'll also need to apply logical reasoning. This month's problem is the inverse of last month's: You need to identify nonexistent ranges of dates for each store during January 2005. That is, you need to produce a report that shows when stores were inactive during a particular month. Figure 1 shows the desired results. Two of the three solutions that I devised apply to all versions of SQL Server; the other applies only to SQL Server 2005. Before reviewing my solutions, try solving the puzzle yourself. I'll use the same tables I used last month in my examples. Run Listing 1's code to create the Sales and Stores tables and populate them with sample data. As a reminder, the Stores table contains a row for each store, with the store ID and name. The Sales table contains a row for each sales day that the store was open, including the store ID, sales date, and daily revenue.
Solution 1: Filtering Start Points and Matching End Points
The first solution involves two steps. First, you identify the starting points of gaps in the sales dates. Then, you match an endpoint of a gap to each starting point.
You can identify a sales date within the Sales table as a starting point of a gap by checking whether the following date exists in the table. Let's take store 1's data as an example. The store was open on January 1. To figure out whether this date marks a starting point of a gap, you need to check whether the store was also open on January 2. It was, so you continue to the next date. The store was open on January 3 but wasn't open on January 4, so there's the gap starting point. The starting point of the gap is January 4, 2005, which is the first missing date following the last activity date, January 3, 2005.
You can already see that you need to devise a query against the Stores table that uses a NOT EXISTS filter. However, because you're relying on existing sales dates to identify the starting points of the gaps, you need to be aware of an exception. If the store wasn't open on the first of the month, and the last day of the previous month doesn't exist in the table, such a query against Sales, using just the NOT EXISTS filter, won't identify the first of the month as the starting point of a gap. To take care of this problem, you can create a derived table, such as the one at callout A in Listing 2, that adds to January 2005's data an extra row that contains the date December 31, 2004, for each store. Run Listing 2's code to return the desired results, which Figure 2 shows—starting points of the gaps—accounting for gaps that start at the beginning of the month.
Notice that in the derived table S1, which the code at callout A creates, I didn't include existing sales rows for January 31. These rows are of no interest because we're not looking for gaps in February that start after January 31. Also, since I couldn't rely on having such a row in the Sales table, I had to query the Stores table to explicitly create a row for each store that contained the date December 31, 2004. The NOT EXISTS filter in the outer query (at callout B) returns only those sales rows for which the store was closed on the next date (no row exists where S1.dt = S2.dt - 1). The SELECT list of the outer query adds 1 day to the existing date to return the starting point of the gap.
The second step of this solution is to match a gap's endpoint to each starting point that you already found. Logically, you can match an endpoint to a starting point by identifying the next existing sales date in the table for the given store and starting point, minus 1 day. This logical deduction is based on the fact that the next existing date is the first date following the gap.
You can easily find the corresponding endpoint of a gap by using a subquery in the SELECT list, as I did at callout A in Listing 3. The subquery returns the MIN date minus 1 day from another instance of the table (call it S3), where the store ID is the same as the one in S1 (the outer table) and the date is greater than the one in S1. If the subquery returns NULL, the gap ends at the end of the month since there are no more dates for this month. For that case, I used a COALESCE() function that returns January 31, 2005. You can run the code in Listing 3, which contains the complete solution query, to get the desired results.
Solution 2: Calculating the Difference Between Adjacent Dates
The second solution to the problem is based on calculating the difference in day units between neighboring sales dates. This solution also involves two steps. First, you return pairs of neighboring sales dates, then you filter only pairs of dates that are more than 1 day apart.
If you think about it, neighboring sales dates that are more than 1 day apart mark the boundaries of a gap. You add 1 day to the earlier date, subtract 1 day from the later date, and you get the starting and ending dates of the gap itself. Of the three solutions that I present in this article, I like this one best because the logical deduction here is so simple and intuitive.
Run the code that Listing 4 shows to return the adjacent sales dates that Figure 3 shows. SQL Server issues the query against the derived table S1—the same derived table I used in the first solution. At callout A, the SELECT list invokes the subquery that returns the next existing sales date for each store. If no such sales date exists in the Sales table on or before February 1, 2005, a COALESCE() function substitutes "February 1, 2005" for the NULL.
The second step is simple: Use the query from step 1 to create the derived table D, as I did at callout A in Listing 5. Add a WHERE clause in the outer query against D that filters the pairs of dates between which the difference is more than 1 day. In the outer query's SELECT list, add 1 day to the current date, and subtract 1 day from the next date to return the gap itself. You can run Listing 5's code, which contains the complete solution query, to get the desired results.
Solution 3: Using an Auxiliary Table of Dates
The third solution applies to SQL Server 2005 only. This solution involves using an auxiliary table that contains all possible dates you need to cover. In this example, you need only 31 rows to contain all possible dates in January 2005, but in practice, the table would likely contain several years' worth of data. Still, the auxiliary table would be small (hundreds to perhaps thousands of rows) even if you need to cover several years.
Run the code that Listing 6 shows to create the Dates auxiliary table and populate it with 10 years' worth of data. The first step of this solution is to return all nonexistent sales dates during January 2005 for each store. You need to cross-join the Dates auxiliary table with the Stores table and use a NOT EXISTS filter that checks that there's no row in Sales that contains the date from Dates and the store ID from Stores. Run the code that Listing 7 shows to return all nonexistent dates for each store during January 2005.
Now that you have all individual dates when each store wasn't open, you're actually dealing with the same problem I covered last month—grouping ranges of consecutive dates—so you can apply any of the solutions I discussed in "Grouping Ranges." The code in Listing 8 applies the row-numbering solution, in which you calculate a grouping factor by subtracting from the sales date a number of days equal to the row's number (according to date order for each store separately).
At callout A in Listing 8, the Common Table Expression (CTE) I named GrpCTE combines the previous step's query, which returns individual nonexistent dates, and the calculation of the grouping factor (date minus row number). The outer query, at callout B, simply groups all consecutive dates for each store by specifying store ID and the grouping factor in the GROUP BY clause. In the SELECT list, you calculate each gap's starting and ending points by requesting the MIN and MAX dates for each group.
Which Solution Is Best?
When I need to pick one of several solutions, I usually examine two aspects: performance and simplicity. I tested all three of these sample solutions against much larger tables. Solutions 1 and 3 have similar run times and are approximately twice as fast as Solution 2. However, Solution 2 is by far the simplest of the three. If the performance of all solutions were acceptable (i.e., if they all finished fast enough to meet the users' needs), I'd probably choose Solution 2. Otherwise, Solution 1 seems best because it applies to all versions of SQL Server.
As you probably noticed, the toughest part of solving this month's T-SQL puzzle was the logical manipulation required to devise a solution. After you've devised your strategy, translating the logical solution to T-SQL is easy. To help you keep in practice and improve your logical deduction capabilities, I provided an extra logical challenge last month in the form of a pure logic puzzle. You can find the solution in the "Logical Puzzle" sidebar, along with a new logic problem to puzzle over.