Try these filtering tricks in your data warehouse

Downloads |
---|

43981.zip |

A data warehouse contains data that has been aggregated for data analysis. You aggregate measures (such as quantities and amounts) for different entities (such as employees, customers, and products). You also determine the time units (e.g., day, month, year) for the aggregations according to your analysis needs. To allow fast query response, you might keep multiple summary tables, each containing a different time unit for the aggregations—for example, a table for daily aggregations, another for monthly aggregations, and a third for yearly aggregations. A group of calculations called *cumulative aggregations *requires you to perform additional manipulation of the summary tables in your data warehouse. They're called cumulative because you need to return a cumulative aggregate of all the rows either from a certain point in time until the current row's point in time or until some condition is met. For example, for each employee and month, you could return the current month's order quantity and the total number of orders from the beginning of the employee's activity until the current month. Let's look at five example cumulative-aggregation problems and explore their solutions.

### Cumulative Aggregations

The examples use an employee monthly orders summary scenario. First, run the code in Listing 1 to create the sample EmpOrders table and populate it with sample data—specifically, monthly order quantities placed by each employee. Each row contains the order month and year in the ordermonth column, the employee ID in the empid column, and the total monthly order quantity in the qty column. My design decision to use a datetime data type for the ordermonth column might seem strange. You need to store only a year and a month to represent the unit of time in this case, but a valid datetime value must also contain a day portion. To populate the ordermonth column with valid dates, you can specify the first of each month in the day portion and ignore it when you analyze the data. So why don't I store the year and month in a character-based column in the form YYMM if the day portion isn't required? I prefer to store years and months in a datetime column because doing so allows easy manipulation (such as calculating date differences or adding date parts) through datetime functions. Such manipulation isn't possible with integers or character strings that contain only the year and month portions of a date. I don't use datetime functions in this article, but I wanted to follow my best practices for datetime values in this case. Now let's walk through the five cumulative-aggregation problems.

** 1. Cumulative totals for each employee by month. **Suppose you want to analyze the progress each employee made during a period of time. For each employee and month, you want to return the total and average order quantities from the employee's first order month through the current month. Figure 1 shows the desired result (abbreviated). The solution for this request is fairly simple. You use a join to match each row from one instance of the EmpOrders table (call it O1) with all the rows from another instance of the same table (call it O2) that have the same employee ID and an earlier or identical order month. Because a row in O1 might have more than one match in O2, SQL Server duplicates each row from O1 for every match in O2 that contains the quantities you want to aggregate. You then group the result by all attributes you want to return from O1—in this case, O1.empid, O1.ordermonth, and O1.qty. Finally, you return those attributes together with the total and average quantities from O2, as Listing 2 shows.

** 2. Cumulative totals up to 500 for each employee. **A slight twist to the first problem is to return only the cumulative aggregations up to the month the cumulative total quantity reaches or passes a certain threshold—say, 500. Figure 2 shows the desired result (abbreviated). You need to add a filter to the query, and you must specify the filter in the HAVING clause because SQL Server applies the HAVING clause to the grouped rows. SQL Server applies the other filters that you can specify in a query (ON and WHERE) before grouping the rows. You might be tempted to use the filter

*SUM(O2.qty) <= 500;*however, such a filter won't return the correct result for an employee unless the cumulative total for a month reaches exactly 500. If the month in which the total reaches 500 has a cumulative total greater than 500, you won't get the row that contains that order month in the result. So, you need to devise a sophisticated filter that can accommodate a situation where cumulative totals don't reach exactly 500. Because each row in the grouped result contains both a cumulative quantity (SUM(O2.qty)) and the current month's quantity (O1.qty), you can use the filter

SUM(O2.qty)—SUM(O1.qty) < 500

This logical expression is true for all months in which the total quantity is less than 500 and for the first month in which the total reaches or surpasses 500. Listing 3 shows the complete solution query for problem 2.

** 3. Cumulative total reaches 500 for each employee. **The third problem adds another twist to the original problem. Suppose you want only one result row for each employee, showing where the employee reached or passed the 500-item barrier; you're not interested in the history leading to that point. To achieve the desired result, which Figure 3 shows, you need to revise the HAVING filter from the previous query so that the logical expression returns FALSE for all rows except the last. Think about the problem this way: The last row is unique because it's the only row where the total quantity reaches or passes 500. So all you need to add to the HAVING filter is

*AND SUM(O2.qty) >= 500,*as Listing 4's query shows.

** 4. Cumulative totals for each employee before cumulative total reaches 500. **The fourth problem is a slight variation of the second problem. This time, you want to see each employee's progress up to, but not including, the 500 quantity barrier. You don't want to see any rows in which an employee reached or surpassed 500. Figure 4 shows the desired result. This problem is simple to solve, as the query in Listing 5 shows, because the HAVING filter is so intuitive. The filter

*HAVING SUM(O2.qty) < 500*returns all rows where the cumulative total is less than 500.

** 5. Point before cumulative total reaches 500 for each employee.** The fifth problem is probably the most complex: From the results of the fourth problem, you want to return only the last row for each employee. In other words, you're after only the cumulative aggregate at the latest point

*before*the employee's total reached 500. Figure 5 shows the desired result. You need to figure out when each employee reached or passed the 500 barrier, then calculate the aggregates up to but not including that month. To achieve this, you can change the JOIN condition slightly and match the rows from O1 and O2 based on a less-than operator (<), leaving off the equal-to operator (=):

ON O2.ordermonth < O1.ordermonth

This filter would match each order month from O1 with all earlier order months from O2. In this way, you can calculate aggregations up to but not including O1's order month and figure out when the employee reached or passed the 500 limit by adding O1's quantity to the total quantity. Take employee 1's data as an example. The filter would match the row for September 1996 in O1 with all earlier months in O2 (July 1996 and August 1996). The total quantity from O2 (121+247=368) is smaller than 500. The total quantity from O2 (121+247=368) plus the current month's quantity from O1 (368+255=623) is greater than or equal to 500. So, the month you're after is the latest month in O2, namely August 1996. Remember that the request was for the month just before the employee reached the 500 barrier.

But the revised JOIN condition introduces a problem. Because you're matching the current month from O1 to earlier months from O2, the first (earliest) row for each employee won't have a match. If an employee's earliest row happens to be the row you're after, you won't get it back in the result. To solve this problem, change the join type to outer join. An outer join will return the earliest row for each employee, with NULLs in the attributes from O2, because it won't find a match for these rows. Use the ISNULL() function to substitute a 0 for the aggregation of an employee's earliest row.

Now that you've determined the join type and JOIN condition, you can work on the HAVING filter. You can use a similar approach to the one you used in the third problem. For that problem, you used the filter

SUM(O2.qty) - O1.qty < 500 AND SUM(O2.qty) >= 500

However, because you used a "<=" operator in the HAVING clause in that case, the total quantity included the current month, and in this case, it doesn't. Also, you want to substitute a 0 for a NULL result, so you end up with the following filter:

ISNULL(SUM(O2.qty), 0) < 500 AND ISNULL(SUM(O2.qty), 0) + O1.qty >= 500

The total quantity before the current point in time must be smaller than 500, and the total quantity including the current point in time must be greater than or equal to 500. For each employee, you isolate the month in O1 where the employee's order quantity reaches or passes 500—but that isn't what you need. Instead of returning the current month from O1, you can return the latest month from O2, which represents the point just before the order total reached 500. The query in Listing 6 shows the complete solution to problem 5.

### Tricky Filters

Some variations of cumulative aggregate problems are fairly simple to solve, but others can be quite tricky. All the solutions rely on accessing two instances of the same table: one "fixed" instance that duplicates the rows for each order month match, and another "running" instance where the rows are aggregated. You aggregate the data by attributes from the fixed side of the join and accumulate attributes from the running side. The toughest part of each problem is figuring out the HAVING filter, which tells SQL Server which groups to return. This article's examples demonstrate the important variations that require tricky filters, which you can use when facing similar problems in your data warehouse.