I've been working on a large B2B Customer Relationship Management (CRM) program. One important Key Performance Indicator (KPI) we wanted to develop for the program was the average number of days between orders. We wanted this KPI to determine the customers' usual purchase patterns so we could answer questions such as "Do some customers purchase more often than the average customer?" and "Is the average time between purchases increasing?" (An increase could be a sign that the customer is at risk for leaving the program.)
Determining the average number of days between orders involves finding two consecutive orders from the same customer and calculating the number of days between them. The first solution I tried used a looping algorithm that worked something like this:
- Find the first order for a customer and save the order date in a temporary variable.
- Find the next order and save that order date in a second temporary variable.
- Find the difference between the two dates and save that value.
- Move the second temporary variable's value into the first temporary variable, retrieve the next record, and save that record's order date in the second temporary variable. Find the difference between the two dates and save that value.
- Continue looping through all the records for a customer, comparing the dates from consecutive orders until you encounter the next customer's records.
Not only were the algorithm queries complicated to write, they didn't perform very well. When I tested the algorithm against the FactInternetSales table, it took more than four minutes to complete on my server running SQL Server 2005 SP2. FactInternetSales has only around 60,000 records and 19,000 customers. I knew that the performance of this algorithm would only get worse as these numbers got larger.
I decided to try a different approach. I began by using the ROW_NUMBER function with an OVER clause to get a chronological list of each customer's orders. After I saved the output to a temporary table, I used an INNER JOIN operation to bring adjacent records together. Callout A in Listing 1 shows this query.
Listing 1: DaysBetweenOrders.sql
As Figure 1 shows, the query produces a list of orders sorted by date for each customer and an additional column that specifies the sequence of each order in each customer list.
If you look at the data closely, you'll see that some orders occur on the same day. We needed to have only one order per customer per day because we wanted to measure the average number of days between orders. (You might choose to do this differently depending on the business problem and rules.) So, as callout B shows, I used the GROUP BY statement to get only one order per customer per day. I also eliminated the time portion from the datetime field. I then used a SELECT statement to get the results into a temporary table named #orders. At this point, the results looked like that in Figure 2.
The next step was to join the table to itself (i.e., join Copy1 of the #orders table to Copy2 of the #orders table) in such a way that consecutive orders could get matched, thus making date comparisons simple. The idea was to join the first order to the second order, the second order to the third order, and so on for each customer, as Figure 3 shows. The trick that turned this idea into reality was adding 1 to the RowNumber of Copy1 before joining it to Copy2. That way, a customer's first order is joined to his or her second order.
Figure 3: Matching consecutive orders to make date comparisons simple (click to enlarge)
With the consecutive orders matched, I used the DATEDIFF function to measure the time between them. The beauty of this approach is that you don't have to worry about the endpoints because they take care of themselves. The last order for a customer isn't joined to any order (e.g., customer 11000 order 3 in Copy1 isn't joined to any order in Copy2) and is therefore automatically dropped from the result set.
Callout C in Listing 1 highlights the code that joins the #orders table to itself, matches corresponding orders, and measures the days between those orders. As you can see, the resulting data set is put into the new #BtwnOrders table. If you were to select the records from #BtwnOrders by adding
- SELECT * FROM #BtwnOrders
at the end of the code in callout C, you'd see results like that in Figure 4.
Figure 4: Sample results from running DaysBetweenOrders.sql in its entirety (click to enlarge)
The final step was to use the AVG function in a SELECT statement to calculate the average number of days between orders for each customer. Callout D in Listing 1 shows this code. When you run DaysBetweenOrders.sql in its entirety, you'll get results such as
I found that this solution took less than one second to run against the FactInternetSales table, which is a vast improvement over the four minutes required by the looping algorithm. When I ran this solution against 12 million records for 120,000 customers, it completed in just over two minutes.
You can download DaysBetweenOrders.sql by clicking the 101924.zip hotlink at the top of the page. Using DaysBetweenOrders.sql as a template, you can create your own solution that determines the average number of days between important events, such as customer orders, server failures, support calls, or website visits.