Thu, 07/07/2011 - 1:00pm

Last week I provided a T-SQL challenge involving a table called DailyVisits that holds information about daily visits to a website. Your task was to identify, for each day, certain statistics compared to the previous day, like how many visits there were, how many visitors were added, how many removed, and how many remained. You can find the puzzle details here. I’d like to thank all those who submitted solutions including Peter Larsson (Peso), Alejandro Mesa, simran and Tomaz Kastrun.

To test the performance of the solutions I populated the DailyVisits table with about 180,000 rows using the following code:

-- large set of sample data for perf tests

-- about 180,000 rows

DECLARE

@from AS DATE = '20110101',

@to AS DATE = '20110630';

TRUNCATE TABLE dbo.DailyVisits;

INSERT INTO dbo.DailyVisits WITH (TABLOCK) (dt, visitor)

SELECT dt, CAST(number AS VARCHAR(10)) AS visitor

FROM dbo.GetDates(@from, @to) AS D

CROSS APPLY (SELECT number, ABS(CHECKSUM(NEWID())) % 2 AS rnd

FROM master..spt_values

WHERE type = 'p'

AND number BETWEEN 1 AND 2000) AS A

WHERE rnd = 1;

I also added the following index:

CREATE UNIQUE INDEX idx_visitor_dt ON dbo.DailyVisits(visitor, dt);

Part of the difficulty in addressing the task was that there was no assurance for visits every day; technically, it is possible to have zero visits in some days, and then there are no entries for such days in the DailyVisits table. But you had to include days without visits in the output if they fell in the input range of dates. In the puzzle’s comments section you can find that simran and Tomaz Kastrun addressed this difficulty by using either a loop or a recursive query to loop through the dates in the input range.

**Simran’s Solutions**

Simran used in one solution a loop and in the other recursion to iterate through the dates in the input range and then used subueries to calculate the different measures. Both solutions ran over a minute against the larger set of sample data.

**Tomaz Kastrun’s Solutions**

Tomaz Kastrun used recursion in his first solution to create the dates in the input period. In his second solution he used master..spt_values as alternative to an auxiliary table of numbers, and based on the numbers generated the dates in the input period. Then in the second part of the solutions, Tomaz joined the set of dates produced in the first step with DailyVisits twice, using left outer joins—once for current date, and second time for previous date. Then he use expressions checking counts of each measure by looking at current vs. previous elements. Both solutions ran over a minute.

**GetDates Functions**

Alejandro, I and Peso used a helper table called GetDates that generates a set of dates within an input range. Here’s the function’s definition:

-- definition of GetDates function

IF OBJECT_ID('dbo.GetDates', 'IF') IS NOT NULL

DROP FUNCTION dbo.GetDates;

GO

CREATE FUNCTION dbo.GetDates(@low AS DATE, @high AS DATE) RETURNS TABLE

AS

RETURN

WITH

L0 AS (SELECT c FROM (VALUES(1),(1)) AS D(c)),

L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),

L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),

L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),

L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),

L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),

Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum

FROM L5)

SELECT TOP (DATEDIFF(day, @low, @high) + 1) DATEADD(day, rownum - 1, @low) AS dt

FROM Nums

ORDER BY rownum;

GO

-- test function

SELECT dt

FROM dbo.GetDates('20110601', '20110608') AS D;

dt

----------

2011-06-01

2011-06-02

2011-06-03

2011-06-04

2011-06-05

2011-06-06

2011-06-07

2011-06-08

**Alejandro’s Solutions**

Here’s Alejandro’s first solution (call it Alejandro 1):

Run time: 55 seconds.

Description (by Alejandro): My approach was to generate the Cartesian product between visitors and dates in the range, which could be very expensive if there are many visitors, or many dates, or both. Then doing an outer join with the table to flag each row as visit or no visit. Finally a self join to compare current and previous visit to calculate if current visit was added, removed, or remained, and grouping by date.

Solution code:

DECLARE

@from AS DATE = '20110101',

@to AS DATE = '20110630';

-- self join

WITH V AS (

SELECT DISTINCT visitor

FROM dbo.DailyVisits

)

, D AS (

SELECT dt

FROM dbo.GetDates(@from, @to)

)

, EDV AS (

SELECT

V.visitor,

D.dt,

CASE WHEN DV.visitor IS NULL AND DV.dt IS NULL THEN 0 ELSE 1 END AS visit_cnt

FROM

(

V

CROSS JOIN

D

)

LEFT OUTER JOIN

dbo.DailyVisits AS DV

ON V.visitor = DV.visitor

AND D.dt = DV.dt

)

SELECT

A.dt,

SUM(ISNULL(A.visit_cnt, 0)) AS numvisits,

SUM(CASE WHEN A.visit_cnt > ISNULL(B.visit_cnt, 0) THEN 1 ELSE 0 END) AS added,

SUM(CASE WHEN A.visit_cnt < ISNULL(B.visit_cnt, 0) THEN 1 ELSE 0 END) AS removed,

SUM(CASE WHEN A.visit_cnt + ISNULL(B.visit_cnt, 0) = 2 THEN 1 ELSE 0 END) AS remained

FROM

EDV AS A

LEFT OUTER JOIN

EDV AS B

ON A.visitor = B.visitor

AND A.dt = DATEADD([day], 1, B.dt)

GROUP BY

A.dt

ORDER BY

A.dt

Alejandro also proposed a second solution (call it Alejandro 2) that uses window functions called LAG and LEAD that aren’t yet supported by SQL Server, but are planned to be supported in SQL Server Denali:

DECLARE

@from AS DATE = '20110101',

@to AS DATE = '20110630';

WITH R AS

(

SELECT dt AS cur_dt, visitor,

CASE WHEN DATEDIFF(day, LAG(dt) OVER(PARTITION BY visitor ORDER BY dt), dt) = 1 THEN 1 ELSE 0 END AS prvbit,

CASE WHEN DATEDIFF(day, dt, LEAD(dt) OVER(PARTITION BY visitor ORDER BY dt)) = 1 THEN 1 ELSE 0 END AS nxtbit

FROM dbo.DailyVisits

)

SELECT dt,

COUNT(CASE WHEN dt = cur_dt THEN 1 END) AS numvisits,

COUNT(CASE WHEN dt = cur_dt AND COALESCE(prvbit, 0) = 0 THEN 1 END) AS added,

COUNT(CASE WHEN dt <> cur_dt AND nxtbit = 0 THEN 1 END) AS removed,

COUNT(CASE WHEN dt = cur_dt AND prvbit = 1 THEN 1 END) AS remained

FROM dbo.GetDates(@from, @to) AS D

LEFT OUTER JOIN R

ON dt IN(cur_dt, DATEADD(day, 1, cur_dt))

GROUP BY dt;

Performance remains to be seen in the future. As for the logic of the solution, the LAG function returns a value from the previous row in the partition based on the given ordering, and LEAD returns a value from the next row. The solution defines a CTE called R where in addition to the existing info from DailyVisits it computes two flags using LAG and LEAD indicating whether the current visitor had a visit the day before and day after, respectively. Then the solution performs a left outer join between the set of dates in the input range obtained from the GetNums table and R. Finally, in the outer query, the solution analyzes the flags to determine the needed statistics.

**Itzik’s Solutions**

Here’s my first solution (call it Itzik 1):

Description: The solution performs a full outer join between two instances of DailyVisits—one representing the current row and the other the previous row. The join predicate is: CUR.visitor = PRV.visitor AND CUR.dt = DATEADD(day, 1, PRV.dt). The solution then performs a left outer join between the set of dates in the input range obtained from GetDates (call it D) and the result of the aforementioned full join. The join predicate is: D.dt IN (CUR.dt, DATEADD(day, 1, PRV.dt)). The solution then groups the rows by D.dt, and computes the desired statistics by analyzing the details in the current vs. previous rows.

Solution code:

DECLARE

@from AS DATE = '20110101',

@to AS DATE = '20110630';

SELECT D.dt,

COUNT(CUR.visitor) AS numvisits,

COUNT(CASE WHEN CUR.visitor IS NOT NULL AND PRV.visitor IS NULL THEN 1 END) AS added,

COUNT(CASE WHEN CUR.visitor IS NULL AND PRV.visitor IS NOT NULL THEN 1 END) AS removed,

COUNT(CASE WHEN CUR.visitor = PRV.visitor THEN 1 END) AS remained

FROM dbo.GetDates(@from, @to) AS D

LEFT JOIN

( dbo.DailyVisits AS CUR

FULL JOIN dbo.DailyVisits AS PRV

ON CUR.visitor = PRV.visitor

AND CUR.dt = DATEADD(day, 1, PRV.dt))

ON D.dt IN (CUR.dt, DATEADD(day, 1, PRV.dt))

GROUP BY D.dt;

In my second solution (call it Itzik 2) I also used the LAG function that is planned to be supported in SQL Server Denali. Here’s the solution’s code:

DECLARE

@from AS DATE = '20110101',

@to AS DATE = '20110630';

WITH C1 AS

(

SELECT dt, visitor,

CASE WHEN DATEDIFF(day, LAG(dt) OVER(PARTITION BY visitor ORDER BY dt), dt) = 1

THEN 1 ELSE 0 END AS prvbit

FROM dbo.DailyVisits

),

C2 AS

(

SELECT dt,

COUNT(*) AS numvisits,

SUM(1-prvbit) AS added,

SUM(prvbit) AS remained,

-- removed = prev numvisits - cur remained

CASE WHEN DATEDIFF(day, LAG(dt) OVER(ORDER BY dt), dt) = 1

THEN LAG(COUNT(*)) OVER(ORDER BY dt) - SUM(prvbit) ELSE 0 END AS removed,

0 AS filler

FROM C1

GROUP BY dt

UNION ALL

SELECT dt, 0, 0, 0, 0, 1

FROM dbo.GetDates(@from, @to) AS D

WHERE dt NOT IN (SELECT dt FROM dbo.DailyVisits)

)

SELECT dt,

numvisits,

added,

remained,

CASE WHEN filler = 0 THEN removed ELSE

LAG(numvisits) OVER(ORDER BY dt) END AS removed

FROM C2;

The CTE C1 uses the LAG function to compute a flag (call it prvbit) that indicates whether the visitor had a previous visit the day before.

The CTE C2 unifies the results of two queries. One query groups the rows from C1 by dt, and computes based on the prvbit flag how many daily visits there were, how many added and how many remained. To calculate how many were removed you can subtract from the previous day’s number of visits the current day’s remained number. To get the previous day’s number of visits, the solution uses the LAG function again. The solution then queries the GetDates function returning all dates from the input range that do not appear in DailyVisits, and unifies the rows with the result of the grouped query. The code assigns a flag 1 to the rows from GetDates indicating that those are filler rows, and the flag 0 to the rows in the result of the grouped query to indicate those aren’t filler rows.

Finally, the outer query queries the CTE C2, and computes in filler rows the *removed* attribute using the LAG function a third time. The number of visits in the previous day is the number of removed visits in the filler row.

Performance is yet to be seen, but I suspect the query will perform quite good.

**Peso’s Solution**

And the best saved for last! Peso’s solution ran only 2 seconds against the large set of sample data. It is very clever and creative, and I enjoyed very much following its logic. Here’s the solution code followed by Peso’s description of the solution:

DECLARE

@from AS DATE = '20110101',

@to AS DATE = '20110630';

;WITH cteSource(dt, NumVisits, Added, Removed)

AS (

SELECT dt,

SUM(NumVisits) AS NumVisits,

SUM(Added) AS Added,

SUM(Removed) AS Removed

FROM (

SELECT DATEADD(DAY, f.DayDelta, w.dt) AS dt,

w.Visitor,

f.NumVisits,

f.Added,

f.Removed

FROM dbo.DailyVisits AS w

CROSS JOIN (

VALUES (0, 1, 1, -1),

(1, 0, -1, 1)

) AS f(DayDelta, NumVisits, Added, Removed)

WHERE w.dt BETWEEN DATEADD(DAY, -1, @From) AND @To

) AS d

WHERE dt BETWEEN @From AND @To

GROUP BY dt,

Visitor

)

SELECT dt,

SUM(NumVisits) AS NumVisits,

SUM(Added) AS Added,

SUM(Removed) AS Removed,

SUM(Remained) AS Remained

FROM (

SELECT dt,

SUM(NumVisits) AS NumVisits,

SUM(CASE WHEN Added = 1 THEN 1 ELSE 0 END) AS Added,

SUM(CASE WHEN Removed = 1 THEN 1 ELSE 0 END) AS Removed,

SUM(CASE WHEN 1 IN (Added, Removed) THEN 0 ELSE 1 END) AS Remained

FROM cteSource

GROUP BY dt

UNION ALL

SELECT dt,

0 AS NumVisits,

0 AS Added,

0 AS Removed,

0 AS Remained

FROM dbo.GetDates(@From, @To)

) AS d

GROUP BY dt;

**Peso’s description:**

This was an interesting challenge because it had a “make it or break it” logic which was not that obvious at first glance. My beta query (in my head) used a JOIN which I think most of you used too. But when you think of the actual problem in this challenge, it’s about “carry-forward” from one day to next.

What does this mean, in terms of logic and performance? Well, if the problem is to carry forward some values from one day to the next day, which is the fastest way to get the final values for the two dates? It’s *not* by joining, because an efficient self-join need extremely well built indexes and even when they are in place, SQL Server optimizer still need to seek out the very next day. And here is another catch by using an index; what if there is no next day, as in the example code given in this challenge?

A better approach is to *calculate* the next day! Doing this you avoid self-joining and can enhance performance both in terms of speed and resource utilization. I have dissected my final solution into 5 sections and 1 non-recursive CTE and I will explain the idea and usage for each section below.

**Section A:**

To have anything to work with, you must get something out from dbo.DailyVisits source table; the dates and the visitors. This is no secret and an absolute must.

**Section B**

Here is the first optimization trick because here is where you calculate the next day value, rather than seeking it out by self-joining. This also enables the solution to be a “one-pass only” solution which I prefer. The DayDelta is a number to add to the real date value; 0 means same day and 1 means next day according to how DATEADD work. The next optimization trick is to work with the visit count and here I used a value of 1 for the real days and 0 for the calculated virtual next date.

By doing this I can easily just SUM this column to get the correct count for NumVisits result. I use the same technique for the Added and Removed column; use +1 and -1 to sum in a later stage. The third optimization trick is to *not* calculate Remained at this point. I will get back to Remained in Section D.

**Section C:**

Here is the major performance gain. Since the clustered index on the dbo.DailyVisits table is on dt column, I can use this to get a nice SEEK operation from the table. And the primary WHERE clause needs to deal with the calculated date values so I want the rows from the day *before* the earliest day, to the last day. This is because I want the calculated *next* day to be included in the beginning of the interval.

**Common Table Expression cteSource:**

Is this section, I simply sum up all NumVisits, Added and Removed values by dt. The trick here is to group by the visitor too, to avoid the costly distinct aggregate operation. For this resultset from cteSource I will later apply another group by and then only over dt column. More about that in Section D. Here I also apply the secondary WHERE clause to remove the real row containing the date before the earliest date and the calculated next date which occurs after the latest date.

**Section D:**

To avoid the distinct aggregate I apply two group by’s; first over both dt and visitor columns and the second over dt column only. This is the first trick in this section and proved to be very efficient and removes the need for a sort operation in the execution plan. The second trick is to wait to calculate the Remained column until now. If you understand that if there was no difference in Added and Removed value for the visitor at any given day, the visitor must have kept/remained his presence.

**Section E:**

The first trick is to have this section to deal with missing dates in the source table. Here I use a very efficient function written by Mr. Ben-Gan to get the full interval. The second trick is to use 0 as value for all columns, because if the row for that date is missing, there cannot be any movement either in terms of Added, Removed or Remained.

**Final query:**

Now the solution is very simple. Just add Section D and Section E and sum the column values.

The query is very fast. I tested on a sample table with about 120,000 random rows and my solution used about 2 ms CPU vs 4400 ms CPU for the JOIN solution; it also used only 10 reads vs. about 800,000 reads for the JOIN solution.

Cheers,

BG
Syndicate