Solutions to TSQL Challenge – Reoccurring Visits

Last week I provided a T-SQL challenge involving a table called DailyVisits that holds information about daily visits to a website. Your task was to identify, for each day, certain statistics compared to the previous day, like how many visits there were, how many visitors were added, how many removed, and how many remained. You can find the puzzle details here. I’d like to thank all those who submitted solutions including Peter Larsson (Peso), Alejandro Mesa, simran and Tomaz Kastrun.

To test the performance of the solutions I populated the DailyVisits table with about 180,000 rows using the following code:

-- large set of sample data for perf tests

-- about 180,000 rows

DECLARE

  @from AS DATE = '20110101',

  @to   AS DATE = '20110630';

 

TRUNCATE TABLE dbo.DailyVisits;

 

INSERT INTO dbo.DailyVisits WITH (TABLOCK) (dt, visitor)

  SELECT dt, CAST(number AS VARCHAR(10)) AS visitor

  FROM dbo.GetDates(@from, @to) AS D

    CROSS APPLY (SELECT number, ABS(CHECKSUM(NEWID())) % 2 AS rnd

                 FROM master..spt_values

                 WHERE type = 'p'

                   AND number BETWEEN 1 AND 2000) AS A

  WHERE rnd = 1;

I also added the following index:

CREATE UNIQUE INDEX idx_visitor_dt ON dbo.DailyVisits(visitor, dt);

Part of the difficulty in addressing the task was that there was no assurance for visits every day; technically, it is possible to have zero visits in some days, and then there are no entries for such days in the DailyVisits table. But you had to include days without visits in the output if they fell in the input range of dates. In the puzzle’s comments section you can find that simran and Tomaz Kastrun addressed this difficulty by using either a loop or a recursive query to loop through the dates in the input range.

Simran’s Solutions

Simran used in one solution a loop and in the other recursion to iterate through the dates in the input range and then used subueries to calculate the different measures. Both solutions ran over a minute against the larger set of sample data.

Tomaz Kastrun’s Solutions

Tomaz Kastrun used recursion in his first solution to create the dates in the input period. In his second solution he used master..spt_values as alternative to an auxiliary table of numbers, and based on the numbers generated the dates in the input period. Then in the second part of the solutions, Tomaz joined the set of dates produced in the first step with DailyVisits twice, using left outer joins—once for current date, and second time for previous date. Then he use expressions checking counts of each measure by looking at current vs. previous elements. Both solutions ran over a minute.

GetDates Functions

Alejandro, I and Peso used a helper table called GetDates that generates a set of dates within an input range. Here’s the function’s definition:

-- definition of GetDates function

IF OBJECT_ID('dbo.GetDates', 'IF') IS NOT NULL

  DROP FUNCTION dbo.GetDates;

GO

CREATE FUNCTION dbo.GetDates(@low AS DATE, @high AS DATE) RETURNS TABLE

AS

RETURN

  WITH

    L0   AS (SELECT c FROM (VALUES(1),(1)) AS D(c)),

    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),

    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),

    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),

    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),

    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),

    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum

            FROM L5)

  SELECT TOP (DATEDIFF(day, @low, @high) + 1) DATEADD(day, rownum - 1, @low) AS dt

  FROM Nums

  ORDER BY rownum;

GO

 

-- test function

SELECT dt

FROM dbo.GetDates('20110601', '20110608') AS D;

 

dt

----------

2011-06-01

2011-06-02

2011-06-03

2011-06-04

2011-06-05

2011-06-06

2011-06-07

2011-06-08

Alejandro’s Solutions

Here’s Alejandro’s first solution (call it Alejandro 1):

Run time: 55 seconds.

Description (by Alejandro): My approach was to generate the Cartesian product between visitors and dates in the range, which could be very expensive if there are many visitors, or many dates, or both.  Then doing an outer join with the table to flag each row as visit or no visit.  Finally a self join to compare current and previous visit to calculate if current visit was added, removed, or remained, and grouping by date.

Solution code:

DECLARE

  @from AS DATE = '20110101',

  @to   AS DATE = '20110630';

 

-- self join

WITH V AS (

SELECT DISTINCT visitor

FROM dbo.DailyVisits

)

, D AS (

SELECT dt

FROM dbo.GetDates(@from, @to)

)

, EDV AS (

SELECT

      V.visitor,

      D.dt,

      CASE WHEN DV.visitor IS NULL AND DV.dt IS NULL THEN 0 ELSE 1 END AS visit_cnt

FROM

      (

      V

      CROSS JOIN

      D

      )

      LEFT OUTER JOIN

      dbo.DailyVisits AS DV

      ON V.visitor = DV.visitor

      AND D.dt = DV.dt

)

SELECT

      A.dt,

      SUM(ISNULL(A.visit_cnt, 0)) AS numvisits,

      SUM(CASE WHEN A.visit_cnt > ISNULL(B.visit_cnt, 0) THEN 1 ELSE 0 END) AS added,

      SUM(CASE WHEN A.visit_cnt < ISNULL(B.visit_cnt, 0) THEN 1 ELSE 0 END) AS removed,

      SUM(CASE WHEN A.visit_cnt + ISNULL(B.visit_cnt, 0) = 2 THEN 1 ELSE 0 END) AS remained

FROM

      EDV AS A

      LEFT OUTER JOIN

      EDV AS B

      ON A.visitor = B.visitor

      AND A.dt = DATEADD([day], 1, B.dt)

GROUP BY

      A.dt

ORDER BY

      A.dt

Alejandro also proposed a second solution (call it Alejandro 2) that uses window functions called LAG and LEAD that aren’t yet supported by SQL Server, but are planned to be supported in SQL Server Denali:

DECLARE

  @from AS DATE = '20110101',

  @to   AS DATE = '20110630';

 

WITH R AS

(

  SELECT dt AS cur_dt, visitor,

    CASE WHEN DATEDIFF(day, LAG(dt) OVER(PARTITION BY visitor ORDER BY dt), dt) = 1 THEN 1 ELSE 0 END AS prvbit,

    CASE WHEN DATEDIFF(day, dt, LEAD(dt) OVER(PARTITION BY visitor ORDER BY dt)) = 1 THEN 1 ELSE 0 END AS nxtbit

  FROM dbo.DailyVisits

)

SELECT dt,

  COUNT(CASE WHEN dt = cur_dt THEN 1 END) AS numvisits,

  COUNT(CASE WHEN dt = cur_dt AND COALESCE(prvbit, 0) = 0 THEN 1 END) AS added,

  COUNT(CASE WHEN dt <> cur_dt AND nxtbit = 0 THEN 1 END) AS removed,

  COUNT(CASE WHEN dt = cur_dt AND prvbit = 1 THEN 1 END) AS remained

FROM dbo.GetDates(@from, @to) AS D

  LEFT OUTER JOIN R

    ON dt IN(cur_dt, DATEADD(day, 1, cur_dt))

GROUP BY dt;

Performance remains to be seen in the future. As for the logic of the solution, the LAG function returns a value from the previous row in the partition based on the given ordering, and LEAD returns a value from the next row. The solution defines a CTE called R where in addition to the existing info from DailyVisits it computes two flags using LAG and LEAD indicating whether the current visitor had a visit the day before and day after, respectively. Then the solution performs a left outer join between the set of dates in the input range obtained from the GetNums table and R. Finally, in the outer query, the solution analyzes the flags to determine the needed statistics.

Itzik’s Solutions

Here’s my first solution (call it Itzik 1):

Run time: 31 seconds.

Description: The solution performs a full outer join between two instances of DailyVisits—one representing the current row and the other the previous row. The join predicate is: CUR.visitor = PRV.visitor AND CUR.dt = DATEADD(day, 1, PRV.dt). The solution then performs a left outer join between the set of dates in the input range obtained from GetDates (call it D) and the result of the aforementioned full join. The join predicate is: D.dt IN (CUR.dt, DATEADD(day, 1, PRV.dt)). The solution then groups the rows by D.dt, and computes the desired statistics by analyzing the details in the current vs. previous rows.

Solution code:

DECLARE

  @from AS DATE = '20110101',

  @to   AS DATE = '20110630';

 

SELECT D.dt,

  COUNT(CUR.visitor) AS numvisits,

  COUNT(CASE WHEN CUR.visitor IS NOT NULL AND PRV.visitor IS NULL THEN 1 END) AS added,

  COUNT(CASE WHEN CUR.visitor IS NULL AND PRV.visitor IS NOT NULL THEN 1 END) AS removed,

  COUNT(CASE WHEN CUR.visitor = PRV.visitor THEN 1 END) AS remained

 

FROM dbo.GetDates(@from, @to) AS D

 

  LEFT JOIN

 

    (           dbo.DailyVisits AS CUR

      FULL JOIN dbo.DailyVisits AS PRV

             ON CUR.visitor = PRV.visitor

            AND CUR.dt = DATEADD(day, 1, PRV.dt))

 

         ON D.dt IN (CUR.dt, DATEADD(day, 1, PRV.dt))

 

GROUP BY D.dt;

In my second solution (call it Itzik 2) I also used the LAG function that is planned to be supported in SQL Server Denali. Here’s the solution’s code:

DECLARE

  @from AS DATE = '20110101',

  @to   AS DATE = '20110630';

 

WITH C1 AS

(

  SELECT dt, visitor,

    CASE WHEN DATEDIFF(day, LAG(dt) OVER(PARTITION BY visitor ORDER BY dt), dt) = 1

      THEN 1 ELSE 0 END AS prvbit

  FROM dbo.DailyVisits

),

C2 AS

(

  SELECT dt,

    COUNT(*) AS numvisits,

    SUM(1-prvbit) AS added,

    SUM(prvbit) AS remained,

    -- removed = prev numvisits - cur remained

    CASE WHEN DATEDIFF(day, LAG(dt) OVER(ORDER BY dt), dt) = 1

      THEN LAG(COUNT(*)) OVER(ORDER BY dt) - SUM(prvbit) ELSE 0 END AS removed,

    0 AS filler

  FROM C1

  GROUP BY dt

 

  UNION ALL

 

  SELECT dt, 0, 0, 0, 0, 1

  FROM dbo.GetDates(@from, @to) AS D

  WHERE dt NOT IN (SELECT dt FROM dbo.DailyVisits)

)

SELECT dt,

  numvisits,

  added,

  remained,

  CASE WHEN filler = 0 THEN removed ELSE

    LAG(numvisits) OVER(ORDER BY dt) END AS removed

FROM C2;

The CTE C1 uses the LAG function to compute a flag (call it prvbit) that indicates whether the visitor had a previous visit the day before.

The CTE C2 unifies the results of two queries. One query groups the rows from C1 by dt, and computes based on the prvbit flag how many daily visits there were, how many added and how many remained. To calculate how many were removed you can subtract from the previous day’s number of visits the current day’s remained number. To get the previous day’s number of visits, the solution uses the LAG function again. The solution then queries the GetDates function returning all dates from the input range that do not appear in DailyVisits, and unifies the rows with the result of the grouped query. The code assigns a flag 1 to the rows from GetDates indicating that those are filler rows, and the flag 0 to the rows in the result of the grouped query to indicate those aren’t filler rows. 

Finally, the outer query queries the CTE C2, and computes in filler rows the removed attribute using the LAG function a third time. The number of visits in the previous day is the number of removed visits in the filler row.

Performance is yet to be seen, but I suspect the query will perform quite good.

Peso’s Solution

And the best saved for last! Peso’s solution ran only 2 seconds against the large set of sample data. It is very clever and creative, and I enjoyed very much following its logic. Here’s the solution code followed by Peso’s description of the solution:

DECLARE

  @from AS DATE = '20110101',

  @to   AS DATE = '20110630';

 

;WITH cteSource(dt, NumVisits, Added, Removed)

AS (

            SELECT              dt,

                                    SUM(NumVisits) AS NumVisits,

                                    SUM(Added) AS Added,

                                    SUM(Removed) AS Removed

            FROM                (

                                                SELECT              DATEADD(DAY, f.DayDelta, w.dt) AS dt,

                                                                        w.Visitor,

                                                                        f.NumVisits,

                                                                        f.Added,

                                                                        f.Removed

                                                FROM                dbo.DailyVisits AS w

                                                CROSS JOIN       (

                                                                                    VALUES  (0, 1,  1, -1),

                                                                                                (1, 0, -1,  1)

                                                                        ) AS f(DayDelta, NumVisits, Added, Removed)

                                                WHERE              w.dt BETWEEN DATEADD(DAY, -1, @From) AND @To

                                    ) AS d

            WHERE              dt BETWEEN @From AND @To

            GROUP BY          dt,

                                    Visitor

)

SELECT              dt,

                        SUM(NumVisits) AS NumVisits,

                        SUM(Added) AS Added,

                        SUM(Removed) AS Removed,

                        SUM(Remained) AS Remained

FROM                (

                                    SELECT              dt,

                                                            SUM(NumVisits) AS NumVisits,

                                                            SUM(CASE WHEN Added = 1 THEN 1 ELSE 0 END) AS Added,

                                                            SUM(CASE WHEN Removed = 1 THEN 1 ELSE 0 END) AS Removed,

                                                            SUM(CASE WHEN 1 IN (Added, Removed) THEN 0 ELSE 1 END) AS Remained

                                    FROM                cteSource

                                    GROUP BY          dt

 

                                    UNION ALL

 

                                    SELECT              dt,

                                                            0 AS NumVisits,

                                                            0 AS Added,

                                                            0 AS Removed,

                                                            0 AS Remained

                                    FROM                dbo.GetDates(@From, @To)

                        ) AS d

GROUP BY          dt;

Peso’s description:

This was an interesting challenge because it had a “make it or break it” logic which was not that obvious at first glance. My beta query (in my head) used a JOIN which I think most of you used too. But when you think of the actual problem in this challenge, it’s about “carry-forward” from one day to next.

What does this mean, in terms of logic and performance? Well, if the problem is to carry forward some values from one day to the next day, which is the fastest way to get the final values for the two dates? It’s not by joining, because an efficient self-join need extremely well built indexes and even when they are in place, SQL Server optimizer still need to seek out the very next day. And here is another catch by using an index; what if there is no next day, as in the example code given in this challenge?

A better approach is to calculate the next day! Doing this you avoid self-joining and can enhance performance both in terms of speed and resource utilization. I have dissected my final solution into 5 sections and 1 non-recursive CTE and I will explain the idea and usage for each section below.

Section A:

To have anything to work with, you must get something out from dbo.DailyVisits source table; the dates and the visitors. This is no secret and an absolute must.

Section B

Here is the first optimization trick because here is where you calculate the next day value, rather than seeking it out by self-joining. This also enables the solution to be a “one-pass only” solution which I prefer. The DayDelta is a number to add to the real date value; 0 means same day and 1 means next day according to how DATEADD work. The next optimization trick is to work with the visit count and here I used a value of 1 for the real days and 0 for the calculated virtual next date.
By doing this I can easily just SUM this column to get the correct count for NumVisits result. I use the same technique for the Added and Removed column; use +1 and -1 to sum in a later stage. The third optimization trick is to not calculate Remained at this point. I will get back to Remained in Section D.

Section C:

Here is the major performance gain. Since the clustered index on the dbo.DailyVisits table is on dt column, I can use this to get a nice SEEK operation from the table. And the primary WHERE clause needs to deal with the calculated date values so I want the rows from the day before the earliest day, to the last day. This is because I want the calculated next day to be included in the beginning of the interval.

Common Table Expression cteSource:

Is this section, I simply sum up all NumVisits, Added and Removed values by dt. The trick here is to group by the visitor too, to avoid the costly distinct aggregate operation. For this resultset from cteSource I will later apply another group by and then only over dt column. More about that in Section D. Here I also apply the secondary WHERE clause to remove the real row containing the date before the earliest date and the calculated next date which occurs after the latest date.

Section D:

To avoid the distinct aggregate I apply two group by’s; first over both dt and visitor columns and the second over dt column only. This is the first trick in this section and proved to be very efficient and removes the need for a sort operation in the execution plan. The second trick is to wait to calculate the Remained column until now. If you understand that if there was no difference in Added and Removed value for the visitor at any given day, the visitor must have kept/remained his presence.

Section E:

The first trick is to have this section to deal with missing dates in the source table. Here I use a very efficient function written by Mr. Ben-Gan to get the full interval. The second trick is to use 0 as value for all columns, because if the row for that date is missing, there cannot be any movement either in terms of Added, Removed or Remained.

Final query:

Now the solution is very simple. Just add Section D and Section E and sum the column values.
The query is very fast. I tested on a sample table with about 120,000 random rows and my solution used about 2 ms CPU vs 4400 ms CPU for the JOIN solution; it also used only 10 reads vs. about 800,000 reads for the JOIN solution.

Cheers,

BG

Discuss this Blog Entry 22

on Jul 14, 2011
Not bad Mr Celko. Not bad at all. It's clever to build a "dual date" cte holding yesterday's and today's dates. Add "sum(today_flg) as NumVisits," to the final output and yout get the same result as the rest of us. Statistics is very good. Only twice the CPU and twice the duration as my #6 version on my laptop with dual cpu. I'll try later in my production server with 16 cores.
on Jul 8, 2011
Thank you Peter, Alejandro and Itzik for nice and creative solutions.
on Jul 8, 2011
Here is another approach from me. It uses twice the CPU as my previous solution, 300 times more reads but finishes in almost a 1/10th of the time of my previous version. This suggestion heavily relies/utilizes parallellism... ;WITH cte AS ( SELECT f.dt, f.NumVisits, f.Visitor, f.Added, f.Removed, f.Remained FROM dbo.GetDates(@From, @To) AS gd OUTER APPLY ( SELECT dt, Visitor, SUM(NumVisits) AS NumVisits, CASE WHEN SUM(Added) > 0 THEN 1 ELSE 0 END AS Added, CASE WHEN SUM(Removed) > 0 THEN 1 ELSE 0 END AS Removed, CASE WHEN COUNT(*) = 2 THEN 1 ELSE 0 END AS Remained FROM ( SELECT CASE WHEN dv.dt = gd.dt THEN dv.dt ELSE DATEADD(DAY, 1, dv.dt) END AS dt, dv.Visitor, CASE WHEN dv.dt = gd.dt THEN 1 ELSE 0 END AS NumVisits, CASE WHEN dv.dt = gd.dt THEN 1 ELSE -1 END AS Added, CASE WHEN dv.dt = gd.dt THEN -1 ELSE 0 END AS Removed FROM dbo.DailyVisits AS dv WHERE dv.dt BETWEEN DATEADD(DAY, -1, gd.dt) AND gd.dt ) AS d GROUP BY dt, Visitor ) AS f ) SELECT dt, SUM(NumVisits) AS NumVisits, SUM(Added) AS Added, SUM(Removed) AS Removed, SUM(Remained) AS Remained FROM cte GROUP BY dt
on Jul 9, 2011
Holy smoley! Only 1 scan count and 344 reads. ;WITH cteSource(dt, NumVisits, Added, Removed, Remained) AS ( SELECT dt, SUM(NumVisits) AS NumVisits, SUM(NumVisits) - SUM(Remained) AS Added, SUM(Removed) AS Removed, SUM(Remained) AS Remained FROM ( SELECT f.theDate AS dt, dv.Visitor, SUM(f.Virtual) AS NumVisits, MIN(1 - f.Virtual) AS Removed, COUNT(*) - 1 AS Remained FROM dbo.DailyVisits AS dv CROSS APPLY ( VALUES (DATEADD(DAY, 1, dv.dt), 0), (dv.dt, 1) ) AS f(theDate, Virtual) WHERE f.theDate BETWEEN @From AND @To GROUP BY f.theDate, dv.Visitor ) AS d GROUP BY dt ) SELECT dt, SUM(NumVisits) AS NumVisits, SUM(Added) AS Added, SUM(Removed) AS Removed, SUM(Remained) AS Remained FROM ( SELECT dt, NumVisits, Added, Removed, Remained FROM cteSource UNION ALL SELECT dt, 0 AS NumVisits, 0 AS Added, 0 AS Removed, 0 AS Remained FROM dbo.GetDates(@From, @To) ) AS d GROUP BY dt
on Jul 15, 2011
Test done on production server. -- Celko #1 (181 row(s) affected) Table 'Worktable'. Scan count 0, logical reads 0. Table 'dimDate'. Scan count 7, logical reads 7. Table 'DailyVisits'. Scan count 362, logical reads 2412. SQL Server Execution Times: CPU time = 1370 ms, elapsed time = 547 ms. -- Peso #6 (181 row(s) affected) Table 'Worktable'. Scan count 0, logical reads 0. Table 'DailyVisits'. Scan count 181, logical reads 1943. SQL Server Execution Times: CPU time = 1325 ms, elapsed time = 101 ms.
Alejandro Mesa (not verified)
on Jul 9, 2011
Peter, Here are the statistics for #6. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'DailyVisits'. Scan count 181, logical reads 1680, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1185 ms, elapsed time = 603 ms. Hardware: Lenovo W500, with Intel Core Duo 2.53 Ghz, 8 BG RAM, a 7200 rpm HD. Software: Windows 7 Pro x64 / SQL Server 2008 DE SP2 x64 I see lower elapsed time (2204 vs 603) and same CPU time, between #1 and #6.
on Jul 13, 2011
You're welcome. See you for next challenge.
on Jul 8, 2011
Thank you guys. Those were some great solutions. Keep the challenges coming
on Jul 8, 2011
This is the last one. This runs in 0.1 second on my server while the featured #3 runs in 1.4 seconds. Enjoy! -- SwePeso #6 ;WITH cte AS ( SELECT f.dt, f.NumVisits, f.Visitor, f.Added, f.Removed, f.Remained FROM dbo.GetDates(@From, @To) AS gd OUTER APPLY ( SELECT q.dt, q.Visitor, SUM(q.NumVisits) AS NumVisits, CASE WHEN SUM(q.Added) > 0 THEN 1 ELSE 0 END AS Added, CASE WHEN SUM(q.Removed) > 0 THEN 1 ELSE 0 END AS Removed, CASE WHEN COUNT(*) = 2 THEN 1 ELSE 0 END AS Remained FROM ( SELECT gd.dt, d.Visitor, 1 - d.theDiff AS NumVisits, 1 - 2 * d.theDiff AS Added, 2 * d.theDiff - 1 AS Removed FROM ( SELECT dv.Visitor, CASE WHEN dv.dt = gd.dt THEN 0 ELSE 1 END AS theDiff FROM dbo.DailyVisits AS dv WHERE dv.dt BETWEEN DATEADD(DAY, -1, gd.dt) AND gd.dt ) AS d ) AS q GROUP BY q.dt, q.Visitor ) AS f ) SELECT dt, SUM(NumVisits) AS NumVisits, SUM(Added) AS Added, SUM(Removed) AS Removed, SUM(Remained) AS Remained FROM cte GROUP BY dt
on Jul 9, 2011
AMB, can you test my #6 on your machine?
on Jul 9, 2011
All, I tried to fix the typo Alejandro was referring to, and when attempting to publish the entry, the entire blog body disappeared. I've sent e-mail to the editors and hopefully it will be fixed soon. In the meanwhile, I arrange a page in my site with the blog entry: http://tsql.solidq.com/blog20110707.htm. Sorry about the temporary inconvenience. Cheers, BG
Alejandro Mesa (not verified)
on Jul 8, 2011
Peter, Yes, I did like the description, and it helped big time to grasp the idea. The one thing that I liked more, was the cross apply and the numbers used for (dailydelta, numvisits, added, removed). That was awesome. Itzik had a typo in the time taken by his first solution. It takes 25 sec in my box, and mine 55 sec. Nothing will change compared to yours :), but hey 30 sec difference is bit deal. I really appreciate learning from you all. -- AMB
on Jul 9, 2011
Thank you . That is what I suspected. I get similar timings on my dualcore laptop. 3108 ms duration vs 943 ms (1934 ms CPU vs 1732 ms). On my 16-core server I get this timings; 1470 ms duration vs 103 ms, and 921 ms CPU vs 1529 ms. 2 cpu's used vs 10 cpu's used. CROSS APPLY is one of my favorite command to get parallellism.
on Jul 25, 2012
Late to the party but this beats Peso's #6 with a very simplistic FULL join: ;WITH tallyCTE AS ( SELECT TOP (SELECT DATEDIFF(d,MIN(dt),MAX(dt))+1 FROM dbo.DailyVisits) n = DATEADD(d,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1,(SELECT MIN(dt) FROM dbo.DailyVisits)) FROM sys.all_columns c1 CROSS JOIN sys.all_columns c2 ) SELECT n, visits = SUM(CASE WHEN value IN ('R','A') THEN 1 ELSE 0 END), added = SUM(CASE WHEN value = 'A' THEN 1 ELSE 0 END), removed = SUM(CASE WHEN value = 'D' THEN 1 ELSE 0 END), remained = SUM(CASE WHEN value = 'R' THEN 1 ELSE 0 END) FROM tallyCTE LEFT JOIN ( SELECT dt = ISNULL(dv2.dt,DATEADD(d,1,dv1.dt)), value = CASE WHEN dv1.dt IS NOT NULL AND dv2.dt IS NOT NULL THEN 'R' --Remained WHEN dv1.dt IS NOT NULL AND dv2.dt IS NULL THEN 'D' --Removed WHEN dv1.dt IS NULL and dv2.dt IS NOT NULL THEN 'A' --Added END FROM dbo.DailyVisits dv1 --Full join returns all records. ISNULL and CASE statement creates the groups FULL JOIN dbo.DailyVisits dv2 ON dv1.dt = DATEADD(d,-1,dv2.dt) AND dv1.visitor = dv2.visitor ) x ON n = dt group by n
Alejandro Mesa (not verified)
on Jul 7, 2011
I tip my hat to you both. That is a great solution, Peter! -- AMB
on Jul 7, 2011
Thank you Alejandro. I hope you liked the solution description. The matching section placeholders in code seem to be missing.
on Jul 8, 2011
Here is a better solution, #5. Almost the same number of reads, twice the CPU and 1/10th the duration of the featured #3 query. ;WITH cte AS ( SELECT f.dt, f.NumVisits, f.Visitor, f.Added, f.Removed, f.Remained FROM dbo.GetDates(@From, @To) AS gd OUTER APPLY ( SELECT q.dt, q.Visitor, SUM(q.NumVisits) AS NumVisits, CASE WHEN SUM(q.Added) > 0 THEN 1 ELSE 0 END AS Added, CASE WHEN SUM(q.Removed) > 0 THEN 1 ELSE 0 END AS Removed, CASE WHEN COUNT(*) = 2 THEN 1 ELSE 0 END AS Remained FROM ( SELECT gd.dt, d.Visitor, 1 - d.theDiff AS NumVisits, 1 - 2 * d.theDiff AS Added, 2 * d.theDiff - 1 AS Removed FROM ( SELECT dv.Visitor, DATEDIFF(DAY, dv.dt, gd.dt) AS theDiff FROM dbo.DailyVisits AS dv WHERE dv.dt BETWEEN DATEADD(DAY, -1, gd.dt) AND gd.dt ) AS d ) AS q GROUP BY q.dt, q.Visitor ) AS f ) SELECT dt, SUM(NumVisits) AS NumVisits, SUM(Added) AS Added, SUM(Removed) AS Removed, SUM(Remained) AS Remained FROM cte GROUP BY dt
on Jul 11, 2011
After having grasped the idea in SwePeso solution, I can't help but say that the cross apply logic is simply awesome. Thanks once again for sharing your super creative solution.
on Jul 25, 2012
This beats Peso's #6: Late to the party but this beats Peso's #6 with a very simplistic FULL join: ;WITH tallyCTE AS ( SELECT TOP (SELECT DATEDIFF(d,MIN(dt),MAX(dt))+1 FROM dbo.DailyVisits) n = DATEADD(d,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1,(SELECT MIN(dt) FROM dbo.DailyVisits)) FROM sys.all_columns c1 CROSS JOIN sys.all_columns c2 ) SELECT n, visits = SUM(CASE WHEN value IN ('R','A') THEN 1 ELSE 0 END), added = SUM(CASE WHEN value = 'A' THEN 1 ELSE 0 END), removed = SUM(CASE WHEN value = 'D' THEN 1 ELSE 0 END), remained = SUM(CASE WHEN value = 'R' THEN 1 ELSE 0 END) FROM tallyCTE LEFT JOIN ( SELECT dt = ISNULL(dv2.dt,DATEADD(d,1,dv1.dt)), value = CASE WHEN dv1.dt IS NOT NULL AND dv2.dt IS NOT NULL THEN 'R' --Remained WHEN dv1.dt IS NOT NULL AND dv2.dt IS NULL THEN 'D' --Removed WHEN dv1.dt IS NULL and dv2.dt IS NOT NULL THEN 'A' --Added END FROM dbo.DailyVisits dv1 --Full join returns all records. ISNULL and CASE statement creates the groups FULL JOIN dbo.DailyVisits dv2 ON dv1.dt = DATEADD(d,-1,dv2.dt) AND dv1.visitor = dv2.visitor ) x ON n = dt group by n
on Jul 14, 2011
Not bad Mr Celko. Not bad at all. It's clever to build a "dual date" cte holding both yesterday's and today's dates. Add "sum(today_flg) as NumVisits," to the final select and you get the same result as the rest of us. Statistics says this query uses twice the cpu and twice the duration as my #6 version on my dualcore laptop. I'll try later on my production server with 16 cores.
on Jul 19, 2011
Solution#1, used idx_visitor_dt. ;with Calendar as ( select @from dt union all select DATEADD(DAY, 1, dt) from Calendar where dt < @to ) select c.dt , isnull(todayNum, 0) numvisitors , isnull(todayNum - remain, 0) added , isnull(yesterdayNum - remain, 0) removed , isnull(remain, 0) remain from Calendar c left join ( select dt , COUNT(v1) as todayNum , COUNT(v2) as yesterdayNum , SUM(remain) as remain from ( select isnull(today.dt, DATEADD(day, 1, yesterday.dt)) as dt , today.visitor as v1 , yesterday.visitor as v2 , case when today.visitor = yesterday.visitor then 1 else 0 end remain from DailyVisits today full join DailyVisits yesterday on today.dt = dateadd(DAY, 1, yesterday.dt) and today.visitor = yesterday.visitor ) t group by dt ) t2 on t2.dt = c.dt option (maxrecursion 0) Solution#2, used idx_visitor_dt, dbo.GetDates. select d.dt , isnull(todayNum, 0) numvisitors , isnull(todayNum - remain, 0) added , isnull(yesterdayNum - remain, 0) removed , isnull(remain, 0) remain from dbo.GetDates(@from, @to) d left join ( select dt , COUNT(v1) as todayNum , COUNT(v2) as yesterdayNum , SUM(remain) as remain from ( select isnull(today.dt, DATEADD(day, 1, yesterday.dt)) as dt , today.visitor as v1 , yesterday.visitor as v2 , case when today.visitor = yesterday.visitor then 1 else 0 end remain from DailyVisits today full join DailyVisits yesterday on today.dt = dateadd(DAY, 1, yesterday.dt) and today.visitor = yesterday.visitor ) t group by dt ) t2 on t2.dt = d.dt
on Jul 14, 2011
I am late to the party, but here goes. Let's assume we have a the usual Calendar table. Use it to build a table of date pairs. Use the date pairs flag each visitor as a today and/or yesterday visit. The LEFT OUTER JOIN will give us the days when nothing happened. Finally aggregate by date and use a little math trick to filter the flags inside the SUM(), which should be faster than a CASE expression. WITH Date_Span (yesterday, today) AS (SELECT DATEADD(DAY, -1, cal_date), cal_date FROM Calender WHERE cal_dte BETWEEN @in_start_date AND @in_end_date), TwoDays(today, visitor_id, today_flg, yesterday_flg) AS (SELECT D1.today, V1.visitor_id, SUM (CASE WHEN D1.today = V1.login_date THEN 1 ELSE 0 END) AS today_flg, SUM (CASE WHEN D1.yesterday = V1.login_date THEN 1 ELSE 0 END) AS yesterday_flg FROM Date_Span AS D1 LEFT OUTER JOIN DailyVisits AS V1 ON login_date IN (D1.yesterday, D1.today) GROUP BY D1.today, V1.visitor_id) SELECT today, SUM (today_flg * (1-yesterday_flg)) AS new_flg_tot, SUM((1-today_flg) * yesterday_flg) AS dropped_flg_tot, SUM(today_flg * yesterday_flg) AS remained_flg_tot FROM TwoDays GROUP BY today;

Please or Register to post comments.

What's Puzzled By T-SQL Blog?

T-SQL tips and logical puzzles from Itzik Ben-Gan.

Blog Archive

Sponsored Introduction Continue on to (or wait seconds) ×