When using SQL Server, you frequently need to work with data that represents intervals of time. For example, consider intervals representing sessions, contracts, projects, and so on. I covered a few aspects of manipulation of intervals in the past, but there are still many interesting challenges to cover. Tasks related to interval manipulation are typically quite intriguing, especially because coming up with efficient solutions isn't easy.

I recently had the chance to work on a few challenges involving intervals and counts. This article is the first in a series covering those challenges and their solutions.

Related: Solving Gaps and Islands with Enhanced Window Functions

I'd like to thank Adam Machanic, who sent me the challenge that's the focus of this article. I'd also like to thank Ben Flanaghan, who created one of the fundamental techniques that I rely on in one of the solutions that I present in this article.

Sample Data

I'll use the same sample data for all of the challenges that I'll discuss in this series. The data involves a table called Sessions that holds sessions against different applications. The column app represents the application (call it the partitioning element), and the columns starttime and endtime represent the start and end points of the session's interval, respectively. The starttime and endtime columns use the DATETIME2(0) type, which has a precision of one second.

Related: Interval Queries in SQL Server

Run the code in Listing 1 to create the Sessions table and two supporting indexes, which will be used by the different solutions that I'll present in the series.

                              SET NOCOUNT ON;                              USE tempdb;                              IF OBJECT_ID(N'dbo.Sessions', N'U') IS NOT NULL DROP TABLE dbo.Sessions;                              IF OBJECT_ID(N'dbo.Apps', N'U') IS NOT NULL DROP TABLE dbo.Apps;                              CREATE TABLE dbo.Apps                              (                                app       VARCHAR(10) NOT NULL,                                CONSTRAINT PK_Apps PRIMARY KEY(app)                              );                              CREATE TABLE dbo.Sessions                              (                                keycol    INT         NOT NULL,                                app       VARCHAR(10) NOT NULL,                                starttime DATETIME2(0)   NOT NULL,                                endtime   DATETIME2(0)   NOT NULL,                                CONSTRAINT PK_Sessions PRIMARY KEY(keycol),                                CONSTRAINT CHK_Sessios_et_st CHECK(endtime > starttime)                              );                              CREATE UNIQUE INDEX idx_start ON dbo.Sessions(app, starttime, keycol);                              CREATE UNIQUE INDEX idx_end ON dbo.Sessions(app, endtime, keycol);

The index called idx_start is defined on the key list (app, starttime, keycol), and the index called idx_end is defined on the key list (app, endtime, keycol).

Use the code in Listing 2 to populate the Sessions table with a small set of sample data.

                              TRUNCATE TABLE dbo.Sessions;                              TRUNCATE TABLE dbo.Apps;                              INSERT INTO dbo.Apps(app) VALUES('app1'),('app2'),('app3');                              INSERT INTO dbo.Sessions(keycol, app, starttime, endtime) VALUES                                (2,  'app1', '20130212 08:30:00', '20130212 10:30:00'),                                (3,  'app1', '20130212 08:30:00', '20130212 08:45:00'),                                (5,  'app1', '20130212 09:00:00', '20130212 09:30:00'),                                (7,  'app1', '20130212 09:15:00', '20130212 10:30:00'),                                (11, 'app1', '20130212 09:15:00', '20130212 09:30:00'),                                (13, 'app1', '20130212 10:30:00', '20130212 14:30:00'),                                (17, 'app1', '20130212 10:45:00', '20130212 11:30:00'),                                (19, 'app1', '20130212 11:00:00', '20130212 12:30:00'),                                (23, 'app2', '20130212 08:30:00', '20130212 08:45:00'),                                (29, 'app2', '20130212 09:00:00', '20130212 09:30:00'),                                (31, 'app2', '20130212 11:45:00', '20130212 12:00:00'),                                (37, 'app2', '20130212 12:30:00', '20130212 14:00:00'),                                (41, 'app2', '20130212 12:45:00', '20130212 13:30:00'),                                (43, 'app2', '20130212 13:00:00', '20130212 14:00:00'),                                (47, 'app2', '20130212 14:00:00', '20130212 16:30:00'),                                (53, 'app2', '20130212 15:30:00', '20130212 17:00:00'),                                (61, 'app3', '20130212 08:00:00', '20130212 08:30:00'),                                (62, 'app3', '20130212 08:00:00', '20130212 09:00:00'),                                (63, 'app3', '20130212 09:00:00', '20130212 09:30:00'),                                (64, 'app3', '20130212 09:30:00', '20130212 10:00:00');

You can use the sample data from Listing 2 to verify the correctness of your solutions. Figure 1 provides a graphic representation of the small set of sample data.

Intervals

To generate a large set of sample data, you'll need the helper function GetNums, which accepts input parameters called @low and @high and returns a sequence of integers in the requested range. Use the code in Listing 3 to create the GetNums function.

                              IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;                              GO                              CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE                              AS                              RETURN                                WITH                                  L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),                                  L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),                                  L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),                                  L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),                                  L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),                                  L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),                                  Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum                                          FROM L5)                                SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n                                FROM Nums                                ORDER BY rownum;                              GO

Use the code in Listing 4 to populate the Sessions table with a large set of sample data.

                              TRUNCATE TABLE dbo.Sessions;                              TRUNCATE TABLE dbo.Apps;                              DECLARE                                @numrows AS INT = 2000000, -- total number of rows                                @numapps AS INT = 100;     -- number of applications                              INSERT INTO dbo.Apps WITH(TABLOCK) (app)                                SELECT 'app' + CAST(n AS VARCHAR(10)) AS app                                FROM dbo.GetNums(1, @numapps) AS Nums;                              INSERT INTO dbo.Sessions WITH(TABLOCK)                                  (keycol, app, starttime, endtime)                                SELECT                                  ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS keycol,                                  D.*,                                  DATEADD(                                    second,                                    1 + ABS(CHECKSUM(NEWID())) % (20*60),                                    starttime) AS endtime                                FROM                                (                                  SELECT                                    'app' + CAST(1 + ABS(CHECKSUM(NEWID())) % @numapps AS VARCHAR(10)) AS app,                                    DATEADD(                                      second,                                      1 + ABS(CHECKSUM(NEWID())) % (30*24*60*60),                                      '20130101') AS starttime                                  FROM dbo.GetNums(1, @numrows) AS Nums                                ) AS D;

You can control the number of applications and the total number of rows in the sample data by setting the variables @numapps and @numrows, respectively. The code in Listing 4 fills the Apps table with 100 applications and the Sessions table with 2,000,000 rows.

The Challenge: Computing Counts During Discrete Intervals

The challenge that's the focus of this article is to identify the discrete intervals that exist in the data and determine how many source intervals each discrete interval overlaps with. Figure 2 provides a graphic representation of the desired result against the small set of sample data.

Counts During Discrete Intervals

Figure 3 shows the desired output of your query.

                              app   starttime            endtime              cnt                              ----- -------------------- -------------------- ----                              app1  2013-02-12 08:30:00  2013-02-12 08:45:00  2                              app1  2013-02-12 08:45:00  2013-02-12 09:00:00  1                              app1  2013-02-12 09:00:00  2013-02-12 09:15:00  2                              app1  2013-02-12 09:15:00  2013-02-12 09:30:00  4                              app1  2013-02-12 09:30:00  2013-02-12 10:30:00  2                              app1  2013-02-12 10:30:00  2013-02-12 10:45:00  1                              app1  2013-02-12 10:45:00  2013-02-12 11:00:00  2                              app1  2013-02-12 11:00:00  2013-02-12 11:30:00  3                              app1  2013-02-12 11:30:00  2013-02-12 12:30:00  2                              app1  2013-02-12 12:30:00  2013-02-12 14:30:00  1                              app2  2013-02-12 08:30:00  2013-02-12 08:45:00  1                              app2  2013-02-12 09:00:00  2013-02-12 09:30:00  1                              app2  2013-02-12 11:45:00  2013-02-12 12:00:00  1                              app2  2013-02-12 12:30:00  2013-02-12 12:45:00  1                              app2  2013-02-12 12:45:00  2013-02-12 13:00:00  2                              app2  2013-02-12 13:00:00  2013-02-12 13:30:00  3                              app2  2013-02-12 13:30:00  2013-02-12 14:00:00  2                              app2  2013-02-12 14:00:00  2013-02-12 15:30:00  1                              app2  2013-02-12 15:30:00  2013-02-12 16:30:00  2                              app2  2013-02-12 16:30:00  2013-02-12 17:00:00  1                              app3  2013-02-12 08:00:00  2013-02-12 08:30:00  2                              app3  2013-02-12 08:30:00  2013-02-12 09:00:00  1                              app3  2013-02-12 09:00:00  2013-02-12 09:30:00  1                              app3  2013-02-12 09:30:00  2013-02-12 10:00:00  1

The desired output in Figure 3 assumes that you're not interested in returning intervals with a count of zero. For example, there were zero active sessions connected to app2 between 8:45 and 9:00. You might need to return such intervals for some purposes, in which case your solution will need to accommodate this need.

Solution Based on PIVOT, LEAD, and SUM Window Aggregates

The first step in the solution is to generate one sequence of start and end events, marking start events with an event type of +1 (causes the count of active sessions to increase) and end events with -1 (count decreases). To generate such a sequence of events, you unify the results of two queries—one returning start events and the other end events, like so:

                              SELECT app, starttime AS ts, +1 AS type                              FROM dbo.Sessions                              UNION ALL                              SELECT app, endtime AS ts, -1 AS type                              FROM dbo.Sessions                              ORDER BY app, ts;

Figure 4 shows the output of this query.

                              app   ts                  type                              ----- ------------------- -----------                              app1  2012-02-12 08:30:00 1                              app1  2012-02-12 08:30:00 1                              app1  2012-02-12 08:45:00 -1                              app1  2012-02-12 09:00:00 1                              app1  2012-02-12 09:15:00 1                              app1  2012-02-12 09:15:00 1                              app1  2012-02-12 09:30:00 -1                              app1  2012-02-12 09:30:00 -1                              app1  2012-02-12 10:30:00 -1                              app1  2012-02-12 10:30:00 -1                              app1  2012-02-12 10:30:00 1                              app1  2012-02-12 10:45:00 1                              app1  2012-02-12 11:00:00 1                              app1  2012-02-12 11:30:00 -1                              app1  2012-02-12 12:30:00 -1                              app1  2012-02-12 14:30:00 -1                              ...

Next ,you need to group the sequence by app and ts, pivoting the counts of events by the type (remember 1 for start and -1 for end). You get one row for each distinct application and timestamp with a column called [1] for the count of start events and [-1] for the count of end events at that point. That timestamp marks when the discrete interval starts. You then use the LEAD function to retrieve the next distinct ts value, marking when the discrete interval ends. The number of overlapping sessions with the current interval is then the running total sum of the counts of start events minus counts of end events. After defining a CTE called C1 based on the previous query, the outer query looks like this:

                              SELECT                                app,                                ts AS starttime,                                LEAD(ts) OVER(PARTITION BY app ORDER BY ts) AS endtime,                                SUM([1]-[-1]) OVER(PARTITION BY app ORDER BY ts ROWS                                  UNBOUNDED PRECEDING) AS cnt                              FROM C1                                PIVOT(COUNT(type) FOR type IN([1],[-1])) AS P;                              

The code in Listing 5 combines the queries.

                              WITH C1 AS                              (                                SELECT app, starttime AS ts, +1 AS type                                FROM dbo.Sessions                                UNION ALL                                SELECT app, endtime AS ts, -1 AS type                                FROM dbo.Sessions                              )                              SELECT                                app,                                ts AS starttime,                                LEAD(ts) OVER(PARTITION BY app ORDER BY ts) AS endtime,                                SUM([1]-[-1]) OVER(PARTITION BY app ORDER BY ts ROWS UNBOUNDED PRECEDING) AS cnt                              FROM C1                                PIVOT(COUNT(type) FOR type IN([1],[-1])) AS P                              ORDER BY app, starttime;

Figure 5 shows the output of the code in Listing 5.

                              app   starttime            endtime              cnt                              ----- -------------------- -------------------- ----                              app1  2013-02-12 08:30:00  2013-02-12 08:45:00  2                              app1  2013-02-12 08:45:00  2013-02-12 09:00:00  1                              app1  2013-02-12 09:00:00  2013-02-12 09:15:00  2                              app1  2013-02-12 09:15:00  2013-02-12 09:30:00  4                              app1  2013-02-12 09:30:00  2013-02-12 10:30:00  2                              app1  2013-02-12 10:30:00  2013-02-12 10:45:00  1                              app1  2013-02-12 10:45:00  2013-02-12 11:00:00  2                              app1  2013-02-12 11:00:00  2013-02-12 11:30:00  3                              app1  2013-02-12 11:30:00  2013-02-12 12:30:00  2                              app1  2013-02-12 12:30:00  2013-02-12 14:30:00  1                              app1  2013-02-12 14:30:00  NULL                 0                              app2  2013-02-12 08:30:00  2013-02-12 08:45:00  1                              app2  2013-02-12 08:45:00  2013-02-12 09:00:00  0                              app2  2013-02-12 09:00:00  2013-02-12 09:30:00  1                              app2  2013-02-12 09:30:00  2013-02-12 11:45:00  0                              app2  2013-02-12 11:45:00  2013-02-12 12:00:00  1                              app2  2013-02-12 12:00:00  2013-02-12 12:30:00  0                              app2  2013-02-12 12:30:00  2013-02-12 12:45:00  1                              app2  2013-02-12 12:45:00  2013-02-12 13:00:00  2                              app2  2013-02-12 13:00:00  2013-02-12 13:30:00  3                              app2  2013-02-12 13:30:00  2013-02-12 14:00:00  2                              app2  2013-02-12 14:00:00  2013-02-12 15:30:00  1                              app2  2013-02-12 15:30:00  2013-02-12 16:30:00  2                              app2  2013-02-12 16:30:00  2013-02-12 17:00:00  1                              app2  2013-02-12 17:00:00  NULL                 0                              app3  2013-02-12 08:00:00  2013-02-12 08:30:00  2                              app3  2013-02-12 08:30:00  2013-02-12 09:00:00  1                              app3  2013-02-12 09:00:00  2013-02-12 09:30:00  1                              app3  2013-02-12 09:30:00  2013-02-12 10:00:00  1                              app3  2013-02-12 10:00:00  NULL                 0

Notice that the result includes intervals with a count of zero, as well as an interval with a starttime value at the last end event and a NULL in the endtime column. Naturally you'll want to remove the rows with the NULL in the endtime column. To achieve this, you can define a CTE based on the outer query (call it C2) and apply the filter endtime IS NOT NULL in the query against C2. If you also want to remove the intervals with the count of 0, use the filter cnt > 0, which will also remove the rows with the NULL in the endtime column because the count there is always 0. Listing 6 contains the complete solution, assuming you want to keep only intervals that have overlapping sessions.

                              WITH C1 AS                              (                                SELECT app, starttime AS ts, +1 AS type                                FROM dbo.Sessions                                UNION ALL                                SELECT app, endtime AS ts, -1 AS type                                FROM dbo.Sessions                              ),                              C2 AS                              (                                SELECT                                  app,                                  ts AS starttime,                                  LEAD(ts) OVER(PARTITION BY app ORDER BY ts) AS endtime,                                  SUM([1]-[-1]) OVER(PARTITION BY app ORDER BY ts ROWS UNBOUNDED PRECEDING) AS cnt                                FROM C1                                  PIVOT(COUNT(type) FOR type IN([1],[-1])) AS P                              )                              SELECT *                              FROM C2                              WHERE cnt > 0                              ORDER BY app, starttime;

This solution generated the plan shown in Figure 6 on my system against the large set of sample data and ran for 37 seconds.

Plan for Query in Listing 6

What's most interesting about this solution is that the two supportive indexes idx_start and idx_end are scanned in order, and throughout the plan, there's not even one explicit sort operator. The grouping, pivoting, LEAD, and SUM window functions, and even the presentation ORDER BY clause, all rely on the order preservation of the data.

Solution Based on LEAD and ROW_NUMBER Window Aggregates

The next solution also generates a chronological sequence of start and end events, plus it computes three ordinals with the ROW_NUMBER function:

  1. An ordinal for start events (call it s) against the starts query
  2. An ordinal for end events (call it e) against the ends query
  3. An ordinal for start or end events (call it se) against the combined starts and ends

Here's the query that generates the sequence of events and the ordinals:

                              WITH C1 AS                              (                                SELECT app, starttime AS ts, +1 AS type, keycol,                                  ROW_NUMBER() OVER(PARTITION BY app ORDER BY starttime,                                    keycol) AS s, -- start ordinal                                  NULL AS e                                FROM dbo.Sessions                                UNION ALL                                SELECT app, endtime, -1 AS type, keycol, NULL AS s,                                  ROW_NUMBER() OVER(PARTITION BY app ORDER BY endtime,                                    keycol) AS e -- end ordinal                                FROM dbo.Sessions                              )                              SELECT *,                                ROW_NUMBER() OVER(PARTITION BY app ORDER BY ts, type,                                  keycol) AS se -- start or end ordinal                              FROM C1;

Figure 7 shows the output of this query.

                              app   ts                  type  keycol  s     e     se                              ----- ------------------- ----- ------- ----- ----- ---                              app1  2012-02-12 08:30:00 1     2       1     NULL  1                              app1  2012-02-12 08:30:00 1     3       2     NULL  2                              app1  2012-02-12 08:45:00 -1    3       NULL  1     3                              app1  2012-02-12 09:00:00 1     5       3     NULL  4                              app1  2012-02-12 09:15:00 1     7       4     NULL  5                              app1  2012-02-12 09:15:00 1     11      5     NULL  6                              app1  2012-02-12 09:30:00 -1    5       NULL  2     7                              app1  2012-02-12 09:30:00 -1    11      NULL  3     8                              app1  2012-02-12 10:30:00 -1    2       NULL  4     9                              app1  2012-02-12 10:30:00 -1    7       NULL  5     10                              app1  2012-02-12 10:30:00 1     13      6     NULL  11                              app1  2012-02-12 10:45:00 1     17      7     NULL  12                              app1  2012-02-12 11:00:00 1     19      8     NULL  13                              app1  2012-02-12 11:30:00 -1    17      NULL  6     14                              app1  2012-02-12 12:30:00 -1    19      NULL  7     15                              app1  2012-02-12 14:30:00 -1    13      NULL  8     16                              ...

Next, the solution computes the count of active sessions after each event. Notice in the output in Figure 7 that for start events you have the start ordinal (s) and the start or end ordinal (se) available. You can compute how many end events happened until that point as e = se - s. Then you can compute the count of active sessions as cnt = s - e, or more fully as cnt = s - (se - s). In a similar way you can compute the count after end events using se and e as cnt = (se - e) - e. To get the count after both start and end events in the same target column, use cnt = COALESCE(s - (se - s), s - (se - s)). You also need to return with each event the timestamp of the next event. This can be achieved with the LEAD function. Assuming you define a CTE called C2 based on the outer query in the previous code sample, you use the following query against C2 to compute both the count and the next timestamp:

                              SELECT *,                                (COALESCE(                                  s - (se - s), -- count of active intervals after start event                                  (se - e) - e) -- count of active intervals after end event                                ) AS cnt,                                LEAD(ts) OVER(PARTITION BY app ORDER BY ts, type, keycol)                                  AS nextts                              FROM C2

Notice the ORDER BY list of the LEAD function. It ensures that start events are positioned after end events, and it uses keycol as a tiebreaker for determinism. The points where ts is different than nextts represent discrete intervals, and the counts associated with those points are inclusive of all events at those points, meaning that they represent the counts of overlapping sessions with those intervals. The final step is to define a CTE based on the last query (call it C3) and have the outer query filter the events where ts <> nextts—and if you're not interested in intervals with zero counts, remove those as well. Here's the outer query against C3:

                              SELECT                                app,                                ts AS starttime,                                nextts AS endtime,                                cnt                              FROM C3                              WHERE ts <> nextts                                AND cnt > 0;

Listing 7 contains the complete solution.

                              WITH C1 AS                              (                                SELECT app, starttime AS ts, +1 AS type, keycol,                                  ROW_NUMBER() OVER(PARTITION BY app ORDER BY starttime, keycol) AS s, -- start ordinal                                  NULL AS e                                FROM dbo.Sessions                                UNION ALL                                SELECT app, endtime, -1 AS type, keycol, NULL AS s,                                  ROW_NUMBER() OVER(PARTITION BY app ORDER BY endtime, keycol) AS e -- end ordinal                                FROM dbo.Sessions                              ),                              C2 AS                              (                                SELECT *,                                  ROW_NUMBER() OVER(PARTITION BY app ORDER BY ts, type, keycol) AS se -- start or end ordinal                                FROM C1                              ),                              C3 AS                              (                                SELECT *,                                  (COALESCE(                                      s - (se - s), -- count of active intervals after start event                                      (se - e) - e) -- count of active intervals after end event                                  ) AS cnt,                                  LEAD(ts) OVER(PARTITION BY app ORDER BY ts, type, keycol) AS nextts                                FROM C2                              )                              SELECT                                app,                                ts AS starttime,                                nextts AS endtime,                                cnt                              FROM C3                              WHERE ts <> nextts                                AND cnt > 0                              ORDER BY app, starttime;

Figure 8 shows the plan for this solution.

Plan for Query in Listing 7

This plan is more efficient than the one for the previous solution in a couple of ways. It doesn't involve grouping, and it reduces the number of Window Spool operators. This query ran for 22 seconds on my system—almost half the run time of the first solution.

Related: SQL Server 2012's Window Functions, Part 1

To Be Continued

This article is the first in a series covering challenges related to manipulation of intervals involving computation of various counts. In this article, I covered how to identify discrete intervals and counts of overlapping source intervals. I showed one solution that uses PIVOT, LEAD, and SUM window aggregates, as well as a faster solution using the ROW_NUMBER and LEAD functions. Next month I'll continue the coverage of this challenge, showing further performance improvements, as well as a variation of the task.