When using SQL Server, you frequently need to work with data that represents intervals of time. For example, consider intervals representing sessions, contracts, projects, and so on. I covered a few aspects of manipulation of intervals in the past, but there are still many interesting challenges to cover. Tasks related to interval manipulation are typically quite intriguing, especially because coming up with efficient solutions isn't easy.

I recently had the chance to work on a few challenges involving intervals and counts. This article is the first in a series covering those challenges and their solutions.

Related: Solving Gaps and Islands with Enhanced Window Functions

I'd like to thank Adam Machanic, who sent me the challenge that's the focus of this article. I'd also like to thank Ben Flanaghan, who created one of the fundamental techniques that I rely on in one of the solutions that I present in this article.

Sample Data

I'll use the same sample data for all of the challenges that I'll discuss in this series. The data involves a table called Sessions that holds sessions against different applications. The column app represents the application (call it the partitioning element), and the columns starttime and endtime represent the start and end points of the session's interval, respectively. The starttime and endtime columns use the DATETIME2(0) type, which has a precision of one second.

Related: Interval Queries in SQL Server

Run the code in Listing 1 to create the Sessions table and two supporting indexes, which will be used by the different solutions that I'll present in the series.

Listing 1: DDL for Sessions Table
SET NOCOUNT ON;
USE tempdb;

IF OBJECT_ID(N'dbo.Sessions', N'U') IS NOT NULL DROP TABLE dbo.Sessions;
IF OBJECT_ID(N'dbo.Apps', N'U') IS NOT NULL DROP TABLE dbo.Apps;

CREATE TABLE dbo.Apps
(
  app       VARCHAR(10) NOT NULL,
  CONSTRAINT PK_Apps PRIMARY KEY(app)
);

CREATE TABLE dbo.Sessions
(
  keycol    INT         NOT NULL,
  app       VARCHAR(10) NOT NULL,
  starttime DATETIME2(0)   NOT NULL,
  endtime   DATETIME2(0)   NOT NULL,
  CONSTRAINT PK_Sessions PRIMARY KEY(keycol),
  CONSTRAINT CHK_Sessios_et_st CHECK(endtime > starttime)
);

CREATE UNIQUE INDEX idx_start ON dbo.Sessions(app, starttime, keycol);
CREATE UNIQUE INDEX idx_end ON dbo.Sessions(app, endtime, keycol);

The index called idx_start is defined on the key list (app, starttime, keycol), and the index called idx_end is defined on the key list (app, endtime, keycol).

Use the code in Listing 2 to populate the Sessions table with a small set of sample data.

Listing 2: Code to Fill Sessions Table with Small Set of Sample Data
TRUNCATE TABLE dbo.Sessions;
TRUNCATE TABLE dbo.Apps;

INSERT INTO dbo.Apps(app) VALUES('app1'),('app2'),('app3');

INSERT INTO dbo.Sessions(keycol, app, starttime, endtime) VALUES
  (2,  'app1', '20130212 08:30:00', '20130212 10:30:00'),
  (3,  'app1', '20130212 08:30:00', '20130212 08:45:00'),
  (5,  'app1', '20130212 09:00:00', '20130212 09:30:00'),
  (7,  'app1', '20130212 09:15:00', '20130212 10:30:00'),
  (11, 'app1', '20130212 09:15:00', '20130212 09:30:00'),
  (13, 'app1', '20130212 10:30:00', '20130212 14:30:00'),
  (17, 'app1', '20130212 10:45:00', '20130212 11:30:00'),
  (19, 'app1', '20130212 11:00:00', '20130212 12:30:00'),
  (23, 'app2', '20130212 08:30:00', '20130212 08:45:00'),
  (29, 'app2', '20130212 09:00:00', '20130212 09:30:00'),
  (31, 'app2', '20130212 11:45:00', '20130212 12:00:00'),
  (37, 'app2', '20130212 12:30:00', '20130212 14:00:00'),
  (41, 'app2', '20130212 12:45:00', '20130212 13:30:00'),
  (43, 'app2', '20130212 13:00:00', '20130212 14:00:00'),
  (47, 'app2', '20130212 14:00:00', '20130212 16:30:00'),
  (53, 'app2', '20130212 15:30:00', '20130212 17:00:00'),
  (61, 'app3', '20130212 08:00:00', '20130212 08:30:00'),
  (62, 'app3', '20130212 08:00:00', '20130212 09:00:00'),
  (63, 'app3', '20130212 09:00:00', '20130212 09:30:00'),
  (64, 'app3', '20130212 09:30:00', '20130212 10:00:00');

You can use the sample data from Listing 2 to verify the correctness of your solutions. Figure 1 provides a graphic representation of the small set of sample data.

Intervals

To generate a large set of sample data, you'll need the helper function GetNums, which accepts input parameters called @low and @high and returns a sequence of integers in the requested range. Use the code in Listing 3 to create the GetNums function.

Listing 3: GetNums Helper Function
IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO

CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
            FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
GO

Use the code in Listing 4 to populate the Sessions table with a large set of sample data.

Listing 4: Code to Fill Sessions Table with Large Set of Sample Data
TRUNCATE TABLE dbo.Sessions;
TRUNCATE TABLE dbo.Apps;

DECLARE
  @numrows AS INT = 2000000, -- total number of rows
  @numapps AS INT = 100;     -- number of applications

INSERT INTO dbo.Apps WITH(TABLOCK) (app)
  SELECT 'app' + CAST(n AS VARCHAR(10)) AS app
  FROM dbo.GetNums(1, @numapps) AS Nums;

INSERT INTO dbo.Sessions WITH(TABLOCK)
    (keycol, app, starttime, endtime)
  SELECT
    ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS keycol,
    D.*,
    DATEADD(
      second,
      1 + ABS(CHECKSUM(NEWID())) % (20*60),
      starttime) AS endtime
  FROM
  (
    SELECT
      'app' + CAST(1 + ABS(CHECKSUM(NEWID())) % @numapps AS VARCHAR(10)) AS app,
      DATEADD(
        second,
        1 + ABS(CHECKSUM(NEWID())) % (30*24*60*60),
        '20130101') AS starttime
    FROM dbo.GetNums(1, @numrows) AS Nums
  ) AS D;

You can control the number of applications and the total number of rows in the sample data by setting the variables @numapps and @numrows, respectively. The code in Listing 4 fills the Apps table with 100 applications and the Sessions table with 2,000,000 rows.

The Challenge: Computing Counts During Discrete Intervals

The challenge that's the focus of this article is to identify the discrete intervals that exist in the data and determine how many source intervals each discrete interval overlaps with. Figure 2 provides a graphic representation of the desired result against the small set of sample data.

Counts During Discrete Intervals

Figure 3 shows the desired output of your query.

Figure 3: Count of Overlapping Intervals During Each Discrete Interval
app   starttime            endtime              cnt
----- -------------------- -------------------- ----
app1  2013-02-12 08:30:00  2013-02-12 08:45:00  2
app1  2013-02-12 08:45:00  2013-02-12 09:00:00  1
app1  2013-02-12 09:00:00  2013-02-12 09:15:00  2
app1  2013-02-12 09:15:00  2013-02-12 09:30:00  4
app1  2013-02-12 09:30:00  2013-02-12 10:30:00  2
app1  2013-02-12 10:30:00  2013-02-12 10:45:00  1
app1  2013-02-12 10:45:00  2013-02-12 11:00:00  2
app1  2013-02-12 11:00:00  2013-02-12 11:30:00  3
app1  2013-02-12 11:30:00  2013-02-12 12:30:00  2
app1  2013-02-12 12:30:00  2013-02-12 14:30:00  1
app2  2013-02-12 08:30:00  2013-02-12 08:45:00  1
app2  2013-02-12 09:00:00  2013-02-12 09:30:00  1
app2  2013-02-12 11:45:00  2013-02-12 12:00:00  1
app2  2013-02-12 12:30:00  2013-02-12 12:45:00  1
app2  2013-02-12 12:45:00  2013-02-12 13:00:00  2
app2  2013-02-12 13:00:00  2013-02-12 13:30:00  3
app2  2013-02-12 13:30:00  2013-02-12 14:00:00  2
app2  2013-02-12 14:00:00  2013-02-12 15:30:00  1
app2  2013-02-12 15:30:00  2013-02-12 16:30:00  2
app2  2013-02-12 16:30:00  2013-02-12 17:00:00  1
app3  2013-02-12 08:00:00  2013-02-12 08:30:00  2
app3  2013-02-12 08:30:00  2013-02-12 09:00:00  1
app3  2013-02-12 09:00:00  2013-02-12 09:30:00  1
app3  2013-02-12 09:30:00  2013-02-12 10:00:00  1

The desired output in Figure 3 assumes that you're not interested in returning intervals with a count of zero. For example, there were zero active sessions connected to app2 between 8:45 and 9:00. You might need to return such intervals for some purposes, in which case your solution will need to accommodate this need.

Solution Based on PIVOT, LEAD, and SUM Window Aggregates

The first step in the solution is to generate one sequence of start and end events, marking start events with an event type of +1 (causes the count of active sessions to increase) and end events with -1 (count decreases). To generate such a sequence of events, you unify the results of two queries—one returning start events and the other end events, like so:

SELECT app, starttime AS ts, +1 AS type
FROM dbo.Sessions

UNION ALL

SELECT app, endtime AS ts, -1 AS type
FROM dbo.Sessions

ORDER BY app, ts;

Figure 4 shows the output of this query.

Figure 4: Sequence of Events
app   ts                  type
----- ------------------- -----------
app1  2012-02-12 08:30:00 1
app1  2012-02-12 08:30:00 1
app1  2012-02-12 08:45:00 -1
app1  2012-02-12 09:00:00 1
app1  2012-02-12 09:15:00 1
app1  2012-02-12 09:15:00 1
app1  2012-02-12 09:30:00 -1
app1  2012-02-12 09:30:00 -1
app1  2012-02-12 10:30:00 -1
app1  2012-02-12 10:30:00 -1
app1  2012-02-12 10:30:00 1
app1  2012-02-12 10:45:00 1
app1  2012-02-12 11:00:00 1
app1  2012-02-12 11:30:00 -1
app1  2012-02-12 12:30:00 -1
app1  2012-02-12 14:30:00 -1
...

Next ,you need to group the sequence by app and ts, pivoting the counts of events by the type (remember 1 for start and -1 for end). You get one row for each distinct application and timestamp with a column called [1] for the count of start events and [-1] for the count of end events at that point. That timestamp marks when the discrete interval starts. You then use the LEAD function to retrieve the next distinct ts value, marking when the discrete interval ends. The number of overlapping sessions with the current interval is then the running total sum of the counts of start events minus counts of end events. After defining a CTE called C1 based on the previous query, the outer query looks like this:

SELECT
  app,
  ts AS starttime,
  LEAD(ts) OVER(PARTITION BY app ORDER BY ts) AS endtime,
  SUM([1]-[-1]) OVER(PARTITION BY app ORDER BY ts ROWS
    UNBOUNDED PRECEDING) AS cnt
FROM C1
  PIVOT(COUNT(type) FOR type IN([1],[-1])) AS P;

The code in Listing 5 combines the queries.

Listing 5: Solution Based on Window Aggregate and Offset Functions
WITH C1 AS
(
  SELECT app, starttime AS ts, +1 AS type
  FROM dbo.Sessions

  UNION ALL

  SELECT app, endtime AS ts, -1 AS type
  FROM dbo.Sessions
)
SELECT
  app,
  ts AS starttime,
  LEAD(ts) OVER(PARTITION BY app ORDER BY ts) AS endtime,
  SUM([1]-[-1]) OVER(PARTITION BY app ORDER BY ts ROWS UNBOUNDED PRECEDING) AS cnt
FROM C1
  PIVOT(COUNT(type) FOR type IN([1],[-1])) AS P
ORDER BY app, starttime;

Figure 5 shows the output of the code in Listing 5.

Figure 5: Output of Query in Listing 5
app   starttime            endtime              cnt
----- -------------------- -------------------- ----
app1  2013-02-12 08:30:00  2013-02-12 08:45:00  2
app1  2013-02-12 08:45:00  2013-02-12 09:00:00  1
app1  2013-02-12 09:00:00  2013-02-12 09:15:00  2
app1  2013-02-12 09:15:00  2013-02-12 09:30:00  4
app1  2013-02-12 09:30:00  2013-02-12 10:30:00  2
app1  2013-02-12 10:30:00  2013-02-12 10:45:00  1
app1  2013-02-12 10:45:00  2013-02-12 11:00:00  2
app1  2013-02-12 11:00:00  2013-02-12 11:30:00  3
app1  2013-02-12 11:30:00  2013-02-12 12:30:00  2
app1  2013-02-12 12:30:00  2013-02-12 14:30:00  1
app1  2013-02-12 14:30:00  NULL                 0
app2  2013-02-12 08:30:00  2013-02-12 08:45:00  1
app2  2013-02-12 08:45:00  2013-02-12 09:00:00  0
app2  2013-02-12 09:00:00  2013-02-12 09:30:00  1
app2  2013-02-12 09:30:00  2013-02-12 11:45:00  0
app2  2013-02-12 11:45:00  2013-02-12 12:00:00  1
app2  2013-02-12 12:00:00  2013-02-12 12:30:00  0
app2  2013-02-12 12:30:00  2013-02-12 12:45:00  1
app2  2013-02-12 12:45:00  2013-02-12 13:00:00  2
app2  2013-02-12 13:00:00  2013-02-12 13:30:00  3
app2  2013-02-12 13:30:00  2013-02-12 14:00:00  2
app2  2013-02-12 14:00:00  2013-02-12 15:30:00  1
app2  2013-02-12 15:30:00  2013-02-12 16:30:00  2
app2  2013-02-12 16:30:00  2013-02-12 17:00:00  1
app2  2013-02-12 17:00:00  NULL                 0
app3  2013-02-12 08:00:00  2013-02-12 08:30:00  2
app3  2013-02-12 08:30:00  2013-02-12 09:00:00  1
app3  2013-02-12 09:00:00  2013-02-12 09:30:00  1
app3  2013-02-12 09:30:00  2013-02-12 10:00:00  1
app3  2013-02-12 10:00:00  NULL                 0

Notice that the result includes intervals with a count of zero, as well as an interval with a starttime value at the last end event and a NULL in the endtime column. Naturally you'll want to remove the rows with the NULL in the endtime column. To achieve this, you can define a CTE based on the outer query (call it C2) and apply the filter endtime IS NOT NULL in the query against C2. If you also want to remove the intervals with the count of 0, use the filter cnt > 0, which will also remove the rows with the NULL in the endtime column because the count there is always 0. Listing 6 contains the complete solution, assuming you want to keep only intervals that have overlapping sessions.

Listing 6: Code to Remove Intervals with 0 Count
WITH C1 AS
(
  SELECT app, starttime AS ts, +1 AS type
  FROM dbo.Sessions

  UNION ALL

  SELECT app, endtime AS ts, -1 AS type
  FROM dbo.Sessions
),
C2 AS
(
  SELECT
    app,
    ts AS starttime,
    LEAD(ts) OVER(PARTITION BY app ORDER BY ts) AS endtime,
    SUM([1]-[-1]) OVER(PARTITION BY app ORDER BY ts ROWS UNBOUNDED PRECEDING) AS cnt
  FROM C1
    PIVOT(COUNT(type) FOR type IN([1],[-1])) AS P
)
SELECT *
FROM C2
WHERE cnt > 0
ORDER BY app, starttime;

This solution generated the plan shown in Figure 6 on my system against the large set of sample data and ran for 37 seconds.

Plan for Query in Listing 6

What's most interesting about this solution is that the two supportive indexes idx_start and idx_end are scanned in order, and throughout the plan, there's not even one explicit sort operator. The grouping, pivoting, LEAD, and SUM window functions, and even the presentation ORDER BY clause, all rely on the order preservation of the data.

Solution Based on LEAD and ROW_NUMBER Window Aggregates

The next solution also generates a chronological sequence of start and end events, plus it computes three ordinals with the ROW_NUMBER function:

  1. An ordinal for start events (call it s) against the starts query
  2. An ordinal for end events (call it e) against the ends query
  3. An ordinal for start or end events (call it se) against the combined starts and ends

Here's the query that generates the sequence of events and the ordinals:

WITH C1 AS
(
  SELECT app, starttime AS ts, +1 AS type, keycol,
    ROW_NUMBER() OVER(PARTITION BY app ORDER BY starttime,
      keycol) AS s, -- start ordinal
    NULL AS e
  FROM dbo.Sessions

  UNION ALL

  SELECT app, endtime, -1 AS type, keycol, NULL AS s,
    ROW_NUMBER() OVER(PARTITION BY app ORDER BY endtime,
      keycol) AS e -- end ordinal
  FROM dbo.Sessions
)
SELECT *,
  ROW_NUMBER() OVER(PARTITION BY app ORDER BY ts, type,
    keycol) AS se -- start or end ordinal
FROM C1;

Figure 7 shows the output of this query.

Figure 7: Sequence of Events with Ordinals
app   ts                  type  keycol  s     e     se
----- ------------------- ----- ------- ----- ----- ---
app1  2012-02-12 08:30:00 1     2       1     NULL  1
app1  2012-02-12 08:30:00 1     3       2     NULL  2
app1  2012-02-12 08:45:00 -1    3       NULL  1     3
app1  2012-02-12 09:00:00 1     5       3     NULL  4
app1  2012-02-12 09:15:00 1     7       4     NULL  5
app1  2012-02-12 09:15:00 1     11      5     NULL  6
app1  2012-02-12 09:30:00 -1    5       NULL  2     7
app1  2012-02-12 09:30:00 -1    11      NULL  3     8
app1  2012-02-12 10:30:00 -1    2       NULL  4     9
app1  2012-02-12 10:30:00 -1    7       NULL  5     10
app1  2012-02-12 10:30:00 1     13      6     NULL  11
app1  2012-02-12 10:45:00 1     17      7     NULL  12
app1  2012-02-12 11:00:00 1     19      8     NULL  13
app1  2012-02-12 11:30:00 -1    17      NULL  6     14
app1  2012-02-12 12:30:00 -1    19      NULL  7     15
app1  2012-02-12 14:30:00 -1    13      NULL  8     16
...

Next, the solution computes the count of active sessions after each event. Notice in the output in Figure 7 that for start events you have the start ordinal (s) and the start or end ordinal (se) available. You can compute how many end events happened until that point as e = se - s. Then you can compute the count of active sessions as cnt = s - e, or more fully as cnt = s - (se - s). In a similar way you can compute the count after end events using se and e as cnt = (se - e) - e. To get the count after both start and end events in the same target column, use cnt = COALESCE(s - (se - s), s - (se - s)). You also need to return with each event the timestamp of the next event. This can be achieved with the LEAD function. Assuming you define a CTE called C2 based on the outer query in the previous code sample, you use the following query against C2 to compute both the count and the next timestamp:

SELECT *,
  (COALESCE(
    s - (se - s), -- count of active intervals after start event
    (se - e) - e) -- count of active intervals after end event
  ) AS cnt,
  LEAD(ts) OVER(PARTITION BY app ORDER BY ts, type, keycol)
    AS nextts
FROM C2

Notice the ORDER BY list of the LEAD function. It ensures that start events are positioned after end events, and it uses keycol as a tiebreaker for determinism. The points where ts is different than nextts represent discrete intervals, and the counts associated with those points are inclusive of all events at those points, meaning that they represent the counts of overlapping sessions with those intervals. The final step is to define a CTE based on the last query (call it C3) and have the outer query filter the events where ts <> nextts—and if you're not interested in intervals with zero counts, remove those as well. Here's the outer query against C3:

SELECT
  app,
  ts AS starttime,
  nextts AS endtime,
  cnt
FROM C3
WHERE ts <> nextts
  AND cnt > 0;

Listing 7 contains the complete solution.

Listing 7: Solution Based on Ranking and Offset Window Functions
WITH C1 AS
(
  SELECT app, starttime AS ts, +1 AS type, keycol,
    ROW_NUMBER() OVER(PARTITION BY app ORDER BY starttime, keycol) AS s, -- start ordinal
    NULL AS e
  FROM dbo.Sessions

  UNION ALL

  SELECT app, endtime, -1 AS type, keycol, NULL AS s,
    ROW_NUMBER() OVER(PARTITION BY app ORDER BY endtime, keycol) AS e -- end ordinal
  FROM dbo.Sessions
),
C2 AS
(
  SELECT *,
    ROW_NUMBER() OVER(PARTITION BY app ORDER BY ts, type, keycol) AS se -- start or end ordinal
  FROM C1
),
C3 AS
(
  SELECT *,
    (COALESCE(
        s - (se - s), -- count of active intervals after start event
        (se - e) - e) -- count of active intervals after end event
    ) AS cnt,
    LEAD(ts) OVER(PARTITION BY app ORDER BY ts, type, keycol) AS nextts
  FROM C2
)
SELECT
  app,
  ts AS starttime,
  nextts AS endtime,
  cnt
FROM C3
WHERE ts <> nextts
  AND cnt > 0
ORDER BY app, starttime;

Figure 8 shows the plan for this solution.

Plan for Query in Listing 7

This plan is more efficient than the one for the previous solution in a couple of ways. It doesn't involve grouping, and it reduces the number of Window Spool operators. This query ran for 22 seconds on my system—almost half the run time of the first solution.

Related: SQL Server 2012's Window Functions, Part 1

To Be Continued

This article is the first in a series covering challenges related to manipulation of intervals involving computation of various counts. In this article, I covered how to identify discrete intervals and counts of overlapping source intervals. I showed one solution that uses PIVOT, LEAD, and SUM window aggregates, as well as a faster solution using the ROW_NUMBER and LEAD functions. Next month I'll continue the coverage of this challenge, showing further performance improvements, as well as a variation of the task.