Packing Date Intervals

SQL Server does not support a native temporal interval data type, nor does it support native relational operations on intervals, e.g., packing intervals. Packing of intervals means merging all intervals that overlap into one contiguous interval. As an example for a practical need for packing, consider the following Projects table and its sample data:

CREATE TABLE dbo.Projects

(

  projectid  INT          NOT NULL,

  title      VARCHAR(100) NOT NULL,

  start_date DATE         NOT NULL,

  end_date   DATE         NOT NULL

);

 

INSERT INTO dbo.Projects(projectid, title, start_date, end_date) VALUES

  (1, 'Project 1', '20100212', '20100220'),

  (2, 'Project 2', '20100214', '20100312'),

  (3, 'Project 3', '20100124', '20100201');

 

The task at hand is: given the start and end dates of an input period, return a row for each contiguous period where there was a project running. For example, given January 1st, 2010 as the start of the input period and December 31st as the end, the desired result is:

start_period end_period

------------ ----------

2010-01-24   2010-02-01

2010-02-12   2010-03-12

As usual with the T-SQL challenges that I provide, I urge you to first try and come up with your own solution before looking at mine. That’s the best way to develop your T-SQL skills.

As for my solution, it makes use of a table function called GetNums that accepts an integer input, and returns a virtual auxiliary table of numbers starting with 1 and ending with the input number. Here’s the code to create the function:

IF OBJECT_ID('dbo.GetNums') IS NOT NULL DROP FUNCTION dbo.GetNums;

GO

CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE

AS

RETURN

  WITH

  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),

  L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),

  L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),

  L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),

  L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),

  L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),

  Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)

  SELECT TOP (@n) n FROM Nums ORDER BY n;

GO

 

The first step in the solution is to write a query that uses the GetNums function to generate a sequence of dates within the input period:

DECLARE

  @from AS DATE = '20100101',

  @to   AS DATE = '20101231';

 

WITH Dates AS

(

  SELECT DATEADD(day, n-1, @from) AS dt

  FROM dbo.GetNums(DATEDIFF(DAY, @from, @to) + 1) AS Nums

)

SELECT dt

FROM Dates;

 

Output:

dt

----------

2010-01-01

2010-01-02

2010-01-03

...

2010-12-29

2010-12-30

2010-12-31
(365 row(s) affected)

 

The second step is to join the Projects and Dates tables, unpacking each project interval to the individual dates within it, like so:

DECLARE

  @from AS DATE = '20100101',

  @to   AS DATE = '20101231';

 

WITH Dates AS

(

  SELECT DATEADD(day, n-1, @from) AS dt

  FROM dbo.GetNums(DATEDIFF(DAY, @from, @to) + 1) AS Nums

)

SELECT D.dt

FROM dbo.Projects AS P

  JOIN Dates AS D

    ON D.dt BETWEEN P.start_date AND P.end_date;

 

Output:

dt

----------

2010-01-24

2010-01-25

2010-01-26

2010-01-27

2010-01-28

2010-01-29

2010-01-30

2010-01-31

2010-02-01

2010-02-12

2010-02-13

2010-02-14

2010-02-14

2010-02-15

2010-02-15

2010-02-16

2010-02-16

2010-02-17

2010-02-17

2010-02-18

2010-02-18

2010-02-19

2010-02-19

2010-02-20

2010-02-20

2010-02-21

2010-02-22

2010-02-23

2010-02-24

2010-02-25

2010-02-26

2010-02-27

2010-02-28

2010-03-01

2010-03-02

2010-03-03

2010-03-04

2010-03-05

2010-03-06

2010-03-07

2010-03-08

2010-03-09

2010-03-10

2010-03-11

2010-03-12

 


The third step is to generate a group identifier for each consecutive range of dates (aka, Island). The point about the group identifier is to generate a value that is the same for all entries that belong to the same island, and different than the value produced for other islands. More generically, this step involves identifying islands in a temporal sequence with possible duplicates. You can find details about different variations of islands problems and their solutions in SQL Server MVP Deep Dives (Manning, 2009), Chapter 5 - Gaps and islands. This chapter happens to be the sample chapter of the book, so you can download it for free. (Of course, we do hope that you will like what you read so much that you will decide to purchase the book, and this way help with the efforts to contribute to War Child!)

Back to our problem, the solution to our variation of the islands problem can be described in two substeps. The first involves using the DENSE_RANK function to generate a sequence of dense rank values alongside the dt sequence, like so:

DECLARE

  @from AS DATE = '20100101',

  @to   AS DATE = '20101231';

 

WITH Dates AS

(

  SELECT DATEADD(day, n-1, @from) AS dt

  FROM dbo.GetNums(DATEDIFF(DAY, @from, @to) + 1) AS Nums

)

SELECT D.dt, DENSE_RANK() OVER(ORDER BY D.dt) AS drnk

FROM dbo.Projects AS P

  JOIN Dates AS D

    ON D.dt BETWEEN P.start_date AND P.end_date;

 

Output:

dt         drnk

---------- --------------------

2010-01-24 1

2010-01-25 2

2010-01-26 3

2010-01-27 4

2010-01-28 5

2010-01-29 6

2010-01-30 7

2010-01-31 8

2010-02-01 9

2010-02-12 10

2010-02-13 11

2010-02-14 12

2010-02-14 12

2010-02-15 13

2010-02-15 13

2010-02-16 14

2010-02-16 14

2010-02-17 15

2010-02-17 15

2010-02-18 16

2010-02-18 16

2010-02-19 17

2010-02-19 17

2010-02-20 18

2010-02-20 18

2010-02-21 19

2010-02-22 20

2010-02-23 21

2010-02-24 22

2010-02-25 23

2010-02-26 24

2010-02-27 25

2010-02-28 26

2010-03-01 27

2010-03-02 28

2010-03-03 29

2010-03-04 30

2010-03-05 31

2010-03-06 32

2010-03-07 33

2010-03-08 34

2010-03-09 35

2010-03-10 36

2010-03-11 37

2010-03-12 38

 

Now look at the two sequences… the dt sequence has gaps, whereas the drnk sequence doesn’t. With this in mind, you can generate a group identifier by subtracting drnk times the interval of the temporal sequence (one day in our case) from the dt value. This way you will get the same date as a result for all members of the same island:

DECLARE

  @from AS DATE = '20100101',

  @to   AS DATE = '20101231';

 

WITH Dates AS

(

  SELECT DATEADD(day, n-1, @from) AS dt

  FROM dbo.GetNums(DATEDIFF(DAY, @from, @to) + 1) AS Nums

)

SELECT D.dt,

  DATEADD(day, -1*DENSE_RANK() OVER(ORDER BY D.dt), D.dt) AS grp

FROM dbo.Projects AS P

  JOIN Dates AS D

    ON D.dt BETWEEN P.start_date AND P.end_date;

 

Output:

dt         grp

---------- ----------

2010-01-24 2010-01-23

2010-01-25 2010-01-23

2010-01-26 2010-01-23

2010-01-27 2010-01-23

2010-01-28 2010-01-23

2010-01-29 2010-01-23

2010-01-30 2010-01-23

2010-01-31 2010-01-23

2010-02-01 2010-01-23

2010-02-12 2010-02-02

2010-02-13 2010-02-02

2010-02-14 2010-02-02

2010-02-14 2010-02-02

2010-02-15 2010-02-02

2010-02-15 2010-02-02

2010-02-16 2010-02-02

2010-02-16 2010-02-02

2010-02-17 2010-02-02

2010-02-17 2010-02-02

2010-02-18 2010-02-02

2010-02-18 2010-02-02

2010-02-19 2010-02-02

2010-02-19 2010-02-02

2010-02-20 2010-02-02

2010-02-20 2010-02-02

2010-02-21 2010-02-02

2010-02-22 2010-02-02

2010-02-23 2010-02-02

2010-02-24 2010-02-02

2010-02-25 2010-02-02

2010-02-26 2010-02-02

2010-02-27 2010-02-02

2010-02-28 2010-02-02

2010-03-01 2010-02-02

2010-03-02 2010-02-02

2010-03-03 2010-02-02

2010-03-04 2010-02-02

2010-03-05 2010-02-02

2010-03-06 2010-02-02

2010-03-07 2010-02-02

2010-03-08 2010-02-02

2010-03-09 2010-02-02

2010-03-10 2010-02-02

2010-03-11 2010-02-02

2010-03-12 2010-02-02

 

The actual date that you get as a result is not really significant; what’s important is that it meets the two requirements that we have for our group identifier: 1. The value must be the same for all members of the same island. 2. The value must be different than the values produced for other islands.

The final step in the solution is simple—group the result by the grp value, and return the minimum and maximum dt values per group:

DECLARE

  @from AS DATE = '20100101',

  @to   AS DATE = '20101231';

 

WITH Dates AS

(

  SELECT DATEADD(day, n-1, @from) AS dt

  FROM dbo.GetNums(DATEDIFF(DAY, @from, @to) + 1) AS Nums

),

Groups AS

(

  SELECT D.dt,

    DATEADD(day, -1*DENSE_RANK() OVER(ORDER BY D.dt), D.dt) AS grp

  FROM dbo.Projects AS P

    JOIN Dates AS D

      ON D.dt BETWEEN P.start_date AND P.end_date

)

SELECT MIN(dt) AS start_period, MAX(dt) AS end_period

FROM Groups

GROUP BY grp;

 

Output:

start_period end_period

------------ ----------

2010-01-24   2010-02-01

2010-02-12   2010-03-12

 

Cheers,

BG

 

Discuss this Blog Entry 2

on May 16, 2010
I've come up with this solution based upon the SQL Server MVP Deep Dives chapter mentioned in the article. If help is needed understanding this query, read about Listing 7 in that chapter. Note that this version assumes that if one project ends the day before another begins, they are considered to be 2 different periods. If they should be considered one period, then s.end_date>=p.start_date in the StartingPoints CTE should be changed to s.end_date>=dateadd(d,-1,p.start_date) and s.start_date<=p.end_date in the EndingPoints CTE should be changed to s.start_date<=dateadd(d,1,p.end_date).

with StartingPoints as
(select start_date ,dense_rank() over(order by start_date) as rownum
from projects p
where not exists
(select *
from projects s
where s.start_date=p.start_date)
),
EndingPoints as
(
select end_date ,dense_rank() over(order by end_date) as rownum
from projects p
where not exists
(select *
from projects s
where s.end_date>p.end_date and s.start_date<=p.end_date)
)
select distinct s.start_date as start_period,e.end_date as end_period
from startingpoints s
join endingpoints e
on s.rownum=e.rownum;

Some additional data for testing it can be made with the following insert.
INSERT INTO dbo.Projects(projectid, title, start_date, end_date) VALUES
(4,'1 of 3 part period','20100501','20100507'),
(5,'2 of 3 part period','20100505','20100514'),
(6,'3 of 3 part period','20100508','20100521'),
(7,'This project is inside #8','20100602','20100603'),
(8,'This project contains #7','20100601','20100604'),
(9,'Part 1 of projects contiguous but not truly overlapping','20100701','20100707'),
(10,'Part 2 of projects contiguous','20100708','20100714'),
(11,'Duplicate of next','20100801','20100802'),
(12,'Duplicate of previous','20100801','20100802'),
(13,'Example of why dense_rank is needed','20100901','20100907'),
(14,'Example of why dense_rank is needed','20100903','20100907'),
(15,'Last example','20101001','20101031');

To see why Dense_Rank is needed instead of Row_Number used in the Listing 7 from which this query originated, change Dense_Rank to Row_Number and see what happens.










































on May 2, 2010
I suggested another solution- based on double recursive CTE in my modest blog:
http://gerireshef.wordpress.com/2010/05/02/packing-date-intervals/

Unfortunately the blog is in Hebrew,
but the SQL is the same SQL all over the world.. :-)



Please or Register to post comments.

What's Puzzled By T-SQL Blog?

T-SQL tips and logical puzzles from Itzik Ben-Gan.

Blog Archive

Sponsored Introduction Continue on to (or wait seconds) ×