TSQL Challenge: Packing Date and Time Intervals

The relational model deals with date and time intervals and defines various operations on those. One of the fundamental operations is packing date and time intervals. Packing of intervals means that you want to merge all intervals that overlap into one contiguous interval. SQL Server doesn’t support a native interval type, so most represent an interval with two attributes holding the start and end points of the interval, or one of the points and a duration.

The challenge at hand involves addressing the task of packing intervals using TSQL efficiently, with a standard, set-based solution, that is reusable in other platforms—namely, use only standard SQL constructs.

I’ll provide both small and large sets of rows for sample data. Use the small set to test the validity of your solution, and the large one to test its optimality.

Here’s the DDL and sample data for the small set to test the validity of your solution:

-- DDL and sample data, small
SET NOCOUNT ON;
USE tempdb;

IF OBJECT_ID('dbo.Sessions') IS NOT NULL DROP TABLE dbo.Sessions;

CREATE TABLE dbo.Sessions
(
  id        INT          NOT NULL IDENTITY(1, 1),
  username  VARCHAR(14NOT NULL,
  starttime DATETIME2(3) NOT NULL,
  endtime   DATETIME2(3) NOT NULL,
  CONSTRAINT PK_Sessions PRIMARY KEY(id),
  CONSTRAINT CHK_endtime_gteq_starttime
    CHECK (endtime >= starttime)
);
GO

INSERT INTO dbo.Sessions VALUES('User1', '20111201 08:00:00.000', '20111201 08:30:00.000');
INSERT INTO dbo.Sessions VALUES('User1', '20111201 08:30:00.000', '20111201 09:00:00.000');
INSERT INTO dbo.Sessions VALUES('User1', '20111201 09:00:00.000', '20111201 09:30:00.000');
INSERT INTO dbo.Sessions VALUES('User1', '20111201 10:00:00.000', '20111201 11:00:00.000');
INSERT INTO dbo.Sessions VALUES('User1', '20111201 10:30:00.000', '20111201 12:00:00.000');
INSERT INTO dbo.Sessions VALUES('User1', '20111201 11:30:00.000', '20111201 12:30:00.000');

INSERT INTO dbo.Sessions VALUES('User2', '20111201 08:00:00.000', '20111201 10:30:00.000');
INSERT INTO dbo.Sessions VALUES('User2', '20111201 08:30:00.000', '20111201 10:00:00.000');
INSERT INTO dbo.Sessions VALUES('User2', '20111201 09:00:00.000', '20111201 09:30:00.000');

INSERT INTO dbo.Sessions VALUES('User2', '20111201 11:00:00.000', '20111201 11:30:00.000');
INSERT INTO dbo.Sessions VALUES('User2', '20111201 11:32:00.000', '20111201 12:00:00.000');
INSERT INTO dbo.Sessions VALUES('User2', '20111201 12:04:00.000', '20111201 12:30:00.000');

INSERT INTO dbo.Sessions VALUES('User3', '20111201 08:00:00.000', '20111201 09:00:00.000');
INSERT INTO dbo.Sessions VALUES('User3', '20111201 08:00:00.000', '20111201 08:30:00.000');
INSERT INTO dbo.Sessions VALUES('User3', '20111201 08:30:00.000', '20111201 09:00:00.000');

INSERT INTO dbo.Sessions VALUES('User3', '20111201 09:30:00.000', '20111201 09:30:00.000');
GO

The table holds information about user sessions against some service or application. Suppose that for billing purposes you need to pack the intervals for each user since you’re not supposed to charge the user multiple times for multiple concurrent sessions, but rather only for connection time regardless of number of sessions. The desired result for the given sample data is:

username  starttime               endtime
--------- ----------------------- -----------------------
User1     2011-12-01 08:00:00.000 2011-12-01 09:30:00.000
User1     2011-12-01 10:00:00.000 2011-12-01 12:30:00.000
User2     2011-12-01 08:00:00.000 2011-12-01 10:30:00.000
User2     2011-12-01 11:00:00.000 2011-12-01 11:30:00.000
User2     2011-12-01 11:32:00.000 2011-12-01 12:00:00.000
User2     2011-12-01 12:04:00.000 2011-12-01 12:30:00.000
User3     2011-12-01 08:00:00.000 2011-12-01 09:00:00.000
User3     2011-12-01 09:30:00.000 2011-12-01 09:30:00.000

To test the performance of your solution, use the following code:

-- helper function GetNums
IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL
  DROP FUNCTION dbo.GetNums;
GO

CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS(SELECT 1 AS c UNION ALL SELECT 1),
    L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
  SELECT TOP (@n) n FROM Nums ORDER BY n;
GO

-- sample data, 5,000,000 rows
DECLARE
  @num_users          AS INT          = 1000,
  @intervals_per_user AS INT          = 5000,
  @start_period       AS DATETIME2(3) = '20110101',
  @end_period         AS DATETIME2(3) = '20110107',
  @max_duration_in_ms AS INT  = 3600000; -- 60 nimutes

TRUNCATE TABLE dbo.Sessions;

WITH C AS
(
  SELECT 'User' + RIGHT('000000000' + CAST(U.n AS VARCHAR(10)), 10) AS username,
      DATEADD(ms, ABS(CHECKSUM(NEWID())) % 86400000,
        DATEADD(day, ABS(CHECKSUM(NEWID())) % DATEDIFF(day, @start_period, @end_period), @start_period)) AS starttime
  FROM dbo.GetNums(@num_users) AS U
    CROSS JOIN dbo.GetNums(@intervals_per_user) AS I
)
INSERT INTO dbo.Sessions WITH (TABLOCK) (username, starttime, endtime)
  SELECT username, starttime,
    DATEADD(ms, ABS(CHECKSUM(NEWID())) % (@max_duration_in_ms + 1), starttime) AS endtime
  FROM C;

This code fills the table with 5,000,000 rows.

Feel free to create indexes to support your solution.

 

Good luck!

BG

 

Discuss this Blog Entry 14

on Jan 19, 2011
Hi Geri,

Your code doesn't parse due to the following part: "...And (S.StartTime And S.EndTime>T2_1.Time)..."

Can you please re-send a workable query, perhaps also send me directly to itzik@solidq.com.

Thanks!





on Jan 27, 2011
Nice one Stefan! Your solution runs for 40 seconds on the machine I'm using to test the solutions. I just posted a status update with the run times of the different solutions. You can find it here: http://www.sqlmag.com/blogs/puzzled-by-t-sql/tabid/1023/entryid/76106/Clarifications-Regarding-TSQL-Challenge-Packing-Date-and-Time-Intervals.aspx.
on Jan 20, 2011
Here's the corrected solution I got from Geri:

-- indexing
CREATE INDEX idx_user_start_end ON dbo.Sessions(username, starttime, endtime);
CREATE INDEX idx_user_end_start ON dbo.Sessions(username, endtime, starttime);

-- solution
With T1 As
(
Select UserName, StartTime Time
From Sessions

Union All

Select UserName, EndTime
From Sessions
),
T2 As
(
Select Row_Number() Over(Partition By UserName Order By Time) Nm,
UserName, Time
From T1
),
T3 As
(
Select A.Nm-Row_Number() Over(Partition By A.UserName Order By A.Time,B.Time) Nm1,
A.UserName,
A.Time StartTime,
B.Time EndTime
From T2 A
Inner join T2 B
On A.UserName = B.UserName
And A.Nm=B.Nm - 1
Where Exists
( Select *
From Sessions S
Where S.UserName = A.UserName
And (S.StartTime < B.Time And S.EndTime > A.Time) )
Or A.Time=B.Time
)
Select UserName,
Min(StartTime) StartTime,
Max(EndTime) EndTime
From T3
Group By UserName, Nm1
Order By UserName, StartTime;














































on Jan 27, 2011
Very interesting challenge. I really can't understand how you manage to get a solution that runs in 9 seconds.

I really look forward to seeing your solution.

/SG

-- stefan_g v1

-- on my dual core laptop the solution below runs in 47 seconds (not counting index creation)
-- It uses a parallell execution plan so it should be even faster with more cores

-- Create index
CREATE INDEX idx_user_start ON dbo.Sessions(username, starttime);

-- Create a table with one row per interval
-- endtime is the end of each interval. The end of an interval is identified by finding the end of a session
-- that is not overlapped by any other session.
-- to speed up the search for overlapping sessions we only search the sessions that overlap the hour we are searching for
if object_id('tempdb..#intervals') is not null drop table #intervals
;with
cte1 as (
select *, datediff(hour, 0, starttime)+n-1 as hash
from sessions
cross apply dbo.GetNums(datediff(hour, 0, endtime)-datediff(hour, 0, starttime)+1) t
)
, cte2 as (
select distinct username, endtime
from sessions a
where not exists(
select *
from cte1 b
where a.username=b.username
and b.starttime<=a.endtime
and b.endtime>a.endtime
and b.hash=datediff(hour, 0, a.endtime)
)
)
select *, row_number() over (partition by username order by endtime) as rn
into #intervals
from cte2

-- add information about the start of each interval by locating the first session after each interval
;with
cte1 as (
select a.username, a.endtime, isnull(b.endtime,'19000101') as pe
from #intervals a
left join #intervals b
on a.username=b.username and a.rn=b.rn+1
)
select username, (select min(b.starttime) from sessions b where a.username=b.username and b.starttime>a.pe) as starttime, endtime
from cte1 a



















































on Jan 20, 2011
Hi Peso,

No, it doesn't mean "one query" solution. It means:
* Not to use iterative constructs like cursors or loops where you interact with one row at a time, as opposed to interacting with the set as a whole.
* Not to rely on consuming data in specific physical order for the solution to work correctly.

Other than that, remember that one of the puzzle's requirements is for the solution to use only standard SQL constructs. That's for posterity, since the problem is so fundamental. To paraphrase Fermat, I have discovered a truly remarkable solution which this note is too small to contain. Well, not really, but I always wanted to say that. ;) Is there an Andrew Wiles among you that will find the fast solution? My standard solution runs for 18 seconds on my moderate laptop, and I have a nonstandard one that runs for 9 seconds. I won’t let you work 500 years on the problem, but I will allow you to work until, say, the first day of Spring before I present mine unless a solution as fast or faster is found beforehand.

Of course, if you have a T-SQL specific solution that runs really fast, please do share...









on Jan 26, 2011
Stefan, by standard I mean constructs defined by ISO and ANSI SQL. The idea is for the solution to be cross platform. Such that you can copy-paste to run it on Oracle/DB2/etc. (or with very minor revisions).
Recursive CTEs therefore are perfectly valid. Window functions also. The solution has to run on SQL Server, though, so constructs that are not available in SQL Server are a problem. Though if you have a solution that relies on those, of course, it's interesting to see it...
Hope this clarifies things...

on Jan 25, 2011
Malpashaa, your solutions produce incorrect results when run against the big sample data. Even though the sample data is created with randomization, you should get only a bit over 1,000 rows in the result. That’s because there are 1,000 users, each with about 5,000 intervals of up to one hour, during a period of one week. Hence, most users will have one packed interval and some will have a bit more.
Your queries return over 100,000 rows in the result. I haven’t looked into the logic of your solutions yet since I’m hoping you can find the bugs, fix them, and send correct versions first.


Alejandro Mesa (not verified)
on Jan 29, 2011
Very nice indeed, Stefan!

--
AMB


on Jan 29, 2011
Hi Will,

The answer is no. It's just the sample data I provides in this case, but the intervals can be of any duration; in the milliseconds, or in the years.

on Jan 29, 2011
Is anything to be gained from taking advantage of the 60-minute maximum for a session in the test data, Itzik?
on Jan 24, 2011
These are my proposed solutions:
/* Indexes */
CREATE NONCLUSTERED INDEX Sessions__IX__username__endtime__starttime
ON dbo.Sessions(username, endtime, starttime);

CREATE NONCLUSTERED INDEX Sessions__IX__username__starttime__endtime
ON dbo.Sessions(username, starttime, endtime);

/* V1 */

WITH StartTimesCTE AS
(
SELECT U.username, ST.starttime,
ROW_NUMBER() OVER(PARTITION BY U.username
ORDER BY ST.starttime) AS row_num
FROM (SELECT DISTINCT username
FROM dbo.Sessions) AS U
CROSS APPLY
(SELECT DISTINCT S1.starttime
FROM (SELECT S.starttime, ROW_NUMBER() OVER(ORDER BY S.starttime) AS row_num
FROM dbo.Sessions AS S
WHERE S.username = U.username) AS S1
LEFT OUTER JOIN
(SELECT S.endtime, ROW_NUMBER() OVER(ORDER BY S.starttime) AS row_num
FROM dbo.Sessions AS S
WHERE S.username = U.username) AS S2
ON S2.row_num = S1.row_num - 1
AND S2.endtime >= S1.starttime
WHERE S2.row_num IS NULL) AS ST
)
, EndTimesCTE AS
(
SELECT U.username, ET.endtime,
ROW_NUMBER() OVER(PARTITION BY U.username
ORDER BY ET.endtime) AS row_num
FROM (SELECT DISTINCT username
FROM dbo.Sessions) AS U
CROSS APPLY
(SELECT DISTINCT S1.endtime
FROM (SELECT S.endtime, ROW_NUMBER() OVER(ORDER BY S.endtime) AS row_num
FROM dbo.Sessions AS S
WHERE S.username = U.username) AS S1
LEFT OUTER JOIN
(SELECT S.starttime, ROW_NUMBER() OVER(ORDER BY S.endtime) AS row_num
FROM dbo.Sessions AS S
WHERE S.username = U.username) AS S2
ON S2.row_num = S1.row_num + 1
AND S2.starttime <= S1.endtime
WHERE S2.row_num IS NULL) AS ET
)
SELECT ST.username, ST.starttime, ET.endtime
FROM StartTimesCTE AS ST
INNER JOIN
EndTimesCTE AS ET
ON ET.username = ST.username
AND ET.row_num = ST.row_num
ORDER BY username, starttime, endtime;

/* V2 */

SELECT U.username, ST.starttime, ET.endtime
FROM (SELECT DISTINCT username
FROM dbo.Sessions) AS U
CROSS APPLY
(SELECT ST.starttime, ROW_NUMBER() OVER(ORDER BY ST.starttime) AS row_num
FROM (SELECT DISTINCT S1.starttime
FROM (SELECT S.starttime, ROW_NUMBER() OVER(ORDER BY S.starttime) AS row_num
FROM dbo.Sessions AS S
WHERE S.username = U.username) AS S1
LEFT OUTER JOIN
(SELECT S.endtime, ROW_NUMBER() OVER(ORDER BY S.starttime) AS row_num
FROM dbo.Sessions AS S
WHERE S.username = U.username) AS S2
ON S2.row_num = S1.row_num - 1
AND S2.endtime >= S1.starttime
WHERE S2.row_num IS NULL) AS ST) AS ST
CROSS APPLY
(SELECT ET.endtime, ROW_NUMBER() OVER(ORDER BY ET.endtime) AS row_num
FROM (SELECT DISTINCT S1.endtime
FROM (SELECT S.endtime, ROW_NUMBER() OVER(ORDER BY S.endtime) AS row_num
FROM dbo.Sessions AS S
WHERE S.username = U.username) AS S1
LEFT OUTER JOIN
(SELECT S.starttime, ROW_NUMBER() OVER(ORDER BY S.endtime) AS row_num
FROM dbo.Sessions AS S
WHERE S.username = U.username) AS S2
ON S2.row_num = S1.row_num + 1
AND S2.starttime <= S1.endtime
WHERE S2.row_num IS NULL) AS ET) AS ET
WHERE ET.row_num = ST.row_num
ORDER BY username, starttime, endtime;



























































































on Jan 19, 2011
I guess my contribution should appear here:

With T1 As
(Select UserName,
StartTime Time
From Sessions
Union All
Select UserName,
EndTime
From Sessions),
T2 As
(Select Row_Number() Over(Partition By UserName Order By Time) Nm,
UserName,
Time
From T1 T1_1),
T3 As
(Select T2_1.Nm-Row_Number() Over(Partition By T2_1.UserName Order By T2_1.Time,T2_2.Time) Nm1,
T2_1.UserName,
T2_1.Time StartTime,
T2_2.Time EndTime
From T2 T2_1
Inner join T2 T2_2
On T2_1.UserName=T2_2.UserName
And T2_1.Nm=T2_2.Nm-1
Where Exists (Select *
From Sessions S
Where S.UserName=T2_1.UserName
And (S.StartTime And S.EndTime>T2_1.Time))
Or T2_1.Time=T2_2.Time)
Select UserName,
Min(StartTime) StartTime,
Max(EndTime) EndTime
From T3
Group By UserName,
Nm1
Order By UserName,
StartTime;



































on Jan 20, 2011
Does set-based mean "one-query solution", or just the absence of a cursor?

I have a solution that runs in 61-63 second and uses 1,116,205 reads but have 3 queries; one insert, one merge and finally one select to display the result.



on Jan 26, 2011
When you say "use only standard SQL constructs" what do you mean ?

For example, are recursive cte:s allowed ?, window functions ?, window functions that are standardized but not implemented in SQL server?

Please or Register to post comments.

What's Puzzled By T-SQL Blog?

T-SQL tips and logical puzzles from Itzik Ben-Gan.

Blog Archive

Sponsored Introduction Continue on to (or wait seconds) ×