Executive Summary:
Dealing with temporal data can be tricky. Create a temporal histogram to show the distribution of events over time by first relaxing some of the requirements to solve a specific case, then enhancing the solution to make it more generic.
|
Recently I received a request from a customer
to come up with a solution that produces
temporal histograms—histograms showing
the distribution of events over time periods. The
problem was an interesting challenge and seemed like
a generic need, so I decided to cover it in my column.
First, I’ll explain the problem in terms of inputs and
desired output. Next, I’ll provide a solution that
handles only a specific case of the problem. Then, I’ll
show you how
to enhance the
solution to make
it more generic.
The Challenge
Suppose that you have a table called Events in your
database, containing information about events in
time. These events can be appointments, sessions, or
anything that has start and end points in time. The
Events table has three columns: event_id is the primary
key, event_start is the start point in time of the event,
and event_end is the end point. Run the code in Web
Listing 1 to create the Events table in the tempdb database and
populate it with sample data.
You need to write a table function that accepts the
following inputs:
@from_dt—start point of a datetime range
@to_dt—end point of a datetime range
@date_part—a datetime part from the enumeration:
‘minute’, ‘hour’, ‘day’, ‘week’, ‘month’,
‘quarter’, ‘year’
@num_parts—the number of datetime parts to be
covered in each step
The table function should produce a histogram
showing the number of active events during each fixed
interval of time within the requested datetime range.
The intervals of time, or steps, are based on the input @date_part and @num_parts.
The problem is best explained through an example.
Given the inputs @from_dt = ‘20080501 00:00’, @
to_dt = ‘20080511 00:00’, @date_part = ‘day’, @
num_parts = 1, you’re supposed to produce the output
in Web Table 1. Each row in the output represents a
different fixed interval of time (from_dt - to_dt) within
the requested datetime range. Because the requested
date part is ‘day’, and the number of parts is 1, each
step in the histogram represents one day. The fourth
column in the output (num_events) holds the count of
events from the Events table that were active during the
current interval.
Regarding the histogram step boundary points, one
of the requirements from my customer was to produce
round points in time (round in respect to the input @
datepart), except for the extreme boundary points that
must be the ones provided by the user. For example,
given the inputs @from_dt = ‘20080501 12:30’, @to_dt
= ‘20080510 10:00’, @date_part = ‘day’, @num_parts
= 1, the step boundary points should be those in Web Table 2. Notice that the first step’s low boundary point
is 2008-05-01 12:30:00.000 and the last step’s high
boundary point is 2008-05-10 10:00:00.000, whereas
all the other step boundary points represent whole days
(in terms of day units).
Producing the Histogram Steps
You can start by creating a table function (call it
fn_HistSteps) that returns the histogram steps table
based on the previously mentioned input parameters.
The function will return a row for each step with the
step boundary points. Once defined, you can join the
function with the Events table to match steps and
events, group the result of the join by step, and count
the number of events in each step.
So that you don’t have to deal with too many
aspects of the problem at once, you can first relax some
of the requirements. For example, take the @date_part and @num_parts inputs out of the equation, and solve
the task for a specific interval—say, one day. After you
manage to solve the problem for a specific interval, you
can add the logic required to handle the requested @
date_part and @num_parts inputs.
In my solution I used an auxiliary table of numbers
that you create and populate by running the code in
Web Listing 2. This code creates a table called Nums
with a single column called n, and populates the table
with integers in the range 1 through 1,000,000.
To create the first version of the fn_HistSteps function,
run the code in Web Listing 3. The function is an
inline table-valued function based on a single query with multiple common table expressions (CTEs). The first
CTE defined by the function’s code is called C0 and it
has two columns: floor_from_dt and diff. The former is a
floor of the input @from_dt value in terms of day units;
that is, midnight of the input @from_dt value. The latter
is the number of days in the range @from_dt - @to_dt.
The second CTE is called C1; it’s in charge of producing
steps with round boundary points. This task is
achieved by joining Nums and Steps, and returning all
n values that are smaller than or equal to diff (number
of days in the input range). The starting point of each
step (from_dt) is calculated by adding n-1 days to
floor_from_dt, and the ending point of each step (to_
dt) is calculated by adding n days to floor_from_dt.
The third CTE is called C2; it’s in charge of
adjusting the extreme boundary points (start point of
first step and end point of last step) if they need adjustment.
Remember that the previous CTE (C1) produced
round boundary points, although the requirement
was that the extreme boundary points would be those
provided by the user as the input datetime range
boundary points. Note that as a result of adjusting the
extreme boundary points, C2 might end up with rows representing irrational steps where to_dt isn’t greater
than from_dt. Those rows will be eliminated by the
outer query. The outer query simply returns all rows
from C2 representing the histogram steps, excluding
the irrational steps produced by a previous CTE.
Continued on page 2
Prev. page  
[1]
2
next page