Work around SQL Server’s incomplete support
|Executive Summary: Microsoft’s SQL Server has limited support for temporal data. Learn how to work around this limitation.|
Temporal data can be difficult to handle in SQL Server queries, because SQL Server implements only a subset of the features that are defined by ANSI SQL to support temporal data. Even in SQL Server 2008, temporal data types represent only a moment in time—no support exists for an INTERVAL data type that represent a time quantity. In addition, no native support is available for operations on temporal data, such as checking whether two time periods overlap, or merging intersecting time periods. You must create your own custom solutions for such operations. For example, you can represent a period of time either as two points in time (start time and end time) or as a start point in time and an integer representing the duration in a predetermined unit (e.g., a second).
A friend and colleague, Davide Mauri, recently presented me with a temporal querying problem. The challenge involves changes that take place in attribute values of entities (e.g., customers). Suppose that you keep track of the validity periods of customers’ attribute values, such as phone numbers and addresses. You have a table called Phones where you keep track of the validity periods for phone numbers, a table called Addresses with validity periods of addresses, and so on. The challenge is to produce all periods with distinct combination of attribute values (phone number, address, etc.).
Run the code in Listing 1 to create the Phones and Addresses tables and populate them with sample data. Notice that I didn’t include an attribute in the tables to hold a customer ID; I wanted to simplify the problem and focus on the temporal querying logic.
For each phone number and address, the pair of attributes startdt and enddt represent the value’s validity period. The attribute value’s validity period is greater than or equal to startdt and less than enddt (startdt = period < enddt). A NULL in the enddt value stands for no end date or still applicable. As I mentioned earlier, the challenge is to produce all periods of time with distinct phone and address values. Table 1, page 22, shows the desired output. Before you read Davide’s solution and my solution, try to solve the problem on your own.
Davide’s complete solution appears in Listing 2, page 22. As you can see, the solution is expressed as a single statement that defines multiple common table expressions (CTEs). For simplicity, I’ll explain Davide’s solution one CTE at a time. Note that the code uses language elements that were introduced in SQL Server 2005 (CTEs and the PIVOT operator); therefore you can’t run the code if you’re using a previous version of SQL Server. However, you can easily adapt the solution to earlier versions of SQL Server by converting the use of CTEs to derived tables, and use pivoting techniques that were supported in SQL Server 2000.
The code in Davide’s solution first defines a CTE called Timestamps that represents all timestamps where the validity period of an attribute value either starts or ends. The code defining the Timestamps CTE has four queries, each returning either a start time or an end time of an attribute value, and all result sets are unified with UNION ALL set operations. Table 2 shows the output produced by the query that defines the Timestamps CTE. Notice in Table 2 that NULL end times were substituted by the value 99991231, and that duplicate timestamp values may appear.
The second step in the solution is to define the CTE TSDR (short for timestamps with dense ranks). The query defining this CTE calculates dense rank values to position the timestamps chronologically, and it uses the DISTINCT clause to get rid of duplicate timestamps. The output of the query defining the TSDR CTE is shown in Table 3. Notice in the output that after the removal of duplicate timestamps, only 5 out of 8 entries remain.
The third step in the solution is to define the CTE Intervals. The query defining this CTE joins two instances of TSDR (aliased as Cur and Nxt) to produce all possible periods of time from the timestamps in TSDR. The JOIN condition used for this purpose is Nxt.pos = Cur.pos + 1, meaning that each point in time that is considered as the start of a period will be matched with the nearest future point as the end of the period. The output of the query defining the Intervals CTE is shown in Table 4.
The fourth step in the solution is to define the CTE called UnifiedValues. This CTE is independent of the previously defined CTEs; the code defining this CTE queries the Phones and Addresses tables directly, and it uses a UNION ALL set operation to unify the validity periods of phones and addresses. The code returns a string with the attribute type (‘phone’ | ‘address’) in the attr column, and the attribute value in the val column. The code uses the COALESCE function to replace a NULL in the enddt attribute to the value 99991231. The output of the query defining the UnifiedValues CTE is shown in Table 5.
The fifth step in the solution is to define the CTE ValidValues. The query defining this CTE joins the Intervals CTE with the UnifiedValues CTE to match possible periods (from Intervals) and validity periods (from UnifiedValues) that overlap. The join predicate ensures that the two periods overlap. Two periods (e.g., p1 and p2) overlap if p1.end > p2.start and p1.start < p2.end. The query defining the ValidValues CTE returns for each pair of overlapping periods the start time and end time of the period from intervals, the attribute type (‘phone’ | ‘address’), and the attribute value. The output of the query defining the ValidValues CTE is shown in Table 6.
The code defining the CTE ValidValues returns the correct result, only with a separate row for each distinct period and attribute. The final step in the solution is achieved by the outer query. This query pivots the attributes such that you get a row for each distinct period, a column for each attribute, and the attribute value in the intersection of period and attribute.
Davide and I took similar approaches in the first part of our solutions, but we took different approaches in the second part. My complete solution appears in Listing 3.
Continue on Page 2
The first step in my solution defining the CTE Timestamps is almost identical to Davide’s Timestamps CTE, except that mine uses a UNION operation as the last set operation instead of UNION ALL. This means that duplicate timestamps are removed at this step in my solution.
In the second step (in my solution, defining CTE TSRN—timestamps with row numbers), I didn’t need to use the combination of the DENSE_RANK function and the DISTINCT clause. Instead, I simply assigned row numbers to position the timestamps chronologically.
The third step in my solution (i.e., defining the Intervals CTE) is identical to Davide’s. Then our solutions start to differ. My solution reaches the last step at this point, implemented by the outer query. The outer query joins the Intervals CTE, which contains all distinct possible periods with each of the attribute tables (Phones and Addresses, in our case) containing validity periods, to match all overlapping validity periods to each possible period. A LEFT OUTER JOIN is used because there’s no guarantee that each possible period will find a match in each attribute table.
Maintaining the Solutions
As you can see from this problem, treatment of time periods can be challenging and quite tricky—especially because SQL Server has incomplete support for temporal data. If you want to use Davide’s or my solution to support new attributes, you’ll need to create a new table for each attribute, then revise the solution to include the attributes. For either solution, you’ll need to revise the Timestamps CTE, adding the timestamps from the new attribute table. For Davide’s solution, you’ll need to revise the UnifiedValues CTE, adding a UNION ALL set operation and a query against the new attribute table to pull the validity periods of the new attribute. In addition, you’ll need to add the new attribute name to the IN clause of the PIVOT operator in the outer query. For my solution, you’ll need to revise the outer query, adding a LEFT OUTER JOIN to the new attribute table.
Did you manage to come up with a different solution than mine or Davide’s? If so, share your solution by posting a comment to this article.