In mathematics, an interval is the set of values between some lower value and some upper value. There are different types of intervals that you might need to represent in your database, such as temporal (e.g., sessions, contracts, projects, appointments, periods of validity), spatial (e.g., segments of a road), and numeric (e.g., temperature ranges).
Starting with SQL Server 2008, you can represent spatial intervals using the spatial data types GEOMETRY and GEOGRAPHY and operate on those types with methods. There's also indexing and optimization support for spatial queries.
As for other types of intervals, such as the very common temporal type, SQL Server doesn't yet provide any special support. Most people represent such intervals with two attributes holding the lower and upper values and use predicates involving those attributes for interval-related querying. There's no special indexing or optimization support for intervals.
Interval-related querying involves identifying common relations between intervals. One of the most common queries is to check whether two intervals intersect (e.g., return all sessions that were active during a certain period of time represented by the inputs @l and @u). A test for intersection is a composition of eleven out of the thirteen relations defined by James F. Allen. Specifically, the eleven relations are meets, overlaps, finished-by, contains, starts, equals, started-by, during, finishes, overlapped-by, and met-by. Unfortunately, the classic methods people use to identify interval intersection and some of the other relations suffer from fundamental optimization problems. The result is that interval-related queries tend to perform very inefficiently.
But all hope is not lost—a group of researchers from the University of Munich (Hans-Peter Kriegel, Marco Pötke, and Thomas Seidl) created an ingenious model called the Relational Interval Tree (RI-tree), and Laurent Martin from France added some further improvements, allowing you to efficiently handle intervals in SQL Server. However, the model and the further optimizations involve some math and computer science that could be a bit complex for some people.
The potential exists to integrate the model within the SQL Server database engine and make it seamless to the user. As a user, you would use basic syntax to create a new kind of index and leave your queries unchanged. The rest of the responsibility would be SQL Server's—your queries would simply run faster. Such support doesn't yet exist in SQL Server, but I'm hopeful that Microsoft will embrace the idea and include such support in the future.
In this article, I start by describing the traditional representation of intervals in SQL Server, the classic queries against intervals, and the fundamental optimization problems of such queries. I then describe the RI-tree model and further optimizations. Finally, I describe the potential for integration of the RI-tree model in SQL Server.
The following resources contain my feature enhancement request to Microsoft, an academic paper describing the RI-tree model, and articles describing further optimizations:
- Microsoft Connect submission suggesting integration in SQL Server
- Academic paper describing the RI-tree model (Kriegel, Pötke, and Seidl of University of Munich): "Managing Intervals Efficiently in Object-Relational Databases"
- Articles describing optimizations to the RI-tree model (Laurent Martin): "A Static Relational Interval Tree" and "Advanced interval queries with the Static Relational Interval Tree"
Many thanks to Kriegel, Pötke, and Seidl for creating such an ingenious model and to Laurent Martin for the improvements and for acquainting me with the model.
Traditional Representation of Intervals
The traditional representation of intervals in SQL Server is with two attributes holding the lower and upper values of each interval and another attribute holding the key. Of course there can be other attributes in the table serving other purposes. In my examples, I use a table called Intervals with 10,000,000 rows representing intervals in the traditional way.
Use the code in Listing 1 and Listing 2 to create and populate such a table in your environment. Listing 1 creates a database called IntervalsDB, a helper function called GetNums that generates a sequence of integers in a requested range, and a staging table called Stage with 10,000,000 rows.
-- create sample database IntervalsDB
SET NOCOUNT ON;
USE master;
IF DB_ID('IntervalsDB') IS NOT NULL DROP DATABASE IntervalsDB;
CREATE DATABASE IntervalsDB;
GO
USE IntervalsDB;
GO
-- create helper function dbo.GetNums
-- purpose: table function returning sequence of integers between inputs
@low and @high
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0 AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM L5)
SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
FROM Nums
ORDER BY rownum;
GO
-- create staging table dbo.Stage with 10,000,000 intervals
CREATE TABLE dbo.Stage
(
id INT NOT NULL,
lower INT NOT NULL,
upper INT NOT NULL,
CONSTRAINT PK_Stage PRIMARY KEY(id),
CONSTRAINT CHK_Stage_upper_gteq_lower CHECK(upper >= lower)
);
DECLARE
@numintervals AS INT = 10000000,
@minlower AS INT = 1,
@maxupper AS INT = 10000000,
@maxdiff AS INT = 20;
WITH C AS
(
SELECT
n AS id,
@minlower + (ABS(CHECKSUM(NEWID())) %
(@maxupper - @minlower - @maxdiff + 1)) AS lower,
ABS(CHECKSUM(NEWID())) % (@maxdiff + 1) AS diff
FROM dbo.GetNums(1, @numintervals) AS Nums
)
INSERT INTO dbo.Stage WITH(TABLOCK) (id, lower, upper)
SELECT id, lower, lower + diff AS upper
FROM C;
You'll use the same source data from the staging table to first fill the traditional representation of intervals and later the RI-tree representation. Run the code in Listing 2 to create the Intervals table with the traditional representation of intervals and fill it with the sample data from the staging table.
CREATE TABLE dbo.Intervals
(
id INT NOT NULL,
lower INT NOT NULL,
upper INT NOT NULL,
CONSTRAINT PK_Intervals PRIMARY KEY(id),
CONSTRAINT CHK_Intervals_upper_gteq_lower CHECK(upper >= lower)
);
CREATE INDEX idx_lower ON dbo.Intervals(lower) INCLUDE(upper);
CREATE INDEX idx_upper ON dbo.Intervals(upper) INCLUDE(lower);
INSERT INTO dbo.Intervals WITH(TABLOCK) (id, lower, upper)
SELECT id, lower, upper
FROM dbo.Stage;
I use integers to represent the lower and upper values of the interval because the RI-tree model, which I discuss later, works with integers. To represent date and time values, you need to map them to integers—for example, by computing the difference between some base point and the target value in terms of the desired granularity of the type. Of course, using the traditional representation of intervals, you could use any type that implements total order.
Observe the two indexes that are defined in the Intervals table. One index is created on the column lower as the key and includes the column upper, and the other is created on the column upper as the key and includes the column lower.
Most interval-related queries—such as the one that identifies intersecting intervals—involve two range predicates. For example, given an input interval identified by the variables @l (for lower) and @u (for upper), the intervals that intersect with the input interval are those that satisfy the following predicate: lower <= @u AND upper >= @l. Herein lies the fundamental optimization problem—a seek in an index can be based on only one range predicate. Other range predicates have to be evaluated as residual predicates. Therefore, the optimizer will have to pick one of the two indexes to work with and perform a seek based on one of the predicates against the leading index key to filter the qualifying rows. While scanning the remaining rows in the index leaf, evaluate the other predicate to determine which rows to return.
This is such an important point to grasp that it's worth spending a bit more time on it to make sure it's well-understood. Consider the following list sorted by X, Y:
X, Y
1, 10
1, 20
1, 40
1, 50
2, 20
2, 30
2, 50
3, 20
3, 40
3, 50
3, 50
3, 60
4, 10
4, 10
4, 50
4, 60
Suppose this sorted list represents data in the leaf level of a B-tree index. Consider a query with the following filter: X >= 3 and Y <= 40. Based on the predicate X >= 3, it's possible with an index seek to go straight to the leaf row (3, 20) and scan only the rows that satisfy this predicate. However, there's no escape from scanning all remaining rows in the index leaf and evaluating the remaining predicate Y <= 40 as a residual predicate.
You can easily observe the optimization problem with a query looking for intersection against the Intervals table. First, use the following code to turn on STATISTICS IO and STATISTICS TIME:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
Next, run the following query to check for intervals that intersect with an input interval that resides roughly in the middle of the entire range:
DECLARE @l AS INT = 5000000, @u AS INT = 5000020;
SELECT id
FROM dbo.Intervals
WHERE lower <= @u AND upper >= @l
OPTION (RECOMPILE);
I use the RECOMPILE query option because otherwise variable values won't be sniffed. Figure 1 shows the plan for this query.
The inputs in this query represent an interval roughly in the middle of the entire range, so it doesn't really matter which of the two indexes the optimizer chooses to use. In this case, it seems that the optimizer chose to use the index idx_upper. As for observing the optimization problem, notice that the Seek Predicates section contains only the predicate against the column upper; the predicate against the column lower appears under the section Predicate—meaning residual predicate. The end result is that about half of the leaf level of the index has to be scanned. The statistics that I got for this query on my system were logical reads: 11256; CPU time: 482 ms. This isn't very efficient, to say the least.
As I mentioned, when looking for an interval in the middle of the entire range it doesn't really matter which index the optimizer uses. However, when looking for an interval that's close to the beginning of the range, naturally it's more efficient to use idx_lower, which involves scanning a smaller section at the beginning of the index leaf. Try it by using the inputs @l = 80 and @u = 100; make sure you use the RECOMPILE option so that the optimizer will be able to sniff the variable values. For this query, I got the statistics logical reads: 3; CPU time: 0 ms.
Similarly, when querying the end of the data, the optimizer chooses idx_upper, scanning a small section closer to the end of the index leaf. For example, try running the query with the inputs @l = 9999900 and @u = 9999920. The statistics I got for this query are logical reads: 3; CPU time: 0 ms.
As you can determine from this exercise, when using parameters in a stored procedure and not specifying RECOMPILE, this query is very sensitive to parameter sniffing problems. At any rate, the important conclusion is that unless users always query only a small section of the entire range that's close to either the beginning or the end of the range, the queries will suffer from serious optimization problems.
Relational Interval Tree
The RI-tree model is an ingenious model created by Kriegel, Pötke, and Seidl that enables very efficient querying of intervals. Implementing the model involves computing an attribute for each interval (stored as a column in the table), creating two indexes, and using new queries to look for intersection and other interval relations.
At the heart of the model is a virtual backbone binary tree whose nodes are the integer values in the range that needs to be covered. For example, if the intervals you need to represent can start or end anywhere in the range 1 to 31, the virtual backbone tree in your case is the one in Figure 2 (ignore the interval and the fork node shown in the figure for now).
You can use a LOG function to compute the height of the tree. To cover a range starting with 1 and ending with @max, the height of the binary tree is @h = CEILING(LOG(@max + 1, 2)). The root of the tree is POWER(2, @h-1). The reason the backbone tree is virtual is because the complete tree isn't persisted anywhere; instead, its nodes are used only when relevant, as you'll learn shortly.
Fork node. A fundamental component in the RI-tree model is what's called the fork node, which you compute for and store with each interval that you need to represent in your database. Figure 2 illustrates how the fork node is computed for some sample interval. You descend the virtual backbone tree starting with the root with bisection; the first node that you find within the interval is the fork node.
Listing 3 contains an implementation of this algorithm based on the RI-tree model in a T-SQL user-defined function (UDF) called forkNode for a tree with a height of 31 (covering the range 1 to 2147483647).
-- Based on "Managing Intervals Efficiently in Object-Relational Databases"
CREATE FUNCTION dbo.forkNode(@lower AS INT, @upper AS INT) RETURNS INT
WITH SCHEMABINDING
AS
BEGIN
DECLARE @node AS INT = 1073741824; -- @node = 2^(h-1),
h = height of tree, #values: 2^h - 1
DECLARE @step AS INT = @node / 2;
WHILE @step >= 1
BEGIN
IF @upper < @node
SET @node -= @step;
ELSE IF @lower > @node
SET @node += @step;
ELSE
BREAK;
SET @step /= 2;
END;
RETURN @node;
END;
GO
For example, invoke the function with 11 and 13 as inputs, and you get 12 as the output fork node:
SELECT dbo.forkNode(11, 13);
The downside of the fork node computation in Listing 3 is that it uses a T-SQL UDF with iterative logic. The combination is highly inefficient, especially when a fork node needs to be computed for each interval that you want to store in the database. To give you a sense of the inefficiency, it took 304 seconds on my system to populate a table with 10,000,000 rows, out of which the computation of the fork node took 297 seconds.
In the iterative algorithm for computing the fork node, the CPU complexity is proportional to the height of the tree. In the RI tree model, Kriegel, Pötke, and Seidl provide a variation where the height and the range of the tree expand dynamically according to the intervals that are added. But this variation involves maintaining additional parameters for the tree and modifying some of them with every addition of an interval, which results in support for only single-row insertions; in addition, the insertions are slow, and they can quickly result in a bottleneck.
Laurent Martin came up with a way to compute the fork node using a scalar expression, enabling support for a large static RI-tree, and using very efficient multi-row insertions. I briefly describe Martin's computation here; for more detail, see "A Static Relational Interval Tree."
As Figure 3 illustrates, the fork node of a given interval is the lowest common ancestor of the boundaries of the interval represented by the values in the columns lower (11) and upper (13).
Examine the binary representation of the nodes in Figure 3. Observe the following:
- For any given node, let L = leading bits before trailing 0s; for example, for node 12 (01100), L = 011.
- Given a node X whose leading bits are L, all nodes in X's left subtree have L-1 as their leading bits (e.g., for L = 011, L-1 = 010), and all nodes in X's right subtree have L as their leading bits.
- For some nonleaf node X and X-1, the ancestor is the same, so for the purposes of computing the ancestor of X, X can be replaced with X-1.
- If Z is a leaf node, Z and Z-1 differ only in the last bit: for Z it's 1 and for Z-1 it's 0.
Based on these observations, the fork node can be computed as: matching prefix of (lower - 1) and upper, concatenated with 1, and padded with trailing zeros.
To achive this computation in SQL Server, consider the following steps, keeping in mind the interval [11, 13] in Figure 3 as an example:
- Let A = (lower - 1) ^ upper
- Let B = POWER(2, FLOOR(LOG(A, 2)))
- Let C = upper % B
- Let D = upper - C
Step 1 computes a value (call it A) that marks the bits that are different in (lower - 1) and upper as 1s. To achieve this, you apply a bitwise XOR operator between (lower - 1) and upper. In our example, 11 ^ 13 in binary representation is 01010 ^ 01101 = 00111.
Step 2 computes a value (call it B) where the leftmost bit that's different in (lower - 1) and upper is set to 1 and all other bits are set to 0. In other words, this step identifies the leftmost bit in A that's turned on. Note that based on the aforementioned observations, the leftmost bit that's different in (lower - 1) and upper will be 0 in (lower - 1) and 1 in upper. In our example, the result of this step in binary form is 00100.
Step 3 keeps the trailing bits from upper after the bit that's set to 1 in B. In our example, this step computes 01101 % 00100 = 01.
Step 4 sets the trailing bits in upper after the set bit in B to 0s. In our example, 01101 - 00001 = 01100. And voila—D is the fork node!
Putting it all together, use the following formula (expressed as a T-SQL expression in SQL Server 2012) to compute the fork node:
upper - upper % POWER(2, FLOOR(LOG((lower - 1) ^ upper, 2)))
This expression encapsulates the computation of the fork node in an ingenious way. When I described it to a friend of mine who is an expert in T-SQL, he got so excited that he jokingly told me he wants to be like Martin when he grows up!
Now that you have a scalar expression to compute the fork node based on the columns lower and upper, you can implement it as a computed column (call it node) in the table that holds your intervals. Based on the RI-tree model, you'll need two indexes: one on the keylist (node, lower) and another on the keylist (node, upper). Note that depending on your needs, you might want to add leading columns to the keylist for equality-based filters in your queries, as well as columns to the index include list for coverage purposes.
The code in Listing 4 creates the IntervalsRIT table, populates it with the sample data from the staging table, and creates the primary key constraint and the indexing based on the RI-tree model.
-- Based on "A Static Relational Interval Tree"
CREATE TABLE dbo.IntervalsRIT
(
id INT NOT NULL,
node AS upper - upper % POWER(2, FLOOR(LOG((lower - 1) ^ upper, 2)))
PERSISTED NOT NULL,
lower INT NOT NULL,
upper INT NOT NULL,
CONSTRAINT PK_IntervalsRIT PRIMARY KEY(id),
CONSTRAINT CHK_IntervalsRIT_upper_gteq_lower CHECK(upper >= lower)
);
CREATE INDEX idx_lower ON dbo.IntervalsRIT(node, lower);
CREATE INDEX idx_upper ON dbo.IntervalsRIT(node, upper);
INSERT INTO dbo.IntervalsRIT WITH(TABLOCK) (id, lower, upper)
SELECT id, lower, upper
FROM dbo.Stage;
The populated IntervalsRIT table and its indexes are the RI-tree–based representation of your intervals replacing the previous Intervals table and its indexes. Recall that with the iterative forkNode function, it took 297 seconds on my system just to compute the fork nodes for 10,000,000 intervals. With Martin's optimized formula, it took only 6 seconds.
Querying the RI-tree model. Your IntervalsRIT table has your intervals stored, along with the fork node, in a column named node. Next, you'll learn how to query the table to identify intervals that intersect with some input interval [@l, @u]. In my examples, I'll use the input interval [@l = 11, @u = 13].
Based on the RI-tree model, there are three disjoint groups of intervals that can intersect with an input interval. I'll refer to them as the left, middle, and right groups.
Left group. Consider the path in the virtual backbone tree descending from the root to @l, as Figure 4 shows. Let leftNodes be the set of all nodes w that appear on the path and are to the left of the input interval.
For all nodes w in leftNodes, the following are true (please refer to Figure 4):
- An interval registered at a node d in w's left subtree can't intersect with the input interval. If it did, it would have registered at an ancestor of d.
- An interval registered at w could be one of two cases: (1) If upper < @l, clearly the interval doesn't intersect with the input interval; (2) if upper >= @l, the interval must intersect with the input interval because we already know that lower <= @u (an interval registered at a node that appears to the left of the input interval obviously starts before the input interval ends).
Therefore, the intervals belonging to the left group are the ones registered at a node in leftNodes and haveupper >= @l.
Listing 5 contains the definition of a table function called leftNodes that returns the nodes in the aforementioned set leftNodes.
-- Based on "Managing Intervals Efficiently in Object-Relational Databases"
-- rightNodes function
CREATE FUNCTION dbo.leftNodes(@lower AS INT, @upper AS INT)
RETURNS @T TABLE
(
node INT NOT NULL PRIMARY KEY
)
AS
BEGIN
DECLARE @node AS INT = 1073741824;
DECLARE @step AS INT = @node / 2;
-- descend from root node to lower
WHILE @step >= 1
BEGIN
-- right node
IF @lower < @node
SET @node -= @step;
-- left node
ELSE IF @lower > @node
BEGIN
INSERT INTO @T(node) VALUES(@node);
SET @node += @step;
END
-- lower
ELSE
BREAK;
SET @step /= 2;
END;
RETURN;
END;
GO
-- rightNodes function
CREATE FUNCTION dbo.rightNodes(@lower AS INT, @upper AS INT)
RETURNS @T TABLE
(
node INT NOT NULL PRIMARY KEY
)
AS
BEGIN
DECLARE @node AS INT = 1073741824;
DECLARE @step AS INT = @node / 2;
-- descend from root node to upper
WHILE @step >= 1
BEGIN
-- left node
IF @upper > @node
SET @node += @step
-- right node
ELSE IF @upper < @node
BEGIN
INSERT INTO @T(node) VALUES(@node);
SET @node -= @step
END
-- upper
ELSE
BREAK;
SET @step /= 2;
END;
RETURN;
END;
GO
The following query returns all intervals in the left group (those registered at nodes in leftNodes and that intersect with the input interval):
DECLARE @l AS INT = 5000000, @u AS INT = 5000020;
SELECT I.id
FROM dbo.IntervalsRIT AS I
JOIN dbo.leftNodes(@l, @u) AS L
ON I.node = L.node
AND I.upper >= @l;
Examine the query plan in Figure 5.
The plan performs a seek in the index idx_upper per node in leftNodes. The important thing is that now there's one equality predicate and one range predicate, both of which can be applied as the seek predicates.
Right group. The right group is simply the symmetric group to the left group. Consider the path in the virtual backbone tree descending from the root to @u, as Figure 6 shows. Let rightNodes be the set of all nodes w that appear on the path and are to the right of the input interval.
For all nodes w in rightNodes, the following are true (please refer to Figure 6):
- An interval registered at a node d in w's right subtree can't intersect with the input interval. If it did, it would have registered at an ancestor of d.
- An interval registered at w could be one of two cases: (1) If lower > @u, clearly the interval doesn't intersect with the input interval; (2) if lower <= @u, the interval must intersect with the input interval because we already know that upper >= @l (an interval registered at a node that appears to the right of the input interval obviously ends after the input interval starts).
Therefore, the intervals belonging to the right group are the ones registered at a node in rightNodes and have lower <= @u.
Listing 5 contains the definition of a table function called rightNodes that returns the nodes in the aforementioned set rightNodes.
The following query returns all intervals in the right group (those registered at nodes in rightNodes and that intersect with the input interval):
DECLARE @l AS INT = 5000000, @u AS INT = 5000020;
SELECT I.id
FROM dbo.IntervalsRIT AS I
JOIN dbo.rightNodes(@l, @u) AS R
ON I.node = R.node
AND I.lower <= @u;
Figure 7 shows the plan for this query. This plan is symmetric to the plan in Figure 5, only this time the index idx_lower is used.
Middle group. Let middleNodes be the set of nodes w that reside within the input interval, as Figure 8 shows.
Any interval registered at w has lower <= @u and upper >= @l; hence, it intersects with the input interval. Use the following query to return all intervals in the middle group:
DECLARE @l AS INT = 5000000, @u AS INT = 5000020;
SELECT id
FROM dbo.IntervalsRIT
WHERE node BETWEEN @l AND @u;
The optimizer can use either idx_upper or idx_lower to process this query efficiently. Looking at the plan for this query in Figure 9, it seems that the optimizer chose to use idx_upper in this case.
Listing 6 contains the code based on the RI-tree model that puts it all together, unifying the results of the queries returning the intersections in all three groups (left, middle, and right).
-- Based on "Managing Intervals Efficiently in Object-Relational Databases"
DECLARE @l AS INT = 5000000, @u AS INT = 5000020;
SELECT I.id
FROM dbo.IntervalsRIT AS I
JOIN dbo.leftNodes(@l, @u) AS L
ON I.node = L.node
AND I.upper >= @l
UNION ALL
SELECT I.id
FROM dbo.IntervalsRIT AS I
JOIN dbo.rightNodes(@l, @u) AS R
ON I.node = R.node
AND I.lower <= @u
UNION ALL
SELECT id
FROM dbo.IntervalsRIT
WHERE node BETWEEN @l AND @u
OPTION (RECOMPILE);
The statistics that I got for the unified query are the following: logical reads: 81 against IntervalsRIT + 2 against the table variable returned by leftNodes + 2 against the table variable returned by rightNodes; CPU time: 16 ms. That's a big improvement compared with the traditional query shown at the beginning of the article; recall that the statistics for the traditional query were logical reads: 11256; CPU time: 482 ms.
Optimized computation of ancestors. The downside of the implementation of the leftNodes and rightNodes functions in the previous sections is that it uses iterative logic, which isn't too efficient in T-SQL. Imagine the overhead to find intersections with not just one input interval but rather with a whole set of input intervals.
Laurent Martin came up with a very elegant solution that uses set-based logic. He describes his solution in detail in "A Static Relational Interval Tree" and "Advanced interval queries with the Static Relational Interval Tree." I'll describe a variation of the solution that I think is a bit easier to understand. The core logic and performance is pretty much the same as in Martin's solution.
Given an input interval [@l, @u], recall that leftNodes is the set of ancestors of @l that appear to the left of the input interval. Similarly, rightNodes is the set of ancestors of @r that appear to the right of the input interval. Martin made the following observation concerning the ancestors of any given node @node. Suppose @node = 13 (in binary 01101). You can determine the parent of a node by clearing the node's rightmost bit and setting the bit to the left of it to 1. You can repeat this process until you reach the root, to determine all ancestors. For example, the ancestors of 13 are 14, 12, 8, and 16:
01101 (13)
01110 (14)
01100 (12)
01000 (8)
10000 (16)
To compute the ancestors of a given node, you'll use a helper table called BitMasks, like the one in Table 1 (the values in the columns b1 and b3 are presented in binary form).
n b1 b3
--- ------ ------
1 11110 00010
2 11100 00100
3 11000 01000
4 10000 10000
...
For a virtual backbone tree with the height h, you'll fill the table with rows where n is between 1 and h-1. For example, the code in Listing 7 creates the BitMasks table and populates it with values for a tree with h = 31 (n between 1 and 30).
-- Based on "A Static Relational Interval Tree"
CREATE TABLE dbo.BitMasks
(
b1 INT NOT NULL,
b3 INT NOT NULL
);
INSERT INTO dbo.BitMasks(b1, b3)
SELECT -POWER(2, n), POWER(2, n)
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)) AS Nums(n);
You're probably wondering why there's no b2 column in the BitMasks table. In the variation of the solution that I describe, I use only b1 and b3—but Martin also uses another column called b2 in his original solution.
Please refer to Table 1 for the following. To compute the ancestors of an input node @node, you query the BitMasks table applying the expression @node & b1 | b3 in each level, which basically implements the logic described earlier: In each level, clear that level's bit and set the bit to the left of it to 1. However, you need to filter only the levels where there are ancestors—namely, the ones where the rightmost set bit (which is simply b3) is to the left of the rightmost set bit in @node.
SQL Server uses the two's complement storage format to represent integers. As the Wikipedia article states, "A shortcut to manually convert a binary number into its two's complement is to start at the least significant bit (LSB), and copy all the zeros (working from LSB toward the most significant bit) until the first 1 is reached; then copy that 1, and flip all the remaining bits."
This means that given a value @node, you can compute the rightmost set bit with the expression @node & -@node. Back to our filtering needs, to filter only levels representing ancestors of a positive input node, you use the following predicate: b3 > @node & -@node.
Listing 8 contains the definition of the Ancestors inline table-valued function, which returns all ancestors of an input node. To return only left ancestors of a given node, you query the function and filter only rows where the returned node is smaller than the input node. To return only right ancestors, you query the function and filter only rows where the returned node is greater than the input node.
CREATE FUNCTION dbo.Ancestors(@node AS INT) RETURNS TABLE
AS
RETURN
SELECT @node & b1 | b3 as node
FROM dbo.BitMasks
WHERE b3 > @node & -@node;
GO
You use the query in Listing 9 instead of the one presented earlier in Listing 6 to return intervals that intersect with an input interval [@l, @u].
DECLARE @l AS INT = 5000000, @u AS INT = 5000020;
SELECT I.id
FROM dbo.IntervalsRIT AS I
JOIN dbo.Ancestors(@l) AS L
ON L.node < @l
AND L.node >= (SELECT MIN(node) FROM dbo.IntervalsRIT)
AND I.node = L.node
AND I.upper >= @l
UNION ALL
SELECT I.id
FROM dbo.IntervalsRIT AS I
JOIN dbo.Ancestors(@u) AS R
ON R.node > @u
AND R.node <= (SELECT MAX(node) FROM dbo.IntervalsRIT)
AND I.node = R.node
AND I.lower <= @u
UNION ALL
SELECT id
FROM dbo.IntervalsRIT
WHERE node BETWEEN @l AND @u
OPTION (RECOMPILE);
Notice the addition of filters that exclude nodes from the left path that are less than the minimum node in the table, and nodes from the right path that are greater than the maximum node in the table. Figure 10 shows the plan for the query in Listing 9.
Because the Ancestors function is inline, the plan scans the BitMasks table directly, without any overhead related to the function itself, as in the previous case. The statistics that I got for this query are logical reads: 66 against IntervalsRIT + 2 against BitMasks; CPU time: 0 ms.
Here I demonstrated how to query the data to handle interval intersection. To find out how to handle all Allen relations, see "Advanced interval queries with the Static Relational Interval Tree."
Potential for Integration in SQL Server
The RI-tree model and some of the optimizations allow very efficient handling of intervals in SQL Server. However, the math and computer science behind the solution might be a bit complex for some people. There's the potential to integrate the model within the SQL Server engine and make it transparent to the user. This could be achieved in a number of ways.
One option is to introduce a new type of index (call it interval index) that the user is responsible for creating using fairly basic syntax. Behind the scenes, SQL Server can compute the fork node for each interval and create two B-tree indexes like the indexes idx_lower and idx_upper that I described in the previous sections. Optimizer support also needs to be added to detect interval queries based on the predicates in the query filter, and when detected, internally process the request similar to the queries presented in Listing 9 (or Listing 6 based on the original model). With this approach, the only responsibility of the user is to create the interval index, and the engine is responsible for the rest. The original queries remain unchanged and just start running faster.
The indexes idx_lower and idx_upper that I presented earlier represent the most basic possible form of the needed indexes. As described in "Advanced interval queries with the Static Relational Interval Tree," you can handle all Allen interval relations based on the RI-tree model. For some of them, you need one index on the keylist (node, lower, upper) and another on (node, upper, lower). For intersection-only queries, suffice to define the indexes on (node, lower) and (node, upper), respectively (hence the option INTERSECTS_ONLY in the proposed index syntax). Also, if the query has additional equality-based filters on other columns, you need to add them to both indexes as leading keys. Finally, if you need to return additional columns besides the ones in the keylist from the query, and you want the indexes to cover the query, you need to add those columns as included ones. So the proposed interval index with all additional optional elements could look like this:
CREATE INDEX myindex
ON dbo.Intervals[(fcol1, fcol2, ...)] -- leading
equality-based filters
INTERVAL(lower, upper) -- interval columns
[INCLUDE(icol1, icol2, ...)] -- included columns
[WITH (INTERSECTS_ONLY = ON)]; -- determines keylist
Behind the scenes, SQL Server would create the following two regular B-tree indexes:
CREATE INDEX myidx1 ON dbo.Intervals([fcol1, fcol2, ...,]
node, lower[, upper]) [INCLUDE(icol1, icol2, ...)];
CREATE INDEX myidx2 ON dbo.Intervals([fcol1, fcol2, ...,]
node, upper[, lower]) [INCLUDE(icol1, icol2, ...)];
As I mentioned, the optimizer would detect interval queries based on the predicates in the query filter and use these indexes.
Remember that although I used integers in all my examples, in reality you often need to work with date and time intervals. Mapping date and time values to integers and back can add a lot of overhead when you do it in T-SQL. If implemented internally, using natively compiled code with some low-level language, Microsoft can do this much more efficiently—as you can imagine.
Other potential additions in SQL Server based on this model could be to get built-in functions that compute the fork node and ancestors, supporting date and time types, and again, doing it much more efficiently than the user-defined computations. Plus, these functions would allow a "roll your own" approach to advanced users. As I mentioned, I submitted a feature proposal to Microsoft.
Speed Up Interval-Related Requests
The traditional methods that people often use to handle interval-related requests, especially the common intersection request, suffer from fundamental performance problems. When there are two range predicates involved in your query filter, only one can be used as a seek predicate in a B-tree index. Using the ingenious RI-tree model and some further optimizations, you get index seeks based on a single range predicate, which results in significantly faster queries. However, the model and its optimizations could be a bit complex for some people. Because it's purely an engineering problem, and the engineering solution already exists, it could be completely encapsulated within the database engine and made transparent to the user. Again, I'd like to thank Hans-Peter Kriegel, Marco Pötke, and Thomas Seidl for coming up with the RI-tree model, as well as Laurent Martin for his additions.