A custom tool that skirts the limitations that traditional techniques entail
| Executive Summary:|
CreateIntegersTable is a multistatement table-valued function written for Microsoft SQL Server 2005. You can use this custom tool to create a numbers table that works around nearly all the limitations that traditional techniques for building these tables usually involve. For example, traditional techniques might themselves involve cursors, loops, or identity columns. The techniques might need temporary tables or actual database tables, which could place expensive demands on database server resources, and the techniques usually lack the flexibility for a developer who needs a different kind of numbers table: one that has starting—and maybe ending—negative values or increment values greater than 1. CreateIntegersTable skirts nearly all these drawbacks.
In the database world, a numbers table, or tally table, is simply a table of unique integers. These one-column tables usually start with a value of 1, increment by 1, and end at some fixed integer. Developers can often use them to eliminate cursors and loops, parse strings, optimize queries, identify number sequence gaps, and more. A Google search for "tally table" or "numbers table" shows many different ways to build them, but the techniques generally have drawbacks. The techniques might themselves involve cursors, loops, or identity columns; cursors and loops usually run slowly, and identity columns can become a hassle. The techniques might need temporary tables or actual database tables, which could place expensive demands on database server resources, and the techniques lack the flexibility for a developer who needs a different kind of numbers table: one that has starting—and maybe ending—negative values or increment values greater than 1.
To get around almost all these drawbacks, I wrote CreateIntegersTable, the multistatement table-valued function shown in Listing 1. Written for SQL Server 2005, CreateIntegersTable takes input parameters @start_int, @step_int, and @end_int and returns a single-column table variable of type BIGINT. CreateIntegersTable relies on a common table expression (CTE) and simple table inserts. Although it uses one WHILE loop, the loop is executed infrequently—a 60-million–row result set, for example, loops at most 11 times. CreateIntegersTable avoids cursors, temporary tables, database tables, and identity columns. It handles both positive and negative start and end values as well as step values greater than 1. It can even make full use of the BIGINT space, both in the data type of the numbers themselves and in the total number of rows returned. Almost all the variable names in function CreateIntegersTable and ActualEndIntegerCalculator, a related function explained later, include "int," to clarify that these functions deal only with integers and big integers.
Although CreateIntegersTable starts with USE \[master\], the function can go in any database. The script drops the function if it already exists in the target database, then creates a fresh instance of the function. The @start_int and @step_int parameters default to 1.
To simplify the code, the function assumes that the first integer in the integer range is at least 1. The function handles @start_int values greater than 1 later on but a @start_int value less than 1 needs special handling. When the start value is less than 1, the IF test adds ABS(@start_int) to the @end_int value. Because the function builds ranges starting at 1, not 0, it also adds 1 to @end_int to compensate by shifting the end value, as Figure 1 shows.
The IntegersTableFill CTE (callout A) inserts rows into @IntegersTable, the table variable that holds the generated integers, up to a maximum @end_int value of 32,767. Variable @end_int specifies the maximum value in the table; because the function has a default @step_int value of 1, the @end_int value of 32,767 applies to both the maximum value in the table and the number of rows in the table. If the function receives a @step_int value greater than 1, however, the number of rows in the table at this point will be less than 32,767 while the maximum value in the table remains 32,767. Using a CTE is fast, and, for this initial set of rows it has high performance—and the maximum number of rows involved, 32,767, makes the function's other features possible. Parameter @step_int increments the integers correctly. If @end_int is less than 32,767, the CTE WHERE clause uses @end_int instead of 32,767. This behavior improves efficiency, as I explain later.
The CTE OPTION (MAXRECURSION 32767) limit forces an @end_int maximum of 32,767. The MAXRECURSION option allows only 32,767 levels of recursion, which would mean a maximum of 32,767 integers. We could specify (MAXRECURSION 0) or even leave this line out of the function, but then nothing would protect the function from infinite recursion, which would crash it.
The next IF test (callout B) deals with @end_int values that exceed 32,767. Although the IntegersTable CTE that I just described works for values less than or equal to 32,767, numbers larger than 32,767 require something else because of the MAXRECURSION limit. For such end values, the function uses an INSERT INTO with a SELECT (callout E) that doubles the number of integers already in @IntegersTable. In other words, this particular INSERT INTO can handle row insertions only in successive groups of 2n power (2 ^ n) multiples of 65,534 (64KB) rows. Figure 2 shows these calculations.
The function inserts as many integers as we want, up to the BIGINT limit, but starting at integer 32,768, it must do so in steps, using a SELECT statement to double the number of rows already in @IntegersTable before it inserts new rows. Therefore, the function must figure out how many row groups it should handle. The SELECT statement in callout E does just that. Say that @end_int is 5,000,000. From Figure 2, the function should then go up to row group 8, the group that includes row 5,000,000 and ends at row 8,388,352. Based on the way the math works, which I described earlier, using the last number of group 8 as an example, we start with the equation
65534 x 2n = 8388352
and try to get a formula for n, the number of row groups. With natural logs and some algebra, we get
n = (ln(128))/(ln(2))
which yields n = 7. In Figure 2, the row group numbers start at 1, not 0, so add 1 to n, and you get n = 8. A general formula for n would look like this:
n = (ln(x/65534))/(ln(2))
To make the numbers work the way we need, round n up to the next integer if it's not an exact integer, and then add 1. The callout C does all this; we multiply @end_int by 1.0 because the math works reliably with real numbers. With this expression, for @end_int = 5000000, @int_row_groups = 8. The earlier number shift is important here. If @start_int is less than 1, the natural log function at best would have returned the wrong value and at worst would have crashed. Although this statement returns 0 for @end_int less than 32,768, @end_int will never drop below 32,768 at this point in the function, so we're safe.statement at
The function is almost ready to make the row insertions from integers 32,767 to @end_int. At callout D, it initializes variable @current_row_group to 1 and uses this variable as the loop counter for the WHILE loop. Variable @num_row_groups has the number of row groups the function handles. The WHILE loop loops once for every row group. Before the loop starts its work, variable @last_int has the last, or highest, value already inserted into table variable @IntegersTable. With @last_int_inserted, the loop knows the integer at which it should start inserting rows. Row group 1 is a special case, and @last_int_inserted helps there, too. Row group 1 covers rows 1 to 65,534, but the IntegersTableFill CTE already inserted at most 32,767 rows. Variable @last_int_inserted equals 32,767, so in the loop, the first row group should start at integer 32,768. Variable @last_int_inserted makes this possible. If the loop is at the last row group, it uses parameter @end_int to stop. Otherwise, it just keeps going. Setting @end_int this way makes the function more efficient, as I mentioned earlier. For example, if the user picked 16,776,706 for @end_int, the function would end at row group 10. But if the SELECT statement didn't stop at @end_int, the function would insert integers16,776,70 to 33,553,408 and throw them away later. Variable @end_int prevents this waste.
Now the function uses the INSERT INTO statement (callout E) to insert the integers. Earlier, I mentioned that the INSERT INTO statement doubles the number of integers already in @IntegersTable. Look closely at the SELECT statement to see why. As a separate SELECT, it adds @last_int_inserted (the latest, largest integer in @IntegersTable) and possibly @step_int, to every integer already in @IntegersTable. Then the INSERT INTO statement inserts this new row set into @IntegersTable. Thus, for each pass of the WHILE loop, the largest integer inserted is (@last_int_inserted + @last_int_inserted) plus maybe @step_int, doubling the number of integers in @IntegersTable. Only in the last row group does the last result set integer exceed @end_int. The WHERE clause tests for this condition, optimizing the INSERT statement. The SELECT statement uses @step_int almost exactly the way the IntegersTableFill CTE used it, except that here, it subtracts 1 to account for the fact that it operates on a "base" set of integers that starts at 1, not 0. Finally, the loop increments @current_row_group.
When the loop ends, table variable @IntegersTable has a "raw" list of integers, but input parameter @start_int might differ from 1, which is the value the function assumed, and the function needs to compensate. First, if @start_int is less than 1, the UPDATE statement (callout F) shifts the integers back to that @start_int value. The function recalculates @end_int, and the DELETE statement trims @IntegersTable accordingly. The function inserts everything into @FinishedIntegersTable, which it returns to the T-SQL statement that originally called it, and then it ends.
If @step_int is greater than 1 and @end_int is greater than 32,767, the last integer the CTE inserts could have a value greater than 32,767 and the largest integers in the row groups would exceed the integers in the row groups that Figure 2 shows. If CreateIntegersTable handled this scenario, it might become even more efficient, but it would also become more complex.
Put It to Work
Listing 2 shows how to use CreateIntegersTable. Declare a table variable and run an INSERT INTO statement with a SELECT statement that calls the function and specifies parameters. That's it.
Unfortunately, CreateIntegersTable adds entries to the tempdb log file, and for a very large integer list, this could become a big problem because the log file could use a lot of system resources and affect performance. SQL Server 2005 has no way to prevent this situation; if it did, the function would have even better performance and use even fewer resources. I researched for a solution but found nothing. If you know how to prevent this problem in eitheror 2005, please send me an email to tell me about it.
Function CreateIntegersTable returns a set of rows that starts at the value of parameter @start_int—each time, every time. However, given specific @start_int, @step_int, and @end_int parameters, the final integer in the row set that CreateIntegersTable returns might differ from the @end_int parameter. This situation would happen because of mod function (i.e., division-remainder) math and could occur when the @step_int parameter is greater than 1. You can use scalar value function ActualEndIntegerCalculator, which Listing 3 shows, to find the actual, largest integer that CreateIntegersTable will return given a specific set of start, step, and end parameters. This function can go in any database, and this example shows how to use it:
SELECT dbo.ActualEndIntegerCalculator (1455, 45, 22401)
If parameter @start_int is less than or equal to 0, ActualEndIntegerCalculator starts at @start_int and finds the multiple of @step_int closest to @end_int. If @start_int is greater than 0, the calculator function shifts the @start_int and @end_int parameters so that @start_int is 0, calculates @end_int based on this shift, as before, and then adds back @start_int. If @start_int is greater than @end_int, the function returns NULL.
With the CreateIntegersTable and ActualEndIntegerCalculator functions, you have efficient, flexible, high-performance tools to build the numbers tables you want, when you want.