Downloads
19737.zip

NULLs, UNION ALL, and random sorting

Editor's Note:Send your experts-only T-SQL tips to Itzik Ben-Gan at blackbelt@sqlmag.com. If we use your tip in the magazine, you'll receive $100 and an exclusive T-SQL Black Belt shirt.

Sorting isn't always as simple as it seems. From time to time, T-SQL programmers receive requests for output sorted in a unique or unusual way and must handle both the logical aspects and the performance implications of those requests. For example, whenever you provide a sorting solution that involves NULLs, you must determine how to handle them. The sorting ideas of Alexey Ruban—a software engineer, developer, and DBA from Ukraine—and SQL Server Most Valuable Professionals (MVPs) Fernando G. Guerrero and Tibor Karaszi motivated me to write this article.

Getting NULLs to Sort High


NULLs are special, no doubt. Like some children, they seem to require special care and attention. Sorting output by columns that allow NULLs is no exception. NULL stands for either an unknown or an irrelevant value. So, in general, you can't assume that one NULL is the same as another. For example, to find rows that have NULLs in a specific column, you might issue the following query:

SELECT * FROM T1 WHERE colwithnulls = NULL

However, according to the ANSI standard, you won't get any rows in the output, even if the colwithnulls column contains some NULLs. To get all rows that contain NULLs in a certain column, you must use the IS NULL clause instead of = NULL. Or, you can turn off the ANSI_NULLS session option so that SQL Server treats NULL comparisons that use an equals sign as if you had specified IS NULL.

Regardless of the ANSI_NULLS setting, certain operations (e.g., the UNIQUE constraint in SQL Server, GROUP BY, ORDER BY) treat NULLs as if they were equal to one another. The ANSI standard specifies that NULLs should sort into one group as if they were equal, but it doesn't specify whether NULLs should sort lower or higher than actual values (i.e., whether they should appear at the beginning or the end of the output). Therefore, different ANSI-compliant relational database management systems (RDBMSs) implement NULL sorting differently—some placing NULLS at the end and some placing them at the beginning.

SQL Server sorts NULLs as if they're the lowest value. For example, let's select all rows from the Publishers table in the Pubs database, ordering by state. Then, use a new SQL Server 2000 optimization technique, which lets you create indexes on computed columns, to put an index on the state column:

CREATE INDEX idx_nci_state ON Publishers(state)
SELECT * FROM Publishers
ORDER BY state

Figure 1 shows this query's execution plan. The query performs an ordered index scan on the idx_nci_state index you created. This scan, which improves the query's performance, prevents an explicit sort operation from taking place because the index is already sorted by the state column. Note that the Publishers table is very small. If the table were bigger, the optimizer might choose not to use an index. Using an index involves visiting a page to look up the data row for each key in the index, and the I/O involved might cost more than simply scanning the table and performing an explicit sort. Nevertheless, for this demonstration, the Publishers table's size is convenient.

Suppose you want to sort NULLs high. You can use a CASE expression in the ORDER BY clause:

SELECT * FROM Publishers
ORDER BY CASE
           WHEN state IS NULL THEN 1
                   ELSE 0
                   END, state

The first value in the ORDER BY clause will be either 1 (when state is NULL) or 0 (when state has an actual value); this value determines the output's primary order. The values in the state column determine the output's secondary order so that you get the desired order for the rows in the output. Note that the ANSI-92 standard specifies that the ORDER BY clause can use only expressions that appear in the SELECT list, but ANSI-99 removed this limitation. SQL Server has never had that restriction.

Figure 2 shows the execution plan for the above query. Notice that a performance problem exists. The query performed an explicit sort operation based on the 1 and 0 values that the compute scalar operation calculated. You can't use the state column index here because the state column doesn't determine the output's primary order; a computation based on the state column does. If you use SQL Server 2000, you can handle this problem elegantly. Add the following computed column to the Publishers table:

ALTER TABLE Publishers
 ADD statenullslast AS CASE
        WHEN state IS NULL THEN 1
        ELSE 0
        END

Create a composite index on the statenullslast and state columns:

CREATE INDEX idx_nci_statenullslast ON Publishers(statenullslast, state)

Now, try the following query:

SELECT pub_id, pub_name, city, state, country
FROM Publishers
ORDER BY statenullslast, state

Figure 3 shows the execution plan for this query. Note that the plan uses the new index instead of performing an explicit sort operation.

Sorting UNION ALL Inputs Differently


You use a UNION ALL query to vertically join several inputs to make a unified row set. Note that UNION ALL doesn't eliminate duplicates, in contrast to UNION, which does. Suppose you partition the Sales table (in the Pubs database) into three tables, each containing a different year:

SELECT * INTO Sales1992 FROM Sales
        WHERE YEAR(ord_date) = 1992

SELECT * INTO Sales1993 FROM Sales
        WHERE YEAR(ord_date) = 1993

SELECT * INTO Sales1994 FROM Sales
        WHERE YEAR(ord_date) = 1994

You can't sort the inputs separately on different columns because a UNION ALL operation results in a unified set of rows; instead, you can have only one ORDER BY operation for the entire set. For example, the following query isn't legal:

SELECT * FROM Sales1992 ORDER BY
UNION ALL
SELECT * FROM Sales1993 ORDER BY
UNION ALL
SELECT * FROM Sales1994 ORDER BY

Ruban submitted the solution that Listing 1 shows, which lets you overcome this limitation. The sort0 column has a different value (i.e., 1, 2, or 3) for each input, which determines the first order for the output rows. For each group of values in the sort0 column, the query computes another column from the values to determine the group's secondary order. For example, for the first input, sort1 holds the qty column's values and all the other rows have NULL in that column. Similarly, for the second input, sort2 holds the ord_date column's values and all the other rows have NULL in that column, and so on.

Random Sort


You might think a request to sort output randomly is strange; nevertheless, it appears from time to time in public newsgroups. If you need the output rows in random order, Guerrero and Karaszi have provided some ideas about how to solve this problem. Obviously, you need a randomizing function. T-SQL supplies the RAND(int_seed) function, but you need to consider some important factors when you use it.

When you supply an argument to the RAND() function, the function doesn't produce a random value but rather a computation based on the supplied seed (i.e., for the same input value, the function always returns the same output). If you invoke the RAND() function with no input value, the system generates its own seed. Thus, you might think that if you sort your query by the RAND() function's result, you will get a random sort. To see what actually happens, create the following T1 table:

CREATE TABLE T1 ( INT IDENTITY)
INSERT T1 DEFAULT VALUES
INSERT T1 DEFAULT VALUES
INSERT T1 DEFAULT VALUES
INSERT T1 DEFAULT VALUES
INSERT T1 DEFAULT VALUES

Now, run the following query several times:

SELECT *, RAND() AS FROM T1
ORDER BY

Note that each time you run the query, SQL Server evaluates the RAND() function only once. Thus, all the rows will have the same value in the rnd column, making it inefficient for the task. Another option might be to use an integer key column as the seed:

SELECT *, RAND() AS FROM T1
ORDER BY

But keep in mind that the RAND() function always returns the same value for the same input. So if you issue this query several times consecutively, the output will remain the same unless the table changes in the meantime.

One way to solve this problem is to create a temporary table to hold only the table's keys and a float column to hold the random values. You can create a loop that iterates through all the key values, and for each key, invokes the RAND() function with no input, updating the float column with the random value generated. When the loop finishes executing, you join the original table to the temporary table and order the result by the random values' column. For example, to produce random output for the Authors table in the Pubs database, you can use the code that Listing 2 shows. I use the lock hints TABLOCK SERIALIZABLE to prevent modifications to the Authors table until the transaction finishes. (Dejan Sarka also presents an elegant solution to random sorting in "Tips from the SQL Server MVPs," http://www.sqlmag.com, InstantDoc ID 19842.)

Query Performance Counts


As NULLs, UNION ALL, and random sorting show, sorting issues aren't always trivial. When you get a complex sorting task, you first must solve the problem logically, providing the required output, but performance issues also figure into the mix. Computed columns and indexes can improve your queries' performance.