Proper indexing can help speed up sorting operations. In "Tuning of a Different Sort," December 2001, InstantDoc ID 22927, I stressed that sorting (as when you use ORDER BY) applies only to the result set because the data in a relational database table has no specific order. Another T-SQL construct you can use in a query's SELECT list to limit the number of rows returned is the TOP clause. TOP is similar to ORDER BY in that it affects only the result set. Let's look at how to optimize the use of the TOP clause and examine which indexes can help improve the performance of TOP queries. To understand how to get the maximum performance from queries that use TOP, you need to know exactly what those queries are trying to accomplish. Let's begin with some details about the semantics of the TOP operator, then let's look at some internal SQL Server mechanisms that improve the performance of TOP queries.
TOP of the Class
You know that the WHERE clause in a query lets you limit the number of rows that a SELECT statement returns. But to use a WHERE clause, you need to know some of the possible values a column might have so that you can include them in (or exclude them from) the WHERE clause's conditions. What if you want to see only a screenful of rows from a table, but you have no idea what range of values exists in the table? Before SQL Server 7.0, your only choice was to use the option SET ROWCOUNT n to limit SQL Server to sending only n rows to the client. Every SELECT statement using the connection in which ROWCOUNT was set would stop sending rows to the client after n rows had been sent. To remove the row limit, you set ROWCOUNT to 0. For example, the following query returns only five rows from the Orders table:
SELECT * FROM orders
SET ROWCOUNT 0
In SQL Server 2000 and 7.0, you can instead use the TOP keyword in the SELECT list to get just five rows of the result set. The following statement is equivalent to the three-line batch above:
As I show later, you can extend the TOP clause to control the data you want to see. The SQL Server query processor has an internal mechanism to help optimize the performance of TOP queries, so in many cases, the performance of TOP queries is better than the equivalent queries that use ROWCOUNT. Using SET ROWCOUNT instead of TOP has one specific advantage, however. The integer argument you use with SET ROWCOUNT can be a variable, but the value you use with TOP must be a constant. For example, the first batch in Listing 1 returns five rows from the Orders table, but the second batch generates an error.
Keep in mind that data values in your table have no effect on which rows are returned when you use either this simple form of TOP or the SET ROWCOUNT option. SQL Server typically returns rows in whatever order is most efficient. With both these options, SQL Server simply stops sending rows to the client when it reaches the limit, regardless of the data values in the rows returned. Therefore, if you want more control over which rows are returned, you need to use other options.
The most straightforward extension to TOP is to request a percentage of rows from the result set, rather than an absolute number. TOP with PERCENT instructs SQL Server to round up to the nearest integer number of rows. The Northwind database's Orders table has 830 rows, so a statement that requests the TOP 1 PERCENT of those rows returns 9 rows:
One of the most common syntax errors that programmers make when using TOP is to leave out the column list. TOP controls just the number of rows; SQL Server still needs a SELECT list to know which columns to return. You can use an asterisk (*) to specify all columns, or you can specify certain columns, as the query below shows:
Another way of extending TOP is to use an ORDER BY clause to gain some control over the rows you want the query to return. The following query returns the row with the smallest Quantity value in the Order Details table:
ORDER BY quantity
If you want the largest Quantity value, you can use the DESC keyword with ORDER BY:
ORDER BY quantity DESC
In the example above, the quantity in the returned row is 130. However, other orders in the table also have a Quantity value of 130. SQL Server returned only one row because I used TOP 1 in the query. So this output doesn't really answer the question, Which Order Details row had the largest quantity of items ordered? If you want to see all the rows with the same quantity as the one listed, you can use the WITH TIES option to see all rows with a value equal to that of the last row returned. SQL Server lets you use this option only if your SELECT statement includes an ORDER BY clause:
ORDER BY quantity DESC
There's often confusion about the way that TOP WITH TIES works. I like to think of SQL Server carrying out this operation by first returning the number of rows specified by the TOP clause—in this case, one. After the first row is returned, SQL Server checks to see whether any other rows have a value equivalent to the first one returned and returns those as well.
Suppose you want the TOP 24 rows, in descending order by Quantity, and you use TOP WITH TIES. SQL Server would return 24 rows exactly, with the last row returned having a quantity of 91. The table has only one row with that quantity value, so no "extra" rows are returned and you get back exactly 24 rows. However, if you request TOP 25, you get 28 rows back because four rows contain the next lowest quantity value, 90. Only values equal to the last value returned will cause extra rows to come back.
Now let's look at some performance concerns associated with using TOP. If you're using TOP with ORDER BY, you're essentially asking for a sort operation to take place. But does SQL Server need to sort all the data to give you the first few largest or smallest values? The answer depends partially on the available indexes. If the table has a clustered index on the column you're sorting by, then SQL Server doesn't need to do any additional sorting. If the column in the ORDER BY clause has no indexes, SQL Server has to completely sort the data. Figure 1 shows the showplan_text output for the TOP 1 query above, which returns the rows that have the highest Quantity value.
As the output shows, with no index on Quantity, SQL Server must scan the entire table (by using a clustered index scan), sort all the data from highest to lowest (DESC), then find and return the quantity that's at the top of the list. That's a lot of work to find just one or two rows; even a nonclustered index could speed the process. Using TOP with ORDER BY on a column that has a nonclustered index is similar to using the FASTFIRSTROW hint I discussed last month. The leaf level of a nonclustered index stores all the data values in sorted order. A nonclustered index on Quantity has the highest value at one end, so SQL Server can access that value very quickly. The code in Listing 2 demonstrates the I/O cost of finding the TOP quantity value without a nonclustered index on Quantity. Then, it builds a nonclustered index on Quantity and examines the I/O cost of finding the TOP quantity value with the nonclustered index.
When I ran this script, it took 10 logical reads without the index and 4 logical reads with the index—an improvement of 60 percent. Six fewer page reads might not sound like much, but Order Details is a small table. When your tables have thousands of pages or more, you'll appreciate a savings of 60 percent. SQL Server uses the nonclustered index to find the highest quantity, then follows the bookmark to retrieve the data row. The query plan in Figure 2 shows the index scan.
One thing that might surprise you is that SQL Server also takes advantage of the nonclustered index if you use SET ROWCOUNT instead of TOP. If you've built the index as Listing 2 shows, the following batch also results in only four logical reads:
SELECT * FROM \[order details\]
ORDER BY quantity DESC
SET ROWCOUNT 0
So, does using TOP instead of ROWCOUNT yield better performance if you don't want to use special features such as PERCENT or WITH TIES? When the column in the ORDER BY clause has a clustered or nonclustered index, I've detected no difference. However, if no supporting index exists, TOP has an advantage because of an internal feature of the SQL Server 2000 query processor.
Ken Henderson, author of The Guru's Guide to Transact-SQL (Addison-Wesley, 2000), challenged the SQL Server Most Valuable Professionals (MVPs) to find a case in which TOP was reproducibly faster than SET ROWCOUNT. MVP Dejan Sarka came up with a script similar to the ones in Listing 3 and Listing 4 to prove that TOP is faster. Sarka's script gives the desired results only on SQL Server 2000, and the performance difference is in CPU time, not in I/O operations. The SQL Server feature that gives this performance improvement is the TOP-N engine. When running a query that includes TOP and ORDER BY clauses for which no supporting index exists, SQL Server maintains a sorted temporary table in memory. SQL Server replaces the last row of this table only when it finds a new satisfying row. For example, if your query is looking for the TOP 10 largest prices, SQL Server maintains a sorted in-memory temporary table of the 10 largest prices so far, along with pointers or keys for the rows that contain those prices. When SQL Server encounters a new price that's larger than the smallest of the 10 in-memory rows, it removes that smallest one and puts the new row in its proper sorted position in the in-memory table. So the I/O is the same as for a full sort; SQL Server must read all the rows in the table. But the processing effort is greatly reduced because a full sort is unnecessary, and only occasionally does SQL Server need to overwrite a row in the in-memory table.
Listing 3 contains T-SQL code to build a table to illustrate the improved performance of TOP in SQL Server 2000. The table needs to be very large for SQL Server to use the TOP-N engine, so this script populates table t1 with 100,000 rows by copying the table into itself 17 times.
Listing 4 shows the tests comparing TOP 10 SET with ROWCOUNT 10 and uses SET STATISTICS to report both the time taken and the I/O required to run the queries. Ignoring the times for parse and compile operations, Figure 3 shows the results that I got when I ran the script on my laptop. The effect of data caching isn't at issue because the script uses DBCC DROPCLEANBUFFERS to clear the data cache before each query. You can see that the I/O is the same for both queries, with 182 logical reads and 160 physical reads. However, the CPU time required for the TOP 10 query was only about 80 percent of the time needed with SET ROWCOUNT, and the elapsed time was reduced to about two-thirds.
If you have proper indexes in place, you might not think you need to use TOP for retrieving a subset of rows from a result set. However, when no useful index exists, TOP is a performance winner. Add to that the extra functionality of PERCENT and WITH TIES, and you'll find that TOP is a very useful feature indeed.