Downloads
22927.zip

Use indexes to retrieve sorted data more efficiently

Learning how SQL Server indexes are physically organized and how they can help speed up data access in a table is the most important step in understanding query tuning. By adding a good index to a large table that previously had no useful index for a query, you can obtain one of the largest gains possible for any type of SQL Server tuning. If a large table, with tens of thousands of pages or more, has no useful index, SQL Server must scan the entire table to find the data your query needs, even if you're interested in only one row.

Until SQL Server finishes searching the entire table, it doesn't know how many rows qualify. With a useful index, SQL Server might need to access only a handful of pages as it traverses each index level. The difference between accessing tens of thousands of pages and accessing only a handful of pages makes indexes a powerful tuning tool.

But even when you're concerned with data in just one table, you need to address queries other than those that search for rows meeting a certain condition. Sometimes, queries involve organizing the data in a table. Queries involving the GROUP BY, DISTINCT, ORDER BY, and TOP clauses fall into this category. Technically, ORDER BY and TOP apply to the resultset a query generates, not to the table data itself; however, for query-tuning purposes, that distinction isn't relevant.

Let's look first at ORDER BY, which requests data from SQL Server in a particular sorted order. One reason to look at ORDER BY first is that you can apply the same principles that let you tune queries that involve sorting to queries that involve the other organizational clauses I mentioned: GROUP BY, DISTINCT, and TOP. Although ORDER BY applies to a SELECT statement's resultset, SQL Server can sometimes use indexes on the data to help it retrieve the rows in the order you want them so that it doesn't need to further sort the resultset.

Presorting the Data


One obvious way that you can use indexes to make retrieving sorted data more efficient is to create a clustered index on the column that you're going to order by. When you create a clustered index, SQL Server reorganizes the data pages so that the rows are logically stored in clustered-index order. SQL Server doesn't necessarily store the data physically on the disk in clustered-index order, but while creating an index, SQL Server attempts to physically order the data as close to the logical order as possible. Each page in an index's leaf level has a pointer to the page that logically precedes the current page and to the page that logically follows the current page, thereby creating a doubly linked list. The sysindexes table contains the address of the first leaf-level page. Because the data is guaranteed to be logically in clustered-index order, SQL Server can just start at the first page and follow the index pointers from one page to the next to retrieve the data in order.

Now let's look at a query plan involving a sort on a clustered-index column. In the Northwind database, the Customers table has a clustered index on the CustomerID column. (SQL Server automatically created the index when the primary key was defined on the CustomerID column.) If I select all the rows from the Customers table and sort by the CustomerID column, I might expect SQL Server to use the clustered index to retrieve the rows already sorted in the correct order. After running SET SHOWPLAN_TEXT ON, run the following query:

--Query 1:
SELECT * FROM Customers
ORDER BY CustomerID

Here's the SHOWPLAN output, which tells you that SQL Server scanned the clustered index to retrieve the data, just as you'd expect:

StmtText
-----------------------------------------  
|--Clustered Index Scan(OBJECT:
(\[Northwind\].\[dbo\].\[Customers\].
\[PK_Customers\]), ORDERED FORWARD)

For comparison, let's execute Query 2, which requests the data sorted by a column that doesn't have an index—Country:

--Query 2:
SELECT * FROM Customers
ORDER BY Country

The SHOWPLAN output includes the extra step of sorting the data, so you can see that the query is more efficient if a clustered index already exists on the sort column:

StmtText
--------------------------------------------
|--Sort(ORDER BY:(\[Customers\].\[Country\] ASC))
    |--Clustered Index Scan(OBJECT:(\[Northwind\]
.\[dbo\].\[Customers\].\[PK_Customers\]))

Note that the query plan for the first query includes the ORDERED FORWARD option, indicating that the index must be scanned in a particular logical order. The second query plan doesn't include that option; the order in which SQL Server scans the clustered index is irrelevant. The only important occurrence is that all the key values are accessed so that they can then be sorted.

Turn It Around


You're probably aware that T-SQL's ORDER BY clause lets you request that the data be returned sorted in descending order, as Query 3 illustrates:

--Query 3:
SELECT * FROM Customers
ORDER BY CustomerID DESC

SQL Server can use the same clustered index to retrieve the data in both ascending and descending order: Because the pages at the leaf level are stored in a doubly linked list, SQL Server can simply traverse the pages from last to first. The SHOWPLAN output for Query 3 is almost identical to the plan for Query 1:

StmtText
--------------------------------------------
  |--Clustered Index Scan(OBJECT:(\[Northwind\]
.\[dbo\].\[Customers\].\[PK_Customers\]), ORDERED BACKWARD)

The only difference is that the clustered index scan for Query 3 is a BACKWARD scan.

By letting you create an index in descending order, SQL Server 2000 enhances the ability of clustered indexes to support sort operations. You can use the DESC keyword in the CREATE INDEX statement or in a constraint definition for a primary key or unique constraint. But SQL Server already had the ability to traverse a clustered index in either ascending or descending order, so why would you need to store the index in descending order? You wouldn't, at least for a single-column index.

The usefulness of controlling the order of the stored index keys is more evident when you need to sort by multiple columns and you need a different order for each of the sort columns. For example, suppose you have a table containing the names of all your salespeople and the total revenue each generated during the previous year. If you want to list your salespeople by state, with the states in alphabetical order, and within each state you want to list the top-earning salespeople first and the lowest-earning last, your query might look like Query 4:

--Query 4:
SELECT Name, Revenue, State
FROM SalesRevenue
ORDER BY State, Revenue DESC

(Note that you can't execute Query 4 because it refers to the nonexistent SalesRevenue table.)

A clustered index on State and Revenue in SQL Server 7.0 or 6.5 would store all the data in ascending order and could be traversed in either ascending or descending order. With such a clustered index, SQL Server could easily find all the rows in state order by following the clustered-index pages. But then, returning the rows for that state in a different order than they're stored in—listing the highest revenue within each state first—wouldn't be easy. In fact, SQL Server would have to completely re-sort the data in the desired order. For example, in the Northwind database's Order Details table, the clustered index is on the OrderID and ProductID columns. Query 5 asks for the data to be sorted by the clustered-index columns, with the first column ascending and the second column descending:

--Query 5:
SELECT * FROM \[Order Details\]
ORDER BY OrderID, ProductID DESC

The SHOWPLAN output shows that SQL Server has to sort the data even though it scans the clustered index:

StmtText
--------------------------------------------
  |--Sort(ORDER BY:(\[Order Details\].\[OrderID\] ASC,  
\[Order Details\].\[ProductID\] DESC))
    |--Clustered Index Scan(OBJECT:(\[Northwind\]
.\[dbo\].\[Order Details\].\[PK_Order_Details\]))

As a comparison, you can run Query 5 without the DESC keyword so that the resulting order is the order in which all the data is stored. Note that the plan for the query without DESC indicates that no sorting is necessary. Although you can create a SQL Server 2000 index in descending order, the only real reason for doing so is if you know that you frequently need to return data sorted by multiple keys, and the sort direction isn't the same for all the sort keys.

How Fast Is Fast?


In clustered indexes, SQL Server keeps the data rows in logically sorted order. For nonclustered indexes, SQL Server stores the index keys in the leaf level in logically sorted order, along with bookmarks to the actual data rows. Why can't SQL Server take advantage of that sorting in the nonclustered index with queries in which you're sorting by a nonclustered-index key? By default, SQL Server doesn't take a nonclustered index into account when sorting by a column with a nonclustered index. (The exception is the case in which the nonclustered index completely covers the query, so SQL Server doesn't need to follow any bookmarks or access the data pages.)

To demonstrate that SQL Server won't use a nonclustered index for sorting, you can build a nonclustered index on the Order Details table's Discount column:

CREATE INDEX DiscountIndex ON \[Order Details\]
(Discount)

Then, execute Query 6:

--Query 6:
SELECT * FROM \[Order Details\]
ORDER BY Discount

The query plan

StmtText
--------------------------------------------
|--Sort(ORDER BY:(\[Order Details\].\[Discount\] ASC))
    |--Clustered Index Scan(OBJECT:(\[Northwind\]
.\[dbo\].\[Order Details\].\[PK_Order_Details\]))

reveals that SQL Server scans the clustered index (the data itself), then performs a sort. SQL Server's sorting algorithms are so fast that the optimal overall performance comes from retrieving all the data, then performing a complete sort, rather than following any bookmarks from the nonclustered index. Although this efficient sorting lets SQL Server return the complete resultset in minimal time, no data can be returned until all the data is sorted. Consider: The Order Details table has only 2155 rows. If I performed the same kind of sort on a table with hundreds of thousands or even millions of rows from a client application that needs to use the sorted data to fill a drop-down list, I'd be waiting a long time to see results.

Here's a query-tuning trick you can use to reduce the initial wait time for queries that sort data on a nonclustered-index key. Generally, you shouldn't use index hints in production applications if you can avoid them. In fact, in most cases, the need to use a hint might be considered a bug in the SQL Server query optimizer. However, for the case of sorting data on a nonclustered index key, you can safely use an index hint. Using the hint doesn't mean that the optimizer isn't working right, and you don't have to be concerned that the hint might no longer be necessary or beneficial after a service pack "fixes" the optimizer. One form of index hint is called FASTFIRSTROW; you can use it as a table hint in your FROM clause. This hint tells the optimizer to use a plan involving a nonclustered index on the sort column. To use the nonclustered index, SQL Server traverses the index's leaf level, retrieving each index key in sorted order; then it follows the bookmark for every index key to find the corresponding data row. (If no such index exists, SQL Server ignores the hint.) In the Order Details table, this process entails 2155 bookmark-lookup operations. The first few rows return very quickly, so SQL Server can begin to fill in your user's drop-down list. However, the time to complete the sort is much longer than if you didn't use the hint.

You can use the SET STATISTICS TIME ON option to see the difference between using and not using FASTFIRSTROW. The code in Listing 1 turns on this SET option, then executes Query 6 both with and without the FASTFIRSTROW hint. On my Toshiba Tecra 8000 laptop computer (with 256MB of RAM, running SQL Server 2000 Enterprise Edition on Windows 2000 Server), the first query (without the hint) used 30 milliseconds (ms) of CPU time. With the hint, the second query used 70ms of CPU time—more than twice as long as the first query. This dataset is so small that I didn't notice the delay in returning the first row when I didn't use the hint. However, when I've run similar queries on tables with just 20,000 rows, the delay before any data is returned was noticeable when I didn't use the FASTFIRSTROW hint; I got no delay with the hint.

When you have a nonclustered index on the sort column, you can choose whether to use the hint. You decide what's more important to you: Do you want the sort to take the least amount of total processing time, or do you want the sorted data to start coming back to the client as quickly as possible? In other words, do you want to minimize initial response time, or maximize throughput?

I mentioned that using the FASTFIRSTROW hint was just one form of the hint to control SQL Server's use of a nonclustered index on a sort column. Another form of this hint, using the OPTION clause with the FAST N hint, lets you specify how many rows (n) SQL Server should return as quickly as possible. For example, using the hint OPTION (FAST 10) tells SQL Server to return the first 10 rows as quickly as possible, then access the rest of the rows in a way that maximizes throughput.

Both clustered and nonclustered indexes can be useful for sorting more efficiently. If sorting is an important part of your applications' data-management operations, you need to consider those operations carefully when deciding the optimal indexes to build on your tables. Next time, I look at indexing for other organizational operations, such as TOP, GROUP BY, and DISTINCT.