The Tipping Point

Understanding how SQL Server's query optimizer makes decisions

What is in this article?:

  • The Tipping Point

DBAs and SQL Server developers who are new to SQL Server tuning are often puzzled by why the query optimizer in SQL Server would prefer to scan an entire table instead of using a nonclustered index that seems to fit the query perfectly. To explain this mystery, I'll use the following query, which retrieves data from the SalesOrderDetail table in the AdventureWorks database:

SELECT LineTotal

FROM sales.SalesOrderDetail

WHERE ProductID = 870

If you have any version of the AdventureWorks sample database, you can follow along. If you don't, you can download it from CodePlex. (See also, "Performance Tip: Find Your Most Expensive Queries.")

The Demonstration

To begin, click the Include Actual Execution Plan button on the toolbar in SQL Server Management Studio (SSMS). This will allow you to see the execution plan chosen by the optimizer.

Next, type the following command in the query window:

USE AdventureWorks2008R2;

SET STATISTICS IO ON;

GO

SELECT LineTotal

FROM sales.SalesOrderDetail

WHERE ProductID = 870;

Note that you might need to change the database name (AdventureWorks2008R2) to the version of the database you're using.

Be sure to include the SET STATISTICS IO ON statement. It'll cause SSMS to display the query cost (in logical page reads) on the Messages tab in the results pane. If you're unfamiliar with the SET STATISTICS IO ON statement or the Include Actual Execution Plan option, see the article "SQL Server Query Tuning." This article also explains why logical page reads are the best metric to use to measure query cost.

In its entirety, the SalesOrderDetail table has 121,317 rows of data contained in 1,240 8KB data pages. It has a nonclustered index on ProductID and a multi-column clustered index on SalesOrderID and SalesOrderDetailID. With the value of 870 chosen for the ProductID filter, this query will return 4,688 rows or about 4 percent of the total number of rows in the table.

In light of this information, you'd expect the query engine to execute a seek on the ProductID index. However, if you execute this query and examine the Execution Plan tab in the results pane, you'll find that instead of doing a seek on the ProductID index as you expected, the optimizer preferred a full scan of the clustered index. It preferred to read all the data contained in the table rather than just the 4 percent of the rows where the ProductID equals 870. Looking at the Messages tab in the results pane will confirm that the query is reading all 1,240 data pages in the table.

Is it a mistake? No, it's not a mistake. That's the optimum query plan. If you don't believe it, change the query slightly to force the use of the ProductID index:

SELECT LineTotal

FROM sales.SalesOrderDetail

WITH (index=IX_SalesOrderDetail_ProductID)

WHERE ProductID = 870

After you execute this query, look at the Messages tab. You'll see that SQL Server had to read 14,377 data pages to return the same result. Forcing the "right" index was roughly 10 times more expensive than the clustered index scan chosen by the optimizer.

The Tipping Point

The key to this mystery is understanding a crucial difference between clustered and nonclustered indexes. Most DBAs and SQL Server developers know that the clustered index contains the data itself in the leaf level of the index, whereas nonclustered indexes contain pointers to the pages containing the data. However, based on my experiences in the field, it seems to me that the implications of that difference aren't widely understood.

At the end of a seek or scan on a clustered index, the data has been retrieved and nothing further needs to happen. At the end of a seek or scan on a nonclustered index, the query engine has only pointers to the actual data. The data itself has yet to be returned.

When a nonclustered index is used, the data is retrieved one row at a time with a series of single-page I/O operations called Row ID (RID) lookups. A page returned by a RID lookup might contain several qualifying rows. If so, it won't be read just once. It'll be read several times, once for each qualifying row.

The bottom line is that a nonclustered index seek will typically use at least as many page reads as the number of rows returned and, depending on various other conditions, perhaps many more. The point at which the number of page reads required by the RID lookups exceeds the total number of data pages in the table is the point at which a clustered index scan becomes less expensive than the nonclustered index seek. This point is called the tipping point. When the tipping point is exceeded, the optimizer will usually prefer a scan of the clustered index instead of a nonclustered index seek.

It's important to know that nonclustered covering indexes don't use RID lookups, so they don't have this problem. For this reason, they can be lifesavers in some situations. Covering indexes contain all the data required by a specific query or set of queries. The data doesn't need to be retrieved from the base table because all the necessary data is contained in the index itself. If you aren't familiar with covering indexes, you might want to check out the article "SQL Server Tuning - The Covering Index."

 ยป

Please or Register to post comments.

IT/Dev Connections

Las Vegas
September 30th - October 4th

Paul ThurottOur Experts will show you:
• Common SQL Server
Problems
• Best Practices for T-SQL
• SQL Server Integration
Services
• Database Development

Come See Michael Otey & Tim Ford in Person!

Early Registration Now Open

From the Blogs
May 21, 2013
blog

A Common Misconception about MAXDOP

Out of the box, SQL Server is (and has been) able to take advantage of multiple processors/cores without any effort on behalf of administrators....More
May 9, 2013
blog

My ISO 8601-Compliant Signature 2

My family recently just "officially" announced that we're in the process of adopting a child from South Africa. We're quite excited, of course, but there's a ton of paperwork to do—along with the need for gobs of signatures....More
May 8, 2013
blog

Use SSIS for ETL from Hadoop

In this blog post, Mark Kromer walks you through using SSIS as a way to use ETL techniques using Microsoft's Hadoop on Windows (HDInsight) as a source using Hive connectors...More
SQL Server Pro Forums

Get answers to questions, share tips, and engage with the SQL Server community in our Forums.