Find out what makes the optimizer tick
The query optimizer is one of the most complex pieces of code in the SQL Server database engine. The optimizer's job is to generate a query plan, which is a set of steps that SQL Server will take to carry out a user query. The optimizer makes decisions such as which indexes to use for each table, which join method to use and in which order to process the joined tables, whether to build an internal worktable to hold intermediate results, whether to perform an internal sort, and whether to run the query on multiple processors.
In the next few months, I'll tell you about many of the changes to SQL Server's query optimizer that Microsoft introduced in the last two releases. This month, I look at the optimizer's background and describe ways the optimizer has changed. In upcoming articles, I'll show you examples of queries that are optimized differently in SQL Server 2000 and 7.0 than in previous releases and that perform better because of the changes.
The Old Days
Before SQL Server 7.0, the optimizer had very few choices to make. It evaluated each possibly useful index and either chose one of them or chose to perform a table scan. SQL Server could process a JOIN operation in only one way, so the optimizer needed to decide only the order in which to join the tables. SQL Server could process GROUP BY and DISTINCT queries in only one way, and a query couldn't run on multiple processors.
The optimizer in SQL Server 6.5 and earlier releases was so straightforward that once you understood the way it worked and the way SQL Server could use indexes, you could predict fairly accurately what the query plan would (or should) be, even before running the query. If the optimizer didn't produce the plan you thought it should, you could enable a couple of trace flags to see the steps that the optimizer took to determine the plan and all the potential plans the optimizer considered. In addition, the optimizer had so few choices to make that if it didn't make the best choice, Microsoft considered it a bug. In effect, you had an implied guarantee that the optimizer's query plan was always the best plan.
For SQL Server 7.0, Microsoft completely rewrote the optimizer and added many query-processing techniques. The updated optimizer was completely modular so that as Microsoft developers added new processing techniques to the product, they could easily amend the optimizer to consider each new technique in any query plan.
The query optimizer in SQL Server 7.0 and later releases is orders of magnitude more complex than its predecessors. SQL Server can use so many different processing strategies that, for many queries, it's almost impossible to predict which plan the optimizer will choose. These releases also have no equivalent to the trace flags that showed you the plans the optimizer was considering, probably because the optimizer goes through hundreds of thousands of steps to choose each plan, so the output of such trace flags would be too complex to use.
As I mentioned, before SQL Server 7.0, you had a kind of guarantee that the optimizer would find the best plan to process your query as long as the optimizer had all the information it needed. If you, as a programmer, could use a query hint and make the query perform better, it meant that the optimizer hadn't come up with the best plan, and Microsoft usually considered this to be a bug. SQL Server 2000 and 7.0 carry no such guarantee, either implicit or explicit. In these releases, the optimizer's goal is to find a "good-enough" plan. In most cases, this plan will be the best—but not always.
You can think of the current optimizer's processes for coming up with a query plan as cyclical, with each iteration of the cycle trying a more complex plan. For example, the first iteration might evaluate the cost of using only each table's clustered index and a simple nested-loop join. If that approach doesn't produce a good plan, the optimizer might evaluate nonclustered indexes, then different join methods, then try using multiple indexes.
As the optimizer evaluates increasingly complex plans, it calculates the cost-benefit ratio of continuing to search for a better plan than any it has found so far. For example, suppose the optimizer spends 1ms finding a plan that it estimates will take 30ms to execute, then determines that it might find a plan that would run in 25ms—but finding it would take 2 more seconds. SQL Server will usually decide that the cost of continued searching isn't worth the benefit. So, it uses the plan that takes 30ms, which the optimizer decided was good enough.
If you think that plan isn't good enough, you could try modifying your query with hints. You might find that by using hints to force SQL Server to use particular indexes and join methods, you can reduce the execution time to 25ms. The fact that you can find a faster plan is no longer considered an optimizer bug; it's a choice. It was SQL Server's choice to stop the optimizer from continuing to search for a better plan. And you have a choice, too—whether to use the hint in your code for that 5ms performance gain.
Can You Take a Hint?
I don't intend to fully explain all the query hints you can use when writing SQL Server queries. Not all hints affect the plan that the query optimizer chooses; quite a few query hints control the locking mechanisms that SQL Server uses when executing a query, so they're irrelevant to this discussion. However, I want to make sure you understand a couple of important details about hints.
First, a hint isn't really a hint as we know it. In English, a hint is a gentle suggestion, but in SQL Server terms, a hint is more like an order. Unless the order is physically or logically impossible, SQL Server will obey any hint you give it.
Also, bear in mind that a hint that improves your query performance today might make it worse tomorrow. Once you use a hint to tell SQL Server how to process your query, you lose all benefit of the incredibly sophisticated query optimizer. A change in your data distribution because of updates or batch loads might mean that the hinted query plan is no longer optimal, but the optimizer can't decide to disregard the hint and come up with a better plan.
Running the Numbers
One of the most important things the query optimizer needs for decision-making is accurate, up-to-date statistical information about your data values and their distribution. Although the way that SQL Server tracks statistics has changed over the years, the idea is still the same. Statistics give the optimizer a way to make an educated guess about an index's usefulness.
For example, you're probably familiar with the basic structure of a nonclustered index, in which the leaf level has a pointer (also called a bookmark) for every data row in the table. Suppose you have a nonclustered index on a field called lastname. Every lastname value in the table, including duplicates, is in the leaf level of the index, and a pointer indicates where to find that row in the table. The last names in the index's leaf level are in order, so all the names that start with Mc are near each other. However, the corresponding table rows might be on separate data pages, possibly spread out over dozens of pages. A query to find all those surnames might look like this:
WHERE lastname LIKE 'Mc%'
In general, a nonclustered index is useful only if you need to access a very few rows through the index. If you need to access a lot of rows, scanning the whole table might be more efficient because SQL Server wouldn't have to follow all the bookmark pointers. How many rows are "a lot" depends on many factors, including the size of the table, the percentage of rows to be accessed, and the number of rows that fit on a page. I've found that if a query needs to access less than 1 percent of a table's rows, SQL Server can effectively use a nonclustered index that helps locate those rows. Of course, that's just a ballpark figure; every query and index will have their own cutoff point. (For more details about the use of indexes, see my July 2001 column, "Are You in Tune?" InstantDoc ID 21038.)
How can SQL Server know, before your query has executed, whether the table contains a lot of names that start with Mc or only a few? The answer is that it can get an estimate by examining the index statistics. As I mentioned, the format of statistics changed completely in SQL Server 7.0 and again in SQL Server 2000. For details about how SQL Server 2000 stores statistics, see the Microsoft white paper "Statistics Used by the Query Optimizer in Microsoft SQL Server 2000" at http://msdn.microsoft.com/library /default.asp?url=/library/techart/statquery .htm. My October 2001 column, "Statistically Speaking," InstantDoc ID 22075, contains additional information about how SQL Server uses index statistics.
For DBAs, one of the most welcome additions that Microsoft made in SQL Server 7.0 was automatic statistics updates. Before SQL Server 7.0, a DBA had to remember to update the statistics on indexes every time data volumes and distribution changed. For example, if last week you had half a dozen customers in Arkansas and you ran a query looking for customers in that state, a nonclustered index on state might have been useful. This week, you acquired two Arkansas-based companies whose customer lists contain thousands of names. Now, using the index that was appropriate last week might give horrible performance. The optimizer in SQL Server 6.5 wouldn't know about the new Arkansas customers unless you manually updated the statistics. Worse, you had no way to update all the statistics in a database. You could use the UPDATE STATISTICS command on only one table at a time, and you'd have to write your own procedure to access every table in the database if you wanted to update all database statistics as part of regular maintenance.
SQL Server 2000 and 7.0 have a database setting called auto update statistics that's set to TRUE by default. Every time an internally defined percentage of values in an index changes, the optimizer updates statistics before trying to determine the optimal plan. In addition, SQL Server 2000 and 7.0 have a stored procedure called sp_updatestats that updates the statistics on every index in the current database.
A related feature Microsoft added in SQL Server 7.0 lets you exclude individual indexes from automatic statistics updates. You can use the WITH NORECOMPUTE option with the UPDATE STATISTICS command to override the database setting of auto update statistics for individual indexes. These options give you almost total control over index-statistics updates.
Put the Optimizer to Work
The optimizer is now so modular and so easy to enhance that Microsoft frequently adds query-processing techniques in service packs without mentioning them in the README file. For most people, that's not a problem. If SQL Server starts performing better after an update to a new service pack, most people won't complain. However, for someone like me who writes and speaks about the optimizer and the various query-processing techniques, it means a recheck of all my demos and sample files after every service-pack upgrade to make sure that my query plans are what I expect them to be. I don't always remember to do this, and I've been taken by surprise in front of an audience more than once. Next month, I'll tell you about some optimizer changes that have caused me embarrassment. And I'll show you some examples of queries that take advantage of these and other improvements in SQL Server.