When the optimizer can't use its usual tools, give it a clue
Last month, in "Inside Search Arguments," I discussed specific types of search arguments—which I call SARGs—that are most beneficial to the query optimizer when it's determining the best execution plan for a query. In most of the examples I used, if the query didn't contain a SARG that included a specific column, SQL Server wouldn't consider any indexes on that column for the query plan.
Even if the query contains a valid SARG, the optimizer sometimes can't determine the best plan because it can't use the statistics-histogram information during optimization. This can happen when the value the query is comparing to the column is an expression that the optimizer can't evaluate. In these cases, SQL Server has to wait for the execution phase before it can resolve the expression's value.
A common situation in which the optimizer can't use statistics occurs when you're comparing a column to a variable. This month, I look at how the optimizer works with variables. In addition, I examine another SARG construct—stored-procedure parameters—that looks very much like using variables. We'll see how differently the optimizer treats variables and parameters.
To run this month's sample code, you need the same modified Orders table that we used last month. The code in Listing 1 recreates the NewOrders table and builds two unique indexes: one on the Number column and one on the OrderID column. After running Listing 1's code, use the following code to build one more index, a nonunique index on the CustomerID column:
CREATE INDEX customerID_index ON NewOrders(CustomerID)
Before we look at the effect of variables on optimization, consider the following two queries on the NewOrders table:
SELECT * FROM NewOrders WHERE CustomerID = 'DUMON' SELECT * FROM NewOrders WHERE CustomerID = 'OCEAN'
The first query returns 20 rows, and the second returns 25 rows. Looking at the query plans for each, you'll see that the first plan uses a seek on the CustomerID index and the second uses a table scan. So the optimizer decided that a result somewhere between 20 and 25 rows was too large for effective index use and that scanning the table would be more efficient.
Besides checking the query plan, you can compare the difference between these two queries by looking at the number of logical reads each required. Turn on the STATISTICS IO option, then execute the preceding two queries again. You should see that the query using the index seek took 22 logical reads and the one using the table scan took 107 reads—the number of pages in the table. If you force the optimizer to use the index on CustomerID, you might find that the resulting plan requires fewer reads—in this case, 27:
SELECT * FROM NewOrders WITH (INDEX = customerID_index) WHERE CustomerID = 'OCEAN'
Keep in mind that the number of page accesses isn't the only cost affecting the optimizer's choices. With an index seek, the data access is random, but a table scan can employ an extent-by-extent sequential read of the table. This type of scan is frequently quite efficient. If you force the optimizer to use the index on the following example, which uses a different search value, you can see how bad an idea overriding the optimizer can be:
SELECT * FROM NewOrders WHERE CustomerID = 'SAVEA' SELECT * FROM NewOrders(INDEX = customerID_index) WHERE CustomerID = 'SAVEA'
When you don't force the optimizer to use the index, a search for the CustomerID value SAVEA takes 107 reads, but when you force the index, SQL Server needs 157 reads.
Now let's look at how the query optimizer processes variables. Consider the following batch:
SET STATISTICS IO ON DECLARE @name nchar(10) SET @name = 'DUMON' SELECT * FROM NewOrders WHERE CustomerID = @name
The query plan shows that the optimizer chose a table scan. If you run the query and look at the statistics, you'll see 107 reads. Although you have a valid SARG in the SELECT statement, when the optimizer evaluates the query, it hasn't yet executed the SET statement that assigns the specific value to the variable. The optimizer optimizes the entire batch at once. You can't put the variable assignment in a separate batch because the scope of a local variable is one batch. Because the optimizer doesn't know what value to look for in the index-statistics histogram, it has to guess. In this case, the estimate is too large for the index to be useful.
In general, the optimizer uses certain rules to produce an estimate when it can't use the index statistics. If the SARG contains an open range (it's looking for the column to be greater than or less than an unknown value), the optimizer estimates that 30 percent of the rows in the table will satisfy the condition. If the SARG contains a closed range (a BETWEEN clause or two overlapping AND conditions), the optimizer guesses that 25 percent of the rows in the table will satisfy the condition. If the SARG contains an equality comparison, the optimizer tries to use the density information in the index statistics; if that's impossible, it guesses that 10 percent of the rows in the table will satisfy the condition.
In most of these cases, the rule-based estimate is too large for the optimizer to consider using an index. In my experience, the optimizer usually doesn't consider using a nonclustered index unless it estimates that only a tiny percentage of the rows in the table will need accessing. That number is usually around 1 or 2 percent, but it can be even smaller.
What Was That Again?
So what's the solution to the problem of poor optimization when a query contains a variable? One possibility is to consider using stored procedures with parameters instead of using an ad hoc query with a variable. So to run the preceding SELECT statement, you could create the following procedure:
CREATE PROC getOrders ( @name nchar(10) ) AS SELECT * FROM NewOrders WHERE CustomerID = @name
Although the SELECT statement in this procedure looks identical to the previous statement that contains variables, SQL Server processes the two very differently. In this case, the optimizer won't determine a query plan until the first time you execute the procedure. At execution time, you must pass in a specific value for the parameter @name. The optimizer will use that value to determine whether a useful index exists.
Let's say that I call the procedure with a parameter value that occurs very few times:
SET STATISTICS IO ON EXEC getOrders 'DUMON'
In this case, SQL Server uses the value DUMON to optimize the procedure's SELECT statement. By looking at the query plan for this procedure call (which includes an index seek) or by looking at the number of logical reads (22—far fewer than the number of pages in the table), you can see that SQL Server was able to optimize for this specific value.
The good news—and the bad news—about using stored procedures is that SQL Server saves and reuses the query plan. All subsequent executions of this procedure will use the same plan as the first execution. So because the optimizer decided to use an index seek the first time you called the procedure, it will use an index seek every time you call the procedure, no matter what para-
meter you use. For example, if you now run
EXEC getOrders 'SAVEA'
you'll notice the plan for the first execution of the procedure isn't ideal for this second execution. IO statistics from executing the procedure with this new parameter show 157 reads, more than for a table scan.
Similarly, if the parameter requires a table scan the first time you execute the procedure, then all subsequent executions will use a table scan. You can demonstrate this behavior for yourself by first running DBCC FREEPROCCACHE to clear all the saved procedure plans. (But don't run this command on a production server!) Now, call the procedure twice again, this time with the more frequently occurring parameter value first:
EXEC getOrders 'SAVEA' EXEC getOrders 'DUMON'
If you're still watching STATISTICS IO, you should see 107 page reads for both queries, meaning that they're both using a table scan, even though the second execution would perform much better with an index seek.
One solution to consider when plan reuse doesn't give the best performance is forcing SQL Server to recompile the procedure's plan. You can force a recompile for one execution as follows:
EXEC getOrders 'DUMON' WITH RECOMPILE
SQL Server won't save the plan it generates for such an execution because using WITH RECOMPILE tells SQL Server that you want the plan for this execution only. Alternatively, you can create (or recreate) the procedure to recompile with every execution, like this:
ALTER PROC getOrders ( @name nchar(10) ) WITH RECOMPILE AS SELECT * FROM NewOrders WHERE CustomerID = @name
In this case, SQL Server won't save any plan but instead will generate a new one each time the procedure runs.
If you know that your parameter values might vary widely and require different plans for almost every execution, consider using the WITH RECOMPILE option when creating certain procedures. However, keep in mind that you'll lose all the cost savings of not having to recompile. For complex procedures, recompilation (which includes optimization) can be an expensive and time-consuming process, so you need to weigh the advantages of always getting an ideal query plan against the cost of the recompilation.
In one other situation, even recompiling the procedure might not be a good solution. Consider a procedure that receives a parameter, then manipulates the parameter before using it. You're treating the parameter like a variable, and when compiling the procedure, SQL Server won't know the final value of the parameter. However, in this case, it assumes that the value it will be using is the initial parameter value. Here's a simple example:
ALTER PROC getOrders ( @name nchar(10) ) AS SET @name = 'SAVEA' SELECT * FROM NewOrders WHERE CustomerID = @name
Keep in mind that altering a procedure invalidates any cached plans, so the next execution will generate a new plan. The optimizer will still compile the plan based on the parameter, even though the parameter isn't the value that SQL Server ultimately uses during query execution. In the following code, I pass in a very selective parameter value, so the optimizer will determine that SQL Server can use an index:
EXEC getOrders 'DUMON'
By the time the SELECT statement executes, the value of the @name parameter has changed, but the optimizer has already generated the plan to use an index seek. You end up with the worst possible performance because SQL Server must use an index seek to access far too many rows.
In this case, recompiling doesn't help. No matter how many times you recompile, the optimizer will plan based on the parameter that you pass in, not on what happens to the parameter inside the procedure.
There's no easy solution that ensures that your procedures always generate the best plan without unnecessary recompilation. However, being aware of the problems is 80 percent of the solution. One simple recommen-
dation I can make is that you keep your parameters and your variables separate. If you want to manipulate a parameter, copy its value into a local variable inside the procedure, then manipulate the variable. That way, the optimizer knows it's working with a variable and won't assume the value is known.
All these situations assume that your WHERE clauses use SARGs on columns that have potentially useful indexes. The problem of non-optimal query plans arises when the optimizer can't use the index statistics, so it can't make an informed decision. If you don't have SARGs in your queries, the best indexes in the world won't help you, and you'll have the overhead of maintaining the indexes without ever getting to use them. Using actual SARGs in your WHERE clauses is an important first step in writing well-tuned queries and procedures.