Strike a balance between performance and security
Since SQL Server's first release, the ability to reuse the saved execution plans for stored procedures has been touted as one of the best things about the product. As I demonstrated last month in "Reusing Query Plans," reusing a saved plan doesn't always produce the best performance. It works when the saved plan is a useful plan, but not when the plan is inappropriate for the current parameter values. SQL Server's query optimizer develops the execution plan for a stored procedure based on the parameters available at the time. So, the plan might be ideal for the initial parameters but not for parameters passed in subsequent procedure calls.
Last month, I discussed one solution to this problem: When you create the procedure, use the WITH RECOMPILE option, which tells SQL Server to rebuild the plan each time the procedure is called. Although this means you always get a good plan for your parameters, it also means you never get the benefit of reusing a saved plan. For a complex procedure, eliminating the recompilation time can make a noticeable performance difference, so you'd want to reuse plans whenever possible. Forcing a recompile for a complex procedure has another drawback. In SQL Server 2000 and 7.0, a recompile always means a recompile of the entire procedure, even if only one statement in the procedure is using a suboptimal execution plan.
A simple solution to allow only partial recompilation is not to use the WITH RECOMPILE option, but to use dynamic execution for those statements in the procedure that need an optimized plan for each set of specific parameter values. Listing 1 shows code similar to the code from last month, but this code copies two tables from the Northwind database into testdb. The script includes an UPDATE statement to change the data distribution so that customer RANCH accounts for a large percentage of the orders in the Orders table. The script then creates a procedure a bit more complicated than the procedure last time. This new procedure executes two SELECT statements using the same CustomerID parameter.
The first SELECT shouldn't cause any performance problems. There's a unique nonclustered index on the customerID column in the Customers table, so no matter what value you pass for customerID, the query should be efficient. The plan for this query is known as a safe plan, because the same plan is valid no matter what actual value you're searching for.
The second query uses the Orders table, in which customerID is non-unique. Some values for customerID can occur only a few times, making the use of the nonclustered index efficient. But other values, such as RANCH, could require an alternative query plan—in this case, a table scan. After running the code in Listing 1, run the following batch to see the problem:
DBCC FREEPROCCACHE SET STATISTICS IO ON EXEC getorders 'BOLID' EXEC getorders 'RANCH'
The first execution of getorders causes the optimizer to generate plans for each query in the procedure, and both query plans will use the nonclustered index for the referenced table. STATISTICS IO will report two logical IOs for selecting the one row from Customers, and five logical IOs for selecting the three rows for customer BOLID in the Orders table. The second execution then tries to use the same plan. It still needs only two logical reads to find the single row for RANCH in the Customers table, but it will need 262 logical IOs to find the 259 rows in the Orders table for RANCH.
Since only the second SELECT statement in the procedure is giving the performance problems, Listing 2 recreates the getorders procedure to use dynamic execution of only the second query. Run Listing 2's code and execute the procedure with each of the same two parameters. Keep in mind that altering the procedure removed any existing plans for getorders from cache.
EXEC getorders 'BOLID' EXEC getorders 'RANCH'
This time, you should see the same logical IO values as before for each SELECT statement except for the second SELECT in the second execution, which is looking for the 259 orders for customer RANCH. This time you should see only 21 logical IOs, corresponding to the number of pages in the table.
Dynamic SQL seems to solve our problem in this case, but it introduces other potential problems. Anytime dynamic SQL is based on passing a parameter into a program or stored procedure with no control over the contents of that parameter, there's a danger of a SQL injection attack. If you haven't yet, familiarize yourself with the dangers of dynamic execution and SQL injection; you can start with the sources listed in the "Related Reading" box. (Another problem with dynamic execution is that users executing the stored procedure need specific permission to access the underlying tables. Normally, users executing the procedure don't have permissions on the underlying tables, but they can perform those actions included in the procedure.)
For a quick example of how SQL injection can cause serious problems, consider the fact that someone could pass in more to the procedure than just the name of a customer. Suppose you passed in a customer name followed by another T-SQL command:
EXEC getorders 'BOLID';SELECT * FROM customers;PRINT ''
If you call the getorders procedure as I just showed, the first SELECT in the procedure won't return any rows because it's specifically looking for a customer ID with a long, meaningless value. However, that isn't an error, so SQL Server goes on to the next statement in the procedure, which builds a string. If you add a line to the procedure to print out the string, it looks like this:
SELECT * FROM orders WHERE CustomerID = 'BOLID';SELECT * FROM customers;PRINT '
This is now a batch containing three valid SQL statements: a SELECT to retrieve rows for a particular customerID, a SELECT of all the rows from the Customers table, and a PRINT statement.
I could use other SQL statements instead of the one that selected all the rows in a table. I could select from another table, causing a serious security breach, or issue an UPDATE or DELETE statement with devastating loss of data. Most of the suggestions in the "Related Reading" articles involve either tightening permissions or controlling the parameter before passing it to the query. In this case, greater parameter control might help. My procedure needed only a customerID value, and all customerID values are five characters long. I changed the procedure definition when I added the dynamic execution in Listing 2 to treat the input value as varchar(500). The original parameter was a char(5), and not many dangerous commands can fit in only five characters.
The reason for using dynamic execution was to recompile the plan for only the second SELECT with each new parameter. Another solution that gives the same behavior is to create a second procedure for the first one to call, so only the second procedure includes WITH RECOMPILE. Run the code that Listing 3 shows to recreate the original procedure and create a second one. Then run the same batch again, calling the getorders procedure with two parameters:
EXEC getorders 'BOLID' EXEC getorders 'RANCH'
You should see the optimal logical IO values when you run this batch.
This still isn't an ideal solution because SQL Server will always recompile the plan for the second SELECT in the procedure. If the procedure were more complex—for example, including joins, aggregations, and function calls—you might want to find a solution that eliminates this repeated compilation and optimization.
Another solution involves creating two new stored procedures that an outer procedure calls, depending on which plan is best. For the examples so far, it would be difficult to predict without running the query which parameters will require which plan because each customerID value occurs a different number of times. For other types of queries, there might be a solution. If you had a range query where the parameter was the endpoint of the range, you might be able to predict that endpoints at the low end of the range would return only a few rows. Thus, using the nonclustered index would be a good plan, but for endpoints higher in the range, a different plan would be better.
The code in Listing 4 rewrites the getorders procedure again and creates two sub-procedures. SQL Server now calls the outer procedure, getorders, when you want to return all the rows that have a customerID value less than the input parameter. For this small amount of data in the Orders table, SQL Server could get faster performance with any value greater than or equal to ANTON by performing a table scan. Note that the two nested procedures are identical. However, the outer procedure calls the first one with a very selective parameter and the second with a not-so-selective parameter, so the plan for the second one will use a table scan. You can test the new procedure with the following batch, keeping in mind that you're now returning all orders for customers whose ID is less than the parameter:
EXEC getorders 'ALFKI' EXEC getorders 'RANCH'
Again, you get an appropriate plan for both parameters, as you can see by the number of logical reads.
This solution isn't perfect, either. It's useful only when you can guess up front which actual parameters SQL Server will treat one way and which it will treat another way, but it doesn't account for changing the plan when the data distribution changes. If you perform updates, inserts, and deletes to the Orders table and change the data distribution, so that now any customerID values less than BINGO should be treated as highly selective, you'll have to rewrite the getorders procedure. You have to constantly monitor performance and retest the procedures to make sure that you're still getting the best possible plan as the data changes.
How do you get the best of both worlds? You don't want to recompile when the existing plan is good, but you need to be able to recompile (and reoptimize) when the existing plan isn't so good. And, you don't want to have to recompile more of the procedure than necessary. There's no perfect solution. When SQL Server knows that an existing plan might not be efficient because of changes to metadata or statistics, it can automatically recompile. But if an existing plan gives bad performance just because you're using a different parameter, SQL Server can know it needs a different plan only by taking the time to reoptimize. The more you know about how to use T-SQL to your advantage for tuning and optimization, the better performance you'll achieve.
| SQL Server Magazine articles: |
"Preventing SQL Injection Attack," August 2004, InstantDoc ID 43012
SELECT TOP(X), "Injection Protection," March 2004, InstantDoc ID 41502
Reader Challenge: "Protecting Against SQL Injection," April 6, 2004, InstantDoc ID 42216
"Rest Secure," August 2001, InstantDoc ID 21268
"The Curse and Blessings of Dynamic SQL," http://www.sommarskog.se/dynamic_sql.html
"Introduction to Dynamic SQL," June 2001, http://www.sqlteam.com/item.asp?ItemID=4599