Reduce I/O and query runtime

As a DBA, you probably tune your database server so that it quickly and efficiently processes queries that applications send. Whether your organization purchases applications from another company or writes applications in-house, you might tend to see the program as a black box that can't be modified. You can conduct performance tuning at the server level by improving disk I/O, increasing memory, and creating or modifying indexes. Ultimately, though, the application's performance depends on the database's design and the queries written for it. Database application developers who understand how SQL Server optimizes and processes queries usually have the best-performing software and the fewest surprises as their applications scale from small installations to large ones. With a few simple tweaks of your SQL Server 6.5 database and queries, you can get optimal performance.

But what about SQL Server 7.0? Because SQL Server 7.0 automatically tunes most of its internal settings, its performance-tuning methodology differs significantly from the methods that you use to tune SQL Server 6.5. The information in this article about selecting indexes and minimizing I/O in SQL Server 6.5 also applies to 7.0, but most of the other suggestions aren't as effective in SQL Server 7.0 because much of its architecture is completely new. You can get more information at http://www.microsoft.com/ sql/productinfo/perftuning.htm, which has several white papers on performance tuning.

What Am I Optimizing?


SQL Server 6.5 uses a cost-based query optimizer. The biggest cost in most queries is I/O operations involving the disk. Because the hard disk is hundreds of times slower than RAM, anything you can do to limit hard-disk use is good. At a basic level, therefore, you're first trying to optimize physical I/O—a page read from the hard disk—and second, logical I/O—a page read from memory.

To optimize your server, first make sure that you allocate as much RAM as possible to SQL Server. In all cases, a logical read is many times faster than a physical read. In an ideal situation, you have enough RAM to cache your entire database. Unfortunately, you can't stop at minimizing physical reads. Even a fast logical read takes some time, and thousands, or even millions, of reads take much more time.

As an example, consider this simple two-table join.

SELECT * FROM Employees a INNER JOIN
   Payroll_Checks b ON (a.empID = b.empID)

One of my clients asked me why this query was running so slowly. After I turned on SHOWPLAN and looked at the query plan, the answer was obvious: The empID column in the Payroll_Checks table had no index. The Employees table had about 10,000 rows and the Payroll_Checks table had about 750,000 rows. Without the index, SQL Server 6.5 performed a table scan on Payroll_Checks 10,000 times. When I canceled the query, the server had already performed 15 million logical I/O operations. Putting an index on the empID column dropped processing time to seconds and logical I/O operations to approximately 750,000.

The Magic Row Sizes


The first step in minimizing I/O is to ensure that the row is as compact as possible. Rows can't span pages in SQL Server, and each page in SQL Server 6.5 can store a maximum of 32 bytes for the row header and 2016 bytes for the data. Each data row also has overhead in the row. The maximum size that any single row can be is 1962 bytes, including its overhead bytes. This limit is imposed so that a row being inserted or deleted can also fit on a transaction log page. Although a single row can't exceed 1962 bytes, two rows can take the whole 2016 bytes available for data on a page. So certain row sizes can help you decrease I/O by large percentages. For example, if your rows are 1009 bytes long, only one row will fit on a page. If you shorten the row size by one byte, two rows can fit on a page. You cut the I/O for a table scan in half simply by removing one byte! Similar situations exist for 673 bytes, 505 bytes, 404 bytes, etc. If you can keep the row size under these amounts, you'll lower I/O by 33 percent, 25 percent, and 20 percent, respectively.

All rows have some overhead bytes, so you need to take those bytes into consideration when you make your calculations. Note that variable-length rows have more overhead than fixed-length rows. To find out whether you have wasted space on a page, run DBCC SHOWCONTIG to determine the average page density and average number of free bytes per page. The most likely candidates for wasted space are tables for which the average number of free bytes is close to the size of a record.

A similarly wasteful situation exists for tables without clustered indexes and in which a large percentage of the rows were deleted. Deletions leave empty space on a page, and because SQL Server doesn't reuse space in a page if the table doesn't have a clustered index, all new rows in the table end up on the last page. As a result, the table will have pages that are less than 100 percent full, so table scans will take more I/O operations. Before you attempt to shrink the row size for such tables, create a clustered index. Then run DBCC SHOWCONTIG again to see how much free space the table has.

Unexpected Update I/Os


Although you might think that your table doesn't need a clustered index because it never has deleted rows, you might be surprised to learn that the UPDATE statement can leave empty spaces in your tables. A seemingly straightforward update such as this one

UPDATE Payroll_Charges SET ChargeAcctCode =
   9102 WHERE ChargeAcctCode = 7102

can potentially create a tremendous number of transaction log entries. The problem stems from the way SQL Server structures the series of operations to prevent violating constraints. Here's a simple example:

UPDATE Authors SET au_id = au_id + 1

If au_id is a primary key (or a unique key), updating the first row will probably violate the uniqueness constraint, especially because au_id is a monotonically increasing value. But the UPDATE statement is valid, so how can SQL Server execute it without violating the constraints? Here SQL Server 6.5 uses a deferred-mode update in which it splits the operation into a row delete followed by a row insert with the new value.

SQL Server handles this situation by writing to the transaction log no-op records that tell which operation needs to be performed. Then, after identifying all the affected rows and writing a delete and an insert record into the log, SQL Server returns to the beginning of the transaction and starts applying the delete operations. When it finishes the deletes, it applies the inserts. The operations are full deletes and full inserts, and they include modifying all affected indexes.

Deferred updates can degrade database and application performance because they cause the transaction log to consume a lot of space and they take more time than direct-mode updates. The longer an update takes, the longer the exclusive locks stay in place and the longer other users must wait to access the page, which increases the likelihood of deadlocks.

SQL Server 6.5 can carry out your update operations in four ways. The fastest is a direct update in- place, where no row movement takes place, and the only information that has to be logged is one record that indicates which bytes changed to new values. The slowest update is the deferred update I just described. The other two types of updates are both direct updates (no extra logging is done) but not in place, so some row movement will occur. A long list of conditions must be met for your update to be an update in place, and I won't document them here. The SQL Server 6.5 Books Online (BOL) section titled The update mode is direct lists some conditions that must exist for SQL Server to use a direct in-place update. However, BOL confuses direct with update-in-place, so the discussion isn't always accurate. The most complete published description of the different types of updates and the complete set of conditions for the various kinds of updates to occur is in Inside SQL Server 6.5 by Ron Soukup (Microsoft Press, 1997).

Two main conditions that must be met for your updates to happen in place are that you can't modify the key columns in a clustered index and you can't mark the table for replication. Modifying the clustered index forces SQL Server to move a row to its proper physical location in the table, a process that it always implements as a delete followed by an insert. And because replication reads the log and creates ODBC commands for the subscribers, the delete/ insert combination is the most straightforward way to describe the update operation. Both situations force the update to not be in-place.

A similar rule applies to variable-length and nullable columns. The updated column in multirow updates must be fixed-length to use an in-place update. SQL Server stores nullable columns as variable-length columns, even if you declare them with a fixed-length data type. SQL Server always uses deferred mode for multirow updates of nullable columns.

Knowing these restrictions and configuring your database accordingly pays handsome dividends when you're trying to squeeze performance from updates. Using methods that involve direct-mode updates can save you log write I/O, logical page read I/O on the log pages, and time during log backups and restores and database recoveries. When you design databases, consider making fixed-length, non-null columns the standard. If you're a programmer, know the properties of the columns you update, keep the effects in mind when you write UPDATE statements, and carefully choose when to issue your statements.

Search Arguments


The easiest way to decrease the I/O your query requires is to decrease the number of rows SQL Server must examine. You do this by including selective criteria in the WHERE clause of the query. Called search arguments, these criteria help the query optimizer by giving it clues as to which method of data access is probably the fastest. Search arguments have the form ColumnName operator \[ColumnName or constant\], where operator is =, <, >, >=, or <=. Search arguments joined by the Boolean operator AND are also valid. The phrase BETWEEN...AND is also valid because you can restate it in terms of >= and <= operators. Here are some examples of search arguments:

LastName = 'Lewis'
LastName >= 'Le' AND LastName <= 'LF'
OrderDate BETWEEN '01/01/99' AND '01/31/99'

Notice that none of the negation operators are listed. To process a query including the clause WHERE (ProductID <> 2), SQL Server looks at each record to determine whether the ProductID equals two. Even an index on ProductID won't help much unless the rows with ProductID <> 2 constitute a very small percentage of the table. In almost all cases, SQL Server will satisfy this query with a table scan, not an index.

In terms of optimizing queries, the LIKE operator can be as inefficient as NOT. If you have a clause such as WHERE LastName LIKE '%Mc%, SQL Server performs a pattern search on the entire column. An index won't help, so you can expect the query optimizer to choose a table scan. The one exception happens when the clause looks like WHERE LastName LIKE 'Le%'. The difference is that this search is logically equivalent to WHERE LastName >= 'Le' AND LastName < 'LF', which is a search argument.

In general terms, search arguments help your query by helping the query optimizer determine how selective an index will be in handling that part of the query. Clauses that use the = operator and the < and > operators are search arguments because they limit the search for the rows in the result set. The = operator limits the search to one row, and the < and > operators limit the search to a range of rows.

A clause's selectivity reflects how well the search argument limits the search and is equal to the percentage you get from dividing the number of rows returned by the number of rows in the table. (This definition is slightly simplified, but it makes this discussion easier.) A low percentage means a clause has high selectivity; a high percentage means a clause has low selectivity. Because the AND operator is associative (i.e., a AND b is the same as b AND a), the query optimizer can choose the most selective clause to drive the query processing because those kinds of clauses tend to limit the I/Os the query performs.

For example, consider the query

SELECT * FROM pubs..authors WHERE au_id =
   '213-46-8915' AND state = 'CA'

Both clauses in the WHERE clause are search arguments. But the state column probably won't have unique values, whereas the au_id column will because it's the table's primary key. Without knowing anything else, you can reasonably guess that au_id = '213-46-8915' has high selectivity and state = 'CA' has medium to low selectivity. Of course, if the state column had only one row with the value of 'CA', both clauses would be equally selective.

The query optimizer decides how selective a search argument is by looking at the corresponding index's statistics. The statistics give an approximate idea of how many rows will match given criteria. Therefore, if the query optimizer knows how many rows are in the table and how many rows each part of the WHERE clause will probably return, deciding which index to use is a relatively simple task. (For more information on SQL Server 7.0's use of statistics, see Kalen Delaney, "SQL Server Statistics: A Useful Query Optimizer Tool," August 1999.) In the previous query, if both au_id and state have an index, SQL Server uses the index on au_id. If au_id has no index but state does, the index on state is the logical choice because it's more selective than a table scan. Without an index on either column, a table scan is the only way to find the rows that match.

(For more information about how the query optimizer works, Microsoft SQL Server 6.5 Unleashed (Sams, 1998) describes in detail several common scenarios. Working through each example takes a while, but you'll write better queries when you know how the query optimizer works. Inside SQL Server 6.5 also has good optimizer information.)

Indexing: More Art than Science


You can't write efficient queries without knowing about the indexes on the table. Without good indexes, even the simplest-looking queries can slow your system tremendously. Your only defense is to know the composition of the data in your tables and to think of your indexes as an integral part of your queries.

Indexes that work well during testing and development might be completely wrong for a production system because the data composition differs from your original assumptions. I've seen systems work well for one customer but perform poorly for another because the data was clustered in the tables in a way that prevented the query optimizer from making good use of indexes. If you get complaints about your application's performance, be aware of this situation and realize that one set of indexes might not fit all your customers.

Now let's look at some broad guidelines for choosing the right kind of indexes and the columns to index. First, because you can have only one clustered index per table, that type of index needs to support the greatest possible number of queries. A clustered index is most useful for range queries because the index's leaf level contains the data sorted in index order. You'll see the most benefit from a clustered index when a query has a WHERE clause that contains >, <, or BETWEEN...AND operators or GROUP BY clauses that list the columns in the same order as the columns in the index. Although it might not help in searching for rows, a clustered index can improve the performance of ORDER BY clauses if the index and the ORDER BY clause have the same columns in the same order.

Because the intermediate level of a clustered index is so small, it works fine when you're searching for unique values. Nonclustered indexes, however, work best for point queries, which find small numbers of rows. WHERE clauses with = operators are prime candidates for nonclustered indexes on the appropriate columns. This index type also works well for the MIN and MAX aggregate functions because finding the first and last entries in a range of values is easy if you look at the index's leaf level. Finally, nonclustered indexes dramatically improve the COUNT function's performance because scanning the index's leaf level is much faster than performing a table scan.

Where Do I Go From Here?


Use the ISQL/window to see how SQL Server reacts when you put various indexes on a table. Enterprise Manager can show you the selectivity of a table's indexes, and SQL Trace can create a script of all the queries sent to the server. By adjusting the indexes, replaying the script, and noting the changes in processing time, you can get a good idea of which indexes are best for your production environment. Just be aware of how many I/O operations your queries use and remember that anything that decreases that number will have a positive effect on overall performance.