Downloads
24700.zip

Make sure you have the right indexes for your needs

You know by now that you have several options when tuning indexes for SELECT queries, but the pertinent factors to consider are different when you're tuning indexes for data modification. When determining the best indexes to use for SELECT queries, the only reason not to create more indexes than you might need is that the extra indexes take up disk space. Having more indexes than you need doesn't degrade performance unless your tables are being modified regularly.

Every time you modify a table's data, you affect the indexes on that table because SQL Server must update the indexes. SQL Server performs this index maintenance automatically, but you need to be aware of the cost. Data modification mostly affects nonclustered indexes because the leaf level of every nonclustered index contains a pointer to every data row. So for every row you insert into a table, a nonclustered index leaf level gets a new pointer; for every row you delete, the leaf level loses a pointer. Remember that a nonclustered index's leaf level keeps the index pointers sorted, so the new row must be inserted in the proper location.

Adding to the complication, these changes can propagate upward in the index. If the new index row fills an index page, the page must split into two pages. Then the level above, which contains pointers to every page, gets a new page pointer. If an index row is removed and is the only row on the page, SQL Server has to remove the leaf-level index page and adjust the index level above.

Pointers About Indexes


I won't cover here all the details about how SQL Server maintains indexes internally. For the purposes of query tuning, the additional details are irrelevant. But keep in mind that SQL Server must make these index adjustments for every nonclustered index, and you can have up to 249 nonclustered indexes on one table. For example, say you have a contacts table with nonclustered indexes on last name, first name, phone number, and city. Every time you insert a new row into the table, SQL Server must add pointers for the new last name, first name, phone number, and city in the right locations in the indexes.

For a clustered index, fewer changes are necessary because the index pages contain pointers only to pages instead of to individual records. SQL Server adds a new index row only when it has to add a whole new page to a table. When someone inserts a new row into a table that has a clustered index, SQL Server must put the row into the proper sorted location, like when you need to add a new pointer to a nonclustered index's leaf level. If the data page has no room for the new row, a page split occurs. (I talk more about page splits later.) When a row is removed, if it's the last one remaining on the page, SQL Server has to adjust the index rows accordingly. So clustered indexes have less overhead, but you can have only one per table. Any additional indexes must be nonclustered.

It might seem that you should completely eliminate indexes if you're going to be modifying data, but you need to consider another aspect of data modification. In particular, for many UPDATE and DELETE operations, you want to modify only a small set of rows, so the UPDATE or DELETE query will contain a WHERE clause, perhaps including search arguments. Having indexes on one or more search arguments in your data-modification queries lets SQL Server find the relevant rows more quickly.

You need to figure out how to build the indexes that SQL Server needs to locate the data without building so many indexes that they'll significantly affect the data-modification operation's performance. Let's look at an example in the Northwind database. First, we'll create a copy of the Order Details table and add multiple indexes. After each new index, we'll delete one row of data. The Order Details table has only nine pages, but even with this small table, you should be able to see the effect indexes can have. I'll leave it to your imagination, or your testing, to determine what the effect would be on tables that are hundreds or thousands of times as large as Order Details.

Here's the code to copy the table:

USE Northwind
SELECT * INTO od
FROM \[order details\]

Now, delete one row to determine baseline statistics before building any indexes; turn on STATISTICS IO to determine how many page accesses are required to find and delete the row:

SET STATISTICS IO ON
GO
DELETE od
WHERE OrderId = 11077 AND ProductId = 66
GO
SET STATISTICS IO OFF
GO

The results show that SQL Server performed 11 page reads. Now, use the following code to build a clustered index on the search argument columns and delete another row:

CREATE UNIQUE CLUSTERED INDEX OD_key ON od(OrderId, ProductID)
GO
SET STATISTICS IO ON
GO
DELETE od
WHERE OrderId = 11077 AND ProductId = 75
GO
SET STATISTICS IO OFF
GO

Because the index is useful in finding the right row to delete, the number of page reads decreases by more than 80 percent to only two pages.

The code in Listing 1 builds additional indexes, deleting a row after building each index to demonstrate how modifications to tables with indexes affect the amount of I/O required. Running this script shows that each index adds more work for SQL Server. By the time you build the fourth nonclustered index, the number of page reads is greater than the number required when no indexes existed. Keep in mind that reading pages isn't the only work that SQL Server is doing. If SQL Server needs to split pages, it must do a lot of work to move rows from one page to another, and that work doesn't show up in the STATISTICS IO values.

Torn in Two


Having a large number of indexes can be great if the table is used only for data retrieval, but as soon as you need to modify the data, you should keep indexes to a minimum. So what can you do if a table is used for both data retrieval and modification? This is one of the most common—and most difficult—problems of database design and tuning. Some people solve it by building two copies of the table, either in the same database, in different databases on the same server, or on separate servers. Each table can have its own set of indexes. One of the tables is used only for querying, and the other only for modifications. The problem then becomes keeping the data in the two tables synchronized. Eventually, you have to propagate the changes made to the data in the modification-only table to the retrieval-only table. (You can use SQL Server replication or other mechanisms to set up and synchronize the copy of the table, but discussing various ways of performing this reconciliation would make an article in itself, so I'll leave that for another time.)

Keeping two separate tables means a more complex system to maintain. But you might be able to tune your table in such a way that your SQL Server can withstand both retrieval and modification on the same data and still provide satisfactory performance. You'll often get better results by just deciding which operations are more important to you and your users. Is it more important that inserts and updates happen quickly, or is data-retrieval speed of greater concern? After you decide which activities should take precedence, you can create indexes as if your users will perform only those types of operations. In many cases, you're better off if you optimize for one set of operations than if you try to make all operations perform quickly. Of course, this is just a general suggestion; you need to thoroughly test any indexes you build, using a test suite that includes representative retrieval and modification operations.

If you decide to index as if data modification is the only type of operation you'll be performing, how can you decide which indexes are the best? Before I give my indexing recommendations, you need to know what happens to indexes as data is modified. First, let's look at INSERT operations.

Inserting Information


When a user inserts a new row into a table, SQL Server must determine where to put that row. When a table is a heap—that is, when the table has no clustered index—SQL Server always inserts a new row wherever room is available in the table. SQL Server keeps track of which pages belong to each table and how much free space is available on those pages so it can efficiently determine the best place to put a new row. If no pages with space are available, SQL Server must allocate a whole new extent (eight contiguous pages, or 64KB) to the table.

A clustered index directs SQL Server to place an inserted row on a specific page based on the new row's value for the clustered-index key columns. A new row can result directly from an INSERT statement or indirectly from an UPDATE statement executed with a delete-followed-by-insert strategy. (I'll talk more about this delete-then-insert strategy in a future column when I discuss UPDATE operations.) SQL Server inserts new rows into their clustered position, splicing in a page through a page split if the current page has no room.

Because the clustered index dictates a particular ordering for the rows in a table, every new row has a specific place it belongs. If there's no room for the new row on the page where it belongs, SQL Server must allocate a new page and link it into the list of pages. If possible, SQL Server allocates this new page from the same extent as the other pages it will be linked to. If the extent is full, SQL Server allocates a new extent to the object, trying to allocate an extent that's as close as possible (on the hard disk) to the extent that contains the original page.

Splitsville


After SQL Server finds the new page, it must split the original page, leaving half the rows on the original page and moving half to the new page. In some cases, SQL Server finds that even after the split, there's no room for the new row—which, because of variable-length fields, might be much larger than any existing rows on the pages. In that case, SQL Server must allocate a second new page and distribute the data from the original page, plus the new large row, across the three pages. (For more information about what data SQL Server moves when a page split occurs, see "Do the Splits," June 2001, InstantDoc ID 20589.)

In a special case, when data is being inserted in a monotonically increasing sequence, a different type of "split" occurs. For example, suppose your rows keep track of orders and each row has an order ID, which SQL Server automatically generates as the next number in an increasing sequence of numbers. If a clustered index is on the order ID column and SQL Server stores the rows on the pages in the order of the clustered key, each new row is placed after the most recently inserted row. That is, the row for order 1002 goes after the row for order 1001, the row for order 1003 goes after 1002, and so on. What happens when a page becomes full in this case? Let's say a page can hold 10 rows and a certain full page contains rows for order numbers 1001 through 1010. What happens when someone tries to insert order 1011? The page where the row should go is full. If SQL Server split that page as described above, rows 1001 through 1005 would stay on the original page and rows 1006 through 1010 would move to the newly allocated page, then row 1011 would be inserted on the new page. After rows 1012 through 1015 were inserted, that page would be full. When order 1016 arrived, another split would occur and SQL Server would allocate a new page for rows 1011 through 1015 plus the new row, 1016. You'd then have a page containing rows 1001 through 1005 and one containing rows 1006 through 1010, and both would be only half full. Because all new rows would have larger clustered key values than all previous rows, the empty space on these pages would never be used.

However, SQL Server can detect when rows are being inserted in a monotonically increasing sequence, so when a page becomes full, SQL Server doesn't split the page's data in half. If SQL Server detects that data is being inserted in increasing sequence, it allocates a new page when one gets full but doesn't move any rows to the new page. In the example above, when the page containing rows 1001 through 1010 became full, the new page would start with row 1011 and would fill up when it received row 1020. You don't need to do anything to cause this special kind of "split" to occur. (Admittedly, this isn't a true split because no existing page is split apart. But because it happens when a page on which a new row should be inserted is full and SQL Server allocates a new page, this process is referred to as a split.) However, if you're aware of how efficiently SQL Server handles space allocation when the table has a clustered index, you'll understand the first of my index recommendations for data modification.

Indexing for Inserts


When inserts are occuring in a table with a clustered index, and the column value on which the clustered index is placed increases with each new row inserted, SQL Server uses one page until it's full, then uses another page. With no clustered index, SQL Server can place new rows anywhere in the table, causing frequent reads from disk if the pages to be used aren't in cache. If the table has a clustered index on a column in which inserts don't occur in any pattern (e.g., last name), again the new rows might be inserted on any page in the table, depending on the value of the clustered-index key column. Building a clustered index for a table that has frequent inserts is almost always a good idea—in particular if the indexed column value increases with each new row inserted. Not only will you get the advantage of efficient space management, you'll get efficient I/O management because SQL Server won't need to read in new pages of the table from disk as frequently. Next time, I'll continue with a discussion of tuning for DELETE and UPDATE operations.