Downloads
25198.zip

Learn how to balance data modification against retrieval

The indexes that work best for data retrieval aren't always the best ones for data modification. To know which indexes are the best for data-modification operations, you need to know how SQL Server uses indexes during those operations. Last month, I told you how SQL Server uses indexes and how INSERT operations affect those indexes, and I gave some basic recommendations for indexing tables with heavy insert activity. Now let's look at the indexing effects and recommendations for the two other data-modification operations, DELETE and UPDATE.

One of my main points last time was that although indexes can help SQL Server find and retrieve data, they can degrade SQL Server's performance of data-modification operations because SQL Server has to maintain the indexes so that they reflect the new data. Because every nonclustered index has a leaf-level index row for every row in the table, each time a new row is inserted into the table, SQL Server must insert a new index row in every nonclustered index. And each time a row is deleted from the table, SQL Server must remove a leaf-level row from every nonclustered index.

Using DELETE and UPDATE can be a two-edged sword. Frequently, you have to perform DELETE and UPDATE operations on only one or a few rows that meet some search condition. So, how do you find the rows you need to modify? Indexes can help SQL Server find the rows of interest, but as I noted earlier, they add overhead during the modification. The trick is to have on your tables only the indexes that help SQL Server find the rows to modify but no extra indexes that add overhead to your data-modification operations.

Determining which indexes can be useful for finding the desired rows is just like indexing for SELECT operations, which I've discussed in several previous columns, so I won't cover that again. Instead, let's look at what happens during the DELETE and UPDATE operations.

DELETE


When you delete rows from a table, you need to consider what happens to the data pages and the index pages. Remember that the data is the clustered index's leaf level, and the deletion of rows from a table that has a clustered index happens in the same way as deletion from the leaf level of a nonclustered index. However, SQL Server manages the deletion of rows from a heap (a table with no clustered index) and from an index's non-leaf pages in a different way.

First, note that SQL Server 2000 and 7.0 don't automatically compress space on a page when a row is deleted from a heap. Compaction doesn't occur until a page needs additional contiguous space for inserting a new row.

Second, in the index's leaf level, when rows are deleted, SQL Server marks them as ghost records. The ghost row stays on the page, but a bit in the row header changes to indicate the row's ghost status, and search operations will never access the row. (The purpose of maintaining the deleted row as a ghost record is mainly to allow greater concurrency optimization during key-range locking, but the details are beyond the scope of this article.) Ghosting happens at the leaf level of both the clustered index (the data) and every nonclustered index on the table. Because SQL Server deletes a row from all nonclustered indexes whenever you delete a row from a table, one deleted row can result in many ghost records. SQL Server doesn't "ghost" rows in non-leaf index pages when they're deleted, but as with heap pages, SQL Server doesn't compress the space until new index rows need space on the page.

When the last row is deleted from a data page, SQL Server deallocates the entire page. (Unless the page is the only one remaining in the table—a table always contains at least one page, even if it's empty.) This deallocation also results in the deletion from the index page of the row that pointed to the old data page. If the deletion of an index row (which might occur as part of a delete-plus-insert update strategy) leaves only one entry in the index page, SQL Server moves that entry to a neighboring page and deallocates the empty page.

UPDATE


SQL Server can update rows in multiple ways, automatically and invisibly choosing the fastest update strategy for the specific operation. In determining the strategy, SQL Server evaluates the number of rows affected, how it will access the rows (by a scan or an index retrieval, and through which index), and whether changes to the index keys will occur. Updates can happen either in place or as a delete followed by an insert.

What happens if an update causes a data row to move to a new location? In SQL Server 2000 and 7.0, a move can happen when an update to a row with variable-length columns makes the row large enough that it no longer fits on its original page. Or, because SQL Server stores rows in order of the clustering key, it can happen when a clustered index column value changes. For example, if you have a clustered index on the lastname column, SQL Server will store a row with a lastname value of Abbot near the beginning of the table. If you then update the lastname value to Zappa, SQL Server has to move the row to near the end of the table. When a row moves because of a change in the clustered-index key value, SQL Server must update every nonclustered index on the table to reflect the new value. When you're deciding which columns to build your clustered index on, remember that the nonclustered index pointers use the clustered-index key as a bookmark for locating the data row. So, it's best to cluster on a nonvolatile column.

If a row moves because it no longer fits on the original page, it still has the same row locator (i.e., the clustering key for the row stays the same), and SQL Server doesn't have to modify any nonclustered indexes. As I've mentioned before, for a table that has a clustered index, the bookmark information in the leaf-level nonclustered index rows is the clustered-index key value. If the table is a heap, the bookmarks in the nonclustered-index leaf rows are the physical locations of the rows. In SQL Server 2000 and 7.0, if a row in a heap moves to a new page, the row leaves a forwarding pointer in the original location. The nonclustered indexes don't need changing; they still refer to the original location, and from there, the forwarding pointer directs SQL Server to the new location.

Forwarding pointers let you modify data in a heap without worrying about SQL Server having to make drastic changes to the nonclustered indexes. If SQL Server has to move a forwarded row again, it updates the original forwarding pointer to point to the new location. That way, you don't end up with a forwarding pointer pointing to another forwarding pointer. Furthermore, if the forwarded row shrinks enough to fit in its original place, SQL Server eliminates the forwarding pointer and moves the record back to that place.

Although SQL Server can manage individual rows efficiently by using forwarding pointers, you still need to be concerned about their overhead. When performing a scan of the entire table, SQL Server follows each forwarding pointer as it accesses a row, then returns to the row after the one that contains the pointer. This process results in two additional page accesses for each forwarded row in the table.

Running the script in Listing 1 lets you see this overhead for yourself. The script first creates a copy of the Northwind database's Customers table, called BigCustomers, then gives the CompanyName column a much longer maximum length. The sp_spaceused procedure shows that the table has 91 rows and 4 data pages (32K of data space). When you set STATISTICS IO on and select all the rows from the table, only four logical reads are required, as you'd expect. However, the script then updates the CompanyName column to make the data fill all 800 bytes. This update increases the size of each row, so most of the enlarged rows need to move to a new page. After the update, the script again runs sp_spaceused, which shows that the table now takes up 13 pages (104K of data space). SQL Server added 9 pages to handle the increased size of the 91 rows.

Finally, the script again selects all the rows from the updated table. Here, you see the real penalty of forwarding pointers. Although the table has only 13 pages, a SELECT of all the rows on those pages requires 153 page reads—almost two reads for every row in the table. Note that you can encounter large numbers of forwarded rows not just from a batch update, as with Listing 1's script, but also when you've updated many individual rows over time, as is likely in a production system.

This problem of needing an enormous number of extra page reads is a concern only when SQL Server is performing scans of most, or all, rows in a table during one operation. If you're accessing the rows individually, as when you have a very selective WHERE clause, the extra overhead is often unnoticeable. You can avoid the increase in reads by building a clustered index on your table because SQL Server uses forwarding pointers only in heap (nonclustered) tables. If you can't have a clustered index, the next best solution is to fix the column length and not allow any change in size.

A future version of SQL Server might include some mechanism for performing a physical reorganization of the data in a heap, which would get rid of forwarding pointers. Note that forwarding pointers exist only in heaps, but DBCC INDEXDEFRAG doesn't work on heap tables. So, you can defragment a nonclustered index but not the heap table itself. With current SQL Server releases, when a forwarding pointer is created, it stays in place forever—with two exceptions. The first exception is the case I mentioned, in which a row shrinks enough to return to its original location. The second exception happens when the entire database shrinks. SQL Server reassigns the bookmarks when it shrinks a file. The shrinking process never generates forwarding pointers. For pages that are removed during the shrinking process, any forwarded rows they contain are effectively "unforwarded."

In SQL Server 2000 and 7.0, updating a row in place is the rule rather than the exception. Unless the row can't stay in the same location because of one of the reasons I mentioned, the row will remain in its place, and no other changes to any indexes are necessary. For more details about the internal mechanics of updating rows, see "SQL Server 7.0 Update Strategies," March 2000, InstantDoc ID 8031. All the information in that article is also relevant to SQL Server 2000.

Recommendations


You can do several things to obtain optimal indexing for data-modification operations. First, it's a good idea to have a clustered index key, which should be on a column (or group of columns) that rarely changes. Second, test all the query types you'll be running to make sure that the indexes that support your retrieval operations don't impose too high a cost on your modification operations. Third, create nonclustered indexes only if they're absolutely necessary to help SQL Server find the rows you need to retrieve or modify. If you have several indexes that all give good performance for retrieval of the same data, keep only one of them—preferably one that also can help your data-modification operations.

This discussion has focused on the internal manipulation necessary for modifying just one row. If one operation modifies multiple rows, you need to be aware of some other side effects. Most of the special considerations you need to take into account when modifying multiple rows within a query are the same for all three data-modification operations (inserts, updates, and deletes). But I'll save that discussion for next time.