If you've upgraded your applications to SQL Server 7.0, you've probably already noticed that many formerly troublesome queries now run in a fraction of the time previously required. Much of this improvement results from a dramatically improved query optimizer, which now has many more options to consider when deciding the best way to execute your query. To make good choices, the optimizer must have meaningful statistical information regarding the distribution of values in the data being queried. In this article, I look at some of the ways that the SQL Server optimizer uses its statistical data. Also, I show how you can look at and interpret SQL Server statistical data, and how SQL Server statistics are maintained.

Index Selection

To determine the most appropriate processing strategy for a particular query, the optimizer gathers and analyzes data on both indexed and nonindexed columns. Whenever you create an index on a table containing data, the optimizer collects and stores statistics for that index. By default, SQL Server automatically updates the statistics whenever it determines that they are out of date. (I discuss the mechanism that SQL Server uses to determine whether the statistics are out of date in the "Updating the Statistics" section.) In addition, the optimizer generates and updates statistics about columns in WHERE clauses.

During query optimization, the optimizer determines whether an index exists for each condition in the WHERE clause. If an index exists, the optimizer first assesses the index's usefulness by determining the selectivity of the clause (i.e., how many rows the query would return), and then estimates the cost of finding the qualifying rows. An index is potentially useful if its first column occurs as a condition in the WHERE clause to establish a lower bound, an upper bound, or both to limit the search. For example, in the following WHERE clause, the condition involving the column name sets a lower and an upper bound for the values in the name column, thereby limiting the number of rows in the result set:

WHERE name LIKE 'M%'

Index Statistics

The optimizer generates two types of statistics: a histogram and density data. Here's a look at how the optimizer gathers and analyzes index data (and data on nonindexed columns) to provide these statistics.

Histogram. A histogram is a graphical representation of a frequency distribution. The histogram is created when an index is created on existing data; running the UPDATE STATISTICS command refreshes the values. If the index is created before data exists in the table, no statistics appear. The histogram would be misleading if statistics were generated when the dispersion of data values significantly differed from the dispersion of the current data in the table.

A histogram consists of a sampling of values for the index key (or the first column of the key for a composite index) based on the current data. In SQL Server 7.0, the histogram can contain as many as 300 values for an index key (or a nonindexed column). When issuing the UPDATE STATISTICS command, you can specify whether you want to gather all occurring values (the FULLSCAN option) or a sample of them. If a table is less than 8MB in size, SQL Server always performs a FULLSCAN. If you don't specify a percentage, SQL Server uses a computed percentage that is a logarithmic function of the table size. When sampling, SQL Server randomly selects pages from the table by following the Index Allocation Map (IAM) chain. (See Inside SQL Server, "The New Space Management," April 1999, for information on IAMs.) After SQL Server selects a page, it uses all the values from that page in the sample. Because disk I/O is the main cost of maintaining statistics, using all values from every page read minimizes maintenance costs.

SQL Server sorts through the values, dividing them into as many as 299 approximately equal intervals. The endpoints of these 299 intervals become the 300 histogram values, which are sometimes referred to as histogram steps. The optimizer's computations assume that the data has an equal number of values between any two adjacent steps, so even if the table or sample has more than 300 rows, the histogram might have fewer than 300 steps to achieve uniformity of step size.

SQL Server stores the histogram's statistics in the statblob field of the Sysindexes table. To fully estimate the usefulness of an index, the optimizer needs to know the number of pages in the table or index; this information is in the dpages column of Sysindexes.

The statblob field in Sysindexes is data type image. (SQL Server stores image data in structures separate from the data row; the data row contains only a pointer to the image data. For simplicity, I refer to the histogram as being stored in the statblob image field.) In addition to the histogram, the statblob field contains the time of the last statistics collection, the number of rows used to produce the histogram and density information, the average row length, and densities for other combinations of columns.

Density data. The statistics information also includes details about the uniqueness, or density, of the data values. Density measures how selective an index is. The more selective an index is, the more useful it is, because higher selectivity means that a query can eliminate more rows from consideration. A unique index is the most selective. By definition, each index entry in a unique index can point to only one row.

The statblob field contains two types of density information: density values and all_density values. The optimizer uses the density and all_density values to determine whether it is better to conduct a table scan or to use an index to access the data. SQL Server computes these values based on the number of rows in the table, the cardinality of the indexed columns (i.e., the number of distinct values), the number of nonfrequent (NF) values (i.e., values that appear no more than once in the histogram steps), and the cardinality of NF values. SQL Server defines these two values as:

density = (NF count/distinct NF count)/(number of rows)
all_density = 1/cardinality of index keys

Density values range from 0 to 1. Highly selective indexes have density values of 0.1 or less. A unique index has a density value of 1/number of rows in the table. For example, a unique index on a table with 8345 rows has a density of 0.00012 (1/8345), which is much less than 0.1. Thus, this unique index is useful because it is selective enough to be more efficient than a table scan.

If the same table has a nonunique nonclustered index with a density of 0.2165, you can expect each index key to point to about 1807 rows (0.2165 * 8345). Driving the query from a nonclustered index means retrieving the pages in index order, one at a time. In other words, SQL Server must individually access an estimated 1807 data pages. Thus, this nonunique nonclustered index is probably not useful because it is not selective enough to be more efficient than a table scan.

In a nonclustered index, the leaf level contains all the index key values, and it is from these key values that SQL Server builds the statistics information. Let's assume that for the nonunique nonclustered index just described, the index leaf pages point directly to the data pages. Also assume that about 40 rows fit on a page, giving you 209 (8345/40) total pages. The chance of two adjacent index entries pointing to the same page is only 0.5 percent (1/209). The number of logical I/Os for just the data pages, not even counting the index I/O, is likely to be close to 1807. In contrast, the entire table can be scanned with just 209 logical I/Os—the number of pages in the table. In this case, the optimizer looks for better indexes or decides that a table scan is the best it can do.

Interpreting the Statistics

Index statistics for a single column consist of one histogram, one density value, and one all_density value. Multicolumn index statistics for one set of columns in a composite index consist of one histogram for the first column in the index, one density value for the first column, and all_density values for each prefix combination of columns (including the first column alone). This means that if the index is on columns a, b, and c, SQL Server will compute an all_density value for column a, an all_density value for the combination of columns a and b, and an all_density value for the combination of columns a, b, and c.

The availability of density information for all columns helps SQL Server decide how useful the index is for joins. For example, suppose that an index consists of three key fields. The density on the first column might be 0.5, which isn't too useful. But as you look at more columns in the key, the number of rows that are pointed to is fewer than (or in the worst case, the same as) the number of rows that are pointed to by just the first column, so the density value goes down. If you're looking at both the first and second columns, the density might be 0.25, which is somewhat better. And if you examine all three columns, the density might be 0.03, which is highly selective. (Referring to the density of only the second column doesn't make sense because the lead column density is always necessary.)

Here's an example of how to interpret multicolumn index statistics. Suppose you have a nonclustered composite index (idx3) on the state and au_lname columns and an existing clustered index on the au_id column in the Authors table of the Pubs database. You can use the DBCC SHOW_STATISTICS command to display the statistics information for this index:

DBCC SHOW_STATISTICS(authors, idx3)

Table 1 contains the results. This output shows that the last update of the statistics for this index was on June 20, 1999. The output also shows that the table currently has 23 rows. State has eight distinct values, but two of them occur in multiple steps, so only six of them are nonfrequent (NF). In this data set, all the NF key values occur only once in the data, so the NF count is the same as the unique NF count, and the computed density value is 6/6/23, or about .0434.

The all_density value for the state column is 1/8, or 0.125. For the combination of state and au_lname, only one duplicate value occurs, so there are 22 unique values, which results in an all_density value of 1/22, or about .0454. An all_density value also exists for the combination of state, au_lname, and au_id. Because this table has a clustered index on au_id, that key appears along with every nonclustered index key. You can use that key to determine density information. The addition of au_id makes the three-valued key unique. In this case, all_density is 1/23, or about .0434; this value is the same as the density for the first column, but that's a coincidence.

SQL Server 7.0 can also keep track of statistics on columns with no indexes. Knowing the density (or the likelihood of a particular value occurring) can help the optimizer determine an optimum processing strategy, even if SQL Server can't use an index to locate the values.

You can use the CREATE STATISTICS command to create statistics on nonindexed columns. Also, you can execute the sp_createstats stored procedure, which creates single-column statistics for all eligible columns for all user tables in the current database. Eligible columns include all columns except computed columns; columns of the ntext, text, or image data types; columns that already have statistics; and columns that are the first column of an index.

Updating the Statistics

If your database option auto update statistics is set to true, SQL Server automatically updates the statistics whenever the query optimizer determines that the index statistics are out of date. Statistics are considered out of date when enough data-modification operations occur to minimize the statistics' usefulness. The number of operations is tied to the size of the table. Typically, the equation is something like

500 + 0.2 * (number of rows in the table)

This equation means that the table must have at least 500 modification operations before updates to the statistics occur during query optimization. For large tables, this threshold can be much higher.

You can manually force updating statistics in two ways. Run the UPDATE STATISTICS command on a table, one specific index, or column statistics. Or you can execute the sp_updatestats stored procedure, which runs UPDATE STATISTICS against all user-defined tables in the current database.

Automatic updates to index and column statistics occur by default because every database has two database options (auto create statistics and auto update statistics) set to TRUE upon creation of the database. You can use the sp_dboption stored procedure to disable either option. However, I don't recommend that you set this option to FALSE unless thorough testing shows that you don't need the automatic updates or your performance overhead is more than you can afford. This database option overrides all individual settings in tables. In other words, if the database option auto update statistics is FALSE, no automatic statistics updating takes place.

You can disable automatic statistics updating for a specific table in two ways. First, you can use the sp_autostats stored procedure to unset (or set) a flag for a table to specify whether to automatically update the statistics for that table. Also, you can use this procedure with only the table name to find out whether a table is set for automatic updates of index statistics. Second, you can execute the UPDATE STATISTICS command with the WITH NORECOMPUTE option, which prevents automatic recomputing of statistics. Running the UPDATE STATISTICS command again without the WITH NORECOMPUTE option enables automatic updates.

To avoid long-term maintenance of unused statistics, SQL Server deletes old statistics automatically created on nonindexed columns. The StatVersion column of the Sysindexes table shows how many times the statistics have been updated. After SQL Server updates the column statistics more than a certain number of times (SQL Server sets this threshold), it drops the statistics rather than update them again. (Statistics that you create are not subject to this process.) If SQL Server needs the statistics in the future, it can recreate them; there is no substantial cost difference between creating statistics and updating them.

The names of system-generated statistics are distinctive, such as _WA_Sys_au_fname_07020F21. You can use the sp_helpindex stored procedure to list the names of column statistics and table indexes. For example, running

sp_helpindex authors

will produce the output shown in Table 2. This example includes two indexes and two sets of statistics. The name of explicitly created statistics is s1. The name of system-generated statistics is _WA_Sys_au_fname_07020F21.

Knowledge Is Power

For the most part, you should leave the statistics-updating mechanisms at their default settings and let SQL Server take care of all statistics' updates. Your SQL Server system can perform well if you never run DBCC SHOW_STATISTICS or even think about statistics maintenance. However, understanding how SQL Server generates and stores statistics and what SQL Server uses them for can give you greater confidence in the decisions the SQL Server optimizer makes. Knowing how SQL Server works on the inside can also help you manage your data and applications and design better queries.