SQL Server uses statistics to determine the type of execution plan it'll use for a query. SQL Server usually does a very good job at estimating the statistics, so the developer doesn't need to worry about anything. But occasionally it doesn't, in which case a better understanding of statistics and how SQL Server determines them is essential to solve performance issues. If you don't have a good handle on statistics, you can have problems with queries that don't improve despite having proper indexes and problems with all the other steps in optimizing a query. And worst of all, because these types of problems relate to the values of the data (and not its structure), they tend to occur after the database is in use, when many transactions are taking place.
Understanding SQL Server Statistics
In a simplified view, the statistics that SQL Server collects are the number of distinct values occurring for a given column or set of columns. SQL Server determines the statistics by sampling the data in the table. Depending on the size of the table, it might sample just a small percentage of the rows. For example, suppose you have a column in a table that holds the response to a multiple-choice question. In it, a fourth of the values are A, half are B, and the remaining fourth are C. SQL Server might look at 1,000 of the 100,000 rows (just 1 percent) and come up with three distinct values. Because it has correctly estimated the number of distinct values, making decisions based on those estimates will give good results. SQL Server will pick efficient execution plans.
Now suppose that you add an Other option, D, which lets users enter in their own values. Only 1 percent of the users choose D, and all the entered values are different. Because SQL Server is sampling 1,000 of 100,000 rows in this example, it should sample about 10 rows in which D was chosen and a fill-in value was recorded. Just like it did for the previous values (A, B, and C), SQL Server will estimate that each sampled value represents 100 identical values. The logic for this is simple:
= 100 actual rows per sampled row
So, each sampled row is assumed to represent the probable values for 100 actual rows. Thus, SQL Server will estimate that there are 13 distinct values-3 for A, B, and C and 10 for D. However, each D value is unique, so the actual number of distinct values is much larger-1,003. As a result, SQL Server might pick a bad execution plan because it has made a bad estimate.
For the vast majority of data, a low sampling rate will give good estimates. It's just one percent of the data that needs a much higher sampling rate to accurately estimate its distribution. SQL Server has a concept called density, which represents 1 divided by the number of distinct values. For this example, when there are only A, B, and C, SQL Server would determine the density to be 1/3, or 0.33. When the D value is added, the density would be 1/13, or 0.077. But the actual density is 1/1,003, or 0.000997. The estimated value is very far off from the real one-77 times too large. This will likely cause SQL Server to pick the wrong type of execution plan because it's expecting too many rows to match for a given value.
Using Statistics-Related Commands
To see the statistics for any table, you can issue the command
where <TableName> is the name of the table. You can then pick one of the statistics and drill down into it with the command
where <StatisticName> is the name of the statistic. This command's results will show you the statistic's details. In particular, the second result set will show the density that it has determined for the columns in the table. The first result set also shows a density value, but SQL Server no longer uses it. It's an old estimate of density used by earlier SQL Server versions.
You can force SQL Server to use different sampling rates by issuing the command
[WITH (FULLSCAN | SAMPLE <N> PERCENT)]
In this command, you can use the FULLSCAN argument to specify that you want to scan all rows in the table or the SAMPLE <N> PERCENT argument, where <N> is the percentage of rows to scan. If you sample 99 percent or fewer rows, SQL Server doesn't promise that it'll use the specified sampling rate, but it will at least match or exceed it. For small tables, it typically samples every row no matter what you specify. If you specify 100 percent or FULLSCAN, SQL Server will use every row to calculate the statistics. In this case, the statistics aren't estimated but are the actual values.
Checking SQL Server's Statistics
Because you can make SQL Server sample at different rates and compare the results of those samplings, it's possible to see how the estimated density changes when the sampling rate is changed. In particular, you can compare SQL Server's default sampling rate with the actual statistics. The result of this comparison will let you know if you can let SQL Server handle updating the statistics or if you need to create your own statistic update scripts. The comparison process is as follows:
1. Enumerate all the statistics for the database by running sp_autostats on every table.
2. Run UPDATE STATISTICS on all the tables using the default sampling rate.
3. Use DBCC SHOW_STATISTICS for each statistic to record the estimated density.
4. Run the UPDATE STATISTICS command with the FULLSCAN option on all the tables.
5. Use DBCC SHOW_STATISTICS to record the actual density values.
6. Compare the estimated values to the actual values. When the density varies by a significant amount, the SQL Server default statistics routines can't be used.
Manually performing this process is tedious, especially if your database contains many tables. Thankfully, the process can be automated, as demonstrated by the usp_CheckStatistics stored procedure. You can download usp_CheckStatistics by clicking the 136329.zip hotlink at the top of this article.
The usp_CheckStatistics stored procedure will work on any database. To use it, you just need to create and execute the stored procedure in the desired database. It will list the statistics and specify how the estimated values compare to the actual values. Figure 1 shows sample output from the stored procedure. Note that some tables appear more than once in the results because there's one row for each statistic and these tables have more than one statistic collected. Also note that some columns have been removed for clarity.
In Figure 1, the Percentage column is the sampling rate that SQL Server used by default and the Ratio column is the comparison of the estimated density to the actual one. In this example, several tables have statistics that aren't well estimated by SQL Server.
If you have any tables that SQL Server doesn't estimate accurately, you can implement a statistics maintenance plan to make sure they are properly updated. Because SQL Server doesn't let you specify the sampling rate that it uses for its automatic statistics updates, the first task that the maintenance plan will have to do is to turn off SQL Server's automatic updating of the statistics in question. Fortunately, this is easy. You just use the UPDATE STATISTICS command with the NORECOMPUTE option for that table every time you call it:
WITH FULLSCAN, NORECOMPUTE
Although you could determine which sampling rate produces an estimated value that is as close as possible to the actual value, I recommend using FULLSCAN because the required sampling rate is usually close to 100 percent.
The remaining task is to determine when you should update the statistics. There are two general approaches:
- You can update them periodically (e.g., every night, every weekend) or immediately after a large update (e.g., after an extraction, transformation, and loading-ETL-type of operation).
- You can watch the same counter (i.e., rowmodctr) that SQL Server uses to determine how many changes (updates, deletes, and inserts) have occurred in the table since the last statistics update. To check the rowmodctr counter, you can run the command:
FROM sysindexes x
JOIN sysobjects o ON x.id = o.id
Figure 2 shows sample results. You might decide to update the statistics when rowmodctr exceeds a certain value.
Prevent Performance Problems
By running the usp_CheckStatistics stored procedure in all your databases, you can determine whether SQL Server is selecting the appropriate sampling rates. It's quite possible that the results will indicate that you don't need to do anything special. Thereafter, you can run it periodically to evaluate the statistics in case the data distribution changed.
Even if you choose to not run usp_CheckStatistics in your databases initially, it might be helpful when there's a performance problem. To troubleshoot a performance problem, you typically run the suspect query in Query Analyzer and look at the execution plan (among other things). The execution plan will have the actual number of rows and the estimated number of rows. If you notice that those two values differ by a lot (typically by more than a magnitude of order), you should run the stored procedure to make sure that inaccurate statistics aren't causing that problem.