This week's question comes from one of the attendees of SQLConnections who asked this during our "Follow the Rabbit - Q&A" session with which Paul and I always end the conference.
Question: Why can’t SQL Server perform “in-line” statistics updates – basically, updating statistics “on the fly” and as data changes?
Answer: This is a great question and one that is best explained by expanding on the internals of a statistics blob. SQL Server stores statistics in 3 parts: the statistics header, the density vector and the histogram. The statistics header has basic (but very helpful) details such as the last time the statistics were updated (column: Updated), the number of rows in the table (column: Rows) vs. the number of rows used to generate the statistics information (column: Rows Sampled) and a few other items such as the number of steps in the histogram (column: Steps). It’s in the “number of steps” where we get some information as to how SQL Server is able to store “summary” information.
To be effective, statistics have to be relatively small. If a statistic represented every data value it could end up being large (possibly as large as the data) and very expensive to read and maintain. If statistics were as large as the data then it would be faster to read the data and just not bother with a statistic about the data. The end result would be that statistics would be useless because of the cost. As a result, SQL Server uses the histogram to store real data values but not every value can be represented. For a variety of reasons, SQL Server limits the number of “steps” to 200 distinct actual values from the first column of the key plus 1 row to represent nulls – if the first column allows nulls. This means that a histogram can have up to a maximum of 201 steps.
It is within this histogram (and how it works) where I can explain the absolute impossibility for “on the fly” updates. Each step has an actual value of data but to help the statistic to be even more helpful to SQL Server, each step also has information about the data encountered in that step. To explain this I’m going to use a real example from the “Credit” sample database. This is one that many of us use in tuning examples and you can download a SQL Server 2000 or 2008 version of it from our resources page. In this database there’s a table called Member and it has 3 indexes: a clustered and two nonclustered indexes. I’d like to explain the histogram in the context of one of the nonclustered indexes: member_corporation_link.
This member_corporation_link index was defined as follows:
First, to see a statistic – you need to use DBCC SHOW_STATISTICS.
|member_corporation_link||Aug 26 2008 5:18PM||10000||10000||201||*…|
* There’s more information provided but not directly relevant to this article/discussion.
|All density||Average Length||Columns|
There’s a lot to the density vector but its not relevant to this question/example.
** This column does not exist in the histogram. It is added here so that you can see what rows – and how many rows – have been removed to reduce the output shown.
Understanding a statistic
In the output of show_statistics, you can see that this particular table has 10,000 rows with all 10,000 having been sampled for the generation of the statistics (this information is from the statistics_header). You can also see that there are 201 steps in the histogram. You can see this in two ways – Steps (in the statistics header) and as the number of rows in the histogram.
The way to read a histogram is as follows. If the following query were to be executed:
Then, SQL Server would use the histogram to determine the number of rows that match. In this case, the number supplied (403) is actually in the histogram. As a result, SQL Server estimates 4 rows (from the EQ_ROWS column). However, if the following query were to be executed, the “estimate” is handled differently because the value is not a specific step but instead a member of a step.
SELECT * FROM Member WHERE corp_no = 404
For this query, the estimate would be 7. This value comes from reading the step shown for the value of 407. The way to read the step is that there are 14 rows between 403 and 407 (but not including those of 403 and 407) and within that range there are 2 distinct values in the range (column: DISTINCT_RANGE_ROWS). Therefore, the average number of rows for any value within this range would be 7 (column: AVG_RANGE_ROWS).
However, is this really true? Without running the query – you won’t know. And, for the purpose of this post – it doesn’t really matter. However, it is with this example where I can describe how “on the fly” updates are impossible!
Between the values of 403 and 407 there are *3* possible values. However, this table currently only shows rows for 2 of the values? Which two is NOT stored within the statistic; it is only known at the time the data is analyzed. And, if a new row were to enter the set with a value of 404, how would the statistic be updated? Is that a new row for one of the already “distinct” values within the range making the numbers:
OR, is this actually a NEW value within the distinct possible for this range, making the numbers:
It is impossible for SQL Server to accurately reflect the “current” view of the data with only a single value entered. And, in fact, this is the reason for why “on the fly” statistics updates are impossible.
And, I love statistics questions – so, keep ‘em coming!