Why can't SQL Server update statistics on the fly?

This week's question comes from one of the attendees of SQLConnections who asked this during our "Follow the Rabbit - Q&A" session with which Paul and I always end the conference.

Question: Why can’t SQL Server perform “in-line” statistics updates – basically, updating statistics “on the fly” and as data changes?

Answer: This is a great question and one that is best explained by expanding on the internals of a statistics blob. SQL Server stores statistics in 3 parts: the statistics header, the density vector and the histogram. The statistics header has basic (but very helpful) details such as the last time the statistics were updated (column: Updated), the number of rows in the table (column: Rows) vs. the number of rows used to generate the statistics information (column: Rows Sampled) and a few other items such as the number of steps in the histogram (column: Steps). It’s in the “number of steps” where we get some information as to how SQL Server is able to store “summary” information.

Related: The Importance of SQL Server Statistics

To be effective, statistics have to be relatively small. If a statistic represented every data value it could end up being large (possibly as large as the data) and very expensive to read and maintain. If statistics were as large as the data then it would be faster to read the data and just not bother with a statistic about the data. The end result would be that statistics would be useless because of the cost. As a result, SQL Server uses the histogram to store real data values but not every value can be represented. For a variety of reasons, SQL Server limits the number of “steps” to 200 distinct actual values from the first column of the key plus 1 row to represent nulls – if the first column allows nulls. This means that a histogram can have up to a maximum of 201 steps.

It is within this histogram (and how it works) where I can explain the absolute impossibility for “on the fly” updates. Each step has an actual value of data but to help the statistic to be even more helpful to SQL Server, each step also has information about the data encountered in that step. To explain this I’m going to use a real example from the “Credit” sample database. This is one that many of us use in tuning examples and you can download a SQL Server 2000 or 2008 version of it from our resources page. In this database there’s a table called Member and it has 3 indexes: a clustered and two nonclustered indexes. I’d like to explain the histogram in the context of one of the nonclustered indexes: member_corporation_link.

This member_corporation_link index was defined as follows:

CREATE NONCLUSTERED INDEX [member_corporation_link]    

ON [dbo].[member] ([corp_no])

First, to see a statistic – you need to use DBCC SHOW_STATISTICS.

DBCC SHOW_STATISTICS (table_name, index_or_statistics_name)

Statistics Header

Name Updated Rows Rows Sampled Steps *…
member_corporation_link Aug 26 2008  5:18PM 10000 10000 201 *…

* There’s more information provided but not directly relevant to this article/discussion.

Density Vector

All density Average Length Columns
0.0025 0.6008 corp_no
0.0001 4.6008 corp_no, member_no

There’s a lot to the density vector but its not relevant to this question/example.

Histogram

** RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1 NULL 0 8498 0 1
2 5 0 1 0 1
3 33 5 2 5 1
         
137 402 0 8 0 1
138 403 0 4 0 1
138 407 14 5 2 7
140 408 0 7 0 1
         
200 499 0 7 0 1
201 500 0 8 0 1

** This column does not exist in the histogram. It is added here so that you can see what rows – and how many rows – have been removed to reduce the output shown.

Understanding a statistic

In the output of show_statistics, you can see that this particular table has 10,000 rows with all 10,000 having been sampled for the generation of the statistics (this information is from the statistics_header). You can also see that there are 201 steps in the histogram. You can see this in two ways – Steps (in the statistics header) and as the number of rows in the histogram.

The way to read a histogram is as follows. If the following query were to be executed:

SELECT * FROM Member WHERE corp_no = 403

Then, SQL Server would use the histogram to determine the number of rows that match. In this case, the number supplied (403) is actually in the histogram. As a result, SQL Server estimates 4 rows (from the EQ_ROWS column). However, if the following query were to be executed, the “estimate” is handled differently because the value is not a specific step but instead a member of a step.

SELECT * FROM Member WHERE corp_no = 404

For this query, the estimate would be 7. This value comes from reading the step shown for the value of 407. The way to read the step is that there are 14 rows between 403 and 407 (but not including those of 403 and 407) and within that range there are 2 distinct values in the range (column: DISTINCT_RANGE_ROWS). Therefore, the average number of rows for any value within this range would be 7 (column: AVG_RANGE_ROWS).

However, is this really true? Without running the query – you won’t know. And, for the purpose of this post – it doesn’t really matter. However, it is with this example where I can describe how “on the fly” updates are impossible!

Between the values of 403 and 407 there are *3* possible values. However, this table currently only shows rows for 2 of the values? Which two is NOT stored within the statistic; it is only known at the time the data is analyzed. And, if a new row were to enter the set with a value of 404, how would the statistic be updated? Is that a new row for one of the already “distinct” values within the range making the numbers:

138 407 15 5 2 7.5

 

OR, is this actually a NEW value within the distinct possible for this range, making the numbers:

138 407 15 5 3 5

 

It is impossible for SQL Server to accurately reflect the “current” view of the data with only a single value entered. And, in fact, this is the reason for why “on the fly” statistics updates are impossible.

Related: Making the Most of Automatic Statistics Updating

And, I love statistics questions – so, keep ‘em coming!

Enjoy,

kt

Discuss this Blog Entry 5

on Nov 11, 2011
Hey there Mark - If you think of the specific possibilities for values between 403 and 407 the possible values are 404, 405 and 406. However, the statistic says there are "2" distinct values within the range (specifically in the table). But, what they don't know is which of the specific values 404, 405 or 406 - doesn't have values. They can only "estimate" 6 for ALL three of those values (if we were to run a query). What is NOT stored within the statistic are which of the 3 do have rows and which of the 3 don't. All we know is the estimated number of rows for any value within that step. Definitely let me know if that doesn't help and/or if you need more information! And, great point Shahchi1 - here's the link to our resources page for anyone that might be interested in watching my statistics video: http://www.sqlskills.com/T_MCMVideos.asp. Cheers, kt
on Nov 8, 2011
Good insght, but it then begs the question: What good is the "Auto Update Stats" setting?
on Nov 7, 2011
Kimberly, When come to indexes and statistics; I find your insight into those areas as subject matter expert very helpful. Excellent. Anyone reading this article I encourage him/her to watch your MCM video on the statistics.
on Nov 7, 2011
Kimberly, The following sentence has confused me a little, the rest I get but this only bit throws me a little, "Between the values of 403 and 407 there are *3* possible values. However, this table currently only shows rows for 2 of the values? Which two is NOT stored within the statistic; it is only known at the time the data is analyzed. " If you do not mind could you explain what you mean by "Which two is NOT withing the statistic"? Thanks Mark
on Nov 11, 2011
Hey there pjcwik - The auto update stats setting is one that looks over the entire data set BUT not on the fly. Instead, when a certain percentage of rows have changed, SQL Server will review the entire set (possibly using sampling) to generate a completely new stat_header, density_vector and histogram. Because they're looking over the entire set, they can generate more accurate steps. So, in general, I do recommend leaving auto update stats on. However, some folks perform their own updates to stats more frequently and as a result, don't need auto updating. Hope that helps! kt

Please or Register to post comments.

What's SQL Server Questions Answered?

Practical tips and answers to many of your questions about SQL Server including database management and performance issues.

Contributors

Paul S. Randal

Paul Randal worked on Microsoft's SQL Server team for nine years in development and management roles, writing many of the DBCC commands. Randal was ultimately responsible for SQL Server 2008'...

Kimberly L. Tripp

Kimberly L. Tripp has been working with SQL Server since 1990, and she’s worked as a consultant, trainer, speaker, and writer specializing in core SQL Server performance tuning and availability...
Blog Archive

Sponsored Introduction Continue on to (or wait seconds) ×