Question: I know that the histogram can have a maximum of 201 steps (as per your Why can’t SQL Server update statistics on the fly? article last week). But some statistics have fewer than 200 steps even though the leading column has more than 200 distinct values. More specifically, If there are more than 200 distinct values why wouldn’t the histogram always have the maximum steps?
Answer: This is another great stats question and one that seems really strange at first glance. If SQL Server can store up to 200, and if having more steps generally leads to better statistical information, then why wouldn’t SQL Server always have 200 steps – when there are more than 200 distinct values?
Simply put, they would keep ALL 200 steps if there were benefits in doing so. However, during the final phase of statistics creation (called Histogram consolidation), SQL Server goes through and looks at neighboring steps. If the AVG_RANGE_ROWS are the same AND the step itself is similar in terms of the number of rows in its range (column: RANGE_ROWS), then calling out a specific – but uninteresting value – is a bit of a waste in the histogram. As a result, two neighboring steps can be consolidated into one larger step (that represents the combination of both sets of rows) because the estimates for both of the ranges were the same. This is more likely in data sets where the leading column is largely evenly distributed. Or, where there are similar patterns to the data throughout various ranges of data.
For example, let’s create another statistic on the member table but this time on the column “FirstName.”
NOTE: This column is not what you might be expecting; first names are randomly generated and as a result – basically unique within the table.
Reviewing the output from DBCC SHOW_STATISTICS (Member, Member_FirstName), we see the following subset of data from the histogram:
** This column does not exist in the histogram. It is added here so that you can see what rows – and how many rows – have been removed to reduce the output shown.
For the value of AKLAVXHXIFNVIN there are exactly 1 row(s) that match (using column: EQ_ROWS)
For the value of AXARKRNWRSVGONVV there are exactly 1 row(s) that match (using column: EQ_ROWS)
Between these two values but NOT including those two values there are 52 rows in the range (column: RANGE_ROWS). And, of that range there are 52 distinct values (column: DISTINCT_RANGE_ROWS).
Then, for the value of AVGFXPM there are exactly 1 row(s) that match (using column: EQ_ROWS).
Between this value (AVGFXPM) and the prior value (ARKRNWRSVGONVV) there are 25 rows and all 25 are unique as well. In fact, even the following step says just about the same thing.
If we were to combine steps 4 and 5 into step 6 – would it make much of a difference for any of the estimates between these steps?
Below would be the changes necessary to combine these step:
Using EITHER version of the histogram – estimate the number of rows for the following query:
Using the original histogram this value would be within step 5 and would estimate 1 row (from AVG_RANGE_ROWS).
Using the consolidated histogram this value would be within step 4 but would also return an estimate of 1 row (from AVG_RANGE_ROWS).
And, if you review the statistics header, you can see that even though this table has more than 200 distinct values, the number of steps stored in the histogram is only 130. They could have compressed/consolidated further and if more data were to go into this table, some of these steps would have to be removed/consolidated. But, for right now, having limited it to 130 – or even limiting it even further (for many of the steps) won’t change the estimates for this column as it’s evenly distributed. For more heavily skewed data you are more likely (again, not a guarantee!) to have the maximum number of steps.
PS – Be sure to read the comments on this (and the prior) as there were some good follow-up questions!