This past weekend, I presented at a developer-centric conference in Atlantic Beach, Fla., on the subject matter of things that we as database administrators see as pain points coming from gaps in development. The session broke down into two areas of concern: performance and security. While navigating performance pain points, one of the areas I spent time on was indexing; specifically decisions for indexing keys. I also then took a side trip into Included Columns in non-clustered indexes and the main purpose for Included Columns: Covering Indexes.

Related: Covered Query vs. Covering Index

A question arose concerning what happens behind the scenes between a query run against a table with a Covering Index and one with the same index structure, but without the Included Columns. Since we were wrapping up the talk, I could'nt really take the time I wanted to to answer the question in full. My response was (basically, but more-elegantly stated here with time and editing tools in abundance):

“The covering index affords you the ability to avoid the trip back to the table to satisfy the request since all of the columns requested exist in the non-clustered index. This means you’re also able to avoid any logical or physical reads to go back to the table for that information as well.”

However, that was not the complete picture I wanted to convey because there is also more to it than just avoidance of logical or physical reads. There is also the consideration of the necessary work to "put the data together" between the columns in the non-clustered index and the columns that need to be looked-up in the table.

To illustrate this let's create two identical tables; same schema, same data:

  1. CREATE TABLE Beach.dbo.cruiser_1
  2.  (
  3.   pkid INT IDENTITY(1,1) NOT NULL,
  4.   registration_id VARCHAR(20) NOT NULL,
  5.   last_name VARCHAR(50) NOT NULL,
  6.   first_name VARCHAR(50) NOT NULL,
  7.   address_1 VARCHAR(100) NULL,
  8.   address_2 VARCHAR(100) NULL,
  9.   city VARCHAR(100) NULL,
  10.   state_region VARCHAR(100) NULL,
  11.   postal_code VARCHAR(10) NULL,
  12.   company_id VARCHAR(20) NULL,
  13.   plus_one_notes VARCHAR(200) NULL,
  14.   registration_notes VARCHAR(200) NULL
  15.   );
  17. CREATE TABLE Beach.dbo.cruiser_2
  18.  (
  19.   pkid INT IDENTITY(1,1) NOT NULL,
  20.   registration_id VARCHAR(20) NOT NULL,
  21.   last_name VARCHAR(50) NOT NULL,
  22.   first_name VARCHAR(50) NOT NULL,
  23.   address_1 VARCHAR(100) NULL,
  24.   address_2 VARCHAR(100) NULL,
  25.   city VARCHAR(100) NULL,
  26.   state_region VARCHAR(100) NULL,
  27.   postal_code VARCHAR(10) NULL,
  28.   company_id VARCHAR(20) NULL,
  29.   plus_one_notes VARCHAR(200) NULL,
  30.   registration_notes VARCHAR(200) NULL
  31.   );

Trust me that the data is identical. We don't need to bore ourselves looking at the 10,000 records in each table. I've also created identical clustered indexes on each table; on the pkid column. Since it's an auto-incremented identity column I've also set the fill factor on that column at 100 percent.

  2.  PK_cruiser_1_pkid PRIMARY KEY CLUSTERED
  3.   (
  4.    pkid
  5.   )
  6.  WITH
  7.   (
  8.    PAD_INDEX = OFF
  9.    , FILLFACTOR = 100
  11.    , IGNORE_DUP_KEY = OFF
  12.    , ALLOW_ROW_LOCKS = ON
  13.    , ALLOW_PAGE_LOCKS = ON
  14.   ) ON [PRIMARY];
  16. GO
  18. ALTER TABLE dbo.cruiser_2 ADD CONSTRAINT
  19.  PK_cruiser_1_pkid PRIMARY KEY CLUSTERED
  20.   (
  21.    pkid
  22.   )
  23.  WITH
  24.   (
  25.    PAD_INDEX = OFF
  26.    , FILLFACTOR = 100
  28.    , IGNORE_DUP_KEY = OFF
  29.    , ALLOW_ROW_LOCKS = ON
  30.    , ALLOW_PAGE_LOCKS = ON
  31.   ) ON [PRIMARY];
  33. GO

Finally, let's introduce some differences between the two tables. dbo.cruiser_1 is getting a non-clustered index on last_name and first_name columns with an included column: registration_id.  dbo.cruiser_2 is getting the same non-clustered index but without the included column.

  1. CREATE NONCLUSTERED INDEX [IX_cruiser_1_last_name] ON [dbo].[cruiser_1]
  2. (
  3.  [last_name] ASC,
  4.  [first_name] ASC
  5. )
  7.  (
  8.   [registration_id]
  9.  )
  10. WITH
  11.  (
  12.   PAD_INDEX = OFF
  14.    , SORT_IN_TEMPDB = OFF
  15.    , DROP_EXISTING = OFF
  16.    , ONLINE = OFF
  17.    , ALLOW_ROW_LOCKS = ON
  18.    , ALLOW_PAGE_LOCKS = ON
  19.    , FILLFACTOR = 80
  20.   )
  21. GO
  23. CREATE NONCLUSTERED INDEX [IX_cruiser_2_last_name] ON [dbo].[cruiser_2]
  24. (
  25.  [last_name] ASC,
  26.  [first_name] ASC
  27. )
  28. WITH
  29.  (
  30.   PAD_INDEX = OFF
  32.    , SORT_IN_TEMPDB = OFF
  33.    , DROP_EXISTING = OFF
  34.    , ONLINE = OFF
  35.    , ALLOW_ROW_LOCKS = ON
  36.    , ALLOW_PAGE_LOCKS = ON
  37.    , FILLFACTOR = 80
  38.   )
  39. GO

Let's recap.  We have two identical tables, each with 10,000 records.  The only difference is in their single non-clustered indexes; one has an included column, the other does not.  They have identical primary keys and clustered indexes. 

If I submit the following query against the table with the included column in the clustered index we see the following execution plan:

  1. SELECT [registration_id], last_name, first_name
  2. FROM Cruiser_1
  3. WHERE last_name = 'Jones' AND first_name = 'Mary';

The entire operation occurs within the non-clustered index itself. We never have to travel back to the table for more data. If we execute the identical query against the table with the same index structure, minus the included column in the non-clustered index, we get different behavior:

  1. SELECT [registration_id], last_name, first_name
  2. FROM Cruiser_2
  3. WHERE last_name = 'Jones' AND first_name = 'Mary';

Since the non-clustered index doesn't contain all the information we need to satisfy the query, it's necessary to hit the index for the base information, and then using the row pointer in the non-clustered index, pull in the registration_id from the table and run a Nested Loop Join in order to marry the two working sets of data together before returning them to the end user. We can see the difference in expense for reads when looking at the I/O statistics from the two queries:

Note that these are all logical reads since the pages were already in the buffer. If this was a situation where it was a much larger data set you were contending with on an active server with the pages still on disk, this could be a very expensive operation. 

I hope this better explains the point I was attempting to get across in the session I presented. As a presenter, it's common to struggle for time in front of a class where we can't go as deep as we'd like to. Unless I'm on a SQL Cruise where I have plenty of time with the group for a full week, including dedicated time to expand on things we cover in formal classes, it's difficult to convey the deeper answers to questions posed in class. Please keep that in mind when you sit in one of our sessions, it's not that we don't want to go deep into the question—if we didn't love talking about our topics we would'nt be in front of the room—but we're not always forwarded the luxury of time to answer questions deeply.

Related: The Big Cover-Up: Get the Inside Scoop about Using Covering Indexes