| Executive Summary: Indexing the data warehouse can reduce the amount of time it takes to see query results. When indexing dimensions, you'll want to index on the dimension key. When indexing the fact table, you'll want to index on the date key or the combined data plus time. |
Indexing a data warehouse is tricky. If you have too
few indexes, the data loads quickly but the query
response is slow. If you have too many indexes,
the data loads slowly and your storage requirements
go through the roof but the query response is good.
Indexing in any database, transactional or warehouse,
most often reduces the length of time it takes to see
query results. This is especially true with large tables
and complex queries that involve table joins.
Some of the variables that you’ll want to take
into account when indexing the data warehouse are
the type of data warehouse you have (i.e., primarily
archive or primarily near real-time), how large the
dimensions and fact tables are (and if the fact tables are
partitioned), who will be accessing the data and how
they'll do so, and whether access will be ad hoc or via
structured application interfaces. These variables will
determine how your indexing scheme should be structured.
Here’s a simple plan for indexing the relational
tables that comprise a portion of your data warehouse.
(Although I only explain how to index dimensions and
fact tables in this article, I explain
how to index the staging database in
the Web-exclusive sidebar “Indexing
the Staging Database,” www.sqlmag.com, InstantDoc ID 99494.) Note
that the relational tables are those
that are managed by SQL Server’s
relational data engine, not those
managed by the SQL Server Analysis
Services (SSAS) engine.
Indexing Dimensions
You’ll want to index the dimension
key (primary key), which is a surrogate
key, not a “natural” or transactional
key such as customer name or
customer ID. Note that you shouldn’t
cluster on the dimension key.
The dimension will contain a natural or transactional
key (e.g., a transaction number or identifier),
which we’ll call a business key, from the source system.
Although the business key might not be unique—as
in the case of a type 2 response to slowly changing
dimensions—create a clustered index on the identity
column, which you can see in Figure 1. The Customer
and the Product dimensions have a clustered index
built on the business key. By clustering on this key, you
can enhance the query response when the business key
is used in the WHERE clause. The expression in the
WHERE clause is often used to search the dimensional
data, and having the dimension records pre-sorted
makes the query response faster.
Clustering by the business key might also help you
avoid lock escalation (i.e., row to table, intent-exclusive
to exclusive) during the extraction, transformation,
and loading (ETL) process, which
could happen if the surrogate key was the
cluster key and all the rows were being added
at the end of the file. If the exclusive locks escalated from row to table, that would block read,
other ETL load, or utility operations on the table and
cause diminished query response and even application
timeouts. If your users and applications can live with
some latency, you can get around this problem by
having them query database snapshots and reserving
the original database exclusively for data loads.
In Figure 1, the Date dimension and the Time dimension
have no external datasource or business key. Instead
of creating an identity-style primary key for these two
tables, consider using a smart key, with a YYYYMMDD
format for Date and an HHMMSSSSS format for Time
(you can use fewer second positions, depending on how
fine a time granularity you need to measure), and clustering
on it. The values will maintain index order, range
queries will be simplified in the fact table, and you’ll need
one less join when querying because the primary key will
contain the date (or time).
For large type 2 slowly changing dimensions (i.e.,
where you add a new row to record the change), you
might want to create a four-part non-clustered index
that includes the business key, the record begin date,
the record end date, and the surrogate key. For efficiency
and to prevent escalating storage requirements,
INCLUDE the record end date and the surrogate key
when creating the index instead of making them part
of the index key, as shown in the following command:
CREATE NONCLUSTERED INDEX MyDim_CoveringIndex
ON (NaturalKEY, RecordStartDate)
INCLUDE ( RecordEndDate, SurrogateKEY);
This command creates a covering index that can be
useful during the ETL process and load operations
and for historical queries. When you make RecordEnd-
Date and SurrogateKEY INCLUDEs instead of part
of the index key, the SQL Server engine stores these
two values at only the leaf level of the index tree, thus
reducing the storage requirements. By having these two
columns in the index (i.e., creating a covering index),
the SQL Server relational engine can get the data that
it needs solely from the index during load and some
query operations, without having to access data from
the underlying dimension.
If there are other columns in the dimension that will
be used continuously for searching, sorting, or grouping,
create non-clustered indexes on those columns as
you would in a transactional database. If there’s an
embedded hierarchy in a dimension, such as the Category-
SubCategory-ProductID hierarchy in the Product dimension,
then consider indexing the components of the
hierarchy if it will enhance query performance and won’t
inhibit data loading.
Indexing the Fact Table
Indexing the fact table is similar to indexing a dimension,
although you must account for partitioning.
You’ll want to index and cluster on the date key or a
combined date plus time. Because business intelligence
(BI) analysis always seems to involve a date/time component,
the fact table will have a date (or datetime) key,
and clustering on this key will help with cube-building.
Also, if the data records are already stored in date or
datetime order, historical queries will have an execution
advantage. If the fact table has more than one date or
datetime column, cluster on the column that’s used
most often for querying or cube-building.
If the fact table is partitioned on the date column,
use that column as the clustering key. When you see
the same column to create the clustered index that you
used to create the partitions and creating the index in
the same file group that holds the partitioned fact table,
SQL Server will automatically partition the index the
same way that the fact table is partitioned (i.e., the
index will have the same partitioning function and
column as the fact table). When the index is partitioned
the same way the fact table is partitioned, the table and
its index are said to be aligned, which makes for an
optimal operational situation, especially if you anticipate
creating additional partitions or making frequent
partition switches.
Next, create a non-clustered index on each of the
foreign keys in the fact table, and consider combining
the foreign key and the date key, in that order, similar
to CustomerKEY + DateKEY in Figure 1. Creating a
non-clustered key on the foreign keys works especially
well if one or more of the associated dimensions is a
type 2 slowly changing dimension. Rows with the same
foreign key value will be searched in ascending date
order, which will enhance the historical query response.
Note that you’ll want to retain relational integrity when
dealing with the foreign keys. For more information
about how to do so, see the Web-exclusive sidebar
“Retaining Relational Integrity in the Data Warehouse”
(www.sqlmag.com, InstantDoc ID 99362).
Modifying Your Indexing Scheme
Over time, your data warehouse will change to accommodate
what’s happening in your organization, and
you’ll have to modify your indexing scheme. Most data
warehouse/BI systems will access these relational tables
directly, so you can use tried-and-true transactional
methods for tuning indexes, such as evaluating the
query and data mix and adjusting it accordingly. If
your relational data warehouse is used only to stage
SSAS structures, then you might not need any more
indexes than those we’ve talked about. SSAS tends
to use the same queries over and over again, so you
can run the Index Tuning Wizard and tune exactly for
that set of queries. Start simple, evaluate thoroughly,
and build conservatively when indexing your data
warehouse.
End of Article