One of the most effective things you can do to improve the performance of a SQL Server based application is to tune the queries that SQL Server is executing. Although query tuning comprises many facets, having good indexes on your tables can make an amazing difference in query performance. However, you also need to make sure that SQL Server is actually using those indexes. You can't separate query tuning from table tuning: To tune your queries, you need to have good indexes on your tables; to tune your tables, you need to make sure that their indexes will help with incoming queries.

In this and upcoming articles, I look at what makes a good index and what kinds of queries can use those indexes. I don't discuss index structures in depth, but I give you basic information about the two kinds of indexes that SQL Server supports—clustered and nonclustered. Here's a simple way to understand the differences. A clustered index is like an old-fashioned 20-volume encyclopedia, in which the data is ordered. You can use the fact that the data is sorted to quickly narrow your search, first using the keywords on the spines of the volumes, then using the keywords on the top corners of each page. When you find the entry you're looking for, all the information that the encyclopedia contains about that subject is in that entry. In SQL Server, the clustered index contains the data itself. For example, if you build a clustered index on the last_name column in your employees table, SQL Server rearranges the data rows to create an ordered list of pages and an ordered set of rows on each page, keeping the data rows in last-name order. Because the data can be in only one order, you can have only one clustered index on a table.

A nonclustered index, however, is completely separate from the data. You can compare a nonclustered index to the index in the back of a book. The index entries are in order, so finding a listing for a particular topic is easy, but the index gives you one or more pointers to pages containing the details. For popular topics, the index might contain many pointers (page references); other topics might have one pointer, or none. A nonclustered index in SQL Server contains all the index key values in order. With each key value is a "bookmark" telling where the row that has that key is in the table. For example, if you have a nonclustered index on the first_name column in your employees table, the index's leaf (bottom) level contains every first name value in the table, in sorted order. Unlike the index in the back of a book, if a particular first name occurs multiple times, a SQL Server nonclustered index will contain one subentry for each occurrence of each first_name value. Each occurrence has a bookmark telling SQL Server where to find the row that contains that first name. SQL Server lets you define as many as 249 nonclustered indexes per table, but because SQL Server automatically maintains indexes as you insert, update, and delete data, I recommend that you not have that many. Most tables need only a handful of indexes; I generally limit my tables to 8 to 10 indexes. (For more details about the differences between clustered and nonclustered indexes, see Dusan Petkovic, "Which Is Faster, Index Access or Table Scan?" January 2000.)

The most important difference to remember is that a clustered index stores all the data in sorted order, whereas with a nonclustered index, only the index keys are in order. In a table that has a clustered index on last name, if you're looking for all the data about everyone with the last name Sasaki, you'll find all that data together. Even if your table has thousands of such rows, SQL Server stores them as close together as possible, so the number of pages SQL Server has to read to find all the relevant data is minimal. However, if you want to find all the data for all employees with the first name Joe, a nonclustered index on first name might not be useful. All the index rows for Joe would be together, but each would have a bookmark directing SQL Server to a different place in the table. In many cases, nonclustered indexes aren't useful if your search will return lots of rows. How many is "lots"? The answer to that question depends on how big the table is and on other factors, such as the size of the index keys compared with the row size. But if 1000 rows have the first-name value Joe, using the nonclustered index means that SQL Server might have to perform 1000 page reads to get all the data for all the relevant rows. If your table has only 500 pages, using the nonclustered index would be less efficient than reading every page in the table.

Optimize This

The SQL Server query optimizer determines which indexes SQL Server can use to produce the best performance during query execution. The optimizer uses index statistics to estimate the number of rows that a query will return, then computes how many pages SQL Server will need to access to read all those rows. The query optimizer compares the cost of using each index or combination of indexes against the cost of scanning the whole table, then chooses the plan it estimates will be cheapest.

To tune your tables so that they have the best indexes, you need to know the kinds of queries that users will run against each table. Because you can have only one clustered index per table, most people first determine the best column or columns to build the clustered index on. And because a clustered index means the table contains the data rows in order, you need to know what type of ordering is most important. If your table frequently receives queries that need to sort the data, what's the most common sort key? If your clustered index is on last name and you need to retrieve data sorted by last name, no additional sorting is required—the data is already in order.

Clustered indexes can also help if you frequently retrieve a range of data. For example, a clustered index on postal_code can be useful if you run queries like this:

SELECT * FROM employees
WHERE postal_code BETWEEN 90210 AND 91000

In this case, SQL Server stores all the rows that have postal codes in the desired range together, so the number of required page accesses will be minimal. You'll encounter a similar situation for columns that have many duplicate values. If you need to find the rows for all employees in a specific department, a clustered index on department number will let SQL Server access the rows as quickly as possible.

So, where do you put the clustered index if your table has one column that you frequently sort by, another column that you frequently search by range, and another column with many duplicates? Part of the answer will depend on your determination of which queries are most important and your analysis of which queries users run most often. A query that your boss runs several times an hour will probably take precedence over a query that runs once or twice a month.

One important detail to note is that nonclustered indexes are almost as good as clustered indexes if you're interested in just one row. So a table's primary key, which by definition is unique, might be well served by a nonclustered index. As the following example shows, if your most common queries that reference the primary key are searches for a single value, a nonclustered index is perfectly useful.

SELECT * FROM employees
WHERE  employee_ID = 'A12345'

By default, when you define a primary key in a table definition, SQL Server builds a unique clustered index on the key column. Unless you'll be searching for a range of primary key values or sorting by the primary key, I recommend that you override the default and define the primary key index as nonclustered. Save your clustered index for where it can do the most good.

Any Arguments?

Now, let's look at one of the basic forms of query tuning. When you're determining which index might be useful for a particular query, the most important elements in the query are the WHERE clause conditions. The WHERE clause limits the number of rows that the query will return; the query optimizer tries to determine whether an existing index can help find this limited number of rows.

The query optimizer examines each condition in your WHERE clause to determine whether the condition might be useful in limiting how much data SQL Server has to access—that is, whether the clause is useful as a search argument (SARG). Although people sometimes use the term search argument to mean any condition in a WHERE clause, when I use the acronym SARG, I'm referring to a specific form of search argument. In most cases, the query optimizer can optimize only the conditions that are expressed as SARGs.

A SARG limits a search by specifying an exact match, a range of values, or a conjunction of two or more items joined by AND. A SARG always contains a constant expression (or a variable that resolves to a constant) that is compared to a column in a table. SARGs take the form

column inclusive_operator

or

inclusive_operator column

The column name appears on one side of the operator, and the constant or variable appears on the other side. If a column appears on both sides of the operator, the condition isn't a SARG. SARGs can include the following operators: =, >, <, >=, <=, BETWEEN, and sometimes LIKE. Whether LIKE is allowed in a SARG depends on the position of any wildcards you use. For example, LIKE 'son%' is a SARG, but LIKE '%son' isn't because the wildcard (%) at the beginning doesn't limit the rows that SQL Server has to search. Here are some SARG examples:

name = 'jones'
salary > 40000
60000 < salary
department = 'sales'
name = 'jones' AND salary > 100000

With a non-SARG expression, SQL Server must evaluate every row to determine whether it meets the conditions in the WHERE clause. So an index typically isn't useful for conditions that aren't SARGs. Such non-SARG expressions include negation operators such as NOT, !=, <>, !>, !<, NOT EXISTS, NOT IN, and NOT LIKE.

But don't suppose that any time a negative operator appears in a query, SQL Server can't use an index. Using a non-SARG in your WHERE condition doesn't always result in a table scan. SQL Server can use indexes with some non-SARGs, and if the query contains non-SARGs, some indexes might be useful to other SARGs in the query. However, one of the first and most straightforward changes to make when tuning your queries is to turn non-SARGs into SARGs whenever possible.

Here are some examples of non-SARG conditions:

WHERE SUBSTRING(au_lname, 1, 2) = 'Mc'
WHERE ABS(change_in_quantity) < 4

These two examples don't have a column name isolated on one side of the comparison operator. For these conditions, the optimizer won't consider using an index. However, you can rewrite both of these conditions to a form that might be optimizable. The first condition could become

WHERE au_lname >='Mc' AND au_lname < 'Md'

In this case, an index on au_lname could be useful. You could rewrite the second condition as

WHERE change_in_quantity < 4 AND change_in_quantity > -4

When a condition involves a SARG, the query optimizer considers using an index on the column in the SARG.

Simple changes such as turning a non-SARG into a SARG can potentially have a large impact on the query performance, but that's only the beginning of what you can do to tune queries. In my next few articles, I'll examine other tuning techniques and tools.