T-SQL is a language that lets you formulate requests in a logical manner. By "logical," I mean that when you write a query, you specify what results you want but not how you want to get them. Devising how to process the query is the job of the query optimizer. Every problem you face that requires a T-SQL solution usually has many different solutions that eventually return the same result. In an ideal world, given two different queries that perform the same task, the optimizer would generate the same execution plan—the optimal plan—for both. The SQL Server development team seems to be moving in this direction. With every release of SQL Server, the optimizer devises more sophisticated and efficient plans, and the chances increase that different queries that perform the same task will use the same plan.
However, in many situations, the way you write a query can still dramatically affect the query's performance. To let the optimizer choose among different plans and find an efficient one, you need to do some preparation—namely, creating indexes. Most people create regular indexes as part of their database tuning, but I haven't seen indexed views implemented widely. Indexed views can dramatically improve query performance, especially when you're aggregating data. In some situations, after creating and indexing views, you don't even need to change your original queries that refer to the base tables because the optimizer is smart enough to use those indexes.
Let's look at a couple of problems in which using indexed views yields improved performance. I'm assuming that you're already familiar with the basics of indexed views. (For details about indexed views, see Kalen Delaney's "Introducing Indexed Views," May 2000, InstantDoc ID 8410.)
The New Customers Problem
The first problem involves a typical marketing request—for each period (e.g., month), what is the number of new, existing, and total customers based on order activity? A month's new customers are customers who placed an order for the first time. Existing customers are those who placed orders in earlier months. Use the Orders table in the Northwind database as your initial base data. Table 1 shows an abbreviated version of the desired output. After creating a solution and verifying that your code is logically correct, test its performance against a larger Orders table, which you can create in tempdb by running the code that Listing 1, page 18, shows.
Listing 2, page 18, shows my first solution to this problem. The innermost query defining the derived table M returns the distinct month and customer combinations from Orders. The month is expressed as a datetime value that falls on the first day of the month. The query defining the derived table N uses a CASE expression that returns, for each month and customer combination, a 1 if the customer hadn't placed an order before the current month and a 0 otherwise. In other words, the CASE expression returns 1 if the customer is new. The outermost query groups the rows in N by month and calculates the total number of new customers by summarizing all the 1s in the column named new in the N table. The total number of customers is the number of rows in the group (COUNT(*)) because each row in the group represents a different customer. The number of existing customers is the total number of customers minus the number of new ones.
When I wrote this solution, I didn't take performance into consideration; I thought only of solving the query logically. The query incurred more than 137,000 logical reads and ran for 17 seconds on my laptop against the large Orders table I created in tempdb. Regular indexes provided little help—after I created a composite index on the customerid and orderdate columns, the query ran for 15 seconds. The problem called for a different approach, and this time, I decided to keep performance in mind.
The larger Orders table can have many occurrences of orders for each customer each month. If you had a summarized table that contained only distinct combinations of months and customers, you could write a better-performing query. Summarizing tables is what indexed views are all about. Run the code in Listing 3 to create an indexed view called Vymcusts that contains only distinct combinations of months and customers. Note that you have to include the COUNT_BIG(*) function inside a view that uses aggregations if you want to index the view. When SQL Server knows the number of rows in each group, it can maintain the indexed view efficiently.
Next, I tested several different solutions, all of which performed much better than the first. Let's look at the best-performing solution I came up with. I created another view called Vnewcusts that contains the minimum year/month (ym) value for each customer:
SELECT customerid, MIN(ym) AS min_ym
FROM Vymcusts (NOEXPAND)
GROUP BY customerid
The minimum ym value for a customer is the month of that customer's first order. The NOEXPAND hint tells the optimizer to treat the indexed view as a table and not attempt to use indexes from the base table Orders. Finally, I left-joined Vymcusts (VY) to Vnewcusts (VN). A row from VY finds a match in VN if they both have the same customer and the current month in VY is the minimum month for the customer (VY.ym = VN.min_ym). The outer join returns NULLs in VN.customerid and in all columns from VN if the month isn't the minimum month for the customer. The query groups the result of the outer join by the VY.ym column. SQL Server calculates the total number of new customers by counting the number of non-NULL values in VN.customerid. Listing 4 shows the final query, which ran for 1 second on my laptop and incurred only 280 logical reads. That performance is satisfactory.
Another problem in which I found indexed views useful is calculating column cardinality—the number of distinct values in a column. Calculating the cardinality of each column in a table can be costly because SQL Server needs to rearrange each column's values so that it can easily count the number of distinct values. Let's look at some examples that use a table with three data columns and 1 million rows. Run the script that Listing 5 shows to create a table called T1 in tempdb that has columns called c1, c2, and c3, which contain 50,000, 100,000, and 200,000 distinct values, respectively.
With no suitable indexes, the following query, which calculates the cardinality of each data column, would give very poor performance:
SET STATISTICS IO ON
SELECT COUNT(DISTINCT c1),
COUNT(DISTINCT c2), COUNT(DISTINCT c3) FROM T1
Look at this query's execution plan in Query Analyzer and notice that SQL Server scans T1's clustered index three times and hashes the values in each column to count the number of distinct values. As a result, the query takes a long time to finish and incurs a lot of I/O overhead. On my laptop, the query ran for 23 seconds and incurred 9294 logical reads.
The obvious first step toward better performance is to index each column:
CREATE NONCLUSTERED INDEX idx_nc_c2 ON T1(c2)
CREATE NONCLUSTERED INDEX idx_nc_c3 ON T1(c3)
This way, SQL Server can scan each index separately in an ordered fashion and count the distinct occurrences. Now the query runs for 5 seconds, incurring 4083 logical reads.
You might be satisfied with this improvement, but with tables that have more columns and more rows, you might still need improvement. One way to get better query performance (albeit at the cost of modification performance) is to write INSERT, UPDATE, and DELETE triggers that maintain a distinct count value for each column in a separate, denormalized table. Or, you can handle the problem without using triggers by using indexed views instead of the regular nonclustered indexes.
For each column, you can create an indexed view that contains only the column of interest and the COUNT_BIG(*) function, which is required for indexing views that have aggregations. For example, create the following indexed view for c1:
SELECT c1, COUNT_BIG(*)
AS cnt FROM dbo.T1 GROUP BY c1
CREATE UNIQUE CLUSTERED INDEX
idx_uc_c1 ON V_T1_c1(c1)
Then, run the script that Listing 6 shows to create the indexed views for c2 and c3. Now, rerun the query, measuring duration and I/O and examining the execution plan. The optimizer is smart enough to understand that, to calculate the distinct count of a column, it can simply count the number of rows inside the indexed view that uses the column in its GROUP BY clause. The query now takes less than a second to run and incurs fewer than 1000 logical reads.
Different Points of View
Until the way you write a T-SQL query doesn't matter, you should test several different solutions for the same problem to gain better performance. Indexed views are a great tuning technique because they store results of queries on disk; especially when you're aggregating data, they can save you plenty in I/O costs. They can improve retrieval performance considerably in some situations, but keep in mind that they can degrade modification performance, too.