SQL Server 2008 introduces several interesting enhancements concerning parallel query processing. These enhancements mainly improve the performance of large-scale queries, such as queries against large data warehouses and queries against partitioned tables. In this article I describe three enhancements: few-outer-row optimization, star join optimization, and partitioning enhancements. All my examples use a sample database called testparallel. Run the code in Listing 1 to create this database, as well as a helper table function called GetNums that accepts a number as input and returns a sequence of integers in the range 1 through the input number.
The few-outer-row scenario refers to a nested loop join whose outer input (i.e., the top one) is a parallel scan that filters data based on a predicate and yields a small number of rows. Regardless of the join, a parallel scan uses a parallel page supplier that provides a set of pages to each thread upon request. As soon as a thread finishes processing a set of pages, it requests another set of pages. This way threads that process rows faster will process more rows and all threads should finish more or less at the same time. But after applying the filter as part of the scan, there’s no assurance that the number of qualifying rows in each thread will be similar. In fact, you can end up with some threads that have zero qualifying rows and some with nonzero numbers.
The inner part of a nested loop join isn’t aware that it is processed in parallel; each thread simply processes whichever rows that it obtains from the upper part. If the upper part has multiple threads, this determines how many threads will be used to process the inner part, as well as how many outer rows each thread will process in the inner part. Prior to SQL Server 2008, if the outer input to a nested loop join was the result of a parallel scan with uneven distribution of qualifying rows among threads, there was no attempt made by the optimizer to somehow redistribute the rows to balance the work. This could result in imbalanced distribution of the load among threads, with some threads remaining idle and others doing most of the work.
Let’s consider an example that demonstrates this imbalance in SQL Server 2005 and then the resolution to the imbalance in SQL Server 2008. I used the code in Listing 2 to create the sample data for the example. This code creates two tables called T1 and T2 with a million rows each, both with a clustered index on col1, and no index on col2.
The following query has a join and a filter; it is processed using a nested loop join and fits the few-outer-row scenario:
JOIN T2 ON T1.col1 = T2.col1
WHERE T1.col2 <= 100;
Figure 1 shows the plan I got for this query on SQL Server 2005 SP3. Note that the machine I used to test this code has eight logical CPU cores.
Observe in the plan (by looking at the properties of the scan) that besides the main thread (thread 0), eight threads are used to process the parallel scan, but only one has qualifying rows (100 of them), whereas all the rest have zero qualifying rows. This distribution of rows unfortunately dictates the distribution of work among threads to process the inner part of the join, so only one thread ends up working and the other seven remain idle.
SQL Server 2008 introduces an enhancement called few-outer-row optimization, in which the optimizer detects the few-outer-row scenario, and when detected, adds a Redistribute Streams operator above the scan to redistribute the rows evenly among threads. Then the work is distributed evenly among threads to process the inner part of the nested loop join. Figure 2 shows the plan I got for this query on SQL Server 2008 SP1 Cumulative Update 6. Note the uneven distribution of qualifying rows returned from the scan, but the redistribution of the rows evenly among threads by the Redistribute Streams operator.
Star Join Optimization
Star join optimization is mainly applicable to join queries against data warehouses. The classic data model in a data warehouse is known as the star schema model. In this model you have multiple dimension tables that hold information about the subjects that you analyze the data by (e.g., customer, employee, product, time) and a centric table called the fact table, with the facts—several measures for each applicable combination of dimension keys. Each dimension table typically has a compact surrogate key used in the fact table as a foreign key referencing the corresponding dimension table. The dimension tables are typically fairly small (comparably), whereas the fact table can get quite large—many millions, or in some cases billions, of rows.
Classic queries against data warehouses with a star schema model involve a join between the big fact table and some of the dimension tables, with the join conditions based on the single column foreign key–primary key relationships, and some filters on nonkey attributes from the dimension tables.
SQL Server processes a join between two inputs at a time. In the star join scenario, each join could end up processing a very large number of rows and therefore be inefficient.
Let’s consider an example of such a join to see how it is processed in SQL Server 2005, as well as SQL Server 2008’s improvements. The code in Listing 3 creates the sample data for my examples. This code creates a general form of a data warehouse with a star schema model. It creates three dimension tables called Dim1, Dim2, and Dim3, and a fact table called Fact. The code fills the dimension tables with 100, 50, and 200 rows, respectively, and the fact table with 1,000,000 rows. Following is an example of a classic star join query:
FROM dbo.Fact AS F
JOIN dbo.Dim2 AS D2
ON F.key2 = D2.key2
JOIN dbo.Dim3 AS D3
ON F.key3 = D3.key3
WHERE D2.attr1 <= 3
AND D3.attr1 <= 2;
As you can see, the query joins the fact table with two of the dimension tables based on the foreign key–primary key relationships, and it filters the dimension rows by nonkey attributes. Figure 3 shows the execution plan I got for this query on SQL Server 2005.
Note that all 1,000,000 rows are returned from the scan of the clustered index on Fact and are used to probe the hash table created based on the rows retrieved from the clustered index on Dim3. This means that the join has a lot of rows to process. The result of this join is still fairly large—85,000 rows. This result is used to probe the hash table created based on the rows retrieved from the clustered index on Dim2.
SQL Server 2008 introduces an enhancement called star join optimization. It uses heuristics to detect a star join (e.g., minimum size of fact, single-column joins); when detected, it can add so-called optimized bitmap filters to the plan. Think of a bitmap filter as a compact in-memory representation of a fairly small set of values that can be used in the plan to filter data. When the optimizer detects a star join scenario, it evaluates the use of bitmaps to produce a compact in-memory representation of the qualifying dimension keys after the filtered scan of each dimension. Then later in the plan when scanning the fact table, it can apply a number of filters based on those bitmaps filtering out the bulk of the rows from the fact table. Then what’s left for the join operators to process are much-reduced sets of rows.
SQL Server 2005 does support bitmap filters in certain cases. What’s new in SQL Server 2008 is support for the so-called optimized bitmap filters. The optimized filters can be added dynamically during optimization when a star join scenario is detected, and the optimizer can rely on estimates based on those to be able to make educated decisions in later parts of the plan tree. Optimized bitmap filters can be used only in parallel plans with has joins. Figure 4 shows the plan I got for our star join query on SQL Server 2008.
A small number of rows are returned after the filtering of each dimension. In both cases, the Distribute Streams operator creates bitmaps (called Opt_Bitmap1007 and Opt_Bitmap1006) based on the qualifying keys and broadcasts those bitmaps to all threads. The qualifying rows in both cases are used as the build inputs to the hash tables. The fact table is scanned. From the Predicate used to filter the rows in the scan you can observe that both bitmaps were used to filter the fact table rows. After applying both filters, a small number of rows remain—8,001 in this case. This means that the number of rows left for the joins to process is much smaller compared with the plan in SQL Server 2005, in which star join optimization wasn’t used. The way you can tell that the new optimized bitmap filters were used is by their name: They have the prefix Opt_.
The last enhancement is an improvement in the parallel processing of queries against partitioned tables. To demonstrate this enhancement, run the code in Listing 4 to create and populate a partitioned table called PartitionedTable, then query the table. The code in Listing 4 populates the table with 1,000,000 rows in four partitions (with col1 ranges <= 250,000, > 250,000 and <= 500,000, > 500,000 and <= 750,000, > 750,000).
In SQL Server 2005, as long as your query has only one qualifying partition, parallel query processing usually performs good distribution of the load among threads. As an example, the following query filters rows from only one partition:
WHERE col1 <= 250000
ORDER BY col2;
Figure 5 shows the plan I got for this query on SQL Server 2005. Observe in the plan that there’s a good distribution of the load among the eight threads used besides the main thread. However, when more than one partition qualifies, SQL Server 2005 assigns a thread per partition. This means that if the number of qualifying partitions is smaller than the query degree of parallelism (DOP), some of the threads will remain idle. As an example, in the following query two partitions qualify:
WHERE col1 <= 500000
ORDER BY col2;
Figure 6 shows the execution plan I got for this query on SQL Server 2005. The Constant Scan operator enumerates the qualifying partitions, and then one thread is assigned to work on each partition via the Nested Loops operator. You can see from the properties of the Clustered Index Seek in the inner part of the Nested Loops join that only two threads actually worked. As you can imagine, the distribution of work among threads in such cases can be inefficient.
SQL Server 2008 doesn’t use a thread-per-partition strategy in case multiple partitions qualify. It also doesn’t use the Constant Scan and Nested Loops operators to enumerate the partitions and issue the work per partitions. In plans involving partitioned tables, you’ll see the work as if done against one partition; from the operator properties you can determine how many and which partitions were processed. In terms of parallelism, regardless of the number of qualifying partitions, SQL Server distributes the work among all threads. As an example, Figure 7 shows the plan I got on SQL Server 2008 for the last query. As you can see, in SQL Server 2008 the work was distributed nicely among all threads even though two partitions qualified.
All the enhancements I discussed—few-outer-row optimizations, star join optimizations, and improvements in parallel query processing against partitioned tables—are built-in improvements in the engine, mainly improving the performance of large-scale queries. None of these enhancements require any sort of intervention from the administrator’s side—no need for any knobs, switches, or query revisions. You get a performance boost for your existing queries simply by upgrading to SQL Server 2008.