Learn how the query optimizer sizes up join options
As I discussed in my last two columns, one way that the query optimizer in SQL Server 2000 and 7.0 is so much more complex and powerful than in previous releases is that it has many more query-processing options to choose from. In particular, besides a nested-loop join, which was the only join-processing technique originally available, SQL Server 7.0 introduced merge joins and hash joins. This month, I continue my examination of query plans available to the optimizer by telling you about situations in which each join type is useful.
As I've mentioned before, releases before SQL Server 7.0 had only one way to process a JOIN operation. The optimizer chose one table to process first, and for each row that qualified (based on any WHERE clause conditions involving columns in that table), SQL Server used the JOIN clause to find all matching rows in the second table. The JOIN clause usually performed an equality comparison between a column in the first table and a column in the second. This type of join, called a nested-loop join, is still an alternative for the optimizer in SQL Server 7.0 and later. In fact, you can think of nested-loop joins as being the default type of join. If the optimizer can't find an acceptable plan that uses another technique, it can always fall back on the basic nested-loop join.
The other two join types are merge and hash. Merge joins are appropriate when both input sets to the JOIN operation are ordered by the values in the join column, as would be the case when both have a clustered index on the column you're using to join the tables. SQL Server most often uses hash joins when no useful indexes for joining the tables exist. Hashing is a well-defined data-processing technique that lets you add some organization to a set of data without incurring the overhead of sorting it completely.
When performing a hash join on two tables, SQL Server uses one table (called the build input) to build the hash buckets, each of which contains all the existing data values that generate the same value when the hash function is applied to them. Then, it inspects the other table (called the probe input) one row at a time and tries to find matching values in the hash buckets. For more information about hashing, see my April 2002 column, "Hashing for Performance," InstantDoc ID 24024.
What factors does the optimizer take into account when deciding what join algorithm to use? Remember that the optimizer in SQL Server 7.0 and later releases is trying to find a good plan but not necessarily the best possible plan, because finding the best plan might take more time than it would save. The optimizer evaluates possible plans from the simplest to more complex plans.
The simplest plan is to use the "default" nested-loop join, so the optimizer evaluates that type first. If the estimated time for using such a join is within acceptable parameters, no further analysis is necessary. The optimizer usually chooses this join algorithm if the outer table isn't large or if restrictive WHERE conditions on the table's columns have reduced it to a manageable size. This is one case in which the SQL Server 2000 and 7.0 enhancements of allowing statistics on columns can be valuable.
For example, consider the following pseudo code that contains a join on two large tables:
SELECT <some columns> FROM Big1 JOIN Big2 ON Big1.column1 = Big2.column2 WHERE Big1.column3 = <some value>
Even if no index exists on Big1.column3, statistics on the column (which, by default, SQL Server creates automatically when it processes the query) let the optimizer determine whether the condition is restrictive enough to make the Big1 table a suitable outer table for a nested-loop join.
To find an acceptable plan that uses a nested-loop join, SQL Server must minimize the number of times it accesses the inner table and the number of inner-table pages it accesses. Having a restrictive search condition on the outer table means that SQL Server accesses the inner table only a few times. An index on the column involved (in this case, Big1.column3) will improve performance but isn't necessary to determine whether a nested-loop join is appropriate.
Another condition the optimizer examines is whether the inner table has an index on the join column. A clustered index is always a good indicator that a nested-loop join would perform well, but a nonclustered index could also be useful if the number of matching rows in the inner table for each row in the outer table is small.
As I mentioned, the optimizer usually chooses a merge join when clustered indexes exist on the join column in both tables because the two inputs are already sorted in the same order. When performing a merge join, SQL Server has to make only one pass through each table, thereby drastically reducing the number of page accesses for the second table. In some cases, you'll see a merge join in the query plan even if both inputs don't have clustered indexes on the join column. If one table has a clustered index on the join column and the other is filtered down to a small number of rows with a WHERE condition, the optimizer might decide to perform a sort operation on the small result set and join the two inputs through a merge join. You'll then see a SORT operator in the query plan right before the MERGE JOIN operator.
Keep in mind, however, that SQL Server can perform merge joins only when at least one of the inputs is known to have unique values in the join column. If both inputs can have duplicates, SQL Server can't process a merge join by making only one pass through each table, so the optimizer usually chooses a nested-loop join instead.
Listing 1 shows some code (based on an example from "Indexes and Joins," March 2002, InstantDoc ID 23731) that illustrates that a merge join isn't the best choice when neither table is known to be unique. Listing 1 makes unindexed copies of the Northwind database's Orders and Order Details tables, then builds nonunique clustered indexes on each table. A simple join between them would necessitate a nested-loop join, as you can see by looking at the query plan for the first SELECT statement. Before running the two queries at callout A, the code turns on STATISTICS IO. When you execute the queries, you can see that the plan with the forced merge requires quite a few extra reads to accommodate the possibility of the first table having duplicates in it.
If neither nested-loop joins nor merge joins will give good performance, the optimizer considers hash joins. Note that this option is usually the optimizer's last choice. If you encounter a hash join in a query that you know will run as a regular part of an application, you should consider creating some useful indexes so that the optimizer can choose a more efficient technique. In addition, be aware that SQL Server can't process all joins as hash joins. Hashing depends on the fact that you're looking for an equality between columns in the two input tables. Not all joins are based on an equality, although that kind of join (called an equijoin) is by far the most common. Joins can also be based on an inequality or a range, such as when you're trying to find all the rows from one table that fall between two values in another table. In that case, the optimizer can consider only a nested-loop join.
If the Query Fits
The addition of merge and hash joins is a powerful enhancement that gives the optimizer in SQL Server 2000 and 7.0 many more possibilities to consider. By knowing when each of the three join techniques is most appropriate, you can more fully understand the query plans that the optimizer chooses for your application queries. And by being aware of when the optimizer generally chooses one technique over another, you can more effectively troubleshoot poorly performing queries and focus on the most appropriate places to consider alternative indexing.