Distributed queries involve data that's stored on another server, be it on a server in the same room or on a machine a half a world away. Writing a distributed query is easy—so easy, in fact, that you might not even consider them any different from a run-of-the-mill local query—but there are many potential pitfalls waiting for the unsuspecting DBA to fall into. This isn't surprising given that little information about distributed queries is available online. There's even less information available on how to improve their performance.
To help fill this gap, I'll be taking an in-depth look into how SQL Server 2008 treats distributed queries, specifically focusing on how queries involving two or more instances of SQL Server are handled at runtime. Performance is key in this situation, as much of the need for distributed queries revolves around OLTP—solutions for reporting and analysis are much better handled with data warehousing.
The three distributed queries I'll be discussing are designed to be run against a database named SCRATCH that resides on a server named REMOTE. If you'd like to try these queries, I've written a script, CreateSCRATCHDatabase.sql, that creates the SCRATCH database. You can download this script as well as the code in the listings by clicking the Download the Code Here button near the top of the page. After CreateSCRATCHDatabase.sql runs (it'll take a few minutes to complete), place the SCRATCH database on a server named REMOTE. This can be easily mimicked by creating an alias named REMOTE on your SQL Server machine in SQL Server Configuration Manager and having that alias point back to your machine, as Figure 1 shows.
Having configured an alias in this way, you can then proceed to add REMOTE as a linked SQL Server on your machine (in this example, MYSERVER). While somewhat artificial, this setup is perfectly adequate for observing distributed query performance. In addition, some tasks such as profiling the server are simplified because all queries are being performed against a single instance.
As distributed queries go, the simplest form is one that queries a single remote server, with no reliance on any local data. For example, suppose you run the query in Listing 1 (DistributedQuery1) in SQL Server Management Studio (SSMS) with the Include Actual Execution Plan option enabled.
SELECT TOP 1000 *
FROM REMOTE.Scratch.dbo.Table1 T1
INNER JOIN REMOTE.Scratch.dbo.Table2 T2
As far as the local server (which I'll call LOCAL) is concerned, the entire statement is offloaded to the remote server, more or less untouched, with LOCAL patiently waiting for the results to stream back. As Figure 2 shows, this is evident in the tooltip for the Remote Query node in the execution plan.
Although the query was rewritten before it was submitted to REMOTE, semantically the original query and the rewritten query are the same. Note that the Compute Scalar node that appears between the Remote Query and SELECT in the query plan merely serves to rename the fields. (Their names were obfuscated by the remote query.)
LOCAL doesn't have much influence over how DistributedQuery1 is optimized by REMOTE—nor would you want it to. Consider the case where REMOTE is an Oracle server. SQL Server thrives on creating query plans to run against its own data engine, but doesn't stand a chance when it comes to rewriting a query that performs well with an Oracle database or any other data source. This is a fairly naive view of what happens because almost all data source providers don't understand T-SQL, so a degree of translation is required. In addition, the use of T-SQL functions might result in a different approach being taken when querying a non-SQL Server machine. (Because my main focus is on queries between SQL Server instances, I won't go into any further details here.)
So, what about the performance of single-server distributed queries? A distributed query that touches just one remote server will generally perform well if that query's performance is good when you execute it directly on that server.
In the real world, it's much more likely that a distributed query is going to be combining data from more than one SQL Server instance. SQL Server now has a much more interesting challenge ahead of it—figuring out what data is required from each instance. Determining which columns are required from a particular instance is relatively straightforward. Determining which rows are needed is another matter entirely.
Consider DistributedQuery2, which Listing 2 shows.
FROM dbo.Table1 T1
INNER JOIN REMOTE.Scratch.dbo.Table2 T2
WHERE T1.GUIDSTR < '001'
In this query, Table1 from LOCAL is being joined to Table2 on REMOTE, with a restriction being placed on Table1's GUIDSTR column by the WHERE clause. Figure 3 presents the execution plan for this query.
The plan first shows that SQL Server has chosen to seek forward (i.e., read the entries between two points in an index in order) through the index on GUIDSTR for all entries less than '001'. For each entry, SQL Server performs a lookup in the clustered index to acquire the data. This approach isn't a surprising choice because the number of entries is expected to be low.
Next, for each row in the intermediate result set, SQL Server executes a parameterized query against REMOTE, with the Table1 ID value being used to acquire the record that has a matching ID value in Table2. SQL Server then appends the resultant values to the row.
If you've had any experience with execution plan analysis, SQL Server's approach for DistributedQuery2 will look familiar. In fact, if you remove REMOTE from the query and run it again, the resulting execution plan is almost identical to the one in Figure 3. The only difference is that the Remote Query node (and the associated Compute Scalar node) is replaced by a seek into the clustered index of Table2. In regard to efficiency (putting aside the issues of network roundtrips and housekeeping), this is the same plan.
Now, let's go back to the original distributed query in Listing 2. If you change '001' in the WHERE clause to '1' and rerun DistributedQuery2, you get the execution plan in Figure 4.
In terms of the local data acquisition, the plan has changed somewhat, with a clustered index scan now being the preferred means of acquiring the necessary rows from Table1. SQL Server correctly figures that it's more efficient to read every row and bring through only those rows that match the predicate GUIDSTR < '1' than to use the index on GUIDSTR. In terms of I/O, a clustered scan is cheap compared to a large number of lookups, especially because the scan will be predominately sequential.
The remote data acquisition changed in a similar way. Rather than performing multiple queries against Table2, a single query was issued that acquires all the rows from the table. This change occurred for the same reason that local data acquisition changed: efficiency.
Next, the record sets from the local and remote data acquisitions were joined with a merge join, as Figure 4 shows. Using a merge join is much more efficient than using a nested-loop join (as was done in the execution plan in Figure 3) because two large sets of data are being joined.
SQL Server exercised some intelligence here. Based on a number of heuristics, it decided that the scan approach was more advantageous than the seek approach. (Essentially, both the local and remote data acquisitions in the previous query are scans.) By definition, a heuristic is a "general formulation"; as such, it can quite easily lead to a wrong conclusion. The conclusion might not be downright wrong (although this does happen) but rather just wrong for a particular situation or hardware configuration. For example, a plan that performs well for two servers that sit side by side might perform horrendously for two servers sitting on different continents. Similarly, a plan that works well when executed on a production server might not be optimal when executed on a laptop.
When looking at the execution plans in Figure 3 and Figure 4, you might be wondering at what point does SQL Server decide that one approach is better than another. Figure 5 not only shows this pivotal moment but also succinctly illustrates the impreciseness of heuristics.
To create this graph, I tested three queries on my test system, which consists of Windows XP Pro and SQL Server 2008 running on a 2.1GHz Core 2 Duo machine with 3GB RAM and about 50MB per second disk I/O potential on the database drive.
In Figure 5, the red line shows the performance of the first query I tested: DistributedQuery2 with REMOTE removed so that it's a purely local query. As Figure 5 shows, the number of rows selected from Table1 by the WHERE clause peaks early on but it's not until there are around 1,500 rows that SQL Server chooses an alternative approach. This isn't ideal; if performance is an issue and you often find yourself on the wrong side of the peak, you can remedy the situation by adding the index hint
where PRIMARY_KEY_INDEX_NAME is the name of your primary key index. Adding this hint will result in a more predictable runtime, albeit at the potential cost of bringing otherwise infrequently referenced data into the cache. Whether such a hint is a help or hindrance will depend on your requirements. As with any T-SQL change, you should consider a hint's impact before implementing it in production.
Let's now look at the blue line in Figure 5, which shows the performance of DistributedQuery2. For this query, the runtime increases steadily to around 20 seconds then drops sharply back down to 3 seconds when the number of rows selected from Table1 reaches 12,000. This result shows that SQL Server insists on using the parameterized query approach (i.e., performing a parameterized query on each row returned from T1) well past the point that switching to the alternative full-scan approach (i.e., performing a full scan of the remote table) would pay off. SQL Server isn't being malicious and intentionally wasting time. It's just that in this particular situation, the parameterized query approach is inappropriate if runtime is important.
If you determine that runtime is important, this problem is easily remedied by changing the inner join to Table2 to an inner merge join. This change forces SQL Server to always use the full-scan approach. The results of applying this change are indicated by the purple line in Figure 5. As you can see, the runtime is much more predictable.
Listing 3 shows DistributedQuery3, which differs from DistributedQuery2 several ways:
- Rather than joining directly to Table2, the join is to SimpleView—a view that contains the results of a SELECT statement run against Table2.
- The join is on the GUID column rather than on the ID column.
- The result set is now being restricted by Table1's ID rather than GUIDSTR.
FROM dbo.Table1 T1
INNER JOIN REMOTE.Scratch.dbo.SimpleView T2
WHERE T1.ID < 5
When executed, DistributedQuery3 produces the execution plan in Figure 6. In this plan, four rows are efficiently extracted from Table1, and a nested-loop join brings in the relevant rows from SimpleView via a parameterized query.
Because the restriction imposed by the WHERE clause has been changed from GUIDSTR to ID, it's a good idea to try different values. For example, the result of changing the 5 to a 10 in the WHERE clause is somewhat shocking. The original query returns 16 rows in around 10ms, but the modified query returns 36 rows in about 4 seconds—that's a 40,000 percent increase in runtime for a little over twice the data. The execution plan resulting from this seemingly innocuous change is shown in Figure 7. As you can see, SQL Server has decided that it's a good idea to switch to the full-scan approach.
If this query were to be executed against Table2 rather than SimpleView, SQL Server wouldn't switch approaches until there were around 12,000 rows from Table1 selected by the WHERE clause. So what's going on here? The answer is that for queries involving remote tables, SQL Server will query the remote instance for various metadata (e.g., index information, column statistics) when deciding on a particular data-access approach to use. When a remote view appears in a query, the metadata isn't readily available, so SQL Server resorts to a best guess. In this case, the data-access approach it selected is far from ideal.
All is not lost, however. You can help SQL Server pick a better approach by providing a hint. In this case, if you change the inner join to an inner remote join, the query returns 36 rows in only 20ms. The resultant execution plan reverts back to one that looks like that in Figure 6.
Again, I feel it is judicious to stress that using hints shouldn't be taken lightly. The results can vary. More important, such a change could be disastrous if the WHERE clause were to include many more rows from Table1, so always weigh the pros and cons before using a hint. In this example, a better choice would be to rewrite the query to reference Table2 directly rather than going through the view, in which case you might not need to cajole SQL Server into performing better. There will be instances in which this isn't possible (e.g., querying a third-party server with a strict interface), so it's useful to know what problems to watch for and how to get around them.
I hope that I've opened your mind to some of the potential pitfalls you might encounter when writing distributed queries and some of the tools you can use improve their performance. Sometimes you might find that achieving acceptable performance from a distributed query isn't possible, no matter how you spin it, in which case alternatives such as replication and log shipping should be considered.
It became apparent during my research for this article that SQL Server does a great job optimizing local queries but often struggles to make the right choice when remote data is brought into the mix. This will no doubt improve given time, but for the moment you need to stay alert, watch for ill-performing distributed queries, and tweak them as necessary.