Performing query tuning and trying to assess parallelism aspects of a query plan can be tricky if your test machine doesn't have the same number of CPUs as the target machine. In addition, sometimes you simply don't have access to the target machine. You must try to figure out what you can by working with the system that you do have access to at the time -- say, your laptop. With parallelism, SQL Server's choices of whether or not to use parallelism, the degree of parallelism (DOP), and CPU costing aspects of iterators depend on the number of CPUs available in the machine.
For example, you could have a query that generates a serial plan in an environment with four CPUs and a parallel plan when the only different factor in the environment is that it has eight CPUs. Similarly, a certain query pattern (e.g., one with a filter col <= @val) might switch from serial to parallel plan in different tipping points in terms of selectivity because of different costing of iterators based on the number of CPUs.
If your testing environment has more CPUs than the target environment (which isn't very likely), you can restrict the number of CPUs with server configuration options that set the maximum degree of parallelism and process affinity, as well as a MAXDOP query option that sets the maximum degree of parallelism at the query level. But what's more likely to be the case is that your test environment will have fewer CPUs than the target environment. The aforementioned options can restrict the number of CPUs considered, not increase it.
This article describes two options that I use in such cases. One is the -P startup parameter, which is well-known in the SQL Server community. The other is the less-known DBCC OPTIMIZER_WHATIF command, which I recently learned about from my friend and SolidQ colleague Eladio Rincón. Note that neither option is covered in the official SQL Server documentation; because these options are unsupported, you should use them at your own risk.
Sample Data, Environment, and Sample Queries
The code in Listing 1 creates the sample data that I use in this article. This code creates a table called T1 and populates it with 1,000,000 rows. The table has three columns: col1, col2, and filler. SQL Server creates a clustered index called PK_T1 with col1 as the key because of the definition of the PRIMARY KEY constraint.
SET NOCOUNT ON;
IF OBJECT_ID('dbo.T1') IS NOT NULL DROP TABLE dbo.T1;
CREATE TABLE dbo.T1
col1 INT NOT NULL
CONSTRAINT PK_T1 PRIMARY KEY,
col2 INT NOT NULL,
filler BINARY(200) NOT NULL
CONSTRAINT DFT_filler_col1 DEFAULT(0x)
DECLARE @numrows AS BIGINT = 1000000;
L0 AS (SELECT c FROM (VALUES(1),(1)) AS D(c)),
L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n
INSERT INTO dbo.T1(col1, col2)
SELECT TOP (@numrows) n AS col1, n AS col2
ORDER BY n;
I ran the examples for this article on a laptop with an Intel Core i5 processor. My machine has four logical CPUs and 8GB of RAM, running SQL Server 2012 CTP3, Enterprise Evaluation Edition (64-bit). Because my laptop has four logical CPUs, SQL Server starts with four schedulers by default.
Consider the three queries in Listing 2 (I'll call them Query 1, Query 2, and Query 3). When I ran the queries from Listing 2 on my laptop, I got the plans that Figure 1 shows.
SELECT col1, col2, filler FROM dbo.T1 WHERE col2 <= 50000 OPTION (RECOMPILE);
SELECT col1, col2, filler FROM dbo.T1 WHERE col2 <= 100000 OPTION (RECOMPILE);
SELECT col1, col2, filler FROM dbo.T1 WHERE col2 <= 500000 OPTION (RECOMPILE);
Observe that Query 1 generated a parallel plan, whereas Query 2 and Query 3 generated a serial plan. It might seem a bit counterintuitive that the plan becomes serial when more rows are filtered, but there's an explanation. The optimizer simply compares the costs of the serial and parallel plans and chooses the one with the lower cost.
Just scanning data doesn't require intra-query parallelism; the read-ahead mechanism takes very good care of efficient scanning. In our case, what can benefit from parallelism is the filtering.
As I mentioned, for the optimizer to even consider a parallel plan over a serial one, the parallel plan cost must be lower. So the optimizer needs to assign a lower cost to a parallel scan than to a serial one. You can observe in Figure 1 that the CPU cost associated with the serial scan iterator (Query 2) is about 1.1, whereas the cost associated with the parallel scan iterator (Query 1) is about 0.55 -- half that of the serial scan.
If you're a curious person, you're probably wondering why the optimizer divided the serial cost by 2 rather than by 4, which is the query DOP shown in the root iterator in the plan. As it turns out, the optimizer defines two degrees of parallelism -- one for execution and another for optimization.
DOP for execution is the one shown in the root iterator in the plan. It determines how many threads can be used in each so-called parallelism zone in the plan. Query 1 got DOP for execution 4 because SQL Server started with four schedulers.
DOP for optimization is used for costing purposes, such as serving as the divisor of the serial iterator's CPU cost. DOP for costing is intentionally computed to be lower than DOP for execution to account for the fact that multiple queries typically need to compete for CPU resources. It's computed as 2 when SQL Server runs with two or three schedulers; otherwise, you'd never get a parallel plan with only two CPUs. With more than three schedulers, DOP for costing is computed as the number of schedulers divided by 2 using integer division. My laptop has four schedulers, and hence DOP for optimization is computed as 2.
I can verify how many schedulers SQL Server started with by running the following query:
Now you know why the optimizer divided the cost of the serial scan iterator by 2 rather than by 4 in the plan for Query 1. But why did it choose a serial scan for Query 2 and Query 3, which filter more rows than Query 1? The answer is that in my environment, in all three cases, the cost of a parallel scan is estimated to be the same -- about 0.55. There's a constant cost associated with each row processed by the exchange iterator that gathers the streams. Hence, the more rows filtered, the higher the total cost of the exchange iterator -- to the point that the total query cost becomes higher than that of the alternative serial plan.
The higher the number of CPUs, the higher the number of schedulers SQL Server starts with. The higher the number of schedulers, the higher the DOP for optimization. The higher the DOP for optimization, the lower the estimated cost of the parallel scan iterator. The lower the cost of the parallel scan iterator, the more rows the exchange iterator can process before the total query cost becomes higher than the alternative serial plan. Pure logical deduction should then lead you to the conclusion that the tipping point from a parallel plan to a serial plan in terms of number of filtered rows increases with more CPUs.
Now that you know why Query 1 generated a parallel plan and Query 2 and Query 3 generated serial plans on a system with four logical CPUs, you know that the plans could be different on a system with a different number of CPUs -- because of the changing tipping point. Suppose that you're on an airplane or a train on your way to a customer. You have your laptop with you, which has four CPUs. But the target system at the customer site has 8, 16, or maybe 32 CPUs. You want to analyze query plans on your laptop, but you need to mimic the target environment as much as possible. I'll cover two options you can use in such a case: the -P startup parameter, which affects the entire instance, and the DBCC OPTIMIZER_WHATIF command, which affects the current session only.
Startup Parameter -P
The first option that I'll describe is a startup parameter for the SQL Server service called -Pn, where n is the number of schedulers you want SQL Server to start with. As far as SQL Server is concerned, for purposes such as computing both DOP for execution and DOP for costing, it's as if the service is running on a machine with that many logical CPUs.
You can specify this startup parameter using SQL Server Configuration Manager (from the Microsoft SQL Server program group under the Configuration folder). After SQL Server Configuration Manager loads, double-click the SQL Server database engine service in question. The SQL Server service Properties dialog box will appear. If you're using the tool in SQL Server 2008 R2 or earlier, you need to enter the startup parameter in the Advanced tab under Startup Parameters, as Figure 2 shows.
Notice the use of a semicolon to separate startup parameters. I used -P8 in this example to indicate that I want the service to start with eight schedulers.
If you're using the tool in SQL Server 2012, you need to enter the startup parameter in the text box in the separate Startup Parameters tab and click Add, as Figure 3 shows.
Figure 3: Service Properties dialog box in SQL Server 2012
You must restart the service for the change to take effect. Run the following query against sys.dm_os_schedulers to verify that SQL Server started with the desired number of schedulers:
Recall that the laptop I used for my examples has four logical CPUs. After starting the service with eight schedulers, I reran the queries from Listing 2 and got the plans that Figure 4 shows.
Compare these plans with the plans in Figure 1. The plans in Figure 4 are the type of plans you'd typically get on a system with eight logical CPUs. Observe that DOP for execution is 8, because the service started with eight schedulers. The cost of the serial scan is four times the cost of the parallel scan; therefore, you can conclude that DOP for costing is 4. As you'll recall, DOP for costing is computed as the number of schedulers divided by 2, and because the number of schedulers is now eight, DOP for costing is 4.
Figure 4: Plans for queries in Listing 2 with startup parameter -P8
Now that my system is running eight schedulers, the estimated cost of the parallel scan is half compared with a system running four schedulers. Therefore, the exchange iterator that gathers the stream can handle more rows before the entire plan cost will be estimated as higher than the alternative serial plan. Note that Query 2 generated a parallel plan in Figure 4, whereas it generated a serial plan in Figure 1.
You can also observe that the estimated CPU costs of the exchange iterators in Query 1 and Query 2 are different because each query handles a different number of filtered rows. In Query 3, the number of rows that would pass through the exchange iterator was such that the total plan cost was higher than that of the alternative serial plan, hence the latter was chosen.
The downside of the -P startup parameter is that it has a global effect on the entire instance and requires restarting your SQL Server service for it to be applied. This brings me to an option that I learned about quite recently.
DBCC Command DBCC OPTIMIZER_WHATIF
I recently learned about an option called DBCC OPTIMIZER_WHATIF from my friend and colleague Eladio Rincón. He discovered it in use at a customer site; the customer had been told that this option, when used with 1 as the first argument and some number n as the second argument, affects the optimizer's choices in terms of parallelism. Conducting a web search on this command revealed very little. A couple of websites list the option as an undocumented DBCC command, but they have very little information about what it does.
Experimenting with this command seems to indicate that it affects the optimizer's choice of DOP for costing. Namely, the optimizer produces a plan as if there were n logical CPUs in the machine, where n is the second argument you provide to the command.
I removed the -P startup parameter discussed in the previous section and restarted the service. It started as usual on my system with four schedulers (verified with a query against sys.dm_os_schedulers). I then ran the following command to instruct the optimizer to create plans in my session as if there were eight logical CPUs on my system:
DBCC OPTIMIZER_WHATIF(1, 8);
I then reran the queries in Listing 2 and got the plans that Figure 5 shows.
Compare these plans with the ones in Figure 4, which were produced when SQL Server was running eight schedulers. Observe that the plans are almost identical and that costing is identical -- so it seems like DOP for costing was computed as if the system were running eight logical CPUs. The one evident difference is that DOP for execution is 4 rather than 8. Therefore, it seems as if DOP for execution is still limited by the number of schedulers in the system. But if all you care about is simply getting the same plans you would get on the target system, it might just do the trick.
To get the session back to normal behavior, run the command with 0 as the second argument:
DBCC OPTIMIZER_WHATIF(1, 0);
Use at Your Own Risk
I find both the -P startup parameter and the DBCC OPTIMIZER_WHATIF command to be great tools when working on parallelism aspects of queries without access to the target system. Unfortunately, they're both undocumented, so you must use them at your own risk. I hope Microsoft will recognize the importance of these tools to the community and will make them public.