In my travels throughout the world as a SQL Server performance consultant, I find that many people have trouble determining where to start when they’re looking for SQL Server performance problems. Many symptoms—and even SQL Server features—can be misleading. That reality, along with the industry outlook that SQL Server DBAs don’t need real training, can be a recipe for inefficient troubleshooting techniques and too much time down the wrong path.
So, I want to show you some techniques that can help you easily determine the root causes of performance problems. Note that I’ll be referencing previously existing articles for details about each area I cover. My hope for this article is to get you in the proper frame of mind and direction when it comes to isolating performance bottlenecks in your server. Four key metrics can point you in the right direction: Wait statistics, File statistics, Top 10 inefficient queries, and Plan reuse.
I always recommend starting with wait stats; this technique has never failed to give me valuable insight into what the SQL Server instance is waiting on, along with what it’s not waiting on. You can read more about wait stats and techniques for using them in “Getting to Know Wait Stats” (InstantDoc ID 96746) and “Dissecting SQL Server’s Top Waits” (InstantDoc ID 98112).
In a nutshell, the wait stats DMV sys.dm_os_waitstatistics can tell you the type of things SQL Server is waiting on the most—although not necessarily what’s causing the waits. (We’ll get to that part shortly.) It’s extremely important to know whether you’re waiting on I/O, CPU, internal latches, and so on, because each wait type will give you clues as to what types of processes are causing the waits in the first place. For example, if you see that PAGEIOLATCH_xx waits are high, you can be pretty sure you’re reading or writing more physical I/O than your storage subsystem can currently handle, for whatever reason.
Now, it’s very possible that you’re doing much more physical I/O than you should be doing, thanks to poorly tuned queries or missing indexes and such. If you’d seen a high rate of PAGELATCH_UP waits, you’d know that you’re getting contention on the internal bitmaps in the data files (e.g., the GAM, SGAM, and PFS pages). High CXPACKET waits indicates that you’re doing lots of parallel activities, some of which might not be totally efficient. If it’s an OLTP system, you might not be getting the concurrency you’d like depending on the number of concurrent users. And LATCHs might indicate that you have some contention in the various portions of SQL Server memory.
Also, always remember to differentiate between signal waits and resource waits. A high percentage of signal waits tells you that most of the time is spent by threads waiting to get their time on the processors and that you might very well be CPU-bound. The point is that the wait stats are telling you where to start or focus your attention for the next step, which is identifying the underlying cause of these waits. You don’t have to spend time troubleshooting I/O issues if you’re CPU-bound (or vice versa).
Keep in mind that because this DMV gives you a snapshot in time, you must take at least two snapshots over a given time period and calculate the delta or difference between them to get any useful information for that time period. The waits by themselves won’t point you to a particular query but will tell you what types of things you’re waiting on or not waiting on. I’ll show you later how to find the actual offending queries.
Another measure I recommend observing on a regular basis is the file statistics that come from the DMF sys.dm_db_virtual_file_stats. For more information about file statistics and physical I/O, see “Getting to Know Virtual File Stats” (InstantDoc ID 96513) and “Is Your Physical Disc I/O Affecting SQL Server Performance?” (InstantDoc ID 103703). This DMF will show you a snapshot in time of the physical I/O (i.e., actual reads and writes to disk), breaking it down by reads and writes with the number of requests for each, along with the amount in megabytes that was read or written. As with the wait stats, you must target a period of time and take the delta between them to see how much physical I/O has occurred during that time period. But once you’ve done that, you can easily see whether you have excessive I/O and—more important—whether it’s waiting too much based on the IOStalls column found in the DMF.
Your first question is probably, “What is excessive?” As always, the answer depends on your environment and conditions. Generally, however, anything that produces high stall time or makes users wait too long is of concern. The amount that you read or write might be dictated by your application, and you’ll need to determine over time what is normal or acceptable and what is not. But, again, the point is that this information will clearly tell you whether you’re waiting on the storage subsystem to satisfy the requests and how many and how large your requests are. You’ll have the information you need to make an intelligent decision from there. Just as before, you need to realize that too much I/O can be a result of such things as poorly tuned queries or missing indexes, and you should always address those concerns before you attempt to beef up your hardware.
Top 10 Queries
I can tell you from experience that most applications have a few queries or stored procedures that represent the majority of performance problems. If you produced a list of all the types of database calls, you would find that the top 10 would probably be 80 to 90 percent of the trouble. So, by finding and addressing these top 10 poorly performing queries, starting with the first one, you can get the biggest bang for your buck in terms of effort spent and final reward.
You can easily detect these in one of two ways. The first way is to use the DMV sys.dm_exec_query_stats, using the techniques in “Are Your SQL Server Statements Performing Well?” (InstantDoc ID 97761). Here, you can easily sort by total or average CPU, Reads, Duration, and so on, to find the top offending statements—and not just the procedure or batch it belongs to. The second way is to parse trace files with the RML utilities, as detailed in “Trace Reporting with RML Utilities” (InstantDoc ID 100670). Both ways have pros and cons, and I’ll leave it up to you to decide which works best for your environment. The key is to look for statements that are executed often and use a lot of resources, thus hurting your overall system performance or concurrency. You’ll be surprised by how much performance will improve after you address some of your top poorest-performing queries.
Last but not least is proper plan reuse. You want to ensure that you minimize compiles associated with new requests to the database by reusing existing plans in the procedure cache. The following query will list each of the plans currently in the procedure cache in the order of the number of times a plan was reused. You can also order by the text to see which plans are similar and should probably reuse previous plans.
SELECT b.\\[cacheobjtype\\], b.\\[objtype\\], b.\\[usecounts\\],
a.\\[dbid\\], a.\\[objectid\\], b.\\[size_in_bytes\\], a.\\[text\\]
FROM sys.dm_exec_cached_plans as b
CROSS APPLY sys.dm_exec_sql_text(b.\\[plan_handle\\]) AS a
ORDER BY \\[usecounts\\] DESC
The article “Do You Have Hidden Cache Problems?” goes into further details.
I can’t stress this enough: Lack of proper plan reuse can cause many performance problems within SQL Server. I can’t tell you how many times I’ve seen poorly formatted ad hoc calls to the database take the server to its knees with all the side effects that accompany it. Again, the key is to identify the calls that don’t reuse the existing plans and come in at a high frequency. Addressing those first will get you the most reward for your effort. The biggest problem with addressing these particular problems is that doing so will likely involve a change to the client application or the middle tier. Even though the changes don’t have to be extensive (in most cases), it usually falls outside the direct responsibility of the DBA and thus might require a more concerted effort to fix the problems.
Take the Time
If you take just a little time to get proficient with these four topics, you’ll have a much greater chance of finding and addressing performance problems in your SQL Server instances with reduced effort. Keep in mind that although throwing hardware at a problem might be a viable solution in some cases, it’s usually far more cost effective and efficient to find and tune the existing poorly performing queries.