SQL Server is no different from any other Windows application in that it has to "play nicely" with Windows' cooperative scheduler and share CPU and other resources with other applications and operations. Consequently, it's hard not to be impressed when you think about SQL Server's ability to handle, say, more than 800 simultaneous queries or operations while other applications (such as antivirus or backup software) are running on the same host at the same time. SQL Server is able to handle multiple concurrent operations despite having to regularly yield processing capabilities back to the OS because it implements its own internal scheduler. This means that whenever a SQL Server process is allowed to take its "turn" on the host, it has a running list of operations that it needs to process and execute.
Related: Troubleshooting Slow Servers
SQL Server's internal scheduler and underlying database engine (collectively referred to as the SQLOS) is a powerful engine capable of not only processing all sorts of simultaneous tasks but also exposing obscene amounts of detail about what's going on with a given SQL Server instance at any given moment. For example, each time SQL Server has to yield resources back to the OS or knows that it'll end up waiting on hardware to complete a task (such as fetching data from disk), the SQLOS actually keeps tabs on when the wait started, why it started, and when it completes. SQL Server exposes this information by means of dynamic management views (DMVs) that act as specialized pseudo-tables that can be easily queried.
By effectively querying DMVs and other forms of instrumentation provided by SQL Server (e.g., extended events, performance counters), you can gain insight into what's going on with a given SQL Server instance and troubleshoot specific problems. Once you master the ability to query DMVs, all sorts of troubleshooting possibilities open up.
Sadly, there's simply not enough space to cover how to query DMVs and other forms of instrumentation here. (Entire books have been written on these topics.) However, querying DMVs to gain insight into a problem really lies at the heart of being able to effectively troubleshoot any SQL Server problem. With that in mind, this article provides a high-level overview of what you can use to troubleshoot common SQL Server problems.
Troubleshooting Performance Problems
Without a doubt, one of the most common problems that requires troubleshooting is SQL Server performance. The problem with performance troubleshooting is that in many cases, performance problems don't manifest until databases grow past a certain size or until concurrent demand against them increases. As such, most performance concerns tend to be more of a systemic issue.
With that said, even the most highly tuned SQL Server environments are still subject to what DBAs affectionately call "queries from hell," such as the odd Cartesian product created by end users or rogue joins created by inattentive developers. Consequently, when servers that normally run without problems become inexplicably nonresponsive or when certain operations that normally run fine become sluggish, a common troubleshooting task is to find out who or what is beating up the server.
To this end, a fantastic way to find out which queries or operations might be hogging system resources or causing other problems is to use the plethora of DMVs available for performance tuning and troubleshooting purposes. Adam Machanic's sp_WhoIsActive stored procedure provides a perfect example of how you can leverage these powerful DMVs to track down performance problems. It's a great stored procedure that even neophyte DBAs can deploy to find out exactly what's active on their servers.
To troubleshoot anything above and beyond simple one-off problems that occur on an irregular basis, you need a contextual background and lots of hands-on experience to interpret the results from querying DMVs, extended events, and performance counters. Otherwise, it's far too easy to confuse symptoms with root causes.
Stated differently, there's no easy or simple way to become proficient in performance tuning, save putting in the time, energy, and effort needed to gain the skills, understanding, and experience. Therefore, if you're really looking to learn more about performance tuning, I highly recommend Troubleshooting SQL Server: A Guide for the Accidental DBA (Red Gate Books, 2011). It provides a great introduction to the key concepts you'll need to understand. For a more advanced and in-depth resource, I recommend Professional SQL Server 2008 Internals and Troubleshooting (Wiley Publishing, 2010). It's a fantastic book with lots of deep-dive content that provides an excellent overview of the inner workings of SQL Server. It also has excellent from-the-trenches insights about best practices for performance tuning. Only with in-depth resources and plenty of hands-on experience will you be able to become truly proficient in performance troubleshooting.
Troubleshooting Architectural Problems
In my experience, deadlocks and excessive locking and blocking aren't performance issues-they're fundamental coding problems or architectural flaws. Granted, these problems can become more pronounced under load and commonly don't show up until there are multiple concurrent users. However, it's a coding failure when otherwise healthy servers with hardware resources to spare aren't capable of processing concurrent tasks. The code doesn't properly comprehend how SQL Server's concurrency model causes locking and blocking to ensure transactional consistency.
Therefore, the biggest problem with architectural flaws is that they commonly lie hidden and dormant until application concurrency increases enough for them to manifest. It's similar to the way a small imperfection in a bike tire isn't a problem while riding around town but can become fatal going down a mountain. Troubleshooting these problems is complicated by the fact that they typically arise at seemingly the worst possible time.
For initial troubleshooting, querying the sys.dm_os_waiting_tasks DMV is an easy way to get a sense of whether queries are waiting excessively on blocks from other simultaneous queries and operations. For example, the query in Listing 1 (below) uses this DMV to check for operations and queries that are blocked. (Note that this query won't catch long-running resource hogs that aren't blocked. To catch those, you'll need to use Machanic's sp_WhoIsActive.)
You can also use DMVs to track overall SQL Server wait statistics, which can provide great insight into whether locking and blocking problems are causing excessive waits on the server. Paul S. Randal provides good examples of how to do this in his "Wait statistics, or please tell me where it hurts" blog post.
In many cases, locking and blocking problems can be mitigated by simply making queries and operations more efficient. Or, stated differently, if longer-running operations are "bumping" into each other and causing locking and blocking problems, then speeding up those operations through the judicious use indexes can minimize the potential for collision.
Still, no amount of performance tuning can fix locking and blocking problems in scenarios where developers have created operations that will deadlock. If the dreaded #1205 Deadlock error messages start showing up, it's time to start looking at root causes and evaluating options for remediation. Toward that end, one of the best ways to track down deadlocks and begin evaluating what corrections will be needed is to use SQL Server Profiler. It provides a full-blown set of specialized tracing capabilities that you can use to not only detect when deadlocks occur but also capture the entire chain of events (known as the deadlock chain) leading up to the deadlock. Figure 1 shows an example of a deadlock chain in a deadlock graph. The query text that was terminated appears at the bottom of the figure. For more information about how to use Profiler's tracing capabilities, I recommend reading Brad McGehee's article "How to Track Down Deadlocks Using SQL Server 2005 Profiler." Although this content is a bit older, it provides a great overview of the specialized techniques you can use to properly track down deadlocks and make sense of all the details and data provided by deadlock graphs.
Finding the source of a deadlock is one thing; properly addressing it is quite another. Common techniques for correcting deadlocks include re-orchestrating the order of operations within problematic queries or stored procedures so that deadlocks are no longer possible (by means of ensuring access to tables in a consistent order). Using locking hints can help correct deadlocking problems in some cases. Just don't be tempted to use NOLOCK hints (no matter what you might read in online forums) unless you're perfectly aware of (and comfortable with) the catastrophic effect that NOLOCK hints can potentially cause in terms of data consistency within your environment.
In some cases, database versioning through the use of snapshot isolation can be used to completely remove locking and blocking problems without the need for any code changes. Kimberly L. Tripp discusses this approach in the video "Snapshot Isolation." Just be aware, though, that you'll want to do extensive testing before rolling this out. You'll also need to ensure proper tuning and performance of your tempdb database and be aware of the additional fragmentation that versioning can potentially cause. (For more information about the fragmentation, see my "Fragmentation from using snapshot isolation" blog post.)
Finally, if you're having deadlocking problems, it's safe to say the database you're dealing with is mission critical-otherwise, it wouldn't have enough concurrent problems to cause deadlocks. As such, don't be afraid to get outside guidance or consulting help in dealing with concurrency problems if you're uncomfortable with what's going on, because troubleshooting this kind of problem can be disastrously expensive if you get it wrong.
Troubleshooting Disaster Recovery
Although troubleshooting lost and corrupted data and systems shouldn't be a common problem, Internet forums are sadly full of people not only asking for help but also providing all sorts of bad advice. Moreover, in cases in which viable backups aren't on hand, the solutions to recover data and systems border more on the kinds of activities associated with forensic recovery.
Although it's possible (under certain conditions) to recover data or systems through various feats of troubleshooting, the best approach is to avoid the situation by keeping regularly verified off-box and offsite backups to ensure optimal contingencies in the face of disaster. By regularly testing your existing disaster recovery plan, you'll be able to ensure that you can sufficiently meet or exceed recovery point objective (RPO) and recovery time objective (RTO) requirements. (For more information about RPOs and RTOs, see my blog post "SQL Server Recovery Time Objectives and Recovery Point Objectives"). Regular testing also helps identify environmental changes and other problems that can silently "break" backups, which is something that you don't want to discover in the midst of a full-blown crisis.
Troubleshooting Errors and Alerts
Setting up alerts and regularly scouring SQL Server logs to ensure that backups, maintenance tasks, and other business-related jobs are running correctly is part of a DBA's job. Being able to efficiently troubleshoot errors and problems when they pop up is an acquired skill that takes a bit of practice to hone. Ironically, because SQL Server usually provides lots of technical details in verbose error messages when exceptions are encountered, SQL Server error messages can commonly seem overwhelming at first. Still, if you persevere and actually read these messages, you'll often find they provide excellent troubleshooting help. However, in cases in which the error messages and details offered aren't as helpful as desired, it's important to point out that for most automated jobs and operations, SQL Server commonly offers additional logging options that can be used to troubleshoot sticky, recurring problems. You can find more information about these logging options in my blog post "Semi-Advanced Logging Options for SQL Server Agent Jobs."
Another great tactic for troubleshooting errors is to punch in the error codes and key pieces of a given error message into online search engines. However, you need to be skeptical of any recommendation, because it might be a red herring or just plain bad advice. As such, I typically prefer to search sites such as ServerFault.com, where visitors can police and vote on posts to weed out much of the bad advice. More important, rather than looking for simple answers that don't require thought, look for contextual and background information that can help you make better sense of error messages. You should use this information to formulate ideas about what's going wrong so that you can address root causes instead of just symptoms.
In a similar vein, the #sqlhelp hash tag on Twitter is a fantastic resource for troubleshooting problems, especially when you've done a bit of research and are either looking for additional confirmation of your hunches or trying to find additional documentation or background. You'll end up getting a timely response from professionals who know their stuff and who do a fantastic job of self-policing to ensure that you'll almost always get great advice and help. For more information about the #sqlhelp hash tag, see Brent Ozar's "How to Use the #SQLHelp Hash Tag on Twitter."
Another common troubleshooting task is tackling connectivity problems. Most often, connectivity problems occur with newly deployed servers or when new users or applications begin interacting with a SQL Server instance. Even when servers have run without connectivity issues for years, it's possible to run into problems in which end users and applications simply cease being able to connect.
With connectivity problems, I typically begin troubleshooting by paying attention to how long it takes the error to generate. As Figure 2 shows, if an error takes 15 to 30 seconds to generate (after seemingly doing nothing), it's typically safe to assume that there's some sort of connectivity problem. If the error occurs within just a few seconds, it's usually safe to assume that the problem is probably related to permissions or the configuration of the server or client.
In the case of 15- to 30-second timeouts, I typically start out by making sure that I can properly ping the host in question before getting too far into other concerns. I've encountered many cases in which SQL Server somehow becomes multi-homed and therefore fails to resolve correctly from some clients. Likewise, ensuring that named instances are or aren't specified as needed can save lots of time when troubleshooting. (There's nothing worse than wondering why a server with only a named instance won't allow new inbound connections when firewalls and everything else have been properly configured.)
A trick that I've commonly employed when troubleshooting stubborn connectivity problems is to fire up Profiler and watch for Audit Login and Failed Audit Login events when remote applications or users try to connect, as Figure 3 shows. If no events occur, I assume that there's a homing, name resolution, firewall, port listener, or named instance networking problem that's preventing connections from finding or reaching the intended server. If an event occurs, I track down errors or other details to troubleshoot what's going on once the connection is made.
The troubleshooting techniques and approaches I covered are by no means an exhaustive list. In addition to the books I already mentioned, I highly recommend SQL Server 2005 Practical Troubleshooting: The Database Engine (Pearson Education, 2007). This book is a bit dated, but the concepts and techniques it presents from Microsoft's product support engineers stand the test of time when it comes to providing insights and guidance into troubleshooting a host of different SQL Server problems. Microsoft's Customer Service and Support (CSS) SQL Support team can also offer help both in the form of formal support instances and its CSS SQL Server Engineers blog. The team's blog posts provide some great insights and guidance on how to deal with common problems. In a similar vein, the vast trove of white papers managed by the SQL Server Customer Advisory Team (SQLCAT) provide a similar wealth of in-depth information that can be invaluable in troubleshooting, as long as the problem you're troubleshooting falls within the domain of something being addressed.
Finally, for some stubborn problems, even a small amount of time spent with an outside consultant can be a worthwhile investment (provided you've found the right consultant). Problems that seem exotic to you might be commonplace enough for a hired gun who specializes in providing SQL Server guidance.
Listing 1: Sample DMV Query to Find Blocked or Waiting Tasks
INNER JOIN sys.dm_exec_sessions s
ON w.session_id = s.session_id
INNER JOIN sys.dm_exec_requests r
ON s.session_id = r.session_id
CROSS APPLY sys.dm_exec_sql_text(r.plan_handle) q
CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) p
w.session_id > 50
AND w.wait_type NOT IN
/* + add your own here*/)