No more firefighting! Control your ad-hoc support calls systematically
It's the call you dread—a user reporting that the application is slow and asking what's wrong with the database. You drop whatever you're doing and flip into "firefighting mode," wondering how long the problem has been going on. Because one of your top priorities is keeping the production databases running smoothly, you have no choice but to do a quick fix as fast as possible; you can't take time to determine the underlying cause of the problem.
There has to be a better way, doesn't there? You bet. You can use proactive monitoring, a simple monitoring regimen that uses performance baselining, benchmarking, and continuous monitoring. Let me show you how to use proactive monitoring and how to set up a no-cost monitoring system that uses Windows System Monitor.
Proactive monitoring is a simple system that lets you catch problems before they become critical. (You probably already use exceptions monitoring, but exceptions monitoring, in which you create automated processes that note only exceptions to normal service, doesn't provide in-depth information or let you prevent problems.) Proactive monitoring, in contrast, provides the best information about environments and applications, short- and long-term. It's easy: You take baseline measurements, establish benchmark metrics, and maintain an active monitoring regimen.
As its name implies, proactive monitoring requires you to take action. You need to spend a little time setting it up and a little time understanding the performance profile of the databases and applications. For proactive monitoring to be effective, you must review reports so that you can use the rich information gathered.
Baseline, Benchmark, Monitor
Let's start by defining some items. A baseline is a set of measurements that tells you how a server and application behave under typical conditions. The baseline, made up of the average of several measurements made under similar conditions, is established as a measure for comparison.
A benchmark records system performance at a predefined level of load on the server, and it lets you compare the production server against these levels when the server's metrics are far above or below normal (i.e., when the server is performing poorly). Like with a baseline, you take a benchmark in a controlled environment, measuring key areas against predefined metrics. You want to ascertain the server and application's behavior under several levels or types of load, so you'll usually take several benchmarks (as opposed to one baseline).
Monitoring is the ongoing, continuous, real-time observation of a server for predefined events (sets of circumstances that you've defined as worthy of further investigation or notification). For example, you might want to know when an important business process completes successfully, when a backup finishes, or when certain performance thresholds are hit, so you monitor for these specific events.
Now let's set up proactive monitoring. You can use third-party solutions, or you can implement a no-cost solution that uses System Monitor. Third-party solutions can simplify proactive monitoring and add value beyond what no-cost or built-in solutions can provide. But to get you started, I show you how to set up proactive monitoring with System Monitor. (Note that System Monitor works with Windows 2000 and monitors SQL Server on Windows NT 4.0 and Windows 2000 only.)
Step 1: Build the Baseline
The first step in implementing a proactive monitoring regimen is to establish the database server baseline profile. This profile tells you about the performance of the server under normal conditions, helps you document and understand all significant background processes, and helps you figure out "do not respond" situations so that you can later filter them out. (Otherwise, DBAs might end up ignoring the system because it generates so many false alerts.)
The best baselines use a small number of graphics (ideally just one) to visually represent performance, so that you can see at a glance how the server is performing. When you're baselining, the following three steps can ensure your success. First, choose a sampling approach that saves the performance data to a log file or displays it in real time. Having both options is ideal: Logs enable you to go back in time to analyze performance that occurred when you weren't watching the system; real-time displays don't consume ongoing disk space or server resources, yet they enable you to give the system 100 percent of your attention when you need to. Second, specify a sampling interval that balances the need for gathering performance data against the disk I/O and space costs it incurs. System Monitor's default is 15 seconds. The longer the interval, the greater the chance that interesting bits of performance data will be overlooked. Third, choose between local and remote monitoring. Local monitoring, in which the monitoring process runs on the server being monitored, adds CPU and disk overhead to the server. Remote monitoring, which uses a separate server, can negate these problems, but it greatly increases network traffic.
Table 1 lists the System Monitor metrics, or counters, that I recommend you use in establishing a baseline. I can't say what value is "right" in the context of your application because what's "right" varies greatly from system to system. Use the average of multiple baselines to establish typical standard (baseline) performance and to indicate what is right for your system.
Baselining with System Monitor
Now we're ready to set up System Monitor to capture baseline data. Open Control Panel, Administrative Tools, Performance. Double-click Performance Logs and Alerts in the left panel. Right-click Counter Logs and select New Log Settings. Enter a name for the baseline chart, then click OK. In the Select Counters dialog box, choose the first counter, then click Add. Repeat until all counters are added, then click Close.
Try the default 15-second sampling interval at first, or you can select a different interval by clicking Properties (or using the shortcut Ctrl + Q) and then entering a value under Sample automatically every: _ seconds. Longer intervals consume less space, but they provide less-granular data.
Select the Log Files tab and specify where you want to save the data. You can review the data later by using View Log File Data. System Monitor will look similar to Figure 1 as it captures baseline data. You can see that tracking many counters at once can get unwieldy, so choose counters carefully for your baseline.
Step 2: Benchmarking
After you've established a baseline of a server's performance, you're ready for benchmarking, which provides understanding of the server's performance under several predetermined usage scenarios.
For benchmarking, you use the same monitoring regimen as for baselining. You can use one or more popular benchmark scenarios available in the industry, such as TPC-C or SAP, but you get the best benchmarking results by building usage scenarios that are specific to the intended use of your specific database server and its applications.
You can build your own benchmarking scenarios by using a collection of T-SQL scripts, the osql utility or Query Analyzer, SQL Profiler, and System Monitor. Creating the load test scripts in T-SQL usually takes a few days. You can expect to spend a few more days capturing load test performance data and analyzing the information.
After you've determined the performance characteristics of the server under predefined benchmarks, you'll have a good idea of the type of behavior that you can expect from the system. Use the information you've gained from benchmarking to form the foundation of your ongoing monitoring. For example, say that you learned that the server can accommodate up to 249 transactions per second (TPS) before it begins to bog down. Knowing this, you'll probably want to set up a low-priority notification for when the server reaches about 200 TPS and set up a high-priority notification for when the server reaches 235 TPS. That way you'll be aware of potential problems with the server and be able to proactively deal with trouble situations long before users notice anything. No more firefighting. Now that's power!
Step 3: Ongoing Monitoring
Perhaps the most important component of the proactive management regimen is ongoing and continuous monitoring. Without it, you can't keep an eye on database performance or become aware of developing performance problems.
You can set up an inexpensive monitoring regimen for SQL Server using a combination of SQL Server Agent and System Monitor. SQL Server Agent lets you define which error events to monitor, specify who receives event notifications, and automatically send a notification when the error event occurs.
Setting up SQL Server Agent can be time-consuming and tricky, so you'll need to consult the Alerts topic in SQL Server Books Online (BOL) for that. SQL Server Agent typically monitors only the database server for error events and doesn't provide performance monitoring.
To monitor server performance, use System Monitor to watch the following counters (set to a 15-minute polling frequency):
Network Interface—Bytes total/sec
Physical Disk—Disk Transfers/sec
Processor—% Processor Time
SQLServer:Access Methods—Full Scans/sec
SQLServer:Buffer Manager—Buffer Cache Hit Ratio
SQLServer:Databases Application Database—Transactions/sec
SQLServer:General Statistics—User onnections
SQLServer:Latches—Average Latch Wait Time
SQLServer:Locks—Average Wait Time
SQLServer:Locks—Number of Deadlocks/sec
SQLServer:Memory Manager—Memory Grants Pending
Specify a value for each counter between the baseline and benchmark values that your testing revealed. For example, you might set a caution notification when a counter reaches 75 percent of the best benchmark value and a warning notification when it passes 90 percent.
To implement alerts, you can use a no-cost tool such as SQL Server Alerts & Notifications, System Monitor, or you can buy Microsoft Operations Manager (MOM) or various third-party tools. At a minimum, I recommend setting alerts for the following events:
- Errors affecting service, specifically errors with a severity of 19 through 25
- CPU utilization
- Disk utilization
- Scans (SQLServer:Access Methods)
You can send alerts to notify staff by email, pager, or netsend. You'll want to set up automated alerts for these areas, too:
- SQL Server log
- SQL Agent log
- Windows Application, Security, and System log
- SQL Server job history and trends
Finally, you'll want to ensure that any custom applications uphold good error logging and then monitor for the custom error messages of your applications.
You Owe It to Yourself
Proactively managing SQL Server means building a performance baseline for both server and application, establishing benchmarks that model server performance under predetermined usage scenarios, and then implementing continuous monitoring, ideally triggering alerts when problems develop. Whether you use no-cost or built-in tools or you choose third-party solutions, you owe it to yourself and your company to use the proactive monitoring regimen to ensure that you get the most out of your critical SQL Server applications.