Learn what to monitor, how to monitor, and how to interpret the results of DMV management
Every DBA knows that there are certain aspects of SQL Server and its associated hardware that you simply must monitor if you want to measure overall system performance. However, not everyone knows exactly what to monitor, how to monitor, and—most important—how to interpret the results. In this longer-than-usual edition of "PTO for the Rest of Us," I'll go into some depth to explore one of these crucial puzzle pieces: I'll use a built-in dynamic management view (DMV) in SQL Server 2005 related to physical file access—sys.dm_io_virtual_file_stats()—and I'll generate and use a small amount of custom code that will let you easily report on this DMV's output. Although you'll find this DMV only in SQL Server 2005 or later, it replaces a similar function in SQL Server 2000 called ::fn_virtualfilestats(). This column's techniques, code, and principles will be specific to SQL Server 2005, but you can easily apply them to SQL Server 2000 with minor modifications. For more on the latter application, join the discussion about this article in the SQL Server Magazine performance-tuning forums (http://sqlforums.windowsitpro.com).
What can this wonderful DMV do for you? In a nutshell, this DMV lets you see exactly what type of physical I/O your SQL Server system is experiencing when you're reading from or writing to the files on disk. The physical I/O distinction is important, compared with the metrics you might get from other SQL Server functions and utilities, which include or report on only logical I/O. (Logical I/O is when a page is accessed only in memory, or cache, whereas physical I/O accesses the file on disk.) So, remember that if the I/O accesses only the SQL Server cache (hence a logical I/O), the output of this DMV won't reflect that I/O. Therefore, you can determine with certainty how much you're reading and writing to your database files on disk. This DMV also indicates how long SQL Server has taken to complete the reads or writes. And because the DMV tracks that data at the individual file level, you can use the results to get a terrific handle on several aspects of your system, such as how busy one file is compared with another, or which file waits the longest for physical operations to occur.
Let's take a look at the raw DMV output, which Table 1, shows. Most of this data is self-explanatory, and all of it is occasionally useful, but I'll focus on just a few for the purposes of this article. The data about number of bytes read and written provides an accurate look at how much data you're reading or writing to each of the files. The number of reads and writes specifies how many individual requests were made for that time period. This information can tell you two very important statistics about your system.
First, there will be no doubt as to which database files have the most physical disk I/O—both in terms of the quantity of data (bytes read or written) and the frequency of the I/O (number of reads or writes). Lots of data read in bytes, combined with a fewer number of reads, can indicate poorly optimized queries that read too much data and have difficulty keeping all the data in the cache. Lots of write requests indicate a busy system in terms of changes to the database.
Second, together with the I/O stall counters, this raw data can tell you exactly where your disk bottlenecks are, in terms of the physical disk arrays the files reside on. The I/O stall counters give you an indication in milliseconds of how much time elapses during the read or write operation—or the combined total, depending on the column. So, obviously, the file with the most I/O stalls is spending the most time waiting. If you add up the stalls for all the files that reside on a particular physical array, you can see which part of the storage system is waiting the most. But that information alone doesn't necessarily suggest that a given array is the poorest-performing. You also have to consider the amount of data processed and the frequency of the I/O requests for that time period.
Making It Work
Table 2 shows some data I gathered for two files on my test system. One is a transaction log file, and the other is a data file. Note that you must be careful when comparing different types of I/O. Log-file traffic is typically smaller in size but accessed more frequently than data, which is larger but characterized by less frequent I/O. These differences suit the purpose of this discussion perfectly. As you can see, the average wait per I/O for the data file is a little over 1 millisecond more than that of the log file. However, the average megabyte per write on the data file is almost four times the average megabyte per write on the log files. That result merely speaks to the way data is written to disk—usually in a deferred fashion as opposed to the transaction log's write-ahead logging behavior. When comparing one data file with another, or a group of files that reside on one array versus another, it becomes much easier to spot the offenders. Remember that the number of waits or requests is always relative, and a high number doesn't always indicate a problem. If you're writing a ton of data, you'll most likely see high values in these columns. The goal is to keep the averages as small as feasible; there will always be some latency or wait times because of the way storage systems work. Microsoft's "Predeployment I/O Best Practices" white paper (http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx) has some great information about preparing disk subsystems for use. The paper also includes handy guidelines for determining how much latency is acceptable under certain conditions.
Another effective way to utilize this DMV's information is in day-to-day monitoring of your storage subsystem. You can take some measurements under known conditions, then use that data as a baseline. Afterward, from time to time, you can compare current metrics against the baseline to spot trends or even potential problems before they get out of hand. This easy solution takes only a few lines of code.
The first thing you need to do is to capture the information at periodic intervals. The information returned from the DMV is a snapshot in time, with values that are cumulative since the most recent SQL Server restart. (There's currently no way to reset these counters short of the restart, but such a feature would be useful.) The DMV materializes the data as a resultset from an in-memory structure on demand, and the end user can't directly change that data. Because the data is a snapshot in time, you simply need to capture the data at periodic intervals and store it in a table. You can then perform a delta of any two snapshots and see how much physical I/O occurred and how much time passed between those two points in time. To that end, I've created a relatively simple stored procedure, which Listing 1 shows, to capture the values and store them in a table residing in the tempdb database. If you'd rather use another database, you can modify the code accordingly.
If you want to clear the contents of the table, you can call the stored procedure with an optional parameter that has a value of 1. Just remember that the counters are reset with each SQL Server restart. So, if you have this table in a place other than tempdb, you might want to clear it right after each restart. Otherwise, to get proper deltas, you'll have to add additional logic to the reporting stored procedure, which I'll share in a moment. How often you collect the data is up to you, but I tend to collect it every 10 minutes, which gives me the flexibility to report down to a 10-minute time span. You can create a basic SQL Agent job that executes this stored procedure at the frequency with which you want to populate the table with the snapshots.
Now, you need some way to view the data you're collecting. That's where the reporting stored procedure, which Listing 2 shows, comes into play. When this stored procedure runs with no parameters, it determines the first snapshot and the last snapshot taken, then calculates the difference or delta between them. It also performs certain calculations to display the counters in a format that's easier on the eye and to break the results down to megabytes instead of just bytes. The stored procedure then joins the results on some system tables to display the actual names of the files and databases in the report. An optional parameter returns only the files for a supplied database ID. You can easily enhance this code to allow for any calculations you want to report on or to specify a specific date/time range. You might even want to perform hourly deltas to see what times of day you experience the most physical I/O. And don't be surprised if that time turns out to be in the middle of the night, when no one is around: That's when your backups are probably occurring!
Gather It and Use It
You have an extremely useful tool at your disposable in SQL Server 2005. By spending a few hours getting to know this DMV, along with a few lines of custom code, and implementing it into your environment, you'll save a lot of time and aggravation down the road. You'll have a head start on creating a baseline for your system, especially when it comes to the I/O aspect. But this DMV can also come in handy for troubleshooting performance problems as they appear. If you aren't regularly capturing this information, you're probably missing some crucial snapshots that could help you in a pinch. The overhead of capturing this data every 10 minutes—even on a busy system—is negligible, so there's little excuse for not collecting it. Just remember that the data is no good unless you do something with it!