Three tips for getting and analyzing I/O information
I'm always looking for ways to measure SQL Server's performance and present the information in a readable and useful format. Here are some useful tips I've come across for getting and analyzing I/O information:
- Use sys.dm_io_virtual_file_stats.
- Use SQL Server Performance Monitor to help identify I/O bottlenecks.
- Understand the infamous error message.
A common way to measure SQL Server performance is to examine I/O statistics by using dynamic management functions (DMFs) and dynamic management views (DMVs), which were introduced in SQL Server 2005. These dynamic management objects let you obtain many different types of statistics about a system's state. With this information, you can monitor your system, tune its performance, and troubleshoot any problems that arise.
A particularly useful I/O-related DMF is sys.dm_io_virtual_file_stats, which returns I/O statistics for database files. This DMF's num_of_bytes_read and num_of_bytes_written columns let you easily calculate total I/O. In addition to calculating total I/O, you can use common table expressions (CTEs) to determine the percentage of I/O usage, which tells you where most of the I/O is occurring.
In the web-exclusive article "Querying DMFs to Analyze Performance Stats", Itzik Ben-Gan provides a helpful tutorial on how to use sys.dm_io_virtual_file_stats. He provides several scripts, including the one that Listing 1 shows. This script uses a CTE to return the I/O usage for each database across all drives, regardless of the disk layout of the particular database.
Listing 1: Itzik Ben-Gan's Code That Calculates the Percentage of I/O for Each Database
Determining I/O usage is useful in many scenarios. For example, if you have the opportunity to move files on one physical array to other physical arrays, you can determine which files have the highest I/O so that you can put them on different arrays.
However, I/O usage doesn't help much in determining whether or not a performance problem exists. For instance, suppose you determine the I/O usage of the files on a server. You notice that one file has an I/O usage of 90 percent. Sounds like a problem, right? Not necessarily. If a file is using 90 percent of the I/O but there's no waiting for reads or writes, you should be OK. The more users wait, the more performance is potentially affected. So, in this case, you also need to look at statistics that tell you how long users have to wait for reads and writes to occur. To do so, you can use the io_stall_read_ms and io_stall_write_ms columns in sys.dm_io_virtual_file_stats. These columns can tell you the total time that users waited for reads and writes to occur for a given file.
In the script IO_Percent_by_DriveLetter.sql (Listing 2), I took advantage of the io_stall_read_ms and io_stall_write_ms columns.
Listing 2: IO_Percent_by_DriveLetter.sql
You can use this script to determine which disk is producing the most I\O and which database that disk is related to. This information can be helpful in determining which database file should be moved to another disk. Figure 1 shows sample results. (You can download both scripts by clicking the 103396.zip link near the top of the page.)
Figure 1: Sample results from IO_Percent_by_DriveLetter.sql
Another way to look at I/O waiting is use the io_stall column in sys.dm_io_virtual_file_stats. This column can tell you the total time that users waited for I/O on a given file. In his blog "SQL Server 2005 Emergency Diagnostic and Performance Queries Part 1," Glenn Berry uses the io_stall column analyze a database's I/O usage by the percentage of I/O stalls. You can find this code, which is part of a script, by going to glennberrysqlperformance.spaces.live.com/blog/cns!45041418ECCAA960!893.entry and looking for the comment
-- Analyze DB IO, ranked by IO Stall %
in the script.
As I mentioned previously, Microsoft introduced DMFs and DMVs in SQL Server 2005. If you're using SQL Server 2000, you can check out the article "Examining SQL Server's I/O Statistics" for alternative methods to get and analyze I/O statistics. Even if you're running SQL Server 2005 or later, you might want to read it. It's a good primer on I/O statistics. Plus, the scripts provided will work on SQL Server 2005. (They'll likely work on as well, but I haven't tested them on that version.)
You can use SQL Server Performance Monitor to help identify I/O bottlenecks. There are two I/O-related counters you can check:
- PhysicalDisk(_Total)\Avg Disk Queue Length. This counter, which measures subsystem I/O, is one of the main counters in Performance Monitor. Avg Disk Queue Length is an estimate of the requests on the physical or logical disk that are either in service or waiting for service. As a general rule for hard disks, an Avg Disk Queue Length greater than 2 (per hard disk) for extended periods of time is considered undesirable. If you have a RAID system with eight disks, you don't want an Avg Disk Queue Length greater than 16 (16/8=2). Faster hard disks with quicker access times (and therefore quicker I/O) will allow greater flexibility with these numbers.
- PhysicalDisk(_Total)\Avg Disk sec/Transfer. This counter reflects how much time a disk takes to fulfill requests. A high value might indicate that the disk controller is continually retrying the disk because of failures. These misses increase average disk transfer time. For most disks, high average disk transfer times correspond to values greater than 0.3 seconds, or if the value is more than 15 to 30 milliseconds, an I/O problem exists.
If you'd like more information about Performance Monitor's two I/O-related counters, read "Tips for Using Performance Monitor I/O Counters".
Have you ever encountered the following I/O-related error message in the SQL Error Log and wondered what it was all about?
SQL Server has encountered n occurrence(s) of I/O requests taking longer than 15 seconds to complete on file
This error message was added to SQL Server 2005 SP2 to help identify issues affecting SQL Server's performance. It could indicate a poor disk subsystem, a misconfiguration, or a problem with the host bus adapter (HBA) driver or hardware connecting the server and SAN. It could also indicate fragmentation, excessive I/O requests not being handled properly by the disk, or data files not optimally placed on the disk.
"I/O Requests Taking Longer Than 15 Seconds To Complete" is an excellent article that delves into the origins and causes of this infamous error message. It's definitely worth reading.
The reason why it's so important to understand this error message, especially for DBAs, is that it's often misunderstood by those who are not DBAs. When this error message occurs, system and SAN administrators are often quick to blame the database server, when in actuality, SQL Server is simply being helpful in identifying a potential disk or hardware problem. And managers are more apt to viscerally accept a system or SAN administrator's analysis over that of a DBA. So, the more you understand about this error message, the better you'll be able to lead the system and SAN folks to the real underlying problem.
This happened recently at a client of mine. It took some serious caucusing and long meeting hours to prove that it was in fact a disk problem. I identified the cause right away, but the skeptics needed more convincing and more disk and SAN diagnostic tests. Because the skeptics were fiddling while Rome was burning, several clusters crashed at regular intervals. As it turns out, the subsystem couldn't keep up with the I/O requests generated from SQL Server because an HBA driver was improperly configured on the SAN. The SAN firmware had been upgraded but not the HBA drivers, which caused poor I/O throughput, The SAN firmware and the HBA drivers should have been upgraded simultaneously.
Sing a Happier Tune
I hope these three tips will prove helpful the next time your server is slow and you suspect it's I/O related. The sooner you find and fix the I/O problem, the sooner you'll be singing a happier tune.