Question: I’m trying to narrow down an I/O subsystem bottleneck and I’ve figured out that the bottleneck occurs when checkpoint is happening. How can I figure out which database is causing the checkpoint activity?
Answer: For the readers, checkpoints obviously happen every so often for all databases and are usually the main source of write activity for data files (the other one being the lazywriter background process in the buffer pool).
You can monitor checkpoint I/O activity using Performance Monitor by looking at the “Checkpoint pages/sec” counter in the SQL Server:Buffer Manager object – but as the question states, you can’t tell from that which database is the one that’s doing the checkpoint.
You might be wondering why anyone would care which database is doing the checkpoint – surely you just need to fix the I/O subsystem? No. If a database is suddenly driving a lot more checkpoint I/O, it could be a sign that someone has written some code that is erroneously updating more data than necessary. It could also be a sign that some of the tables in the database have a large amount of fragmentation, meaning the data density is very low (i.e. the page density of the indexes) and so there are a lot more pages being updated than usual.
And what if you can’t do anything about the I/O subsystem, and it’s not bad code or data density? Then you’ll need to do more frequent, manual checkpoints to spread the I/O load over time rather than in periodic bursts that overload the I/O subsystem.
The easiest way to see which database is doing a checkpoint is to use trace flags so that the checkpoint process writes to the error log what it’s doing. Turning on trace flags 3605 and 3502 will let you see which database is doing a checkpoint. Use the command:
The “-1” is needed to make sure that all threads pick up the trace flag. The 3502 turns on checkpoint information and the 3605 lets the info go to the error log.
With these flags on, you’ll see entries in the error log as below:
2011-05-05 18:25:15.880 spid18s Ckpt dbid 17 started (8)
2011-05-05 18:25:15.880 spid18s About to log Checkpoint begin.
2011-05-05 18:25:15.880 spid18s Ckpt dbid 17 phase 1 ended (8)
2011-05-05 18:25:18.830 spid18s About to log Checkpoint end.
2011-05-05 18:25:18.840 spid18s Ckpt dbid 17 complete
The “s” on the end of the spid number shows that this is an automatic checkpoint. If someone manually issues a CHECKPOINT command, it will be logged in the same way but the spid will not have the “s”.
If you want more details about what’s happening during the checkpoint, you can use trace flag 3504 as well. It will expand the output as below:
2011-05-05 18:32:36.970 spid18s Ckpt dbid 17 started (8)
2011-05-05 18:32:36.970 spid18s About to log Checkpoint begin.
2011-05-05 18:32:36.970 spid18s Ckpt dbid 17 phase 1 ended (8)
2011-05-05 18:32:37.100 spid18s FlushCache: cleaned up 1149 bufs with 55 writes in 131 ms (avoided 40 new dirty bufs)
2011-05-05 18:32:37.100 spid18s average throughput: 68.52 MB/sec, I/O saturation: 51, context switches 100
2011-05-05 18:32:37.100 spid18s last target outstanding: 2, avgWriteLatency 2
2011-05-05 18:32:37.100 spid18s About to log Checkpoint end.
2011-05-05 18:32:37.100 spid18s Ckpt dbid 17 complete