Use event tracing to find the cause of a crash

When Microsoft introduced SQL Server 7.0, one of my favorite features was the SQL Server Profiler. I can say the same thing about SQL Server 2000. The differences between the SQL Server 7.0 version of the Profiler and the SQL Server 2000 version might not appear as great as those between SQL Trace in SQL Server 6.5 and SQL Server Profiler in SQL Server 7.0 because the UI still looks familiar. But Microsoft completely redesigned Profiler in SQL Server 2000 for improved performance, functionality, and scalability. I can't describe all of Profiler in this article, or even tell you about all the changes. Instead, I focus on one Profiler feature that has changed dramatically: the Blackbox.

The Blackbox contains a record of the most recent statements executed against the SQL Server system so that in case of a severe crash or other unexplained server behavior, you have a record of what queries ran recently and therefore a clue to what went wrong. In the sidebar "SQL Server's Black Box," April 2000, Itzik Ben-Gan described the creation and use of the blackbox.trc file in SQL Server 7.0. Although the Blackbox recording feature is separate from the tracing mechanism, the two use many of the same internal mechanisms. And, after a server crash, you can use Profiler to view the file that stores the Blackbox information, just as you can view any other trace file. Let's look at how to enable Blackbox recording in SQL Server 2000, using nearly the same mechanism that you use to set up any other server-side trace.

Gathering Information

SQL Server provides a special utility—the sqldiag command—to help you gather the information you need for troubleshooting SQL Server problems. You can find the utility in the same binn directory as all your other SQL Server executables; by default, the directory is \Program Files\Microsoft SQL Server\MSSQL\Binn. When you execute sqldiag from a command prompt while SQL Server is running, this utility gathers a vast amount of information about your OS, your hardware configuration, and your SQL Server configuration. Sqldiag writes the output to a file called sqldiag.txt (the default location is the \log folder of your SQL Server 2000 installation directory). If you enabled Blackbox recording, then executing sqldiag copies the Blackbox file to the same directory the sqldiag.txt file is in.

SQL Server 2000's sqldiag utility lets you specify several parameters to control certain aspects of the information gathering. The full list of parameters is in SQL Server 2000 Books Online (BOL), but the parameters relevant to this discussion are -X and -O output_file.

The -X parameter tells sqldiag not to include the error logs in the sqldiag.txt file. By default, the error log that SQL Server records restarts only when SQL Server starts, and the error log keeps growing until the next time you stop and restart the SQL Server service. By default, SQL Server maintains a history of the previous six error logs, so when you start the SQL Server service, the previous error log is copied into errorlog.1, the previous errorlog.1 is copied to errorlog.2, and so on. If you're gathering information to send to a support provider, then including the contents of several previous error logs can be a good thing. But for my own troubleshooting, I find that capturing the contents of all the error logs makes the sqldiag.txt file unwieldy. I rarely call sqldiag without the -X option.

The -O output_file parameter lets you specify a name other than sqldiag.txt for the sqldiag output. If you specify the -O option, sqldiag renames the blackbox.trc and blackbox_1.trc files based on the output_file name you use. For example, if you specify the output_file as diagreport.txt, sqldiag renames the trace files diagreport.trc and diagreport_1.trc, respectively.

Even if SQL Server isn't running when you execute sqldiag, the Blackbox trace file (if you enabled it) is still available. This availability can be useful if SQL Server is down because of an error; the Blackbox trace file will contain a record of which queries were running immediately before the failure. Running sqldiag before you restart your server can give you a clue to why the server stopped running.

Enabling the Blackbox

To start a trace from the server, you use the sp_trace_create procedure. This procedure usually takes up to five parameters, but when you're creating a trace for enabling Blackbox recording, you use only the first two. (A full discussion of creating server-side traces is beyond the scope of this article, but I'll discuss tracing again in future issues.)

The first parameter is an output parameter that returns a trace ID number if the Blackbox trace starts successfully. You need to save that trace ID value if you want to examine or manipulate the trace. The second parameter lets you specify one of several options to identify the trace type, but for the Blackbox, this parameter's value must be 8, which tells SQL Server to save the trace's output to a Blackbox file.

After you create a trace, you start it by executing the sp_trace_setstatus procedure, using a status value of 1. Starting the trace automatically captures a trace file called blackbox.trc in the default \data directory of your SQL Server installation. You can use a table-valued function called fn_trace_getinfo to look at the trace properties, including the exact name of the file. (A table-valued function is one that returns a rowset, so that you can invoke the function in the FROM clause of a SQL statement.) The following example code creates the Blackbox trace:

DECLARE @rc int, @TraceID int
EXEC @rc = sp_trace_create @TraceID output, 8
SELECT TraceID = @traceID, error = @rc
EXEC sp_trace_setstatus @traceID, 1
SELECT * FROM ::fn_trace_getinfo(@TraceID)

Figure 1 shows the trace information that the fn_trace_getinfo function returns.

Monitoring the Blackbox Trace

When you first enable Blackbox tracing, you'll see the blackbox.trc file in the \data directory, but the file size might appear to stay at 0 initially. In fact, SQL Server Profiler probably won't report a positive size for the blackbox.trc file until SQL Server has executed enough batches to increase the file size to at least 128KB.

SQL Server captures the trace data to an internal location and writes the file to the disk in 128KB chunks. In addition, sqldiag captures only the data that SQL Server has written to the blackbox.trc file, and that file might not include the most recent queries. By default, any query batches submitted after SQL Server writes the last 128KB chunk aren't available in the blackbox.trc file.

When SQL Server stops, either because of an error or because someone manually stops it, SQL Server writes any remaining captured trace data to the blackbox.trc file from the internal location. If you want all the trace data to be available without having to stop the SQL Server service, you must stop and close the Blackbox trace. You can do this with two calls to the sp_trace_setstatus procedure.

To stop and close the trace that has the trace ID of 1 (the trace I showed you in the previous section), you can execute the following statements:

EXEC sp_trace_setstatus 1, 0
EXEC sp_trace_setstatus 1, 2

Setting the status to 0 stops the trace; setting the status back to 1 can restart it. You can set the status to 2 to close the trace, but you must already have stopped it. You can't restart a closed trace; you must completely recreate it. When you've closed the trace, the complete blackbox.trc file is available. Subsequently running sqldiag copies the complete blackbox.trc file to the \log directory as sqldiag.trc. (BOL incorrectly states that sqldiag captures only the last 100 queries; that number is accurate for SQL Server 7.0, but not for SQL Server 2000.)

The Blackbox trace is automatically configured for a maximum file size of 5MB and will roll over and start a new file called blackbox_1.trc (which BOL incorrectly reports as blackbox_01.trc) when the file exceeds the initial 5MB. If this second trace file exceeds 5MB, the trace reverts to writing to the original blackbox.trc file. You always have the last 5MB of trace information available—and up to 10MB if you have a rollover. Because the Blackbox is intended to save only the most recent queries, 10MB is usually sufficient. If you want to save more than 10MB of queries, consider running a trace independent of the Blackbox.

Automatic Tracing

Because you never know when you might have a serious error and need the contents of the Blackbox, you can enable Blackbox tracing to start automatically when SQL Server starts. To do this, you need to create the trace within a stored procedure and use the sp_procoption procedure to mark the trace for autostart, as the code in Listing 1 shows.

Be aware that having a constantly running trace might negatively affect your server's performance. I don't have any numbers yet to quantify this impact, which is far smaller than in SQL Server 7.0, primarily because of the redesign of the tracing mechanism. The only events the Blackbox trace captures are the query batches sent from clients to SQL Server, and capturing batches is one of the least expensive events to trace. Nonetheless, tracing isn't free in terms of resource usage, and if you're trying to tune your SQL Server for maximum performance, you need to consider this added cost.

This article isn't a complete discussion of server-side tracing in SQL Server 2000, but it does alert you to the Blackbox's new behavior and capabilities to record SQL Server's most recent queries. The Blackbox tracing feature is incredibly useful for troubleshooting, but SQL Server 2000 BOL's discussion about the feature is seriously limited. The information I've presented can fill in the gaps and give you the information you need to take advantage of this tool.