One thing software developers often forget after we write brilliant code is that someone will need to actually operate the applications we create. No matter how spectacular the code, we can’t anticipate everything that developers and users might change in an application.
Related: SSIS Logging and Data Auditing
Issues concerning system changes, data changes, and errors are particularly acute in any extraction, transformation, and loading (ETL) work. ETL code, by definition, must bridge two or more systems. Those systems change over time. Even without new development, upgrades, and patches, changes can cause problems. In addition, the user community continually adds data to source systems. Data that’s allowed on the source system could violate a business rule set on the target system, causing your elegantly designed ETL code to throw up like a frat boy on his 21st birthday.
There are two laws of ETL development that are as inviolate as the laws of thermodynamics. They are:
- Source systems change constantly. No one involved in the changes will ever tell the ETL developers that the change is coming until after the change has been made and your extractions fail. This is usually accompanied by someone with a "C" in their title (as in CIO) getting upset over data marts not having data and indiscriminately lopping off various appendages.
- Users continually mess up data input. They do it in new and varied ways that boggle the mind and make no logical sense. You’ll need to deal with all of these issues while management screams at you that they need their reports. You’ll do this until someone takes pity on you and you, blessedly, retire a shattered and broken human being.
Therefore, we need to design systems that can recover from these errors. It also must provide sufficient information to the appropriate people so that they can address these issues on an ongoing, operational basis. This raises the questions of what should be logged and, sometimes, where it should be logged, versus how you should log it. Broadly speaking, there are three main areas for logging:
- Things that go bump in the night
- WTF (as in "Where’s this From?" …. Get your minds out of the gutter)
- Event driven logging, or Audit logs
Let’s take a look at these three logging areas in a bit more depth.
Things That Go Bump in the Night
Things can go bump at any time, but the typical ETL job runs late in the evening or during the early morning hours. That’s when data is extracted from a transactional system and loaded into a data mart or data warehouse for analysis, or when different systems are synchronized during a maintenance window. I’m referring specifically to unrecoverable errors. As good developers, we need to provide the operations staff with enough information for them to find and fix the problem. As self-centered megalomaniacs (we leave out megalomaniacs who aren’t self-centered), we need to provide this data so that we aren’t getting phone calls in the early morning hours or on the weekend asking us to come in and fix a broken system.
Related: SSIS For DBAs
These problems should be (but often aren’t) rare. For example, it’s difficult to imagine how one can programmatically recover from someone kicking the power cord out of the source system’s server. We simply need to build our code with a nod to the fact that it runs on a machine and machines occasionally break.
When these events occur, the error needs to be raised in such a way that it is visible to the folks who can begin to address the problem. So the first thing required here is an interface for the operations staff to view the status of each job as it runs. Let me say the following in big bold letters, so I'm very clear—NEVER USE EMAIL TO REPORT ERRORS!
Why? First, email lists for notification purposes are rarely, if ever, maintained or kept up-to-date. Second, everyone, without exception, puts rules on their email accounts to route such email messages to some folder. No matter what this folder is named, the name means "Ignore Until Management Calls." The result is the error is ignored until you have annoyed users. Third, bad things come in threes. If the first thing bad is that the application died, the second thing bad is that the person who gets the email about it is unavailable (e.g., out sick, on vacation, in some remote location with no cell phone coverage). The third bad thing will be the 2AM call you get to come in to fix the problem.
Fourth, many times these email messages are sent to groups. It’s an ironclad law of corporate behavior that when everyone is responsible, no one is responsible. So nothing will happen until someone with a "C" in their title (as in CIO) gets a call asking why there’s no data. This isn’t where you want to be.
Ideally, there will be a simple interface with an easy to understand "red light, green light" approach to these errors. That application will direct the operations staff to detailed messages that will help them. Unfortunately, we typically get rather cryptic error messages back from our applications. Often, we have stumbled across these problems while in the development cycle or handing off to operations. We should be making it a practice to decipher these messages for the operations staff.
For example, I recently completed a project that used a web service as a source for a data mart. Towards the end of development, I would often get the following exception in my code: An unknown error has occurred. Please contact the application administrator for further information. This is hardly the most informative error message ever created. However, we noticed that invariably this error would arise because the web service server was down. It seemed to me to be a matter of common courtesy, if nothing else, to pass this knowledge to the poor soul who would be getting it at 2 AM, when this data load was scheduled to run. In this case, some simple code wrote the additional details out to a text file, and the path to the text file was then put in the message logged in the Windows Event logs, as shown in Listing 1 below.
It should also be noted that in this application the inner exception messages contain a good deal of information that could help solve the problem. This is a simple method call that can be abstracted to a class file and called in whatever way you typically call a method. In our case, for example, we simply called the method from the catch portion of a try/catch block in a VB Script. You can also incorporate this method into a CLR stored procedure and execute it from a try/catch block in a SQL statement, or you might call it from the packages on error event handler.
Where you raise such errors will depend on where the monitoring system looks for the application status. I have yet to find a monitoring tool that didn’t look at the Windows Event logs for errors, so logging errors to the event logs would be a good place to start. Again, the code to do so is quite simple using the EventLog class as Listing 2 (below) shows.
In this case, we simply note that an error has occurred, raise it to the Application event log, and direct the user to a text file where they can find the details they need to diagnose the problem. In short, this need not be complex, but it does need to provide the operations staff with the data they need to do their jobs. The following are three best practices for providing operations staff with the necessary information:
- Operational staff needs a simple UI to monitor ETL status (NOT EMAIL!).
- Provide error details somewhere. It’s easy to simply write these details to a text file and provide the path to this file via the Windows event logs or the monitoring tool.
- Have some courtesy. If you can provide more details then the exception thrown by the application, do so.
Every ETL project invariably comes across data in unexpected formats or is otherwise a knuckleball. For example, you’re expecting integers and get "two" instead of "2." None of these problems will rise to the level where an application can or should fail. Instead, these represent business rules or data quality issues that should be reported back to users so that the inconsistencies might be addressed. Data from source systems can, for example, violate business rules or constraints on the target system. Every database professional has, at some point in their career, come across a table that was created using poor design practices or coding practices. Such practices result in critical data held in nullable fields. If a more carefully designed BI system attempts to enforce the NOT NULL business rule, an insert or update error will inevitably arise when someone attempts to insert a null value carried erroneously on the source system. This error should not rise to such a level that an entire ETL process fails. Instead, the single row causing the error should be caught and the error handled gracefully through the WTF logging process.
Data from the source system can violate rules of common sense or industry standards, and a bug in the source system can produce illogical data. For example, in one application, the cumulative production of an oil well would reset to 0 from time to time on the source system because of a source system bug. So, when calculating the average production of a well, we tried to use a simple formula of
With a cumulative volume reset to zero, we would get a negative number as a result—implying the company was putting oil back in the ground. At the price at the time, about $140/barrel, this seemed unlikely.
These examples illustrate different types of data quality issues, each of which might be handled in different ways using SQL Server Integration Services (SSIS). Those issues can broadly be classified as the following:
- Data from the source system fails based on some constraint or business rule on the target system.
- The source system consistently violates either an industry standard or a corporate standard.
- A bug in the source system produces unexpected or illogical data.
There are several ways to handle these issues as they arise. The first, and my preference, is to work with the folks running the source systems to resolve the problems. This usually involves a dope slap, accompanied with some variation of "@#$, what are you thinking?!" Unfortunately this method, while effective, violates most HR policies. Your fallback here should be more political negotiations around field mappings, particularly where you want to keep your data mappings within some reasonable semblance of logic and industry standards.
The next thing that should be done is to enforce business rules and logical rules on the target database itself. Although doing so is a standard best practice for database design, it’s remarkable how often developers fail to follow this practice. SSIS provides easy mechanisms for logging violations of these rules or constraints. For example, consider the case where the cumulative volume of oil pumped from a well is reset to zero. A simple check constraint on the target field enforces the rule that the cumulative volume for today must be greater than or equal to the cumulative volume for yesterday. Should this test fail, the insert of today’s data will fail. SSIS can then catch and log this failure for this one row of data, while continuing to load data that complies with the business rule. Figure 1 shows an extremely simple example of how this can be handled.
Data is extracted from an OLE DB source to be loaded directly to an OLE DB target. Should an insert fail, the data will be written to a comma delimited file, along with the error code and the error column.
Most frequently, the business process breaks down at this point. You have this data in some type of log, and then what? The best practice is to provide an easy UI that will let a business user address these issues. At a minimum, this UI would provide
- Identifying information that will allow a user to track the offending data to its source.
- A clear reason that the insert or update failed. This will require that you translate the error code. You can view the SQL Server 2008 error codes at msdn.microsoft.com/en-us/library/ms345164.aspx, and the SQL Server 2005 error codes at msdn.microsoft.com/en-us/library/ms345164(SQL.90).aspx. Adding an error description requires a single line of script, as described in the SSIS help topic, Enhancing an Error Output with the Script Component.
- Additional information that will help the user identify the error. For example, simply telling the user "The data value violated the schema constraint" isn’t helpful. You must tell the user why the constraint was violated so that they can do their job to fix the issue.
Audit- and Event-Driven Logging
There will be times when the business requires auditing of data transfers. Most of these requirements can be met with the event-driven logging internal to SSIS. The basic logging screen is relatively simple. Simply select which events to log, as Figure 2 shows.
On the advanced screen, you have more control over what gets logged, as shown in Figure 3.
You should be cautious with SSIS logging. It carries a large I/O overhead and can slow application performance dramatically. Unless absolutely required, I would limit SSIS logging to OnError and OnTaskFail events. I would suggest you limit the details to the computer, source ID, and Message Text details for production purposes. For day-to-day operations, this will provide enough information to troubleshoot failures and raise the errors to a level in which they can be seen and handled.
If you begin to see more routine errors or intermittent errors, or if you’re not getting sufficient data to properly diagnose an issue, it’s best to add additional events to the log slowly. First, you should avoid simply overloading your own diagnostic efforts with too much data. Eventually, someone will need to plow through all of this data to ascertain what’s causing a problem. Second, the logging itself introduces overhead—in some cases substantial overhead—to the SSIS process, and this additional overhead could cause additional problems to arise. I would add the On Warning events and diagnostic events first for errors and bugs. For other issues, such as slow-running loads, you might want to look at the OnProgress, OnPreExecute, and OnPostExecute events as well. Other events can, of course, be used as your environment warrants.
How one should log will vary based your specific situation. Here, I have limited myself to the simplest text file logging. You can also log to any OLE DB–compliant application you choose by selecting the correct tool from the SSIS tool box, or by creating a custom script or class. This is limited only by your imagination.
Listing 1: Writing Additional Details to a Text File to Be Added to Windows Event Logs
//irrelevant code removed
StreamWriter ErrorWrite = new StreamWriter(logPath.ToString() );
ErrorWrite.WriteLine("There has been an error in the load of " + LoadName);
string errMess = ex.Message;
if (ex.InnerException == null)
if (errMess.Contains("An unknown error has occurred. Please contact the application administrator for further information."))
ErrorWrite.WriteLine("This error typically appears when there is a failure with a call to a web service. Please chack the appropriate WCF service for errors.");
ErrorWrite.WriteLine("Inner exception: ");
if ( ex.InnerException.InnerException != null)
ErrorWrite.WriteLine("Inner exception: ");
//irrelevant code removed
Listing 2: Using the EventLog Class to Log Errors to Event Logs
StringBuilder PleaseSee = new StringBuilder();
PleaseSee.Append( "Please see ");
PleaseSee.Append(" for further details");
System.Diagnostics.EventLog.WriteEntry(LoadName, PleaseSee.ToString(), System.Diagnostics.EventLogEntryType.Error);