The process for designing SQL Server Integration Services (SSIS) packages is typically iterative. You start by getting the components working individually or in small sets, then concentrate on ensuring that the components will work in the correct sequence. During later iterations, you add in more components or adjust properties to perform error handling. Then, in a final pass, you might add in abstractions, taking advantage of variables and expressions to enable runtime changes for your package. But your work is not yet complete at this stage.

Before you put your package into production, you need to take some more time to review your package with an eye toward preventing, or at least mitigating, performance problems. Sooner or later, the business environment is likely to change, requiring you to perform the same work within a shorter time span or to process higher data volumes than originally planned. I’ll describe areas within a package that can cause performance problems and offer suggestions for modifying the package design to get better performance.

Bear in mind that there are various factors that can affect the performance of SSIS packages. Several of these factors are completely external to SSIS, such as the source systems that you’re using for data extractions, disk speed and configuration, NICs, network bandwidth, and the amount of memory available on the server executing the packages. I won’t address how to resolve bottlenecks caused by external factors, but rather concentrate on specific strategies you can take within the design of your packages. I assume that you already have a general understanding of SSIS architecture and package development practices.

Understanding Control Flow Performance

Every SSIS package has at least one task in the control flow. If you add multiple tasks to the control flow, you can direct the sequence of tasks by adding precedence constraints to connect a series of tasks from beginning to end. You can even group these tasks together in containers. Apart from the Data Flow Task, which I’ll describe in more detail later, the performance of each individual task depends on the external systems with which the task must communicate to get its job done. Therefore, the only way within the package to speed up control flow processing is to have tasks or containers (collectively called executables) run in parallel. Figure 1 illustrates parallelism in Container A and sequential workflow in Container B. In this example, the containers themselves also execute in parallel.

Figure 1: Running executables in parallel to speed up control flow processing
Figure 1: Running executables in parallel to speed up control flow processing

The package property that determines how many executables can run in parallel is MaxConcurrentExecutables, as shown in Figure 2. The default is -1, which means that the control flow engine will use the number of logical processors on the server plus 2. For example, if you’re executing the package with the default setting on a quad-core server, you can have up to six executables running in parallel.

Figure 2: Using the MaxConcurrentExecutables property to configure how many executables can run in parallel
Figure 2: Using the MaxConcurrentExecutables property to configure how many executables can run in parallel

In some cases, you won’t see much change if you try to increase the MaxConcurrentExecutables value. However, one scenario with potential for improvement is when you have a package with tasks that must wait for a response from external systems and your server is dedicated to executing packages. In that case, you might be able to boost parallelism by replacing the MaxConcurrentExecutables value with a higher number. Start by increasing the value to the number of processors plus 3, then test the package in Business Intelligence Development Studio (BIDS) to see whether the number of tasks executing in parallel increased. If so, continue to increment the property value by 1, test, and repeat until no further parallelization is possible, or you have no more tasks to run in parallel.

Understanding Data Flow Performance

The Data Flow Task is used to retrieve data from one or more sources, optionally change the structure or content of the data in transit, and send the data to one or more destinations. In business intelligence (BI) applications, this task is used to perform extraction, transformation, and loading (ETL) operations.

SSIS uses a pipeline engine with an in-memory buffer architecture to efficiently manage Data Flow Task operations. Performance of the pipeline engine largely depends on the number of records moving through the pipeline and the number of buffers required to accomplish the transformations and to move the data to the destinations. Therefore, to optimize performance, you need to understand how data flow components and the Data Flow Task properties affect pipeline throughput and buffer requirements.

Finding Bottlenecks and Establishing Baselines

Before you start making changes to the data flow, you should establish some baselines so that you can identify the slower components in the data flow and better assess the impact of any subsequent changes. Let’s assume that your package has a Data Flow Task with a single source, one transformation, and one destination. You should execute the Data Flow Task multiple times to calculate an average for various conditions, which you record in a table. Here are the steps:

  1. Create a table with the column and row headings shown in Table 1.
  2. Execute the package a total of 10 times, adding the times to column A. This will determine the time required to execute the package when all three components are intact.
  3. Replace the destination component with a Multicast transformation, which allows the task to complete successfully without adding any noticeable overhead to the task execution time. Execute the package a total of 10 times, adding the times to column B.
  4. Remove the first transformation in the data flow and connect the source directly to the Multicast transformation to measure only the data extraction process. Execute the package a total of 10 times, recording the times in column C.
  5. Calculate the average for each of the three base measurements.
  6. Derive the processing time of the transformation component by subtracting the averaged value in column C from the averaged value in column B.
  7. Derive the time required to load the destination by subtracting the averaged value in column B from the averaged value in column A.

Most packages have more than one transformation, and some packages have multiple sources or destinations. For these packages, you simply add more columns to the table as needed and work through combinations of components to isolate processing time for each component.

Based on the test results in the table, you can identify the component that consumes most of the processing time and focus your tuning efforts on it. Furthermore, as you implement changes, you’ll be able to quantify the difference in performance and determine whether the resulting difference justifies altering your package permanently.

Fixing Source Component Bottlenecks

Bringing data into the data flow pipeline is naturally constrained by the speed with which the source can actually deliver the data requested by the source component. The processing time required to extract the data is the fastest possible speed for the Data Flow Task, but it’s pointless to extract data without doing something with it. Any subsequent transformation or destination that you add to the data flow pipeline will add performance overhead, which I’ll discuss separately.

Assuming that you optimized the external factors (e.g., disk and server resources) and identified the source component that’s the bottleneck, what can you do within the data flow design to improve that component’s performance? Here are some guidelines:

Reduce the number of columns. Regardless of the type of data source you’re using, select only the columns that are needed for transformations and columns that must be loaded into the destination. That way, memory isn’t wasted on data that won’t be used downstream.

Reduce the number of rows. When working with a relational source, use a WHERE clause to filter the data to the rows you need rather than rely on a transformation in the pipeline to eliminate rows. When working with large flat files, break up the source file where possible and process the resulting files using sequential Data Flow Tasks to minimize I/O contention.

Reduce column width. The column width is affected by the data type. You might need to recast data types in a relational source query or manually edit column widths for other sources to reduce the column width to the smallest possible size that can still accommodate the data coming from the source. SSIS works more efficiently with smaller data types.

Use the SQL Command option instead of the Table or View option for relational sources. The OLE DB source lets you retrieve data from a table or view in the OLE DB data source or use a SQL command to retrieve data from the OLE DB data source. When you use the Table or View option to retrieve data from a table or view, the data flow pipeline engine issues a command using the OPENROWSET function to get the data. When you use the SQL Command option, the engine uses the faster sp_executesql procedure.

Use the fast parsing mode for Flat File sources. If the data environment doesn’t need to support local-specific formats for integer, date, and time data types, you can boost performance by setting the FastParse property to True for relevant columns in the Advanced Editor for the Flat File source.

Fixing Transformation Bottlenecks

The performance of transformations in the data flow depends on the efficient use of buffers. High-performance transformations, known as synchronous transformations (see Figure 3), operate on data in an existing buffer, thereby avoiding the overhead of copying data from one buffer to another. Synchronous transformations can either manipulate data in place (e.g., Derived Column transformation) or add a new column (e.g., Audit transformation). But even this group of transformations consists of two subgroups: streaming transformations, which are the fastest of all the transformations, and row-based transformations, which must perform an operation row by row and consequently are slower.

Figure 3: Categorizing transformations into the synchronous and asynchronous groups
Figure 3: Categorizing transformations into the synchronous and asynchronous groups

After a row-based transformation completes processing all rows in a buffer, that buffer becomes available for processing by the next transformation in the data flow. If the next transformation is a row-based transformation, the data stays in the same buffer. Performance of the data flow is optimal in this case. If the next transformation in the data flow is an asynchronous transformation, a new buffer is required (which takes time to create and load) and a new thread is introduced into the data flow.

Moving from one memory buffer to another is still faster than the I/O operations required to store data on disk between transformations, which is why SSIS in general performs very well. The problem arises when the data flow engine must handle large data volumes, causing it to run out of memory to allocate for buffers. When memory is no longer available, the data flow engine swaps data to disk and consequently slows the data flow performance.

There are two types of asynchronous transformations—partially blocking transformations and blocking transformations—as you can see in Figure 3. Partially blocking transformations have multiple inputs, but a single output. After the transformation writes a row to the new buffer, the row is available to the next transformation even while the partially blocking transformation continues processing remaining rows. Blocking transformations, by contrast, must read all rows in the pipeline before writing rows to the new buffer, which prevents downstream components from starting. Consequently, blocking transformations are most likely to be the cause of performance problems in the data flow when large volumes of data are involved.

When baseline performance testing identifies transformations as the bottleneck, there might be another way to design the data flow to accomplish the same goal. Consider the following design alternatives:

Perform transformations in the source query. When you’re using a relational source, performing transformations in the source query lets you take advantage of the database engine’s capabilities while freeing up resources on the package execution server if it’s separate from your source server. Data type casting, cleansing (using ISNULL or TRIM functions, for example), aggregations, and sorts are common operations that you can perform in the source query. This approach will work only with relational sources. To use it, you need to set the IsSorted property of the source component’s output to True. You also need to configure the SortKeyPosition properties for each output column using the Advanced Editor for the source component.

Separate aggregate operations. If you need to perform different aggregations (each of which is associated with a separate destination) but you can’t perform the aggregations in the source query, you can create a separate Aggregate transformation for each destination rather than create multiple outputs for the transformation. That way, downstream processing for the more summarized aggregations (with fewer rows) can continue without waiting for the less summarized aggregations (with more rows) to finish. In addition, you can configure the AutoExtendFactor property to tune memory separately for each Aggregate transform.

Remove unneeded columns. After asynchronous operations, there might be columns that were used for a transformation and left in the pipeline even though they won’t be loaded into the destination. By eliminating these columns, you can minimize the memory requirements for the new buffer.

Reduce the number of rows. If the data flow doesn’t have a relational source, you won’t be able to use a WHERE clause in the source query to eliminate rows. However, row reduction is still an important goal for performance tuning, so add a Conditional Split transformation immediately after the source component to filter data as early as possible in the data flow.

Optimize the cache for Lookup transformations. Using the Lookup transformation without a cache is slow, so having all rows available in memory using the Full Cache option is a better approach. However, pipeline processing is blocked until the cache is loaded. So, you should follow some of same recommendations I discussed in the “Source Data Extraction” section: Reduce the number of columns and rows to manage memory usage and use the SQL command to retrieve data more quickly.

Preload the cache for Lookup transformations. If you’re using SQL Server 2008 or later, you can use the Cache Transform transformation to preload the cache in a separate data flow. You can even use a separate package to load the cache as a special type of Raw File that persists until you delete it. This cache file loads into memory much more quickly than loading the cache directly from an OLE DB source.

Replace the Slowly Changing Dimension (SCD) transformation with Merge Join and Conditional Split transformations. The SCD transformation is notoriously slow with large dimensions because it has to perform a row-by-row lookup to check whether a dimension row in the pipeline already exists in the target dimension. You could use a Lookup transformation instead, but the memory and time required to load the cache might still create a performance bottleneck. An alternative design is to use a Merge Join transformation with a LEFT JOIN to match sorted source records on the left with sorted dimension records on the right. You then add a Conditional Split transformation to evaluate columns added from the dimension source. If those columns are null, there is no match, which means the row is a new record. If there’s a match, you use the Conditional Split transformation to separate Type 1 from Type 2 processing.

Use an Execute SQL task instead of an OLE DB Command transformation. Another commonly slow component is the OLE DB Command transformation, which performs operations row by row. If the purpose of the OLE DB Command transformation is to update a table with a value obtained in the pipeline, it might be faster to load the pipeline data into a staging table, then use an Execute SQL task in the control flow to perform a set-based UPDATE operation.

Fixing Destination Component Bottlenecks

Sometimes the performance bottleneck in a package is caused by the destination component. Although external factors can affect the performance of destinations, there are some recommendations for improving the data flow design that relate to destinations:

Optimize the OLE DB destination. When using a table or view as the target for the data access method, you can choose whether to use the Fast Load option. This option performs a bulk insert, which is much faster than the row-by-row insert that would otherwise occur. Also, enabling the Table Lock option will help improve performance. If you have a lot of rows to insert into a table with a clustered index, the data flow engine must first sort all rows in the pipeline before performing the insert. You can get better performance if you specify the number of rows per batch, which will reduce the volume of data to be sorted at one time. Another alternative is to drop the index before loading the data into the destination and rebuild the index after the load is complete.

Use a SQL Server destination instead of an OLE DB destination. If the target is a SQL Server database, you can get up to 25 percent faster performance from the SQL Server destination because the pipeline engine can bypass the network layer during the data load. However, the package must execute on the same server as the target database and all data types in the pipeline must match the data types in the target table.

Set data types explicitly. An OLE DB destination can recast data types using a Data Conversion transformation, but there’s a performance cost. You can avoid adding a Data Conversion transformation to the data flow by casting the data type explicitly in the source query for a relational source or by manually editing the data type for each column in the Flat File Connection Manager for a Flat File source.

Evaluating Buffer Efficiency

After working through any bottlenecks to optimize performance, your next step is to evaluate how efficiently SSIS can place data in buffers. To do this, you must enable the BufferSizeTuning logging event, then execute your package and examine the log. You’ll be able to see how many rows were put into the buffer. You can compare this value to the DefaultBufferMaxRows property, which has a default value of 10,000 rows, as shown in Figure 4. If the buffer is actually getting 1,000 rows instead of 10,000, you can adjust DefaultBufferSize to make the buffer larger so that it can hold more rows. Alternatively, you can eliminate columns or reduce column widths by changing data types to try to get more rows into the buffer.

Figure 4: Checking the DefaultBufferMaxRows property’s value
Figure 4: Checking the DefaultBufferMaxRows property’s value

Another property that you can adjust for performance is EngineThreads. This property’s default value is 10 in SQL Server 2008 and later, but only 5 in SQL Server 2005. If you have a server with multiple processors, you can increase this value. You can test your package by incrementally changing the value to determine whether the additional threads improve performance.

Designing in High Performance

As you develop your SSIS packages, it’s important to consider how the design choices you make can affect package performance. Often, there’s more than one way to accomplish an objective in SSIS, but attaining high performance requires you to understand the performance differences in possible approaches.