The first version of Data Transformation Services (DTS), which Microsoft introduced with SQL Server 7.0, gave database professionals an easy-to-use, low-cost alternative to more expensive products in the data extraction, transformation, and loading (ETL) market. The first versions of most products leave gaps in their coverage, however, and DTS was no exception. Microsoft provided several enhancements in SQL Server 2000 that significantly increase DTS's power and usability. Two new tasks, as well as upgrades to an existing task, are standout improvements. Let's walk through a data ETL scenario that showcases these features as you create a SQL Server data mart from the FoodMart sample database that ships with SQL Server 2000.
Introducing the Create FoodMart 2000 Package
How many times have you wished that you could put SQL Server through its paces on a database larger than Northwind and Pubs? Actually, SQL Server ships with the larger FoodMart sample database, which is the source database for the FoodMart Analysis Services cube. The FoodMart database has just one drawback—it's a Microsoft Access database. I created a set of DTS packages that takes the Access database and moves it to SQL Server. (You can download the files you need to run the sample DTS packages. For download instructions, see "More on the Web," page 54.)This scenario provides a good framework for introducing DTS's key new features.
Before diving into the details, let's look at Figure 1, which shows the Create Foodmart 2000 DTS package. You can break down this package into 15 tasks that you group into five main steps:
- initializing global variables and the package state (Tasks 1—2)
- deleting the FoodMart database if it exists (Tasks 3—6)
- creating the FoodMart database and tables (Tasks 7—10)
- moving data from Access to SQL Server (Task 11)
- cleansing the data, creating star indexes, and adding referential integrity (Tasks 12—15)
Before looking at these steps in detail, let's look at global variables—the glue that holds the package together.
Initializing Global Variables and the Package State
Global variables are the nerve center of a DTS package because they provide a central location for DTS to share information. To create, view, and set global variable values, go to the DTS Package Designer's toolbar, select Package Properties from the menu, then click the Global Variables tab, which Figure 2 shows. SQL Server 2000's enhanced task support for global variables incorporates multiple task types—including ActiveX Script, Dynamic Properties, and Execute SQL tasks—which can set and retrieve global variable values. DTS 2000 and DTS 7.0 also support a wide range of data types, including COM components. The ActiveFoodMartConnections global variable, which Figure 2 shows, is an example of a COM component. This global variable, which I created as an output parameter in Task 4, stores an ADO Recordset object that contains records describing all active FoodMart connections.
Task 1: Initializing global variables. To initialize the package global variables, you can write VBScript code into an ActiveX Script task, as Listing 1 shows. In VBScript, global variable assignments take the form
where name is the global variable's name and Input-value is the value that you assign to the global variable. Note that although I use VBScript for all packages, you can also use any other installed ActiveX scripting language, such as JScript or Perl.
Task 2: Using .ini files to initialize global variables. Now, let's look at the way the new Dynamic Properties task removes one of DTS 7.0's major limitations—the inability to set key package, task, and connection values at runtime from outside the DTS environment. In DTS 7.0, developers had to manually configure packages as they moved through the package life cycle—from development to testing and finally to production. With DTS 2000, the package remains unchanged through the development life cycle; only the parameter settings made outside the package change. In this example, I use Windows .ini files to initialize the global variables. You can also initialize environment variables, database queries, DTS global variables, constants, and data files. Figure 3 shows the global variables that you can initialize. Note that the window also includes Connections, Tasks, and Steps properties. Later in this article, I show you how to initialize both Connections and Tasks properties. Each global variable is linked to one entry within the specified .ini file. Figure 4 shows the Add/Edit Assignment dialog box, in which you initialize the SQLServerName global variable with the SQLServerName key from the C:\Create-foodmart.ini file. Listing 2 shows the Createfoodmart.ini file code. Note that this .ini file is the only parameter in this package that isn't dynamic. You need to place it in the C directory or modify the task to point to the .ini file's new location.
The next two instances of the Dynamic Properties task use these initialized global variables to dynamically set important connection information, the SQL Server database files directory, and the CopyFoodMart DTS package filename, package name, and package GUID. The next four tasks delete active FoodMart database users and drop any existing FoodMart database to make sure that the system is ready for the database creation.
Preparing the Existing Environment
Task 3: Setting the connection parameters. The power of the Dynamic Properties task becomes evident when you set the connection parameters. The Dynamic Properties task uses the global variables that the .ini files have already initialized to initialize SQL Server OLE DB connection properties. DTS in turn uses the connection properties to connect to SQL Server. On the General tab in the Dynamic Properties Task Properties window, which Figure 5 shows, you can see that global variables set three connection parameters and a constant value sets one parameter. Clicking Edit brings you to the Dynamic Properties Task: Package Properties window, which Figure 6 shows. The window displays the specific property (in this case the OLE DB Data Source property) that the global variable is initializing. Clicking Set takes you back to the Add/Edit Assignment dialog box.
Task 4: Getting the FoodMart connection. After you set the connection parameters, you need to drop the existing FoodMart database. If users are logged in to the database, you have to terminate their sessions before you take that action. Figure 7 shows the General tab in the Execute SQL Task Properties window, which resembles the same tab in DTS 7.0. However, the Execute SQL Task Properties window in DTS 2000 incorporates the new Parameters button and the new "?" parameter marker in the SQL query. Clicking the Parameters button takes you to the Input Parameters tab in the Parameter Mapping window, which Figure 8 shows. This window lets you pass input parameters into the Execute SQL task and place output parameters from the Execute SQL task in global variables—actions you can't take in SQL Server 7.0. Let's take a closer look.
In the Parameter Mapping window, any global variable can set the SQL parameter marker, named Parameter 1. For this task, you pass the input FoodMart database name into the query by using the DatabaseName global variable. DTS 2000 packages give you the flexibility to specify the database name at runtime. In contrast, SQL Server 7.0 requires you to use additional SQL statements within the task to accomplish the same goal. Figure 9 shows how you cache the query's output recordset for use in the next task. On the Output Parameters tab, you can store one value at a time by first choosing the Row Value option, then mapping the SELECT LIST values one-to-one with global variables. You can use all values or a subset.
The ability to pass input parameters into the SQL task and place output parameters from the SQL task in global variables, as well as to store one value at a time, might seem minor at first. However, these features let you use the Execute SQL task in more places, providing a high-performance alternative to the DTS data pump transformation capability. As a general rule, set-based operations perform better than transformations. When I assembled DTS packages in SQL Server 7.0, I had to include additional SQL code within each task to set the correct input parameters and use temporary tables to store output parameters. In DTS 2000, you can eliminate from each SQL task the code you had to write in DTS 7.0 for passing input parameters and storing output parameters. In eliminating the code, you reduce the volume and complexity of code and therefore the time required to develop and test your DTS packages.
Task 5: Killing the FoodMart connections. To terminate processes that are accessing the FoodMart database, apply the SQL Server KILL command. Task 5's ActiveX script code loops through the rowset that is stored in the ActiveFoodMartConnections global variable, calling the code that Listing 3 shows. First, the ActiveX script builds the database connection string from DTS global variables, then saves the connection as a DTS global variable that future ActiveX Scripting tasks can use without first having to define it. You can use this connection to build and execute one KILL command for every server process ID (SPID) in the output rowset. After you kill all connections, you're ready to drop the existing FoodMart database.
Task 6: Dropping FoodMart. The ActiveX script that you run for Task 6 retrieves the ADO connection that you cached in the previous task, as Listing 4 shows. Then, you build the DROP DATABASE statement and execute it. Note that you have to build the statement explicitly each time for both the KILL and DROP DATABASE commands because the SQL Data Definition Language (DDL) doesn't support the "?" parameter marker. For that reason, you can't pass the database or SPID as an input parameter at the same time you pass the FoodMart database name. Now that you've finished cleaning up the environment, you're ready to build the new FoodMart database. Note that you designate the workflow from Task 6 to Task 7 as On Completion not On Success. You want the package to continue executing if the DROP DATABASE command failed because the database didn't exist. To change the workflow precedence, highlight the workflow arrow that connects Task 6 to Task 7, right-click, select Properties, then select Completion Success or Failure from the Precedence drop-down combo box.
Creating the FoodMart Database and Tables
You might wonder why I haven't recommended using the Access Upsizing Wizard to move the FoodMart database to SQL Server. Although the Upsizing Wizard, which became available in Access 95, is a helpful tool that easily migrates Access databases to SQL Server, the wizard doesn't work as well for large Access databases such as FoodMart. For these databases, you need to stage an Access-to-SQL Server migration in multiple steps similar to the steps in this example—creating the database, creating the database objects, loading the database, cleansing the data, and adding referential integrity. In deciding which utility to use, you have to take into account such factors as the underlying physical database design, table design, data type selection, and how much flexibility you have in determining when to move and cleanse data.
Task 7: Creating FoodMart. The script that Listing 5, page 52, shows creates the FoodMart2000_Master database that appears in the Data Files tab on the FoodMart2000 Properties window in Figure 10, page 52. Note that the database's size is 25MB, expandable by 10MB. Although the database can grow to 35MB, it reclaims this space when you issue a DBCC ShrinkDatabase operation from the cleanup task. Again, I used an ActiveX Script task rather than an Execute SQL task to specify at runtime the database name and the directory in which I wanted to create the new database files. I used the scripting task because DDL statements don't support parameter markers.
Task 8: Setting the database properties. Set Database Properties is an ActiveX Script task that initializes database level settings by calling the sp_dboption stored procedure. One of the database options settings worth noting here is bulkcopy, which the ActiveX script code sets to true. Bulkcopy's true setting lets the data load faster because it means that SQL Server doesn't log row-insert operations. However, be aware that for nonlogged bulk load to work, your database settings must meet additional conditions. These conditions are well documented in SQL Server Books Online (BOL).
Task 9: Initializing FoodMart's connections. The Initialize Food-Mart Connections task initializes FoodMart's SQL Server OLE DB connection and the parameters required for the Execute Package task. Figure 11 shows the General tab in the Dynamic Properties Task Properties window. You've already set the OLE DB properties, so let's set a task parameter. Clicking Edit on the General tab and highlighting the PackageGuid destination property opens the Package Properties window, which Figure 12 shows. In this window, you can select the task, the PackageID, and the PackageID's default value. Once again, the Dynamic Properties task gives you maximum flexibility for configuring a property at runtime, a capability that's vital when you move a package between environments.
Task 10: Creating tables. After you choose the package properties, you can create 24 database tables and populate them. Note the size of the FoodMart database—too large to use Access's Upsizing Wizard. FoodMart holds enough data to warrant the explicit creation of the database schema to optimize the final database size. The script that creates the tables is available online at http//:www.sqlmag.com (for download instructions, see "More on the Web," page 54). Also available online is a readme.txt file that tells you how to run the Create Foodmart 2000 package. Your next step is to run the initial load.
Task 11: Moving data from Access to SQL Server. Many ETL projects are complex enough to warrant the separation of logic into multiple packages. When you use SQL Server 7.0, linking these multiple packages together in a workflow is a challenge. The technique commonly used—creating an EXECUTE process command, then using the dtsrun command-line interface—is a cumbersome solution. In addition, in SQL Server 7.0 you can't set runtime parameters.
SQL Server 2000 addresses both shortcomings with a new task, the Execute Package task. You use this task to invoke the DTS package that moves data from Access to SQL Server. I examine the package in more detail later in this article. First, let's look at the General tab in the Execute Package Task Properties window, which Figure 13 shows. Task 9 sets the key values for this package at runtime. The window in Figure 12 displays the available properties. Be aware that for all tasks, the minimum properties you need to set are the PackageName, the package FileName, and the PackageGuid so you can dynamically set the package properties to work correctly at runtime.
The Execute Package task incorporates another valuable feature: You can initialize the called package from the task in the Execute Package Task Properties window. To initialize the called package, you can choose either the Inner Package Global Variables tab or the Outer Package Global Variables tab, which Figure 14 shows. For this example, I used Outer Package Global Variables to initialize global variables of the same name within the called package. Figure 15, page 54, shows the called package that you use to copy the data from Access to SQL Server. This package uses a technique similar to the initialization technique that the main package uses. After the initialization task completes, each of the 24 transformation tasks fire and complete independently of one another.
In each transformation, you map the source to the destination column. DTS refers to this action as the data pump. Figure 16, page 54, shows the transformations for the account table in the Transform Data Task Properties window, Transformations tab. You can set one transformation for the entire row, as Figure 16 shows, or map the table column-to-column, as Figure 17, page 54, shows. You might expect that minimizing the number of transformations would significantly speed up the copy task's performance. However, my SQL Server Profiler tests showed that the timing results are similar for both packages. One of the test runs revealed that both techniques use the BULK INSERT command to transfer information to SQL Server. BULK INSERT used as a default command is another new SQL Server 2000 feature. When you use BULK INSERT capabilities, you can greatly improve execution time for your transformation tasks. However, this performance gain comes at a cost: Inserting data in bulk mode doesn't work with the new SQL Server 2000 logging features.
To understand the problem, let's look at Figure 18, page 55, which shows the Options tab for one of the transformations. Note that the Use fast load option is enabled by default for a copy transformation. Disabling this feature changes the method of loading the destination data rows from a nonlogged, bulk-load interface to a logged interface. The quick Profiler timing tests I ran on my machine show that the task runtime is more than 10 times longer when you disable Use fast load. However, when you run a transformation with Use fast load enabled, you can't take advantage of one of the new SQL Server 2000 logging features, which lets you save copies of all rows that fail during the transformation. This logging feature is valuable because it lets you log and later process all failing rows for a particular transformation. ETL processing often requires you to make choices—and a trade-off accompanies every choice. Here, you must decide between set-based processing and row-based processing when you build your transformations. Set-based processing usually provides better performance, whereas row-based processing gives you more flexibility. I use set-based processing in the next two tasks, in which I cleanse the data and create primary keys and referential integrity.
Tasks 12 and 13: Cleansing the data. The FoodMart Access database suffers from data-quality problems. For this exercise, let's look at the snowflake schema for the sales subject area, whose key values and structure Figure 19, page 55, shows. The sales_fact 1997 table holds foreign key references to FoodMart's key dimensions: products, time, customers, promotions, and store geography. I structured the products and store dimensions in a snowflake pattern to reflect the hierarchies for each dimension; for example, each product has a product family, department, category, subcategory, and product brand. Several rows duplicate between sales_fact_1997 (8 rows) and sales_fact_1998 (29 rows). If you want to apply star indexes and referential integrity to the star schema, you have to purge the duplicated data. This challenge is nothing new to developers with data warehouse experience; typically 80 percent of total project time is spent on data cleansing. The ETL developer has to decide whether to use set-based processing or row-based processing for the data-cleansing phase of the project. For this example, I used set-based processing. To cleanse the sales_fact_1997 table, you can run the SQL code that Listing 6 shows.
The first step in cleansing the data is to find all the rows that contain duplicate entries and create a spot to store them; in this example, the code stores the results in a temporary table. Next, it deletes the duplicate entries from the fact table. Then, the code deletes the table that it used to store the duplicates. Note that in using set-based processing to cleanse information before inserting it into the star or snowflake schema, you introduce data loss because you don't re-insert the distinct duplicate rows that contain identical key values into the table. I decided to use set-based processing in this example because I don't know enough about the underlying data to determine which of the duplicate rows is the correct one. In a real project, you place these duplicate rows in a permanent table in a data warehouse or data mart metadata database that you establish to store rows that fail the data-cleansing process. You can then examine these failed rows to determine what exception processing should occur. The data mart database also stores additional information about package execution, source data, and other key information that describes and documents the ETL processes over time. (This subject is worthy of one or more separate articles.) After cleansing the fact tables, you can create star indexes and add referential integrity.
Task 14: Creating star indexes. Task 14 creates a primary key for both the sales_fact_1997 and sales_fact_1998 tables. The primary key, which is also called a star index, is a clustered index that includes each of the fact tables' foreign keys that reference a dimension's primary key. You can realize several benefits from creating primary keys; one significant benefit is that the query optimizer can use this primary key for a clustered index seek rather than a table scan when it builds its access plan. The query optimizer takes advantage of the star index in the code example that Listing 7 shows. Note that the query-access patterns demonstrate how much the star index can speed up your queries; for example, the execution time in the query that Listing 7 shows plummeted by two-thirds when I added the star index. Using a star index in queries for very large databases (VLDBs) carries another important benefit: The query optimizer might decide to implement a "star join," which unions the smaller dimensions together before going against the fact table. Usually, you want to avoid unions for the sake of efficient database optimization. However, a star join is a valid and clever optimization technique when you consider that the fact table might be orders of magnitude larger than its dimensions.
Task 15: Adding referential integrity. The last major task in this package adds referential integrity, which links all the star schema's foreign keys to their associated dimensions' primary keys. As a general rule, adding referential integrity is beneficial because it ensures that the integrity of the data mart is uncompromised during the load phase. Administrators for large data warehouses might choose not to implement this step because of the extra overhead of enforcing referential integrity within the database engine. Cleanup, the final task, uses an ActiveX script to invoke DBCC ShrinkDatabase and to clean up the connection that the global variables are storing. A production-quality DTS package includes additional tasks, such as a mail task that sends the status of the package execution to the DBA team.
Change Is Good
Sometimes little things make a big difference. This maxim is certainly true for SQL Server 2000's DTS enhancements. The Create Foodmart 2000 package showcases two new tasks in particular: the Dynamic Properties and Execute Package tasks, which help DTS programmers implement production-quality packages. And when Microsoft added I/O capabilities to the Execute SQL task, the company established global variables as the hub of activity within a DTS package.