DTS takes the movement of data to a new level

SQL Server 7.0 includes many new tools to support enterprise database systems. Data Transformation Services (DTS) is one of the most exciting new offerings because it helps developers and database administrators (DBAs).

Microsoft recognized the difficulty of using bulk copy program (bcp) to move data into and out of SQL Server and created DTS as a solution. DTS fulfills Microsoft's need for a fast, robust, extensible architecture for moving data out of traditional relational databases and into Microsoft's new online analytical processing (OLAP) engine (i.e., SQL Server OLAP Services, formerly code-named Plato).

The DTS utility service features a highly graphical, wizard-based interface; improved performance; and programmable extensibility through a rich set of publicly available COM interfaces. Also, the new service provides an excellent facility for migrating data from one database to another. DTS provides import, export, and data transformation services between Object Linking and Embedding Database (OLE DB), Open Database Connectivity (ODBC), or ASCII data sources. In addition, the new service satisfies data warehousing needs such as data extracting, scrubbing, validating, aggregating, and summarizing.

DTS takes data migration to a new level by providing the means to import, export, and transform data between multiple sources—interactively and programmatically—through COM interfaces. DTS differs from its predecessors, such as bcp and replication, by being able to interact with non-relational data and to programmatically transform data. The new service lets users interact with most existing data sources—not just those using relational technology—and eliminates much of the complexity that bcp and replication introduced. DTS removes the need for intermediary tables; transformation code now becomes part of the transfer process. This change minimizes the number of processing steps, which reduces the likelihood of failure and ensures a high-performance data transfer.

DTS uses OLE DB technology to support non-relational databases; OLE DB, Microsoft's solution for universal data access (UDA), defines a set of COM interfaces (called data consumers and data providers) for accessing and offering any type of data. SQL Server 7.0 is a native OLE DB provider. DTS is an OLE DB consumer. Additionally, with SQL Server 7.0, Microsoft provides OLE DB interfaces to Oracle 7.3+, Microsoft Jet, and ODBC sources. Microsoft envisions that OLE DB will eventually replace ODBC as the industry standard for accessing data.

Using OLE DB, DTS gains independence. Acting as an OLE DB consumer, DTS does not require SQL Server. Furthermore, the new service interacts with any database (including SQL Server 4.2 and 6.x servers) with a compatible ODBC driver. (Test your configuration for compatibility before you commit to a specific driver.)

DTS provides a robust user interface for developing and administering transformation tasks. For ad hoc transformations and simple transfers, DTS offers import and export wizards to assist users in developing their transformation tasks.

For more complex task development, DTS provides DTS Package Designer, a development tool. From within the designer, you can define multiple data sources and targets and create elaborate transformation scripts using any COM-based scripting language. Additionally, you can define packages, database tasks, and external applications for the system to execute, and you coordinate these tasks using built-in scheduling and workflow functionality.

Evolution


DTS has evolved from the idea of a dedicated data pump between SQL Server and SQL Server OLAP Services and has become an independent data service that can extract, transform, and manage data movement between multiple heterogeneous data sources. (DTS does not run like an NT service. Instead, DTS runs as a COM server. Also, it can run as an external application via dtsrun.exe.) The evolution of DTS means it can perform roles in many different projects. For example, DTS might replace native replication to move data back and forth between two vendors' data stores, such as an Oracle database running under NT and an Informix database running under UNIX. In another example, DTS might scrub and validate text data from a legacy system before loading the data into a Web-based database application.

Although DBAs and developers will find many such uses for DTS, Microsoft has designated DTS as an essential component in its Data Warehousing Framework. DTS is part of the warehousing framework diagram in Figure 1. In data warehousing, developers face three main tasks—data acquisition, data transformation, and data presentation. Microsoft is positioning DTS as the tool for automating data acquisition and data transformation. In addition, DTS provides scripting, task scheduling, and workflow tools to manage such jobs and handle exceptions that might arise.

Data acquisition is the task of getting data from any data source and feeding it into a warehouse. Acting as an OLE DB consumer, DTS extracts data from any data source that provides a native OLE DB or ODBC interface. For example, a company might host an inventory tracking system in a DB2 database running on a mainframe. The company's accounting system might be on an Oracle database running under UNIX, and the company's human resources data might be in a SQL Server database. To pull all this information into a SQL Server warehouse, the developer uses the DTS Package Designer to create OLE DB connections to each data source. The developer then uses the data source connections to map source data fields to destination data fields in the SQL Server warehouse.

Data transformation is a set of tasks for scrubbing, validating, aggregating, and summarizing data after its extraction and before its arrival at a target storage location. DTS provides this functionality through scripting capability that accesses COM interfaces.

DTS offers an infrastructure for managing acquisition and transformation tasks. DTS steps define the logic flow of a transformation; each step has an associated task, and the task reports its result (i.e., success or failure, with more detail available). DTS lets you schedule and execute steps serially, in parallel, or according to precedence—providing flexibility and performance in logic flow. For example, if an entire transformation process consists of six steps, you can start the first three steps in parallel, and have the fourth step start after successful completion of the first three. The remaining two steps can begin in parallel after the fourth step has successfully completed. In addition, you can conditionally execute steps. For example, after the first step returns a status code, you can have DTS use the status code to dynamically determine whether to execute step two or step three. DTS scheduling and workflow facilities fill a huge gap in most enterprise-warehousing solutions.

Architecture


The key to understanding the DTS architecture is to understand the DTS Package, which is a complete, self-contained description of all the tasks and steps for completing an import, export, or transformation process. Figure 2 shows a diagram of the DTS Package. A package always includes a source and a destination data store. Also, a package can invoke SQL tasks and external Windows applications, or manipulate the data stream and invoke COM objects and other packages (through scripting). You accomplish such invocation by defining each operation as a separate task in the package. You schedule each task as a step to execute serially, in parallel, conditionally, according to precedence, or in any combination of these orders.

DTS offers COM-based interfaces (i.e., function calls) for creating, maintaining, and executing packages. From the graphical toolset, DBAs will use the Enterprise Manager to administer DTS packages—executing the packages interactively or using SQL Server Agent to schedule them. Developers will use either DTS Package Designer or the Import and Export wizards to create new packages. For maintaining and creating complex packages, the Package Designer provides a graphical environment in which users define tasks and lay out execution orders. To save a package, you can write the package to a COM structured storage file, save the package within SQL Server in the msdb database, or add the package as metadata to the SQL Server Repository. The preferred location for storing packages is the SQL Server Repository. (The Microsoft Repository specification is the basis of the SQL Server Repository. Microsoft developed this specification in cooperation with industry and data warehousing tool vendors. These partners' intent was to form a means of providing better tool integration by sharing tool-generated data and metadata that describe the tools' data. Developers have more choices for interoperable tools when all the tools use the same repository.)

DTS leverages the repository in two ways. First, it lets developers reuse packages, tasks, and scripts. Second, the repository lets DTS track the line-age of data and packages through an audit trail that shows the data origination and all transformations performed on that data before it reached its destination. (Not having this information often leads to difficulties in confirming the validity of the destination data.)

In addition to providing the graphical interfaces, DTS exposes a complete COM object hierarchy (transformation object model) that you can use to create, maintain, and execute packages within their applications. Figure 3 illustrates this hierarchy, which is the model for the graphical tools that DTS includes. Using any COM-compliant programming language, you can create custom applications and still leverage all the functionality of DTS. This capability sets COM interfaces apart from graphical interfaces.

Tasks define the pieces of work in a package, as the object model shows. The tasks define what work to perform and can consist of executing a script, spawning an external process, spawning an SQL task, or performing a transformation via the Data Pump (which is where the data transformation occurs). As Figure 3 shows, Packages include Steps, which control the scheduling and execution of tasks, and Connections, which the Data Pump uses to communicate with data sources.

Export Wizard


To illustrate DTS in action, we'll export data from the SQL Server pubs database into an Excel spreadsheet. You can launch the DTS Export Wizard from three locations. You can launch it directly from the desktop by clicking Programs in the Start menu and selecting Microsoft SQL Server 7.0, Import and Export Data. You can launch the DTS Export Wizard from within Enterprise Manager by clicking Wizards in the Tools menu. Or, you can launch the wizard by clicking Data Transformation Services in the Enterprise Manager Tools menu, and selecting Export Data. (Each of these launch methods will give you slightly different DTS dialog box interfaces.)

After launching the wizard, choose a data source and target from one of the known data providers. (The selection process is similar to the ODBC Manager application.) For our example, we chose Microsoft OLE DB Provider for SQL Server as the source. We selected our database server (i.e., acorn), entered the appropriate login information (sa and no password), and chose the pubs database as the data source, as Screen 1 shows. In the Choose a Destination dialog box, we selected Microsoft Excel 8.0 as the destination, and chose a file to export the data to. Screen 2 shows the pathname and filename C:\MyDocuments\PubsDataExport Wizard.xls. (The dialog box in Screen 2 means the Excel file exists; if it doesn't exist, DTS dynamically creates a new file and transfers data to it, adding a separate worksheet for each table in the transfer. If the Excel file exists, DTS lets you append data or delete the existing data.)

Now that you've identified the source and destination data stores, you must refine your data selections by defining individual data entities for the system to move. The Export Wizard lets you access data directly from the table or indirectly through a query. The Specify Table Copy or Query dialog box in Screen 3 shows that you do direct table copies from the source database. Alternatively, you can use a query. If you choose the query option, you can define complex joins against any tables and views. In addition, you can define a query as a stored procedure, which provides users access to the Transact SQL (T-SQL) language. Although not applicable to this example, the option for performing object and data transfers between SQL Server databases is in this dialog box. This option replaces SQL Server 6.5's Database/Object Transfer functionality and is available only when both the source and destination data stores are SQL Server 7.0 databases. (You cannot use Database/ Object Transfer to transfer objects from earlier SQL Server versions). Screen 4 shows available object transfer options.

Continuing with our example, we exported the sales, stores, and titles tables as Screen 5 shows. If we click Preview, DTS displays the first 100 records from the data source we selected. For example, when we select the sales table and click Preview, the system displays the sales data (as Screen 6 shows). Although we used the default transformation options for all our tables (performed a direct one-to-one copy), we clicked Transform for the Sales table to illustrate the additional customizable attributes that the wizard exposes.

Clicking Transform for the Sales table lets us see the Column Mappings and Transformations dialog box, which Screen 7 shows. To perform more than a simple one-to-one data copy, use the Column Mappings tab to redefine and customize the destination columns and control how the system applies data to the destination table. An example of such customization is giving a more descriptive name to the destination table. Customization is useful for providing data to end users or for moving data into a data store that allows granular datatyping (e.g., moving from an Access text datatype to a SQL Server varchar(25) datatype). In addition, users can append data to an existing data set, re-create the target data store, or purge and refresh the data.

The Transformations tab, which is the second tab in the Column Mappings and Transformations dialog box, lets you implement simple transformations. Because we performed a one-to-one copy, you see on the Transformations tab in Screen 8 that each DTSDestination data stream value copies directly to its corresponding value in the DTSSource data stream. In addition, you can choose an ActiveX scripting language to implement the transformation. If you don't want to script transformations, you can gain additional control over conversions by accessing the Advanced Transformation Properties dialog box, which Screen 9 shows.

The Save, Schedule and Replicate Package dialog box prompted us to execute and save our package, as Screen 10 shows. Optionally, we can publish our package for replication, which lets a user execute the package at a remote location. In addition, we can run our package immediately or schedule the package for later execution. We saved the package to our local server from the Save DTS Package dialog box, as Screen 11 shows, and executed it immediately. Saving a package in the wizard makes it available for future execution, editing, and enhancements.

The Transferring Data dialog box, which Screen 12 shows, reports the current execution state of each step of the package; if steps execute concurrently, the dialog box shows the states of these steps progressing simultaneously. If a step executes only after other steps complete, the Transferring Data dialog box shows the state of that step as starting after the other steps are finished.

For DBAs and Developers


Much publicity and anticipation surround many of SQL Server 7.0's new features. As is typical with a new release, DBAs and business users are getting many of the enhancements in SQL Server 7.0 that they have been waiting for. DBAs have the new and enhanced administration tools that promise to remove many scalability barriers that large databases impose. Business users have the performance enhancements in the redesigned query optimizer and storage engine. With this new release, database developers are finally getting something, too: DTS.