Downloads
16548.zip

Periodically refresh the data in your data mart

In the first two articles in this series ("Relational to Dimensional," June 2000, and "Building a Cube from a Dimensional Database," October 2000), I described a simple solution to provide reporting and analysis for the fictitious Northwind Traders Company. The solution included creating a new database designed for querying, loading the database from the relational Northwind database, and using SQL Server 2000 Analysis Services to create an OLAP cube. This article completes the series by describing the steps you need to take to periodically refresh the data in the data mart so that you can reload the entire data set without updating the data from the source database.

Changing Dimensions


The dimensional, or star schema, database created for reporting and analyzing sales information across your company has dimensions for suppliers, products, employees, customers, and time, as Figure 1 shows. Except for time, this data changes as business parties begin or end relations with Northwind Traders. Depending on analysis requirements, a variety of rules could apply to these changes. New customers, suppliers, products, and employees are relatively easy to handle. You add them to the dimension tables so that you can capture related transactions in the fact table. When these business parties aren't in the picture, the rules can be more troublesome.

An easy solution to changing dimensions is to ignore those dimension members that go away and delete them and their facts, but this approach skews history. A product that Northwind Traders discontinued still has sales numbers for the time that the company sold the product, and this history factors into the numbers for the product's parents in the product dimension hierarchy.

What do you do if two customer or supplier companies merge into one entity? Although many possibilities exist, they all boil down to business requirements. How you deal with changing dimensions can vary on a case-by-case basis. For information about three options for dealing with slowly changing dimensions, see Ralph Kimball, The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses (Wiley & Sons, 1996); Ralph Kimball et al., The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses (Wiley & Sons, 1998); and Amir Netz's Microsoft article "OLAP Services: Managing Slowly Changing Dimensions," at http://msdn.microsoft.com/library/techart/slowly2.htm. Each of the three options has a different effect on the ability to look back into history.

The first option is to simply replace an old value with the new one. For example, if a customer's marital status changes from single to married, you could just change his or her status. But that approach hides the person's historical buying behavior as a single person, which could be very different from the buying behavior of a married person.

The second option for dealing with a changing dimension is to add a dimension member for the customer with a status of married so that historical transactions associated with him or her as a single person still show up and his or her new transactions as a married person are captured moving forward. This option preserves historical information.

The third option is to store a reference to the previous value. In the case of the customer who marries, he or she might have a current marital status of married and a previous marital status of single. Because being married is definite, referring to the previous value isn't necessary. However, this option makes more sense for less cut-and-dried situations. For example, in The Data Warehouse Lifecycle Toolkit, Kimball uses the redrawing of sales district boundaries as a possible candidate for this option. Analysis Services also has features for dealing with changing dimensions.

Following the simple approach of my first two articles, I will preserve history in the dimensional database and add new business parties as changes arise, as in the second option above. This approach lets me add sales numbers to the dimensional database fact table based on the business parties that each transaction involved.

Detecting the Changes


To refresh the dimensional database, all I need to do is determine whether rows exist in the Northwind database—not the Northwind_Star database dimension tables—and if so, add them to the appropriate dimension. This task is fairly easy to accomplish in a set-oriented manner with T-SQL in Data Transformation Services (DTS) packages. Or, you could build logic into an ActiveX task, ActiveX transformation, or Data Driven Query task to check for the existence of specific rows in the dimension tables and insert those rows if they don't exist. However, these methods process the tables one row at a time, which isn't nearly as efficient an approach as set-oriented T-SQL.

Another point to consider in refreshing the dimensional database is that I decided to put the Northwind_Star database on a separate SQL Server system to offload query activity from the production transaction processing (TP) system. If I want to use T-SQL code to join tables in my source and destination systems for loading new data, I need to use linked servers. Figure 2 and Figure 3 show Linked Server Properties dialog boxes for connecting the servers.

The linked server name in Figure 2 is a descriptive name that references the remote server. You can give the server any name you want; the name I used for this linked server references the target database. Also, you don't need the blank fields in this dialog box. The data source is the name of the remote server (this example uses a two-part name for a SQL Server 2000 instance), and the catalog is the name of the database you want to use on the remote server. You configure the linked server in the source SQL Server system (where the Northwind database is) to connect to another SQL Server system (where the Northwind_Star data mart is) because the DTS package's T-SQL tasks join tables between the two databases across servers. For this example, I simply use the sa account on the remote server to access the remote server, as Figure 3 shows.

Building the DTS Package


Now I can build a DTS package to refresh the data mart. Figure 4 shows a package that first clears all the tables in the data mart, then completely loads the data. This package contains steps to populate the time dimension table and the customer cross-reference table. The package populates the time dimension table with a series of dates covering every day from the beginning of 1995 to the end of 2010, which covers the order dates in the Northwind database with room for future growth. The customer cross-reference table assigns an integer value to each customer ID for use as a key in the data mart ("Relational to Dimensional" explains why I assigned such a key). The rest of the package populates the dimension tables and the fact table from the production data in the Northwind database.

I don't want to load the whole data mart every night, so I need a second package that runs periodically after the load package runs and updates the data in the data mart as the dimension members change and as the source system processes orders in the Northwind database. Figure 5 shows the second package. The first step in the new package is to add new customers to the Customer_Xref table in the Northwind_Star database. An ExecuteSQL task in the DTS package inserts from the Northwind source database the new customers' customerid values that don't exist in the Northwind_Star Customer_Xref table. Listing 1 shows the T-SQL code that this task runs.

The next step is to use EXISTS tests in SQL queries to refresh the data mart dimension tables with new product, employee, and supplier information. The DTS package refreshes the data mart dimension tables in a Transform Data task and uses T-SQL statements to join the tables in the Northwind database on one server with tables in the Northwind_ Star database on the other server. Figure 6 shows one of these JOINs. Notice the four-part name that references the table in Northwind_Star via the linked server. A Transform Data task also exists for customer information, using the surrogate customer key from the Customer_Xref table to add new customer information and using the EXISTS test to figure out which rows to insert.

Finally, you use the orders added to the Northwind source database since the last load to update the Orders_Fact table. Deciding how to identify new orders involves various considerations. The order date could be inadequate, depending on the criteria the system uses to assign this date to new orders. For this example, I assume that orders have a date assigned only when they are entered into the system, so I look for orders that are newer than the most recent order date in the data mart and that are more recent than the date on which the package runs. I check the order dates against the current date to force the package to load only data from the previous day and before. Without this check, you might load part of a day's data at noon, then never pick up the orders from noon to midnight because those rows don't qualify as greater than the most recent order date in the data mart. This approach is simplistic and requires that the DTS package be run after the end of the business day. Listing 2 shows the T-SQL code that accomplishes this task.

Refreshing the Cube


So far, I have refreshed the star schema data mart. Now, I need to refresh the Analysis Services cube I created in "Building a Cube from a Dimensional Database." This part of the task is easy if you're using SQL Server 2000 DTS or if you've downloaded Data Transformation Services Task Kit 1 for SQL Server 7.0 to add the OLAP Services processing task (available at http://www.microsoft.com/sql/downloads/dtskit.htm). An Analysis Services processing task lets you set up processing for cubes. In this example, I chose to incrementally update the cube data and dimensions. This approach adds new fact data to the Orders_Fact table, leaving the data that already exists and adding new dimension data. Figure 7 shows the dialog box for the Analysis Services processing task.

The DTS packages are now ready to run. After the initial load, something has to change in the Northwind source database before anything changes in the Northwind_Star dimensional database and the Analysis Services cube. To see what the package does, add a new customer, employee, supplier, product, and order. As the package executes, you'll see a count of the rows each step added. You can then query the Northwind_Star database to see new order facts and new dimension members. You can view the new data through Analysis Manager by browsing the cube data. Of course, you might still have work to do if client applications that are based on this cube need to be refreshed as well.

A Simple Solution to a Complex Problem


With the articles in this series as a guide, you can design a simple reporting and analysis system. Using dimensional database design techniques, Analysis Services, and DTS, you can create a data mart, build an OLAP cube, and incrementally update the data mart. Creating this kind of solution can be complex, but these articles clarify concepts and provide guidance on how to approach a data mart solution.