Build cubes, write MDX queries, optimize DTS, and more
Improve Performance at the Aggregation Level
You can improve OLAP performance when you set a cube's aggregation level. When you build a cube, you set the aggregation level according to the desired speedup in processing queries. (Speedup describes how much faster queries run with precreated aggregations than without aggregations.) The system estimates the speedup based on the I/O amount that the system requires to respond to queries. The total possible number of aggregations is the product of the number of members from each dimension. For example, if a cube has two dimensions and each dimension has three members, then the total possible number of aggregations is 9 (3 x 3). In a typical cube, the number of aggregations possible is extremely large, so calculating all of them in advance isn't desirable because of the required storage space and the time it takes to create the aggregations. Imagine a cube with four dimensions, each with 10,000 members. The total possible number of aggregations is 1016. When you tell SQL Server 7.0 OLAP Services to calculate aggregations for a 20 percent speedup, OLAP Services picks key aggregations (which are distributed across the cube) to minimize the time required to determine any other aggregations at query time.
Using Children to Automatically Update Products
Let's say you want to write an MDX query that shows sales for all hot beverage products for each month of the year. That task sounds simple enough, but what if you add and remove products from your product list each month? How would you write the query so you don't have to update it every time you update your list of products? Here's a trick to help: Use the descendants or children function. The example query that Listing 1 shows uses both of these functions. Try running Listing 1's query in the MDX Sample program. The descendants and children functions are powerful.
Saving DTS Information to a Repository
To save Data Transformation Services (DTS) information into the Microsoft Repository, choose SQL Server Repository as the location for saving the package. Then, use the Advanced tab on the Package Properties to set the scanning options, which Figure 1 shows. Doing so causes DTS to call the OLE DB scanner to load all source and target catalogs into the Repository. If you don't set the scanning options, DTS creates DTS Local Catalogs as the reference for all source and target catalogs, which can make locating the databases impossible. Each subsequent save replicates this reference, so you can't keep comments and other descriptive information updated.
You can run into problems when you try to save certain DTS transformations to a repository. If you use a script to perform a simple transformation and you choose the source columns explicitly (not from a query), all the transformation data is captured, as you can see in the transformation model in "The Open Information Model," March 2000, InstantDoc ID 8060. If you choose a query as the transformation source, that source becomes objects that aren't part of the OLE DB imported data. This choice makes following the connection back to the true source objects difficult. Also, the query isn't parsed to create a connection between the query columns and the columns you select the data from. So in many cases, the connection between source and target is available, but in some, it isn't. You can solve these problems by writing a program to resolve the references in a repository or by using a custom model along with the DTS model to store the source target mappings.
I knew nothing about business intelligence (BI) until I sat through a session about a new feature tentatively called the d-cube (for data cube) during the developer's conference several years ago for the beta version of SQL Server 7.0 (code-named Sphinx). The d-cube feature appeared in SQL Server 7.0 as OLAP Services, which evolved into Analysis Services in SQL Server 2000. At the time, I was sure that OLAP Services would immediately revolutionize the database world. In a nutshell, Microsoft's BI tools are all about letting the right people ask the right questions at the right time, then applying the answers to achieve competitive advantage. You'd think everyone would be using OLAP by now, but most organizations haven't yet applied modern OLAP techniques to their decision making. In fact, many still have no idea what OLAP is. The adoption of BI as a mainstream approach to problem solving has been much slower than I originally anticipated. However, I believe that the adoption rate is beginning to pick up and that more companies will embrace BI for competitive advantage. After all, who doesn't want to make better decisions?
I firmly believe that Analysis Services is an opportunity-packed specialty for SQL Server professionals, and I'm putting my money where my mouth is. I'm not going to let my core skills in SQL Server development rust away, but I do plan to spend most of my R&D time this year focusing on becoming a hard-core Analysis Services expert. Implementing successful OLAP solutions can have a tremendous impact on your client's bottom line, which is fulfilling for a database professional. But most important, I think the demand for skilled Analysis Services engineers will far exceed the supply, which is great for my wallet.
I've found that learning the basics of Analysis Services is relatively simple. The hardest tasks to master are modeling data multidimensionally (you'll need to forget many of the database-normalization lessons you've learned over the years) and using MDX to query the data (MDX is a rich query language, but it's much harder to learn and master than SQL).
You'll need to start somewhere if you're intent on becoming an Analysis Services pro. I suggest you start by attempting to master MDX. As the market for Analysis Services experts grows, the demand for your skills is sure to follow.
Techniques for Creating Custom Aggregations
Custom rollup techniques can solve a variety of problems, but a couple of alternative techniques also let you create custom aggregations. For example, if you need to define an algorithm for aggregating one or more measures across all dimensions but the basic Analysis Services aggregation types won't do, you can use either a calculated member in the measure's dimension or a calculated cell formula that you limit to one measure. Both of these techniques are powerful because you use MDX formulas, which are flexible and extensive, to define them. Calculated cells are possibly the most powerful custom aggregation tool because they control the way existing (noncalculated) dimension members are evaluated and you can limit their effects to almost any subset of a cube. (For more information about calculated cells, see "The Power of Calculated Cells," November 2000, InstantDoc ID 15525.)
Using Loaded Measures to Customize Aggregations
A common technique for customizing the way you aggregate a measure is to define a calculated measure based on a loaded measure, then hide the loaded measure. For example, you might aggregate a Sales measure as a sum, but in two dimensions, you want to aggregate the measure as an average. In the measure definition, you can specify that a measure be named TempSales and be loaded directly from the Sales column in the fact table. You can mark this measure as hidden so that it's invisible to OLAP client applications; then, you can use TempSales in a calculation without TempSales being available to your analysis users. You can then use Analysis Manager to create a new calculated measure named Sales that will return the unmodified TempSales value except when you want the value to be an average of TempSales.
This technique of creating calculated measures and hiding loaded measures is common in SQL Server 7.0 OLAP Services implementations because OLAP Services doesn't support calculated cells or custom rollup techniques. However, calculated measures in both SQL Server 2000 and 7.0 have several drawbacks. For example, you can't use calculated measures when writing back to cube cells. One reason Analysis Services and OLAP Services won't let you write back to a calculated measure is that a calculated measure doesn't map directly back to a column in the fact table, so Analysis Services doesn't know which loaded measure to modify.
Consequently, calculated cells aren't particularly useful in budgeting or modeling applications. Another drawback of calculated members is that you can't use them with the MDX AGGREGATE function. The AGGREGATE function is a common MDX function that you use to aggregate a set of members or tuples. The measure you identify in the set determines the aggregation method that the AGGREGATE function uses. If you use a calculated measure in the set, Analysis Services (and OLAP Services) can't determine the aggregation method, so the AGGREGATE function fails. If you use a technique such as calculated cells to modify a measure's aggregation, the AGGREGATE function works because it is based on the measure's defined aggregation method.
Caution: Large Dimensions Ahead
Be very careful when dealing with dimensions. Look before you leap into Analysis Services' very large dimension feature, which places large dimensions into a separate memory space. This feature is buggy, so avoid it. Also be careful with Relational OLAP (ROLAP) dimensions, which the server reads into memory as needed at runtime. Because you can place a ROLAP dimension only into a ROLAP cube, performance will suffer mightily. In theory, ROLAP mode supports larger dimensions, but it's non-functional in my experience.
Decoding MDX Secrets
I joked in the December 2000 issue that I wished I knew some super-secret MDX command to help solve the problem of creating virtual dimensions on the fly. Well, believe it or not, an MDX function that was undocumented in the initial release of Analysis Services provides a great solution to this problem. The MDX function is CreatePropertySet()—you use it to create a group of calculated members, one for each member-property value. The query that Listing 2 shows, which creates a group of calculated members that are children of the All Store Type member in the FoodMart Sales cube, is a simple example of how to use this function. The query creates one calculated member for each unique Member Card property value for the members of the Customers Name level. The query creates a new set, CardTypes, with the new calculated members and displays it on the rows of the result. Figure 2 shows the query's result set.
Improve Cube Processing by Creating a Time Dimension Table
Some people create a view from the fact table by using the syntax
GROUP BY \[Fact_Table\].\[Date\]
and use the view as a source for the Time dimension. This method has a couple of drawbacks. First, it's inefficient: The fact table is usually much bigger than the dimension table, and accessing a view of the fact table is the same as accessing the underlying base table. Another disadvantage of using the fact table as a source for the Time dimension is that the dimension won't contain a date that had no activity. Thus, this method can create gaps in the dimension sequence by skipping weekends, holidays, and so on. If you want these gaps, remember to exclude irrelevant dates from your Time dimension table.
A better way to create a Time dimension is to create a special Time dimension table in your data warehouse to hold all relevant dates. Simply create the table in Microsoft Excel, then use Data Transformation Services (DTS) to import the table into the data warehouse. This approach to creating a Time dimension significantly improves dimension and cube processing because you don't need to query the fact table to get the Time dimension members. And if the table's date field is of any time data type (e.g., smalldatetime), Analysis Services' and OLAP Services' Dimension Wizard, which you use to create dimensions, detects that the dimension could be a Time dimension and prompts you to confirm its choice, as Figure 3 shows. After you confirm that the dimension is a Time dimension, the Dimension Wizard helps you create the Time dimension's levels (e.g., Year, Quarter, Month, Day), as Figure 4 shows. You can also define the first day and month of the year; the default is January 1.
Transforming Data with DTS
Data Transformation Services (DTS) is widely used as a SQL Server data-transfer tool, but in addition to simple data transfer, DTS offers the ability to perform data transformations to the data you're transferring. The ability to perform data transformations makes DTS more versatile than most other data-transfer tools. DTS's transformations let it perform a variety of tasks that would otherwise require custom programming. For example, by using DTS transformations, you can perform simple conversions such as converting a set of numeric codes into alphabetic codes. Or you can perform more sophisticated jobs such as turning one row into multiple rows or validating and extracting data from other database files as the transformation executes.
DTS transformations are row-by-row transactions, and as such, they add overhead to the transfer process. The amount of added overhead depends mainly on how much work the transformation script must perform. Simple data conversion adds negligible overhead, while more involved transformations that require accessing other database tables add significantly more overhead.
To add a custom transformation to a DTS package, click the Transform button on the Select Source Tables and Views dialog box; you'll see the Column Mappings, Transformations, and Constraints dialog box open. Then, click the Transformations tab to display the Edit Script dialog box, which contains a VBScript template that by default includes code that copies the source columns to the destination columns. You can freely modify this template to create your own custom transformations.
The code in Listing 3 shows how DTS converts the values in the column named CHGCOD from a numeric code in the source database to an alpha code in the target database. You can see that the code tests the CHGCOD column to see whether it's equal to a 1, 2, or 3. If the code finds a 1, it writes an A to the destination table. If the code finds a 2 or 3, it writes a B or C (respectively) to the destination column. The code writes a D to the target column if it finds any other value.
Supporting Disconnected Users
A common shortcoming of analytic applications is that they can't support mobile or disconnected users. Because analytic applications are complex, developers move more application functionality to Web browser–based UIs, which use dynamic HTML (DHTML) and JScript to minimize the amount of application code that workstations download. Unfortunately, disconnected workstations (e.g., laptops) can't run this limited code without a network connection. Because I'm one of those mobile users, I appreciate applications that I can use whether or not I'm connected. The number of users like me is growing; more workers in the enterprise are using laptops instead of desktop computers. Managers, especially, rely on mobility, and they're heavy consumers of analytic applications. To support disconnected users, developers need to enable users to take part or all of an application with them.
I don't have a solution that will make a fancy DHTML Web application run well on a disconnected laptop. But I can tell you about a new feature in SQL Server 2000 Analysis Services that makes supporting disconnected users easier: local-cube Data Definition Language. DDL provides a simple way to create local-cube files in Analysis Services through MDX. These local-cube files let you put part or all of the data from a server-based cube onto a laptop. You can then use the local-cube file to perform the same analysis that you could if you were connected to the OLAP server on a network. To create a local cube without this new MDX syntax, you must construct a verbose SQL-like statement and pass it to ADO through a connection string.
Local-cube DDL is superior to the old connection-string method for three reasons. First, the shortcuts in the DDL syntax make using it simpler than repeating all the details of the server-based cube to create a local cube with the same dimension structures. Second, most OLAP applications don't give users the ability to customize the connection string to the required degree, so developers created custom applications to provide the CREATECUBE functionality. Third, a variation of the new DDL can create session- scoped temporary cubes
Dependency Risk Analysis
Many businesses use a type of analysis called dependency risk analysis. This type of analysis determines whether one group of items in your business (e.g., products) is overly dependent on just one item of another group (e.g., customers). Retailers describe an overly dependent item as at risk. For example, you might want to find out which products depend most on a single customer. To answer this question, you need to find what percentage of total store sales for each product comes from a single customer. To test yourself, find the top 10 highest risk products, and show the percentage and amount of the product's sales that are at risk.
Listing 4 shows a query that defines two new measures. One measure calculates the total of Store Sales for the selected product (e.g., you might want to find the total sales to that product's top customer). The other measure calculates the percentage of the product's total sales that's at risk. The MDX query in Listing 4 uses the PercentAtRisk measure to find the 10 products with the highest percentage of Store Sales at risk. The query then displays both the amount at risk and percentage at risk for each of the top 10 products.
Choosing the Right Client for the Task
The lessons our development team learned from building a Web-based analysis system can provide a valuable guide for deploying OLAP solutions in an enterprise. Microsoft Excel provides a capable, familiar client that you can deploy in a LAN but requires realtime connectivity to the OLAP server. Office Web Components (OWC) works well for deploying an Analysis Services client in an intranet because you can easily control the client platform and open ports securely in an intranet. The Analysis Services Thin Web Client Browser provides a good Internet solution when firewalls are in place and you want minimal impact on the user OS. For any development project, you need to understand the business requirements and needs of the people who will use the products you develop. By outlining requirements and weighing all the options, you can discover the right solution to satisfy your client's requirements.
Using Access as a Data Source
To analyze a relational data source, you need to first publish it as a data source in the Windows 2000 or Windows NT environment by establishing a data source name (DSN). To set up Microsoft Access as a data source, start by accessing the Data Sources (ODBC) settings in Windows NT through the Start menu under Settings, Control Panel. In Windows 2000, choose Start, Settings, Administrative Tools. Double-click to open the Data Sources (ODBC), then select the System DSN tab. Click Add; in the Create New Data Source window, select Microsoft Access Driver (*.mdb). Click Finish to display the ODBC Microsoft Access Setup dialog box. Under Data Source Name, enter the name you choose for your Access data source. In the Setup Wizard's Database section, click Select. In the Select Database dialog box, browse to find the database, select it, then click OK. To finish the source-data selection sequence, click OK in the ODBC Microsoft Access Setup and the ODBC Data Source Administrator dialog boxes.
One of the most common measurements of group or individual performance in a consulting agency is utilization. Decision makers in consulting groups calculate utilization by dividing the total number of hours billed by the total number of business hours available (e.g., 8 hours for each business day). Having a high percentage of utilization is good because it means you're effectively deploying available resources to generate revenue. You can use the cube structure that Figure 5, page 42, shows to create the MDX for a measure that calculates utilization for a selected employee and time period. The query in Listing 5, page 42, calculates utilization as a percentage of hours worked. The meat of the formula is in the definition of the calculated measure, AvailableHours. AvailableHours multiplies the number of work days in the selected time period by 8 hours. You get the total number of work days by eliminating weekend days and holidays from the total number of calendar days. The Utilization measure then divides the total work hours by the available work hours to get a percentage. The result is a percentage that can be more than 100 percent if the average number of work hours exceeds 8 hours per day.
Use Member Properties Judiciously
When you start up the OLAP server, the server loads every dimension—including member keys, names, and member properties—into server memory. Because Analysis Services is limited to 3GB of RAM, this is one of the primary bottlenecks for enterprise-scale deployments. For this reason, limit member properties to the bare essentials, particularly when the level has lots of members.
Get Level Names Right from the Get-Go
When you first build a dimension, level names default to the same names as the column names in the dimension table (except that Analysis Manager replaces special characters with spaces). This means that you wind up with level names like Cust Code, or worse. Then, after the cube is processed, you can't change the level names without reprocessing the dimension, which in turn requires that you reprocess the cube. Because it's painful to rename levels after the cube is processed, many cubes go into production with frighteningly cryptic level names. To compound matters, MDX formulas are often written with dependencies on the unfriendly level names, adding another hurdle to the level-rename task. Cubes are supposed to be easily usable right out of the box, so avoid this pitfall by getting the level names right from the beginning. As soon as you build a dimension, change the default level names to user-friendly names before placing the dimension into the cube.
Aggregating a Selected Group of Members
Sometimes you need to aggregate a group of dimension members for one query. For example, suppose you want to return Unit Sales for the quarters of 1997 for each product family. The solution is easy. But what if you want to run the same query for only the customers in California and Oregon, leaving out Washington? This is a common problem with a simple solution. All you have to do is create a calculated member that aggregates California and Oregon, and select that calculated member in the WHERE clause, as Listing 7 shows.
The Aggregate function aggregates the set of members passed to it and uses the Aggregation method defined for the member's dimension. In this case, the Customers dimension is aggregated with a Sum function that we defined in the OLAP Manager when we built the cube, so the new dimension member \[CA-OR\] is the sum of \[CA\] and \[OR\].
This tip is useful, but be careful. Performance can suffer if you use aggregation heavily in the WHERE clause. If you have a common alternative aggregation, you might be better off creating a second hierarchy for your dimension.
Determining the Percentage of a Product's Contribution
A common business problem is determining percentage of contribution to a group. For example, you might want to know what percentage of the total revenue of a product line a particular product contributed, or what percentage of growth of sales in a country each region in that country contributed.
Here's one way to solve this problem: For each revenue or dimension combination you want to analyze, create a new calculated member. For instance, if you want to analyze Store Sales as a percent of a group in the Product dimension, create a member, as Listing 6 shows. Figure 6 shows the result set from this query.
Avoid Crippled Client Software
Can you imagine using a front-end tool for a relational database management system (RDBMS) that doesn't let you specify an SQL statement? Of course not. Yet somehow that's what developers are faced with in the OLAP space. Remarkably, many shrink-wrap query and reporting tools that work with Analysis Services are crippled in a fundamental sense—they don't let developers supply an MDX SELECT statement. The problem is this: None of the commercial clients, even the most robust, come close to exposing the full power of MDX. Maybe simple cube browsing is all your users require. Nonetheless, to avoid painting yourself into a corner, choose a front-end tool that lets the developer specify custom MDX SELECT statements.
There's a catch to this advice, however. The client tools that don't expose MDX tend not to be tightly bound to Analysis Services—they provide connectivity to other data sources. However, I don't think it's asking too much for these query- and reporting-tool vendors to expose an MDX SELECT query string as a pass-through.
Setting OLAP Cube Aggregation Options
After you create an OLAP cube and choose the storage technique that's optimal for your situation, the OLAP server designs the aggregations and processes the cube. If you choose the Relational OLAP (ROLAP) storage technique, the OLAP server will create the summary tables in the source database after it processes the cube. Otherwise, aggregations are stored in OLAP server native format. You can choose the degree of aggregation by considering the level of query optimization you want versus the amount of disk space required. Figure 7 shows the Storage Design Wizard. For example, I chose 80 percent performance, which produced 124 aggregations and required 22.5MB of storage space for Multidimensional OLAP (MOLAP) storage. The aggregations roll up, so if you choose low performance in favor of conserving disk space, the OLAP server query engine will satisfy queries by summing existing aggregations.
Use Views as the Data Source
Always use views as the data source for dimension tables and fact tables. In addition to providing a valuable abstraction layer between table and cube, views let you leverage your staff's expertise with relational database management systems (RDBMSs). When you use a view as a fact table, you can manage incremental updates by altering the WHERE clause within the view instead of assigning the WHERE clause to an OLAP partition. When you use a view to source a dimension, you can define logic inside the view that otherwise would have to be defined in Analysis Services (e.g., formulated member names, formulated member properties).
Enter Count Estimates
When you first build a dimension, Analysis Services stores the member count for each level as a property of the level. This count is never updated unless you explicitly update it (manually or by using the Tools, Count Dimension Members command). In addition, it's typical for cubes to initially be built against a subset of the data warehouse. In this case, the cube will likely go into production with the count properties understated by an order of magnitude. Here's the gotcha: The Storage Design Wizard uses these counts in its algorithm when you're designing aggregations. When the counts are wrong, the Storage Design Wizard is less effective at creating an optimal set of aggregations. The solution is simple—when you build the dimension, manually enter estimated counts for each level.