An overview of the data-mining features in SQL Server 2000

Data mining is one of the most exciting new features of SQL Server 2000. I define data mining as an automated process of analyzing data to find interesting information. You can use data mining with either relational data sources or OLAP cubes, but the integration of data mining with OLAP offers outstanding benefits because the cube meta data is already structured to free the user from feeding a lot of information to the data-mining algorithm. Although I'm not yet an expert, I've learned enough from my colleague, Greg Bernhardt, to explain and, I hope, demystify the data-mining features of Analysis Services. I'll provide an overview of the data-mining features in the Analysis Manager (Analysis Services' configuration utility) and show you how to use data mining in your own analysis application.

Data-Mining Features


Data mining fills an important gap in Analysis Services' capabilities. Microsoft introduced ad hoc and exploratory analysis capabilities in SQL Server 7.0 with OLAP Services. In ad hoc analysis, an analyst knows which question the client wants answered and simply uses an OLAP cube to retrieve information. For example, a fast-food restaurant manager might ask, "What is the trend of revenue and margins for hamburgers in the last four quarters?"

In exploratory analysis, an analyst might have some idea of a client's interests but no specific question to answer. For example, a company might know that some of its retail stores are not profitable but have no idea why. The analyst navigates the information in an OLAP cube by drilling into more detail, pivoting the dimensions, and following the most interesting information. At Knosys, we call this process surfing the data.

Data mining differs from both ad hoc and exploratory analysis. With data mining, Analysis Services surfs the information for you, seeks relevant data, and presents the data to you. Data mining is a perfect companion to exploratory analysis.

SQL Server 2000 exposes its data-mining capabilities in a new API, OLE DB for Data Mining (OLE DB for DM), a programming interface designed to generalize the capabilities of data mining for various applications. Through OLE DB for DM, Microsoft provides two data-mining algorithms, but other software vendors can also plug in. Theoretically, analysis applications developed on OLE DB for DM will be able to use new algorithms as they become available.

The two algorithms included with SQL Server 2000 are decision-tree and clustering. Decision trees classify information in a tree structure that helps you predict some attribute of the data. For example, you can feed customer information (e.g., income, marital status, and education) to the decision-tree algorithm to help predict whether a customer is a credit risk. You can use the clustering algorithm to find natural groupings in data. For example, you can feed all your customer information into the clustering algorithm and ask for three groupings. The algorithm might discover one natural group of married, highly educated people with low incomes, another group of single people with high incomes, and a third group of poorly educated people with high incomes. Further analysis might disclose that one group has a very specific buying pattern. Using this information, you can create a cost-efficient, targeted marketing campaign. Notice that the second and third groups didn't reference all three inputs because not all inputs were important in identifying those groups. So, for example, the description for the second group didn't include education level because that group wasn't uniquely identified with any particular education level.

Using Analysis Services


If you data-mine OLAP cubes rather than relational data sources, you aren't using OLE DB for DM directly because Analysis Services exposes data-mining features through its own programming interfaces. On the server, you can access data mining through Decision Support Objects (DSO); on the client, you can access data mining through OLE DB for OLAP or ActiveX Data Object Multidimensional (ADO MD).

I'll highlight the decision-tree algorithm (rather than the clustering algorithm), because it lets you create new dimensions based on decision-tree data-mining results. You can then include these dimensions in a new virtual cube, and you can surf existing dimensions by using the data-mining results.

Create a Mining Model


To create a mining model, open Analysis Manager and expand the tree view in the left pane to open the FoodMart 2000 database. You'll see the Mining Models folder. (A mining model defines the specific data you're mining and the type of predictions you're making with that information.) Right-click the Mining Model folder, and select New Mining Model to launch the Mining Model wizard. The wizard first asks you to choose whether you want to mine relational or multidimensional data. Select Multidimensional, and click Next. Then, select the cube to mine (choose the Sales cube for this example). The wizard's third step lets you choose a data-mining algorithm. Select Microsoft Decision Trees, and click Next.

In the next step, you choose the data-mining case. (A case is the data entity being mined.) Select the Customers dimension and the Name level. The next step lets you choose the type of predictions you'll make. Imagine that a data-mining algorithm has input and output. This example uses information about customers as the input and yearly income as the predicted entity or output. Yearly Income is a member property on the Name level of the Customer dimension. Select A member property of the case level, and choose Yearly Income, as Figure 1 shows.

The next step lets you choose which part of the OLAP cube you'll use to predict yearly income. Check the Customers dimension, all the levels in the Customers dimension, and all the levels under Member Properties in the Name level, as Figure 2 shows. (Note that these are the default settings.) Notice that Yearly Income is chosen as an input as well as an output because we are training the model. To train a mining model, the algorithm needs the correct answers (i.e., the actual yearly income for the existing customers).

Click Next to create an OLAP dimension and a virtual cube. Complete the dialog box, as Figure 3 shows. The Mining Model wizard will create a virtual cube based on the Sales cube and will add the new, data-mined dimension called PredictIncome. Click Next to name your mining model and decide whether to process it immediately. Name the mining model IncomeModel, then click Save and Process Now. When you click Finish, Analysis Services will crunch the data and drop you into the mining model editor to see the results.

Figure 4 displays the OLAP Mining Model Editor. The middle pane shows a few nodes of the resulting decision tree (Figure 5 shows the full tree). The upper-right pane in Figure 4, labeled Content Navigator, graphically represents the entire decision tree, using color for data density—darker colors represent more customers. In the middle pane, I've selected one node, Customers.Name.Member Card = Normal. This node represents all customers for which the member property, Member Card, has a value of Normal. The middle right pane shows a histogram of different yearly income levels for customers in this category. Note that 83 percent of them earn between $10,000 and $30,000, meaning that a normal member card is a very good predictor for customers of this income level. (This conclusion shouldn't surprise you, because the model created Member Cards by mining yearly income and education level.)

Analyze a Cube Without Analysis Manager


Now suppose you want to analyze the Trained Income cube from a client application, and you don't have OLAP Administrator privileges to run Analysis Manager. You can use the Multidimensional Expression (MDX) query in Listing 1 to view the same histogram information. The query returns the number of customers for each income level that meet the criterion Member Card = Normal. The query works by first creating a set (CustMembers) of customers with a normal member card. The query then creates a new measure that counts how many of these customers have the currently selected income level.

Create a Mining Model from a Client Application


Try several different models to discover which work best. Users will want to create mining models from their client PCs, not the server, although current client applications don't yet support this capability. Client analysis applications need to support this capability. You can use extensions to the MDX syntax to create mining models through OLE DB for OLAP or ADO MD. You need the Enterprise Edition of SQL Server to create mining models from the client because this capability relies on the Enterprise Edition's custom rollup feature.

Creating a virtual cube with a data-mined dimension takes three steps. First, create the mining model. The command in Listing 2 generates a decision-tree mining model that predicts yearly income based on a customer's gender, marital status, and education.

Note that the FROM clause calls out the cube name, and each PROPERTY keyword calls out a member property of the Name level of the Customers dimension. After running this command, you'll have an empty mining model that is unique to your Analysis Services connection and not shared by other workstations. (To create shared mining models, you need to use either the Analysis Manager or the DSO programming interface.) The following command fills the mining model with data from the Sales cube; it's a simple INSERT INTO statement that specifies the name of the mining model:

INSERT INTO \[Local Find Salary\]

Finally, you'll create a virtual cube based on the mining model and any other dimensions and measures you want to include from the source cube. Listing 3 shows the code to create the cube. In this statement, I used only the Unit Sales measure and the Customers and Time dimensions. The last dimension, PredictIncome, is created as part of the CREATE VIRTUAL CUBE statement. The NOT_RELATED_TO_FACTS keyword tells Analysis Services that this dimension is not from a table related to the cube's fact table. The FROM and COLUMN clauses further specify which mining model and predicted entity of the mining model will be used to create the dimension. (SQL Server Books Online—BOL—documents most information for these statements, but SQL Server 2000 beta 2, which I used in preparing this article, provided no information about the CREATE VIRTUAL CUBE statement.

Web Listing 1 (which you can find by entering InstantDoc ID 9160 at http://www.sqlmag.com/ and opening the 9160.zip file) displays a short Visual Basic (VB) program that demonstrates running these commands. Unfortunately, you have to use a program like this one rather than MDX Sample, because the commands must be executed with a returning recordset, not the standard multidimensional cellset. Note that the listing executes the three commands I just discussed, then opens a Microsoft proprietary schema rowset. (I added the OpenSchema function to the sample program to demonstrate how to access the mining model content.) My sample listed only the caption of each node in the decision tree. The rowset is structured like a parent-child dimension table. Each row of this rowset identifies a node in the decision tree and has a NODE_UNIQUE_NAME and a PARENT_UNIQUE_NAME field. The parent unique name identifies which row contains the parent of the current row. At the beginning of Web Listing 1 is a list of constants that include globally unique IDs (GUIDs), which you use to access all the provider-specific rowsets related to data mining. I didn't use them in my sample, but you might want to experiment with them if you plan to develop an application with data-mining support.

Easier Than It Looks


When I first read the OLE DB for DM specification, I thought it required a lot of information from the programmer or application user to do simple data mining, and I wondered about its usability for the average decision maker. But after closely inspecting the specification, I realized that it seems complex because of the information required to mine relational data sources, not because of OLAP cubes. OLAP data mining is simple, fast, and useful, and the decision-tree algorithm is a great way to help structure large amounts of data so you can gather useful information.