Many analytical tools integrate with Microsoft Excel, and for a simple reason: People who crunch numbers for a living spend a lot of time with spreadsheets. SQL Server 2005's data-mining algorithms offer powerful capabilities for analyzing data, but Microsoft didn't provide any out-of-the box tools for delivering data mining to the desktop until recently. In February 2007, Microsoft delivered the Data Mining Add-ins for Office 2007, which let you take advantage of SQL Server 2005 predictive analytics in Excel 2007 and Microsoft Office Visio 2007. Let's walk through the installation and configuration of the Data Mining Add-ins and look at a comprehensive sample. We'll focus on one add-in tool in particular: the Data Mining Client for Excel.
Installation and Initial Configuration
Before installing the add-ins, make sure Excel 2007 and the Microsoft .NET Framework 2.0 are installed on your machine. Additionally, run Office Setup from the Control Panel Add/Remove Programs applet to ensure that the .NET Programmability Support component is installed under Excel. (This component is included with the default installation of Excel.) Lastly, you'll need access to an instance of SQL Server 2005 Analysis Services, though this instance doesn't have to reside on your machine. I recommend that you also install the AdventureWorks Analysis Services database sample. (The Microsoft article "Running Setup to Install AdventureWorks Sample Databases and Samples," http://msdn2.microsoft.com/en-us/library/ms143804.aspx, explains how to install the sample.)
Next, download and install the Data Mining Add-ins. (For download details, go to http://go.microsoft.com/fwlink/?LinkId=82754.) In the Feature Selection window, which you'll see after you begin the installation process on your local machine, make sure the following features, at a minimum, are set to be installed on the local hard drive:
- Data Mining Client for Excel, which lets you run through the full datamining process within Excel 2007
- Server Configuration Utility, which lets you set Analysis Services server properties specific to data mining and create a database for storing models.
After the installation is complete, click Start, All Programs, SQL Server 2005 DM Add-Ins, Server Configuration Utility.The configuration utility sets the Analysis Services Data Mining\AllowSessionMiningModels server property to True, then creates the DMAddinsDB database to use for creating temporary and permanent data models. You can also view the documentation that comes with the Data Mining Add-ins if you prefer to configure your server and create a database manually.
A Data-Mining Primer
At a high level, data mining is the process of finding information (e.g., patterns, trends) in large volumes of data in an automated or semi-automated manner. The classic example of the value of data mining is its use in cross selling. For example, when I buy a book online, the vendor might query its database for customers who bought the same book and display other titles those customers purchased.
SQL Server 2005 ships with nine data-mining algorithms, each of which is suited to a different type of problem, such as forecasting sales, identifying fraudulent transactions and claims, or segmenting customers into different profiles. Each situation involves the use of historical data to build a model to predict a future state or to better understand the current state. Data-mining literature (including SQL Server Books Online—BOL) typically groups algorithms into different types (or tasks). An example of a data-mining type is Classification, which categorizes historical data around a predictable attribute. The Microsoft Decision Trees algorithm is a classification algorithm.
So, data mining helps solve problems, but how exactly does one mine data? Using BOL as a reference, let's look at the data-mining process in terms of six basic steps:
- Define the business problem.
- Prepare the historical data.
- Explore/validate the historical data.
- Build the data-mining model.
- Explore/validate the model.
- Deploy and update model.
The example I use is similar to Lesson 2: Building a Targeted Mailing Scenario in the data-mining tutorial that ships with SQL Server 2005. (For more information about the tutorial, see SQL Server 2005 BOL "Data Mining Tutorial," http://msdn2.microsoft.com/en-us/library/ms167167.aspx.) Assuming you've installed the AdventureWorks Analysis Services database sample, you can review the BOL datamining tutorial and the accompanying data-mining models for additional insight.
A Data-Mining Scenario
A typical data-mining scenario you might encounter is that of classifying customers so you can build a targeted mailing list. Let's walk through this scenario using the BOL six-step data-mining process as a guide.
Define the business problem. Let's say you're working in the marketing department of a famous bicycle company, AdventureWorks. The marketing director has decided to run a promotion for a new line of bicycles. You've acquired a long list of potential customers that includes key demographic data such as age, gender, marital status, and occupation. However, the marketing budget is tight. You need to minimize the cost of a direct mail campaign by identifying the set of potential customers most likely to purchase a bike, as indicated by customer purchase history. To identify these customers, you'll use the Data Mining Client for Excel.
Prepare the historical data. In this example, the task of preparing historical data has already been done for you. Click Start, All Programs, Microsoft SQL Server 2005 DM Add-ins, Sample Excel Data, to open the sample workbook named DMAddins_SampleData.xlsx. (You might want to make a backup copy of this workbook, which is located by default in the C:\Program Files\Microsoft SQL Server 2005 DM Add-Ins folder.) Navigate to the Source Data worksheet to view the demographic data about the existing customers, including a field indicating whether they've purchased a bike.
Explore and validate the historical data. To explore the historical data, you'll need to understand the various data attributes, group data into buckets to reduce complexity, look for outliers (i.e., problematic data values that are far outside the expected range and that might adversely affect or skew a model), and potentially change the data. Click the Data Mining tab at the top of the Excel Ribbon. Click Explore Data to open the Explore Data Wizard.
Click Next and make sure the ‘Source Data'!'Source Data' Table is selected. In the Select column drop-down menu, select Yearly Income. Click Next, and you'll see an Explore Data chart similar to the one in Figure 1, which distributes incomes across eight buckets. (You can reconfigure the number of buckets as needed.) By exploring data in this manner, you can find and, if necessary, change or delete outliers that might skew the data model. Go ahead and cancel out of the wizard.
Let's assume you don't want to consider yearly incomes above $150,000. In the Data Mining tab of the Ribbon, click Clean Data, then select Outliers to bring up the Outliers Wizard. Click Next, make sure the ‘Source Data'!'Source Data' Table is selected, and click Next again. In the Select column dropdown menu, select Yearly Income, then click Next. In the Specify Thresholds step, change the Maximum value to 150000. Click Next, and in the Outlier Handling step, select Delete rows containing outliers. Click Next one more time, then select Copy sheet data with changes to a new worksheet. A worksheet named Clean Data will automatically be created. Note that you're creating this worksheet just to get an idea of how it's done, so you can delete it whenever you wish.
Within the Clean Data feature, the Relabel option can help you clarify or distinguish data attributes. Click Clean Data and select Re-label. In our example, the possible values for the Home Owner and BikeBuyer columns are Yes and No. If it better suits your needs, you can use Re-label to change the Home Owner values to something more descriptive, such as Rent or Own.
Build the data-mining model. Before you build the data model, you'll need to partition the historical data into two buckets: one that you'll use to build the actual model and another to test the model's accuracy. In the DMAddins_SampleData workbook, you can see the worksheets titled Training Data and Testing Data, which are partitions of the original Source Data worksheet. You'll use these pre-partitioned worksheets in a moment. But first, to get an idea of how this partitioning was done, you're going to create your own partitions on your own temporary worksheets, which we'll label TempTrainingData and TempTestingData. You won't use these partitions to build the actual model, because partitioning is random and each reader's model would be different, so you can delete these temporary worksheets when you're done creating them.
To partition the historical data, click the Data Mining tab of the Ribbon, click Partition Data, then click Next. Make sure the ‘Source Data'!'Source Data' Table is selected and click Next again. In the Select Sampling Type step, stick with the default Split data into training and testing sets option, click Next, and keep the default value of 70.0 for the Percentage of training data field. Click Next again. Name the temporary training worksheet TempTrainingData and the temporary testing worksheet TempTestingData, then click Finish. You now have created two worksheets containing a random sample of the historical data—70 percent in the TempTrainingData worksheet and 30 percent in the TempTestingData worksheet. You can delete these temporary worksheets now.
Next you need to define a connection to an instance of Analysis Services, which is where you'll build the actual data-mining model. In the Data Mining tab of the Ribbon, click Connection. (If no connections have been created, the command label will be <No Connection>.) Click New and enter the connection information for your instance and for the DMAddinsDB database, a database that gets created when you use the Server Configuration Utility I mentioned earlier. Click OK to close the Connect to Analysis Services dialog box, click Make Current, then close the Analysis Services Connections dialog box.
Now you can build the model. In the Data Modeling section of the Data Mining tab, you'll see several commands for creating different types of models (e.g., Classify, Estimate, Cluster). Click Classify to start the Classify Wizard, then click Next and make sure the ‘Training Data'!'Training Data' table is selected. Click Next again, and in the Column to analyze box select BikeBuyer, which then becomes your predictable attribute—the attribute that's affected by other attributes in a manner you wish to understand. In the Input columns data grid, which Figure 2 shows, clear the check box for the ID column, which isn't relevant to helping you understand whether someone is likely to be a bike buyer. Click Next again.
In the Finish step, you can change the default structure, model name, and descriptions. We haven't talked about data-mining structures, but you can think of them as a schema definition of your historical data. By creating a structure, you can apply multiple models to your historical data without having to redefine the historical schema. Leave the Browse model option selected, select the Enable drillthrough option, then click Finish to deploy the data-mining model to the Analysis Services instance for processing.
Explore and validate the model. After you complete the Classify Wizard, the Browse window will be displayed. (If you close this window, you can view it later by clicking Browse in the Data Mining tab.) To make your Decision Tree tab look like the one that Figure 3 shows, you need to make some slight adjustments. First, change the value in the Background drop-down box to Yes. By changing the background value, you can more easily see the likelihood of Bike Buyer being equal to Yes in each node: The darker the shading of the node, the higher the likelihood. Set the Show Level slider to 4 to adjust the number of levels shown in the tree. Then click the Size To Fit toolbar button (the one that has four red arrows in the shape of an X).
By looking at the decision tree, you can now start to make some observations. If you hover over (or click) the All node, you'll see that the 7,000 cases include 696 bike buyers. The first split in the tree shows that age is the most significant factor in influencing the customer's decision to purchase a bike. The second split shows that the next-significant factor is either the number of cars owned or, for customers from 32 to 53 years old, yearly income. The dark background color of the Cars = 0 node shows you where the largest percentage of likely buyers lies within the four levels. To see the cases behind this node, right-click the node and select the Drill Through menu item; a new worksheet will be created in your workbook with the applicable case data.
Next, switch to the Dependency Network tab and select the Bike Buyer node. Using the color coding at the bottom of the tab, you can see that Bike Buyer (in light blue) is the selected (or predictable) attribute and that the orange nodes are the input nodes that predict Bike Buyer. The links, which appear as arrows, point from the input nodes to the predictable node. You can adjust how many links the viewer shows by adjusting the slider (located on the left side of the window). Lowering the slider shows only the strongest links. For example, if you lower the slider to the bottom, only the Age link remains, which is consistent with the first split in the decision tree.
Now you need to test, or validate, the model for accuracy. Recall that the Source Data worksheet has been split into two partitions, one (the Training Data worksheet) to build the model and the other (the Testing Data worksheet) to test it. In the Data Mining tab, you can choose among three commands within the Accuracy and Validation section. Each option takes a set of test cases as input. The Classification Matrix command runs all the test cases through the model and then scores, or classifies, the results of the prediction against the actual values in the test data—in other words, it tells you whether your model predicted Bike Buyer = Yes when it was supposed to. The accuracy chart (aka a lift chart) shows the results of the model's predictive capabilities against a "random guess" and "perfect" model charted for comparison. The profit chart is similar to the accuracy chart but lets you input cost and revenue data to determine the point of maximum return.
For our example, let's work with the Classification Matrix command. Click Classification Matrix in the Data Mining tab to bring up the Classification Matrix Wizard. Click Next three times, then, in the Select Source Data step, make sure the ‘Testing Data'!'Testing Data' table is selected. Click Next again, then click Finish; a worksheet named Classification Matrix will be created. According to this matrix, the model is 89.13 percent accurate in terms of predicting both bike buyers and non-bike buyers. You might be dismayed to discover the model correctly identifies only 16.12 percent of the actual bike buyers. However, keep in mind that the model correctly identifies non-bike buyers 97.37 percent of the time. If you recall, when we defined our business problem, we needed to minimize the cost of a direct mail campaign. This model identifies only a small subset of the likely buyers, but it eliminates most of the likely non-buyers.
I like to think of the accuracy this way: According to the test results, the model identified 120 likely bike buyers (2.63 percent of the non-bike buyers and 16.12 percent of the bike buyers). Of those 120, 49 were bike buyers, meaning the model has an accuracy of 41 percent. That percentage is much better than a random-guess approach (e.g., flipping a coin), which would be accurate about 10 percent of the time (the Training Data worksheet consists of 3,000 rows, of which approximately 10 percent are bike buyers).
On a side note, you could use another technique to make your model more aggressive. This technique, called oversampling, involves manipulating the source data to increase the frequency of rarely occurring data. The add-in's partitioning wizard supports oversampling, and you can learn more about this technique at the Microsoft Web site (see "Chapter 24 - Effective Strategies for Data Mining," in the SQL Server 2000 Resource Kit at http://www.microsoft.com/technet/prodtechnol/sql/2000/reskit/part6/c2461.mspx?mfr=true). With that said, let's stick with the current model and look at a profit chart for additional verification.
Merely for the purposes of this training example, let's make some assumptions about the size of our potential customer list, the variable and fixed costs, and gross revenue. Click the Profit Chart command in the Data Mining tab, click Next twice, then, in the Specify Profit Chart Parameters step, enter the following values:
- Mining column to predict: BikeBuyer
- Value to predict: Yes
- Target population: 35000
- Fixed cost: 5000
- Individual cost: 5
- Revenue per individual: 35
Click Next and make sure the ‘Testing Data'!'Testing Data' table is selected. Click Next again, then click Finish; a new worksheet will be created. Figure 4 shows the profit chart that's part of this worksheet.
According to the chart, you maximize profit by targeting 11 percent of the target population. The worksheet also indicates the optimal "probability threshold" is 15.14 percent. In other words, when the datamining model makes a prediction, the model also computes a probability for the prediction. The worksheet indicates we should target customers with a probability (to purchase a bike) of 15.14 percent or higher. You'll see this probability value in the final step.
Deploy and update model. In this example, Excel is your client tool, so no deployment is necessary. Let's go ahead and run your model against a small set of potential customers. In the Data Mining tab, click Query, then click Next twice. In the Select Source Data step, make sure the ‘New Customers'!'Table 17' table is selected. Click Next again and check that all the relationships (except BikeBuyer, of course) are properly mapped from the worksheet to your model attributes.
Click Next again, then click Add Output. In the Name field, enter ProbabilityToBuy. Select BikeBuyer from the Columns list, PredictProbability from the Column Functions list, and Yes from the Function Parameters list. Click OK to close the dialog box. Complete the Wizard by clicking Next and then Finish. A new column, ProbabilityToBuy, is added to the New Customers table. Using the probability threshold of 15.14 percent from the Profit Chart worksheet as a guide (and rounding your new column to the nearest hundredth), you should target 19 of the 78 potential customers in this set. You can now run a query against all of the potential customers and send the final results to the marketing director. Congratulations—you're finished!
More uses for the Data Mining Client for excel
We used the Data Mining Client for Excel add-in to walk through the entire datamining process. You can also use the add-in to browse or query an existing model. For example, if you've installed the AdventureWorks Analysis Services database sample, you can establish a connection to the database (using the Connection command at the Data Mining tab), then browse or query any of the models. A practical idea is to use Excel data as input to a data-mining query. With this approach, the model is ultimately built and maintained by IT folks (and might be processed using very large volumes of historical data) but is available to end users for validation, browsing, and querying.