Data mining helps users find hidden patterns and useful business information in large data sets. Corporations can use these patterns and rules to improve their marketing, sales, and customer-support operations by better understanding their customers. These patterns can help solve many business problems, such as figuring out which customers are most likely to leave or discovering what other products certain customers are most likely to be interested in. In these scenarios, the key activities are discovering inherent (but not always obvious) patterns in the data and then, based on the patterns, making predictions.
In today's e-business environment, data mining is beginning to garner more attention. Many organizations are realizing that it has the potential to be an essential component of their IT architecture and business-development strategy. Because data mining is about exploration and analysis, either by automatic or semiautomatic means, large quantities of data can help business analysts uncover meaningful patterns and rules. Corporations have accumulated very large databases from enterprise resource planning (ERP) or customer relationship management (CRM) applications and other operational systems. Data-mining techniques can help you put your data to work as you extract the patterns.
With SQL Server 2000, Microsoft introduced data-mining functionality as part of Analysis Services. Besides adding this functionality to SQL Server 2000, Microsoft joined several leading data-mining providers to initiate the OLE DB for Data Mining API. The API defines a data-mining query language (OLE DB for Data Mining Query Language) that's based on SQL syntax. This language treats data-mining models as a special type of relational table and treats prediction operations as a special kind of join. For explanations of the terms used in this article, see the sidebar "Data-Mining Terminology," page 40. Analysis Services includes the Microsoft data-mining provider, which is based on the OLE DB for Data Mining standard. The new provider includes two data-mining algorithms: Microsoft Decision Trees (MDT) and Microsoft Clustering. Let's look at how you can use each of them to solve typical business problems.
Microsoft Data-Mining Algorithms
Analysis Services ships with the MDT and Microsoft Clustering algorithms, which are the result of many years of research at Microsoft. Let's look briefly at both algorithms. You can find more information about these algorithms at http://citeseer.nj.nec.com/bradley98scaling.html and http://www.acm.org/sigmod/disc/p_scalableclassifsuj.htm.
MDT. The decision tree is probably the most popular technique for predictive modeling. To understand the basic workings of the decision-tree algorithm, let's look at an example. Table 1 shows a set of data that you could use to predict credit risk. In this example, we generated hypothetical information about customers, including their debt level, income level, type of employment, and an evaluation of their credit risk. Figure 1 shows a decision tree that results from this data.
In this example, the decision-tree algorithm might determine that the most significant attribute for predicting credit risk is debt level. The algorithm therefore makes the first split in the decision tree based on debt level. One of the two new nodes (Debt = High) is a leaf node containing three cases with bad credit and no cases with good credit. In this example, a high debt level is a perfect predictor of a bad credit risk. The other node (Debt = Low) is still mixed, having three good credit cases and one bad. The decision-tree algorithm then chooses employment type as the next most significant predictor of credit risk. The split on employment type has two leaf nodes that show that the self-employed people in this example have a higher bad-credit probability.
For the example, we used a small amount of synthetic data to illustrate how the decision tree can use known attributes of credit applicants to predict credit risk. In reality, each credit applicant would typically have far more attributes and the number of applicants would be very large. When the scale of the problem expands, manually extracting the rules that identify good and bad credit risks becomes difficult. But the MDT algorithm can consider hundreds of attributes and millions of records to generate a decision tree that describes rules for credit-risk prediction.
Microsoft Clustering. The Microsoft Clustering algorithm is based on the Expectation and Maximization (EM) algorithm. The EM algorithm iterates between two steps. In the first step—the "expectation" step—the algorithm calculates the cluster membership of each case. The cluster membership is the probability that a case belongs to a given cluster. In the second step ("maximization"), the algorithm uses these cluster memberships to reestimate the models' parameters, such as the location and scale parameters of Gaussian distribution.
Figure 2 shows a few iterations of the EM algorithm for a one-dimensional data set. The algorithm assumes that the data is drawn from a mixture of Gaussian distributions, more commonly known as bell curves. In the first row of Figure 2, the algorithm initializes the mixture distribution, which is the mixture of several bell curves here. In the second and third rows, the algorithm modifies the mixture distribution based on the data. The iteration stops when it meets specified stopping criteria—for example, when it reaches a certain likelihood-of-improvement rate between iterations.
Most clustering algorithms load all data points into memory, which causes serious scalability problems when you're processing a large data set. However, the Microsoft Clustering algorithm uses a scalable framework that selectively stores important portions of the database and summarizes other portions. The algorithm basically loads data into memory buffers in chunks and, based on the updated data-mining model, summarizes cases that are close together in a Gaussian distribution, thereby compressing those cases. Therefore, the Microsoft Clustering algorithm needs to scan the raw data only once.
Building Analysis Services Data-Mining Models
So, how can you use Analysis Services to build data-mining models that give you vital information about your business? Let's walk through some examples to see how you can train and browse data-mining models and make predictions based on the trained models. Table 2 lists four banking scenarios along with the algorithm best suited for each scenario.
To solve the business problems, we use two relational database tables: Customer and Purchases. The Customer table contains demographic information about bank customers. Demographic information includes the customer's age, income, education level, house value, loans, and so on. The Purchases table contains purchase information about the bank products that a customer subscribes to. This table has information about bank products such as checking, money market, and savings accounts. As Figure 3 shows, the Customer table is linked to the Purchases table by CustomerID. In relational terminology, the Purchases table makes a foreign key reference to the Customer table.
Now let's look at various steps for creating, training, and browsing the data-mining model. The first scenario: Identify those customers who are most likely to churn (leave), based on customer demographic information. Then, we'll look at how to solve the fourth scenario by using the clustering algorithm.
Creating Data-Mining Models
When you create a data-mining model, you need to define the model's structure and properties. According to the OLE DB for Data Mining API, you create a new data-mining model by using the CREATE MINING MODEL statement. In relational databases, the CREATE TABLE statement defines a relational table's structure and properties, including column names and data types, constraints, and storage options. Similarly, the CREATE MINING MODEL statement defines a model's keys, columns, and the specific algorithm and parameters to use in the training operation.
You create a data-mining model by using the Mining Model Wizard in Analysis Manager. After you select the data-mining algorithm, define the input table, and specify the input and predictable columns, the Mining Model Wiz-ard automatically generates the CREATE MINING MODEL statements. We used the MDT algorithm to create a data-mining model to solve the first problem in Table 2 because it's a prediction problem. We chose CustomerID as the case key column, which is the case table's identity key column and uniquely identifies a case. As Figure 4 shows, we selected all the demographic information as input columns and the Churn_Yes_No column as the output attribute we wanted to predict. In the wizard's final step, we named this model Model1_MDT_NonNested.
After you click Finish, the wizard generates a CREATE MINING MODEL statement based on OLE DB for Data Mining syntax, then sends it to the Microsoft Data Mining Provider. The generated CREATE MINING MODEL syntax is similar to standard SQL's CREATE TABLE statement. In Figure 5, for example, the keywords LONG, DOUBLE, and TEXT define the columns' data types and are similar to 's int, float, and varchar. However, the statement has a few extensions that aren't part of T-SQL. For example, the keyword KEY designates a Key content type column (or columns), which uniquely identifies a row in the data-mining model. The keywords CONTINUOUS and DISCRETE are two possible values for Attribute content type, specifying a continuous or discrete column type. The keyword PREDICT designates the data-mining model's predictable column, which is the target column that the user wants to find patterns about.
Processing (Training) Data-Mining Models
After you create the data-mining model, the next step (the last step in the wizard) is training the model. To train the model, you use training data to populate the mining model. The CREATE MINING MODEL statement creates only the model's structure. Training adds the model's contents. Training is usually the most time-consuming step in the data-mining process. The algorithm might have to iterate over the training data set several times to find the hidden patterns.
The OLE DB for Data Mining API uses the INSERT statement to specify the training command. This use of the INSERT statement is the same as the standard SQL INSERT, which populates a table with data. Although you're feeding massive quantities of data into the data-mining model, the model doesn't store any of the data; instead, it stores the patterns that it finds within the data. After the model is trained, the client application can browse its content and perform queries on the new data set.
Figure 6 shows the training process statements for the customer-churning example. From the wizard's View Trace Line window, you can see that table SONISLAP.NEWDMDB.CUSTOMER contains the training data. The columns CustomerID, Income, OtherIncome, Loan, and so on are used for training. (A complete discussion of the training process is beyond the scope of this article; we'll concentrate on training in a future article.)
Browsing Mining-Model Contents
After training the model, you can browse the model's content through Analysis Manager's tree browser. This browser displays the content graphically, letting you navigate through different portions of the model. Studying the contents can give analysts important insight into the data and help them understand the patterns and rules within it. Later, the analyst can apply these rules to new data sets to make predictions.
The tree that Figure 7 shows represents the customer-churning pattern that the MDT algorithm found in the training data set. The algorithm found that income is the most significant attribute for predicting whether a customer will churn: People with lower incomes have a higher churn probability. The algorithm divides the customers' income into four branches. The decision-tree algorithm then chooses age as the next most significant predictor. At the third level of prediction, the algorithm selects education level for customers with less than $49,923.75 in income and house value for those with incomes between $100,040.25 and $124,517.25. Based on this information, the bank can predict the probability of churn for each customer.
Prediction Using Mining Models
You can use the trained mining model to make predictions about new data sets—for example, to predict the probability of potential churn for each new customer. And a new SQL Server 2000 Data Transformation Services (DTS) task called Prediction can help you create a DTS package for prediction.
Using the OLE DB for Data Mining API, prediction statements are SELECT statements that join a data-mining model with a new input table. This special kind of join is a PREDICTION JOIN. The general syntax for prediction is
FROM <mining model name> PREDICTION JOIN
<source data query> ON <join condition>
This statement is very similar to a relational join with two tables, except that one of the input "tables" is an already trained mining model.
Using the Clustering Algorithm with a Nested Table
We've seen how the MDT algorithm works; now let's look at how to use Microsoft Clustering in data mining. To solve Table 2's problem of grouping customers according to the banking products they subscribe to, we used a nested table to build a clustering model. You need a nested table to model this business problem because the products that a customer has purchased are nested under that customer (i.e., one customer case includes a set of products). After starting the Mining Model Wizard, we selected the Customer and Purchases tables in the relational database, then we selected the Microsoft Clustering algorithm. We chose CustomerID as the case key column from the Customer table, then selected all the demographic information as input columns from Customer and all the purchase information as input from the Purchases table. After creating the model, we edited it in the Relational Mining Model Editor. For example, the wizard assumed that Purchases is a lookup table, so it initially made the Customer table the "many" side of the relationship. However, we wanted to use Purchases as a nested table, so we clicked the link between these tables in the editor and reversed the relationship. Figure 8 shows the edited nested model. After we processed this data-mining model, using the INSERT INTO statement that the wizard generated, the trained model looked like the one that Figure 9 shows.
From browsing the model content in the cluster browser, we made the following observations with the help of the description in the node path section:
- Cluster 1 customers have the following characteristics: Most are around 50 years old, their average income is $79,000, they have no extra income source, and many of them have a high education level. They own many certificates of deposit, money-market or savings accounts, and so on.
- Cluster 2 includes customers who are, on average, 40 years old, have an average income of $56,000, and have an average additional income source of about $42,000. Customers in this cluster own a large number of certificates of deposit and savings accounts but a small number of credit cards.
- Cluster 3 includes customers whose average age is 65, who have an average income of $56,347, and who have an average additional income source of $42,645. Customers in this cluster own a large number of credit cards and money-market accounts.
Based on this information, the bank's marketing department can send appropriate mailings for different products to different clusters of customers. The response rate from these mailings will likely be much higher than if the bank used a randomly selected portion of the customer base.
The information in this article is by no means exhaustive; it's simply meant to provide an overview of data-mining concepts. By exploring the techniques that this article presents, you can begin to see how the data-mining features in SQL Server 2000 Analysis Services can help your organization find the patterns and rules to improve its marketing, sales, and customer-support operations and to gain a better understanding of its customers. Stay tuned for an upcoming article, in which we take a detailed look at the training performance of these algorithms.