Data mining helps users find hidden patterns and useful business information from large data sets. Corporations can use these patterns and rules to improve their marketing, sales, and customer-support operations by better understanding their customers. The most resource-intensive task in data mining is data-mining model training. A business analyst might need to mine millions of cases with hundreds of attributes and hundreds of states (values) to see the patterns that lead to effective revenue prediction or customer segmentation. But you can optimize the performance of SQL Server 2000's data-mining model training. In "Put Data Mining to Work," November 2001, we introduced some data-mining terms and examined how Microsoft implemented data mining in SQL Server 2000 Analysis Services. We also discussed the Microsoft Decision Trees (MDT) and Microsoft Clustering algorithms in some detail and demonstrated how you can use Analysis Services to build data-mining models to solve sample business problems. In this article, we summarize the results of our testing of the MDT algorithm and suggest some parameters that will get you the most consistent, satisfactory model-training performance.
Microsoft added new functionality to SQL Server 2000 to perform data-mining tasks on both relational data (in relational tables) and multidimensional data (in OLAP cubes). In addition to the data-mining algorithms, this functionality includes extensions toand the OLE DB extension OLE DB for Data Mining, which is the result of a collaboration between Microsoft and other data-mining providers. Through the use of OLE DB's Data Shaping Service and a special column content type called a table column, OLE DB for Data Mining lets you use both nested cases and non-nested (simple) cases for training and prediction. (For more information about column content types, see the Web sidebar "Column Content Types" at http://www.sqlmag.com, InstantDoc ID 23092.)
In testing SQL Server 2000's data-mining modeltraining performance, we conducted hundreds of experiments, focusing on various factors in the training data that affect the scalability of performance. These factors are the training data set's parameter configurations. For example, we varied the number of input attributes, the sample size of the training cases, the number of states for each input attribute, and the number of predictable (output) attributes. With each test, we varied the parameters for the test factor across a range of typical values while holding the other factors constant so that we could isolate the effect on performance of the parameter being tested. We included information about the training times for both non-nested and nested cases. You can find more details about this study in the white paper "Performance Study of Microsoft Data Mining Algorithms" (http://www.unisys.com/windows2000/default-07.asp).
About the Performance Tests
After you create the data-mining model, the next step is training. The model-creation process creates only the model's structure; training adds the model's contents. First, you use sample (training) data to populate the model. Then, you use a data-mining algorithm to evaluate that data and find patterns. Training is usually the most time-consuming step in the data-mining process because the algorithm might have to iterate over the training data set several times to find the hidden patterns. After you train a model, you can query it. Of course, when any input attribute, behavior, or environment changes, or when new data becomes available for training, you need to retrain the model to reflect the changes.
We performed hundreds of experiments that looked at model-training performance based on different parameter configurations of the training data set. We divided the tests into two studies: one for the MDT algorithm and one for the Microsoft Clustering algorithm. We further divided each study into a group of experiments for non-nested cases and a group for nested cases. Each experiment consisted of several tests in which we measured the training time as a function of the test factor as we varied that factor across a range of discrete values. Meanwhile, we held all the other factors constant. For example, we measured the effect of the number of input attributes while holding constant the sample size, number of states of each input attribute, and number of predictable attributes. Each measurement produced a data point in the result graph.
Many factors can influence a case, and the number of cases in most data-mining samples can be in the millions. As the scale of the data set expands, identifying which factors have the greatest influence on performance becomes more difficult. When resource constraints rather than business considerations influence the selection of cases, the algorithm can have even more difficulty identifying the predictable attributes in the data. For example, because of time and resource constraints, someone might use sampling techniques to identify only a subset of input cases when a more thorough selection technique would be a better choice. If the input data isn't homogeneous enough, using a sampling technique might leave behind valuable data.
We ran our tests on a Unisys e-@ction Aquanta ES5045R server with four Pentium III Xeon 550MHz CPUs. On this server, we ran Windows 2000 Advanced Server and SQL Server 2000 Enterprise Edition with Analysis Services. We configured the server with a 512MB cache, 4GB of RAM, and an OSM7700 Fibre Channel data storage unit with five 9GB disks in each RAID 5 disk array. The server was connected to a 100Mbps Ethernet network.
We generated the test data from the typical retail banking scenarios that Table 1 shows. We used what we felt were typical value ranges for each of the factors that influence performance. Decision trees are useful for classifying data and predicting outcomes, especially across a few broad categories. Our performance study for the MDT algorithm consisted of seven experiments: four for non-nested cases and three for nested cases. Each experiment measured the effect that one factor had on the scalability of performance over a typical range of values. For each of the seven experiments, Table 2 shows the factor we tested, the case type (non-nested or nested), and the range of values we used for each test factor. For each case, the MDT algorithm considered hundreds of attributes and millions of records to come up with the decision tree that best described the rules for prediction. (Because of space constraints, we haven't included the details of our Clustering algorithm experiments. See the sidebar "Cluster Analysis Experiment," page 56, for more information.)
Test Results for Non-Nested Cases
In most situations, data-mining analysts gather source data from an online transaction processing (OLTP) system. After you cleanse and transform the data, you can store the training data in a case table. For example, in Scenario 1 of Table 1, an analyst might want to predict customer churn risk based solely on customer demographic information. Figure 1 shows the Customer table, which is the data-source table we used for non-nested cases. This case table contains the customer demographic information. Using the MDT algorithm, we conducted four experiments to measure the model-training performance of a non-nested data-mining model.
Number of input attributes. In this experiment, the sample size of training cases was 1 million, each input attribute had 25 different states, and we selected 1 predictable attribute. We studied the MDT training performance by varying the number of input attributes from 10 to 200. Graph 1 shows that the algorithm training time scales better than linearly when the number of input attributes increases from 10 to 100, but performance begins to decrease at more than 100 input attributes. Training for 50 input attributes took 31 minutes 30 seconds; for 100, it took less than 41 minutes. However, when we increased input attributes to 200, the training time increased to almost 130 minutes—more than three times the 100-attribute level. However, the training performance for 200 input cases is still reasonable, considering the number of cases.
Case sample size. In the second experiment, we varied the sample size from 10,000 to 10 million. We kept the other factors fixed, using a case table with 20 input attributes, 25 states per input attribute, and 1 predictable attribute. As Graph 2, page 55, shows, the training times scaled better than linearly from 10,000 to 5 million cases, but above that number, the relative performance began to decrease. For 1 million cases, training the model took 11 minutes 20 seconds; for 5 million, it took less than 35 minutes; but for 10 million, the time was more than 100 minutes, three times the amount for 5 million cases.
Number of states per input attribute. For this experiment, we fixed the sample size at 1 million cases and input attributes at 20, with 1 predictable attribute. Then, we varied the number of input attribute states from 2 to 50. Graph 3 shows that training times again scaled better than linearly when the tested factor was less than a certain value—in this case, 10. But here, something peculiar happens. When the number of states increased above 10, the training time began to drop rapidly.
What caused the sudden drop-off? We believe that the MDT algorithm has trouble finding useful splits for tree growth above a certain number of input attribute states, resulting in a reduced tree depth and reduced training time. The training-time decrease might be good for performance, but it might also affect the prediction accuracy of the resulting tree because the nodes (classes) in the resulting tree might be broader than expected. When the nodes of a decision tree are too broad, its ability to accurately predict cases is decreased. Consequently, we recommend that you limit the maximum number of input attribute states to 20.
Number of predictable attributes. Here, we varied the number of predictable attributes while maintaining the other factors at 1 million cases, 40 input attributes, and 25 attribute states. Graph 4 shows the resulting training performance as we increased the number of predictable attributes from 1 to 32. Examples of predictable attributes would be whether a customer is going to leave or what the credit risk is for a particular customer if the bank issues a new loan to that customer. Because in this scenario the algorithm needs only one data loop for a prediction, the training time is linear when the number of predictable attributes is less than four (i.e., the number of CPUs in our test server). However, when the number of predictable attributes increases to 16, the training time increases by a factor of almost 10 to 5 hours 24 minutes. The algorithm builds a tree for each predictable attribute; this multiple tree-building process happens in parallel. Training time begins to increase as the number of trees increases, possibly because the overhead of switching between trees during the training process also increases. We recommend that you use data cleansing and normalization to reduce the number of predictable attributes. For nested tables, data cleansing and normalization ensures that the predictable attributes are free of irrelevant and dirty data and minimizes the redundancy of values in the domains of the predictable attributes.
Test Results for Nested Cases
The nested table is a new concept introduced in OLE DB for Data Mining. Using nested tables makes data-mining models more powerful. For example, from Scenario 3 of Table 1, we built a model with a nested table to predict a list of banking products that customers might be interested in, based on their demographic information and a list of the banking products they purchased before. Without using a nested table, analyzing this data-mining problem is difficult unless you denormalize the table. For all nested-case experiments, we built models based on two tables: the main case table (Customer), containing the customer demographic information, and the nested table (Purchases), containing the list of unique banking products that the customers purchased. Figure 2 shows the data-source tables we used for our three nested-case experiments.
Number of nested-table states. For this experiment, we varied the number of unique banking products, which is the number of states of the Purchases table's ProductID column. The main case table (Customer) contained 200,000 customers (cases). From the main case table, we used five input attributes. Each customer purchased from 0 to 50 different products, which are stored in the nested table, Purchases. We varied the number of products in a store from 100 to 1000. As the rate at which each customer bought products during each visit remains constant, the Purchases table's nested key (ProductID) becomes sparse (i.e., contains many nulls).
Graph 5 shows the training times for increasing numbers of states in the nested table. The training took significantly more time than similar situations that didn't use a nested-table model. When the number of states of nested-table keys was less than 200, training time increased. Beyond 255 states, the training time starts to decrease because when the nested-table key number is more than 255, the MDT algorithm uses feature-selection techniques to pick the most important keys (ProductIDs). It uses a marginal (single-node) model—basic statistics gathered from the data set—to predict the remaining products. By default, 255 is the maximum number of trees a mining model can have. When the number of nested-table key states is more than 255, as the customer purchase rate remains the same, the nested-table key becomes sparser (i.e., each product is purchased less frequently). Thus, with few related patterns for each of the keys, the trees become smaller and training time decreases.
Number of products purchased per customer. For this experiment, we fixed the sample size at 200,000 and the number of different products in the nested table at 1000. Then, we varied the average number of products each customer bought from 10 to 100. Graph 6, page 58, shows that the MDT training times scale better than linearly with the number of products purchased per customer. This result is reasonable because training time increases roughly linearly with the number of input attributes.
Sample size of master case table. In this experiment, we fixed the average number of customer purchases at 25 and the number of states in the nested table at 200, and we varied the number of cases (customers) from 10,000 to 200,000. Because the key ratio (the number of purchases per customer) is fixed, the number of rows in the nested table will increase when the number of customers increases. Graph 7, page 58, shows that training time increases linearly with the sample size. Training a model with 200,000 customers in the main case table and 10 million rows in the nested table takes about 4 hours; training a model with 100,000 customers takes about 2 hours. Note that training 10 million cases without nested tables takes only 100 minutes, as Graph 2 shows.
Summary of MDT Results
Our results show that the performance of the MDT algorithm is fast and scalable within certain parameter configurations. Specifically, you get the best performance when the number of input attributes doesn't exceed 100, the number of cases doesn't exceed 10 million, the number of input attribute states doesn't exceed 20, and the number of predictable attributes doesn't exceed the number of CPUs available. Data-mining model-training performance is highly predictable and acceptable within this "sweet spot." As you might expect, our results also showed that performance for non-nested cases is considerably better than for nested cases, although nested cases bring you valuable functionality.
Our test results revealed that, within certain parameters, the training performance you can expect from the two Microsoft algorithms (MDT and Clustering) is both efficient and scalable. These factors become increasingly important when you're dealing with the larger, more complex data sets that e-business and customer relationship management (CRM) applications produce.
With Analysis Services, data mining is no longer a domain reserved for statisticians. The complexity of the data-mining algorithms is invisible to the user. Through Microsoft's OLE DB for Data Mining, much of the theoretical complexity of data mining is also invisible to the database developer. So, what does this mean for SQL Server developers and DBAs? Without having to get a degree in statistics or master special-purpose technology, SQL Server professionals can quickly learn to create and train data-mining models and embed these advanced features into their(BI) applications. Applying the knowledge and techniques we presented in this article can help you predict the resource consumption of the data-mining part of Analysis Services so that you can better plan for its impact on your environment and its deployment in your BI and data-warehousing applications.