Over the years, data-mining technology has proven extremely useful for solving problems ranging from data quality to predictive analytics. Now, the most important business intelligence (BI) question isn't about how useful data mining is, but how you can make the known benefits of data mining available to a larger group of business users who might not have a data-mining background. Historically, data mining has been viewed as a BI topic that only BI experts could handle. But SQL Server 2000 introduced many data-mining features that let experts more easily perform data-mining tasks in their standard SQL Server environments. And SQL Server 2005 makes data-mining functionality available to all users. New tools let end users report and learn from data and give developers the ability to embed data mining in applications.
This month, I want to look at data mining from the developer perspective. For most developers, data mining is a new topic, and although it's exciting, it's not a simple subject to learn. (For more information about enhancements in SQL Server 2005 and Visual Studio 2005 that are designed to make the products work together compatibly, see Bill Sheldon's June 2005 article "Better Together," InstantDoc ID 46104.) To better understand the challenges of wide-scale deployment of data mining in a large enterprise, let's look not at SQL Server 2005 features but instead at the industry standard lifecycle of a data-mining project.
In Europe and North America, several initiatives are in the works to create a formal, documented data-mining process. The processes emerging from all of these initiatives are similar; most people agree on the main steps or stages involved in deploying data mining, and any differences are only in the detailed tasks within each stage. Figure 1 shows a diagram of a process model, developed by the Cross-Industry Standard Process for Data Mining (CRISP-DM) consortium, which outlines the six phases of a data-mining project.
Typically, data understanding, data preparation, modeling and evaluation have been the domain of the data-mining expert. To the extent that a solution requires a deep understanding of the data, data-preparation methods, and data-mining modeling techniques, data mining and its associated benefits remain locked in the hands of a few highly-trained specialists. But for many business problems, the solutions at each of the phases of data mining are well understood. Once an organization has defined a business scenario and a data-mining solution, it can deploy the data-mining solution to a larger user community, who can benefit from the results. Typically, you can accomplish this broad deployment by embedding data-mining functionality into a smart-client application that you give to all users. The application lets users work with a simple, constrained interface to access the data-mining results.
To understand when mass-market data mining can be valuable, let's look at some real-life examples of problems in which a smart application can add value. For example, in an online storefront or a call center, the ability to make personalized recommendations based on a customer's buying history is one of the most traditional applications of data mining. Likewise, customer churn (a term that describes a customer leaving and going to a competitor) is a problem that many companies face. To be successful, companies must keep the customers they have as well as attract new customers. By identifying customers at risk of churning, companies can better evaluate whether to accept some loss of customers or design strategies to reduce churning and increase customer retention. Data mining can help identify customers likely to churn by showing you customers who have churned and those who haven't and identifying characteristics that help predict what a new customer might do.
In real-life cases such as these, the data-mining technology needs to do its work in the background, without any human intervention, to come up with the most appropriate recommendation. In other words, we're talking about embedded data mining, in which experts build a model and deploy it through developer-built applications, letting end-users consume the results without even knowing that they're actually doing data mining. We're making the applications "smarter" through the addition of data-mining functionality.
Extending the Reach of Data Mining
One of the design principles of SQL Server 2005 data mining is not only to support analysts who are setting up, training, and deploying the model but also to make it easy for business users to consistently, repeatedly consume the results of a data mining model without any special knowledge of the underlying data-mining technology.
SQL Server 2005 adds a powerful but simple set of APIs called the Data Mining eXtensions (DMX), which let application developers create smart applications and easily embed data-mining technology. DMX includes the ability to call predictive models from client applications without having to understand the internals of each model and how the models work. Application developers can call the engine and choose the model that provides the best results based on the data analyzed. Returned data is tokenized (meaning that numeric values are returned in a series of attributes), which lets developers work with simple data rather than some new data format.
DMX syntax is designed to be approachable to people who already know SQL, as the following DMX query shows.
NATURAL PREDICTION JOIN
OPENQUERY('CustomerDataSource', 'SELECT * FROM Customers')
ORDER BY PredictProbability(\[Churned\],
SQL Server 2005 includes several common data-mining algorithms: Decision Trees, Association Rules, Naïve Bayes, Sequence Clustering, Time Series, Neural Nets, and Text Mining. In addition to this out-of-the-box functionality, vendors can add data-mining algorithms to the data-mining engine through the plug-in architecture. (For more information about this enhancement, see the Microsoft article "SQL Server Data Mining: Plug-In Algorithms" at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/ssdm pia.asp.) The custom algorithms become peers to the algorithms that come with SQL Server 2005. Third-party applications can use DMX to call algorithms and benefit from other SQL Server 2005 features such as scalability.
By using SQL Server data mining, you can analyze transactional data in either relational data warehouses or OLAP cubes to find frequent data combinations. And you can go beyond simple analysis by applying models you create with SQL Server 2005 data mining to produce realtime recommendations based on live data. Once you've chosen the best model, you can put the model to work by using the DMX language. (For more information about the DMX APIs, see the Microsoft article "SQL Server Data Mining Programmability at http://msdn.microsoft.com/library/default .asp?url=/library/en-us/dnsql90/html/SQL DMPrgrm.asp.
Data mining, once the realm of specialists, is now available to a wider demographic of users. Although analysts must still define the models to ensure you get appropriate results, the built-in models in SQL Server 2005 go a long way toward helping you get up and running with data mining. And once the models are built, you have a simpler standard process for exposing the functionality and results to end users through your applications. Take a look at SQL Server 2005's BI features, and maybe you can wow your boss or business users with the new wave of smarter applications.
For more information about data mining or any other of the great new BI features for developers in SQL Server 2005, take a look at the product information Microsoft has posted at http://www.microsoft.com/sql/bi and keep an eye on the BI resources listed on the SQL Server Developer Center pages at http://msdn.microsoft.com/sql/2005/busint.