In 2002, a study from the University of California, Berkeley suggested that it had taken humans 300,000 years to generate 12 million terabytes of data. Now we can generate that much data in only two or three years. In fact, we have more data than we can handle. We seem to think that somehow all of this data is going to help us make better business decisions; however, it’s doing nothing but taking up space until we can turn this data into useable business information.
Recently, the subject of BI without the data warehouse has been discussed in the SQL Server community. Initially I thought this idea was terrible; however, after doing some research I’m finding that there's some validity to this approach, with caveats.
Analyzing Raw Transactional Data
With OLTP databases already approaching or exceeding terabytes in size, duplicating the data in a data warehouse (or staging database plus data marts) and transporting data from multiple data sources to a centralized repository (i.e., the data warehouse), which eats up network bandwidth and time and consumes endless CPU cycles in transformation processes, seems like a budget-buster of an idea. If you can trust the data in your OLTP database and non-relational, legacy systems, then creating what’s essentially a virtualized data warehouse environment could save you time and rescue your budget.
Why would you want to analyze raw transactional data before it has been verified and validated? There's one situation in which you’d want to do exactly that—if you worked for an enterprise-level organization in which there’s a discrepancy between how management believes the business operates and how the business actually operates. Analyzing raw transactional data can help you expose the gap between perception and reality.
Traditionally, the data in a data warehouse has been cleansed, decoded, reorganized, reordered, and basically sanitized before it's provided to the business community. But by using raw transactional data, the bad and the ugly data are exposed with the good data during analysis. The business decision-makers will have an opportunity to evaluate what’s broken and why, as well as have a chance to fix it.
I’m not saying that this approach will work for every company, nor is it necessary for every BI project. If your firm is resistant to a “king has no clothes” type of message, then you probably won’t want to suggest this approach for BI analysis, unless you want to be demoted to the mailroom. However, if your company is a little more open-minded, or if you and your IT staff need to fix a discrepancy or anomaly in a transactional or non-relational data source, then this approach might work really well.
BI without the data warehouse is also a valid approach for continuous data loading for dynamic data elements (e.g., clickstream data, streaming stock market results, real-time construction project management data rendered as dashboards or scorecards), assuming that you’ll stretch the definition of BI to include dashboards. By eliminating the data warehouse step in this process, you can analyze the data flow immediately. You no longer have to wait days or weeks for data warehouse content to refresh. Having immediate access to the data as it’s being captured could give your company a competitive edge. Of course, you must ensure that your OLTP source systems have enough horsepower to manage data streaming in and data streaming out, with potentially complex processing occurring in between.
Why This Approach Works
There are many reasons to resist building a traditional data warehouse. ROI and delivery-point and delivery-time objectives can be very hard to quantify. Funding for any major IT project depends on a clear business case and a solid bottom-line contribution to the balance sheet. Data warehouses don’t lend themselves easily to these conditions. A data warehouse project is invariably large and complex, it might not have a defined business case, and it might be a perpetual “work in progress” or “project black hole” that consumes resources and careers. A study done by the Gartner Group in 2002 stated that “80 percent of enterprises implementing data warehouses will not properly plan for their implementation efforts, and will underestimate the costs related to the data acquisition tasks by an average of 50 percent—leading to cancelled projects or data warehouses delivered with inaccurate or incomplete data.” I doubt that we’re doing much better today, and with statistics like those presented in the study, it’s tempting to forego the data warehouse portion of the project and just use raw source data for BI analysis.