The more technology pervades humans’ daily lives, the more data that’s being generated. Back in 2007, IDC estimated that 45GB of data existed for each person on the planet. Most industry experts agree that the volume of data doubles every two years, so that number is likely much higher now. Many companies want to not only capture this data but also record additional value-added attributes about it. For example, almost every activity that humans (and organizations) perform can be tied to a geographic location, which has spurred the addition of spatial data.

Another trend seen in the past few years is that business leaders are increasingly realizing the value of data-driven decisions. When data is cleansed and accurate, the information gleaned from it can truly deserve the label “actionable information.”

Both the data explosion and the embracement of data-driven decision making are producing the business challenge of how to store and analyze massive volumes of data with a minimal total cost of ownership (TCO) footprint. This business challenge is providing momentum to the scalable data warehousing industry.

The goal of scalable data warehousing is to easily and cost effectively expand a company's data warehouse and thus increase overall solution ROI. Scalability is especially important in today’s economy because enterprise hardware is by no means cheap. Thus, companies need their data warehouse hardware and software platforms to scale with their analytic needs, without a complete retooling. In response to this need, the scalable data warehousing industry has produced two main types of solutions: data warehouse appliances and data warehouse reference configurations.

So what’s the difference between a data warehouse appliance and a data warehouse reference configuration? When explaining the difference to clients, I like to use the LEGO analogy: An appliance is a castle that’s been preassembled from a set of LEGO pieces with a bit of “special sauce” added in, whereas a reference configuration is a similar (but not the same) set of LEGO pieces that you have to assemble into a castle. Let’s take a closer look at each type of solution.


Data Warehouse Appliances

A data warehouse appliance is a turnkey solution. Appliances come with hardware and software preconfigured for data warehousing workloads. Several data warehouse appliances use massively parallel processing (MPP) hardware. By using MPP hardware in a shared-nothing architecture, you can create a data warehouse infrastructure in which multiple servers (i.e., nodes) can cooperate to process large quantities of data and queries. To scale out data warehouse appliances, you simply need to purchase additional racks (n nodes per rack).

Although there’s some disagreement as to which company created the original data warehouse appliance, it’s generally accepted that Netezza (now an IBM company) was responsible for generating the initial mainstream interest. Netezza currently offers the Twinfin and Skimmer appliances.

Other data warehouse appliances that you’ll find in the marketplace include:

Data Warehouse Reference Configurations

A data warehouse reference configuration is essentially a bill of materials. The bill includes each hardware and software component needed to create a desired solution. Most reference configurations use symmetric multiprocessing (SMP) hardware, so they’re great for scaling up a data warehouse.

Although a solution created from a reference configuration doesn’t come preconfigured like an appliance, using a reference configuration offers certain advantages over trying to devise your own solution:

  • It takes the educated guesswork out of estimating the hardware required for a data warehouse deployment.
  • Initial deployment and configuration is more efficient.
  • Each hardware component is equally balanced.

Microsoft is the most well-known provider of data warehouse reference configurations. It offers the SQL Server Fast Track Data Warehouse, a set of reference configurations for participating hardware vendors’ server platforms. The hardware vendors that are currently participating are Bull, Dell, EMC, HP, and IBM. For example, Microsoft offers three SQL Server Fast Track reference configurations that leverage the HP ProLiant DL385 G6, DL585 G6, and DL785 G6 server platforms.

There are a few other reference configuration providers, including:

Deciding Which Is Best

With data warehouse appliances, you’re provided with a solution that can be easily scaled out, which means you can scale your data warehouse for a much lower TCO. Depending on the vendor and configuration of the initial appliance, it can be a sizable initial investment. The key value-added business driver of data warehouse appliances is that they offer predictable, guaranteed, linear performance and scalability.

With data warehouse reference configurations, you can build a solution that can be easily scaled up. Solutions built from reference configurations are initially less expensive than a data warehouse appliance. However, as you continue to scale up, a reference-configuration solution’s cost can approach that of an entry-level appliance, depending on the vendor and the configuration. In addition, you need an infrastructure expert to help configure and deploy the hardware and software components.

Which type of solution should you use? It really depends on a few key factors, including the initial volume of data the warehouse needs to store as well as the types and patterns of queries being executed. For example, based on my experiences with the Microsoft solutions, any data warehouse that needs to initially store 40TB or more is a strong candidate for the Parallel Data Warehouse appliance. For smaller data warehouse deployments, one interesting approach is to first leverage a reference-configuration solution, then migrate to a data warehouse appliance. The migration results in the reprovisioning of the reference-configuration solution as a “spoke” in a “hub-and-spoke” data warehouse architecture. A great technical article on the hub-and-spoke architecture in the Microsoft data warehouse environment can be found at Microsoft TechNet web page.


From Terabytes to Petabytes

The future of scalable data warehousing has some aspects that are pretty easy to predict and some that aren’t so easy to foresee. It’s pretty easy to see that companies are going to continue to need the ability to analyze large volumes of data originating from both within and outside its corporate walls. In addition, as the volume of data increases, the speed at which it can be analyzed will also need to increase. Recent technologies such as MPP, in-memory databases, and solid state disks (SSD) will help fuel the rapid-analytical capabilities of large data volumes.

One aspect of the market that’s not so clear is the movement to the cloud. Some vendors are already beginning to offer cloud-based data warehousing solutions. For example, Netezza is partnering with AppNexus to provide a cloud-based data warehousing service. But the bigger question is will the companies of tomorrow embrace the cloud for their data warehousing needs? This is more of a business dynamic than a technical challenge. Data warehouses contain crucial corporate data; placing that data in the cloud will require companies to completely trust the cloud service. It will be interesting to witness how tomorrow's companies view and leverage the cloud for data warehousing and business intelligence (BI).