Exploring the two types of scaling solutions
The more technology pervades humans’ daily lives, the more data that’s being generated. Back in 2007, IDC estimated that 45GB of data existed for each person on the planet. Most industry experts agree that the volume of data doubles every two years, so that number is likely much higher now. Many companies want to not only capture this data but also record additional value-added attributes about it. For example, almost every activity that humans (and organizations) perform can be tied to a geographic location, which has spurred the addition of spatial data.
Another trend seen in the past few years is that business leaders are increasingly realizing the value of data-driven decisions. When data is cleansed and accurate, the information gleaned from it can truly deserve the label “actionable information.”
Both the data explosion and the embracement of data-driven decision making are producing the business challenge of how to store and analyze massive volumes of data with a minimal total cost of ownership (TCO) footprint. This business challenge is providing momentum to the scalable data warehousing industry.
The goal of scalable data warehousing is to easily and cost effectively expand a company's data warehouse and thus increase overall solution ROI. Scalability is especially important in today’s economy because enterprise hardware is by no means cheap. Thus, companies need their data warehouse hardware and software platforms to scale with their analytic needs, without a complete retooling. In response to this need, the scalable data warehousing industry has produced two main types of solutions: data warehouse appliances and data warehouse reference configurations.
So what’s the difference between a data warehouse appliance and a data warehouse reference configuration? When explaining the difference to clients, I like to use the LEGO analogy: An appliance is a castle that’s been preassembled from a set of LEGO pieces with a bit of “special sauce” added in, whereas a reference configuration is a similar (but not the same) set of LEGO pieces that you have to assemble into a castle. Let’s take a closer look at each type of solution.
Data Warehouse Appliances
A data warehouse appliance is a turnkey solution. Appliances come with hardware and software preconfigured for data warehousing workloads. Several data warehouse appliances use massively parallel processing (MPP) hardware. By using MPP hardware in a shared-nothing architecture, you can create a data warehouse infrastructure in which multiple servers (i.e., nodes) can cooperate to process large quantities of data and queries. To scale out data warehouse appliances, you simply need to purchase additional racks (n nodes per rack).
Although there’s some disagreement as to which company created the original data warehouse appliance, it’s generally accepted that Netezza (now an IBM company) was responsible for generating the initial mainstream interest. Netezza currently offers the Twinfin and Skimmer appliances.
Other data warehouse appliances that you’ll find in the marketplace include:
- Aster Data Systems' MapReduce Data Warehouse Appliance
- EMC's Greenplum Data Computing Appliance
- Kognitio's WX2 Appliance
- Microsoft's SQL Server 2008 R2 Parallel Data Warehouse
- Oracle Exadata Database Machine
- Teradata Data Warehouse Appliance
- XtremeData's dbX

