Plan for levels of availability and speed of retrieval
Even SQL Server professionals who hear the term “storage design” might think of closets and garages, not data, databases, or data warehouses. However, if you try to build a data warehouse without paying attention to the way the storage is laid out, I guarantee you’ll be back to the drawing board within a few months, trying to figure out what went wrong.
Data warehouse performance is tied to the performance of the underlying storage subsystem. You can design a storage subsystem for maximum performance, or you can take the default and hope it works out. You’re building a high-maintenance system. The selling point of a data warehouse is its timeliness—so every organizational change, every new product line and launch must be reflected there. You (or your staff) will be spending about 80 percent of your time dealing with the extraction, transformation, and loading (ETL) issues that are part and parcel of a data warehouse, so getting the data storage right in the beginning is going to save you many sleepless nights.
Categories of Data
Not all data is created equal, not even in a data warehouse. In most situations there are three categories of data warehouse data: high, moderate, and low. While there are no hard-and-fast rules governing how available data should be, there’s a direct relationship between availability and usability. High availability data is data that’s needed for current business analysis, such as the numbers that reflect the success or failure of a new product launch. Moderate availability data is older data that’s used occasionally, for example, to compare current performance to past performance, such as a comparison of this year’s first quarter (Q1) sales results to the past 5 years of Q1 sales. Low availability data is data with a very limited use, such as information about discontinued products. Once data slips off the radar, you need to think about whether or not you even need to access it. You’ll also need to make decisions about long-term archiving or even disposal.
The level of availability dictates the speed of retrieval. Data that needs to be readily available must be configured for speedy retrieval. The faster your retrieval requirements are, the lower the storage density will be. This translates to cost decisions you must make. You must weigh higher costs for high-performance storage against lower-cost high-density storage. It would be wonderful if you could find storage that’s both high-performance and high-density at the high-density cost, but I’m not aware that such an animal exists—yet. Let’s begin at the beginning, with optimizing storage for high-availability, high-usability data: This is the data that reflects events such as new product launches.
Using the 7D Method to Plan for Storage Configuration
Determining a storage configuration for a data warehouse is like any other project; it’s a multi-step process. You can adapt my trademarked 7D Method to this task. To learn more about the 7D Method, see Seven Steps for Successful Data Warehouse Projects, April 2009, InstantDoc ID 101562. Using the 7D Method steps, first, you must Discover how and why the data warehouse will be used. You’ll have to balance performance expectations with availability and cost constraints, and you’ll have to determine how much fault-tolerance you need to build into the warehouse. Second, you need to Design the configuration. Third, you must Develop the plan that will make the configuration a reality and choose the components. Fourth, you’ll Deploy the configuration plan. Fifth, you’ll manage the Day-to-Day operation of the data warehouse, which from a data storage perspective involves loading, querying, and backup and restore. The sixth and seventh steps, Defend and Decommission, are beyond the scope of this discussion, but don’t forget them.
Discover the Essential Storage Components
To discover how data warehouse storage will work you’ll need to consider things such as the primary query access method for the data warehouse, and whether it will be sequential or random; how much temporary storage will be used and how often; how new data will be introduced into the warehouse, by bulk updates or the “trickle” method; whether new data will simply be added to the warehouse, or whether existing data will be edited, either by overwriting it or by creating an “old value” column in the dimension record to store the previous value (for more on these methods for dealing with change see Data Warehousing: Slowly Changing Dimensions, January 2008, InstantDoc ID 97409 ). Each option adds a new condition to the provisioning picture.
During this discovery process you’ll list the essential physical components of the data warehouse, which should include the database; the data files that contain the actual data records and indexes that reference data records; the transaction log, which records additions and modifications to the data in sequential order, among other things; temporary storage (tempdb), which stores intermediate results and temporary tables during query processing; the database backup area, whether disk or tape; and system files, which include but are not limited to the OS and its swap files, the database management system executables, and any application DDLs or executables that will reside on the warehouse server. Storage for each of these distinct components can be treated differently, depending on how you deal with the considerations raised in the preceding paragraph.
As you know, data is stored on disk, and disk drives fail. The very best way to protect data in the event of such a failure is to use a redundant array of independent disks (RAID). RAID technology distributes data across a series of disks in order to unite these physical devices into one higher-performing logical drive. By distributing the data, access is concurrent instead of sequential, yielding I/O rates faster than rates achieved from a single disk. There are many RAID configurations, but RAID 0 (striping), RAID 1 (which includes RAID 0+1 and 1+0, all ways of mirroring the data), RAID 5 (independently storing data and parity—error correction data—separately), and Advanced Data Guarding (ADG)—which is like RAID 5 on steroids—are used in the data warehouse arena. Because the data warehouse holds such critical information, you really should employ some type of fault-tolerant system.
A typical data warehouse will employ two types of storage, DAS and SAN. There are many articles available which describe the pros and cons of each, but for our discussion it’s enough to realize that DAS commonly uses the 160 MB/sec SCSI protocol and is local to the server, while NAS uses the 100 MB/sec fiber channel protocol and is storage shared across the network.
Design the Storage Subsystem for Querying
After the discovery phase you’ll start the design phase. Of the three major data warehouse operations—loading, querying, and backup/restore—query performance is generally the most critical. If, in fact, query performance is more important to your warehouse than loading or backups, then design the storage subsystem for query retrieval. Rarely is a data warehouse dedicated to a single user (which is the only way, from a storage subsystem perspective, that query performance could be sequential), so you should design for random access. Even if you’re running queries that perform sequential retrievals, if there’s more than one query running at the same time, this means that the disk drives are seeking in order to satisfy the multiple read streams, and that’s random access. Plan to use an array of physical disks to optimize retrievals, and, while you’re at it, use parity-based RAID (RAID 5 or ADG) for fault tolerance.
How much and how often will temporary storage be used? Temp storage is random access, typically with 50 percent reads and 50 percent writes; plan to configure tempdb as RAID 0 or RAID 10. If you’re going to place tempdb on DAS, then limit the number of drives per SCSI bus to the vendor recommendation for maximum performance, which is usually six or seven.
You can eke out a few more performance points if you can place the database and temporary storage on the same set of physical drives, but this may be difficult if your RAID controller won’t let you partition the physical array into multiple logical arrays with differing RAID levels. Ideally, the RAID controller would let you define one logical RAID 5 or ADG array for the database and a second logical RAID 0 or 10 for tempdb. If that can’t happen, plan to place tempdb on a separate drive, or if the data warehouse doesn’t need as much fault tolerance as parity (RAID 5 or ADG) can give it, put the database and tempdb together, configured as RAID 10.
The transaction log is write-intensive and sequential; drive mirroring is a must. Plan to configure the translog as RAID 1 or 10, and put it on a drive separate from the data. System files should be mirrored (RAID 1 or a variant thereof), and placed on DAS, if it’s available, due to the faster access times.
Design the Storage Subsystem for Data Loading
Is data loading going to be a major factor in your data warehouse operations? How will you load the data? Will you use the trickle or continuous updating method, which is common in synchronous or near-real-time data warehouses? See Data Warehouse Workloads and Use Cases, September 2008, InstantDoc ID 99653. Or will you use periodic bulk loading? The answer is usually associated with how much data latency your users can live with.
In the trickle updating scenario, as the source data is updated, so is the data warehouse. This technique gives warehouse users the smallest latency delay. Compared to query processing, updating is a relatively minor event. A page read is done by a thread that directly services the user application; many independent read operations can be conducted concurrently. A database write, on the other hand, happens in cache; afterward, a database function will write the modified memory page to storage. The component that needs to be carefully configured in a trickle-load scenario is the transaction log. Plan to place the translog on a separate drive array from the data files, and make sure the translog is mirrored (RAID 1 or 10).
Bulk updating using a simple recovery model (essentially, without persistent logging) and backing up after the bulk load completes is the norm in many, if not most, data warehouse operations. If you’re planning to bulk-load the data warehouse and log the bulk operations, expect the transaction log to be larger than the amount of data that you’re loading—this is important when you’re sizing the files. As with trickle-load, place the transaction log on a separate drive array, mirrored.
Source data location can have an impact on storage subsystem performance. If the source data is coming in over a network and the network bandwidth is too limited for the data stream, then the performance bottleneck is the network. If possible, relocate a copy of the source data to a local or SAN drive, but preserve the sequential access by placing the source data on a drive array that’s not being accessed during the load operation. Don’t let the source files share a drive with the transaction log, database backups, or temporary storage (tempdb).
Are you presorting the data before loading into the data warehouse? Do it, if you can. Otherwise, the database engine will be sorting it, and using either the destination file group or tempdb in the process. If you cannot presort the data, then set SORT_IN_TEMPDB to ON and configure tempdb as RAID 0 or 10, on its own array, separate from the database and the transaction log. When you’re bulk-loading, is data simply being added to the tables? If so, you can optimize for sequential writes. If the data is actually being updated, as in the case of overwriting it or creating an “old value” column in the dimension record to store the previous value, each record will have to be located on disk before update, resulting in a random access model of bulk data loading. If this is the case, spreading the data over a larger number of drives may improve performance. And one more recommendation for this situation only: consider co-locating the data and tempdb on the same array, even assuming that tempdb will be heavily used. Data access will be randomized in any case, so eliminating the call to a second array may gain you some performance points.
Design the Storage Subsystem for Backup and Restore
Designing for data warehouse backup and restore boils down to one question: are you backing up to tape or to disk? If you’re using some form of tape device, then you don’t need to do anything to the storage subsystem configuration. If you’re backing up to disk, you’ll want to position the backup file on a separate drive array from any of the data warehouse components—the data, the transaction log, tempdb. Configure the backup array as either RAID 5 or ADG. ADG can write at approximately 20MB/sec, so you might consider multiple RAID controllers in order to ensure adequate throughput.
Recommendations for Using the 7D Method
When using the 7D Method I recommend that you spend a lot of time in the discover and design stages, so that when you get to the develop and deploy stages the work goes quickly, with minimum or no reworking. Based on your findings during discovery and the plans you developed in the design stage, you’ll be ready to quickly develop and deploy your data warehouse storage subsystem. Developing the subsystem is no more involved than picking out the parts from your favorite hardware vendor and making sure they all work together, at least on paper, and deploy means, well, deploying. For more information on data warehouse configuration in general, go to Microsoft’s new SQL Server 2008 Fast Track Data Warehouse home page. Here you’ll find resources on how to configure the entire data warehouse setup, from the CPU on out, with capacity calculators and balanced system suggestions. As noted in Implementing a SQL Server Fast Track Data Warehouse, Frost et al. (2009), “The Fast Track approach is specifically focused on building scalable CPU core-balanced configurations to support SSDW sequential I/O data access workloads.” One caveat: the authors mention multiple times the “sequential data access workloads” consistent with SQL Server data warehousing. This concerns me because the only way to ensure true sequential access at the storage subsystem is to single-thread queries. I’m sure this is not happening in any version of SQL Server. I’m also not comfortable with their lack of discussion on fault tolerance, and their lack of options when recommending levels of RAID protection.
The configuration suggestions that I’ve made in this column are for high-access data, which requires a high-performance storage subsystem. Not all the data in a data warehouse falls into this category, however, and it’s counter-intuitive to treat all data equally. Next time I’ll cover the low-availability data, i.e. data that should be stored on high-density, low performance but cost-effective storage subsystems. I hope to also cover long-term archiving and disposal of data that has outlived its usefulness.