A common question I hear from my clients is “how much storage do we need?” Now wait a second, I thought storage was cheap these days right? Wrong, consumer storage is cheap; enterprise storage is still not what I would classify as cheap. Add to this fact that Enterprise Data Warehouses (EDWs) commonly break into the several terabytes range and the matter of planning storage capacities becomes greater. Consulting firms that simply ignore this common client question are missing a value-add scenario. Your clients need to know how much space to plan for and if you simply say “roughly X TB” you are creating two net results. The first is that the client may purchase too much storage or may not purchase enough (very bad). Second, the client will have no idea as to how much storage to plan for in the coming years. Do both yourself and the client a favor; research the storage requirements of the BI solution you have proposed!

Estimating Data Warehouse Storage

To estimate a data warehouse’s storage requirements have a dimensional model designed in any number of tools including Visio, Erwin, or Excel. The main point is to have the actual data types required and their storage affecting settings such as variable length string columns designated. When estimating your data warehouse storage requirements make sure you assume the data types have their max allowed values being used for every record. The thinking is that it’s better to overestimate then under when it comes to storage (to a degree).

Now, obtain from the client how many records they believe will be processed and stored in the data warehoused based upon business requirements. Note how the storage requirements of fact tables are directly related to their grain definition. If your fact tables must store data by day you will certainly require more storage than if you only need to store the various measures by week. This is why it’s very important to have a solid dimensional model phase completed prior to entering other data warehousing tasks such as storage estimates. At this point you should instantiate the dimensional model in a SQL Server relational database.

Populate your schema with a sample percentage of the client’s estimates. For example, if the client has said that you can expect one million records per month and you must retain three years worth of data then you can expect to hold a total of around 36 million records. Based upon the estimates populate your fact tables with 1, 5, 10, or even 25 percent of the total records expected to be retained. Rebuild your fact table’s indexes. Finally, obtain the data and index spaced used by your fact tables and multiply those values according to the percentage of sampling you used.

Estimating Analysis Services Cube Storage

Clients not only need to know how much storage to obtain for the relational dimensional model but also the cubes (multidimensional objects). Estimating cube storage requirements is a more difficult task; however the same basic logic applies. Design and build your sample cubes as close as you can to how they will be built in production. The most important settings for estimating cube storage are the partition aggregations & storage mode settings. In a development environment you will probably only have a single partition, so there is an error of margin when estimating cube storage requirements since you will likely have multiple partitions with various storage settings in a production environment. Finally, perform a full process on the Analysis Services database once you are done creating the sample cubes and their related settings.

ROLAP & HOLAP w/0% Aggregations incur zero storage overhead excluding the cube’s metadata. If you apply aggregations to the HOLAP storage mode the aggregations will occur in the partition files (cube), thus consuming space. If you apply aggregations to the ROLAP storage mode, indexed views are created (and populated) in the underlying relational database (which means ROLAP with aggregations does in fact incur some storage overhead). MOLAP stores both grain data as well as aggregations in the partition files (cube). Finally, remote partitions can be used for placing less frequently access data onto cheaper disks.

With your sample cubes built and processed you can begin the cube estimation process. Obtain the sample cubes’ partition(s) estimated size in Management studio. Next, multiply the partition storage estimates according to the percentage of sampling you used (just like we did in the fact table estimates). Note that this method assumes both the storage mode and the aggregations are consistent throughout each cube (which is fairly unlikely in production environments). I spoke about the storage mode margin of error earlier but I should also mention that the % aggregations effect cube partition storage estimates as well. In the real world aggregation designs change, however for estimation purposes it’s easiest to assume a static setting of X%.

By using some variation of the methods mentioned above you will have a solid foundation for providing accurate estimates of storage required for both dimensional data warehouses as well as their associated cubes.