Manage nulls to keep analysis applications running efficiently
If you're going to use an OLAP server to develop an analysis application, you must learn the server's query language—in the case of Analysis Services, MDX. And to write efficient, accurate queries, you have to understand how that language treats nulls (empty data cells). The treatment of empties usually varies between OLAP servers. Even though two OLAP servers (e.g., Analysis Services and Hyperion's Essbase) might both support MDX as their query language, they don't necessarily treat empties the same way.
Empty data cells are especially important to understand in an OLAP application because OLAP cubes are sparse. In other words, many cells in an OLAP cube aren't populated with data—no fact-table records correspond to that portion of the cube. For example, if you're analyzing your customers for the Northeast region of the United States, you don't want to see all your company's customers; you want to view customer data (transactions) for the Northeast. Customer data outside that region isn't relevant to your analysis and appears as empty data cells. Once you understand how Analysis Services treats empties, you can use some tricks for writing MDX that efficiently handles empty cells.
Empty cells occur for a couple of reasons. To understand these reasons, let's first define what empty means in an OLAP cube. Analysis Services builds a cube from a fact table. The fact table contains facts (i.e., records) that correspond to cells in the cube. For example, in the FoodMart 2000 Sales cube, the fact table is called sales_fact_1997. Each record in this table represents a particular product that a customer purchased in a sales transaction. The record contains measures (numeric values) such as store_sales, store_cost, and unit_sales for the transaction. The record also contains references to the date, customer, product, promotion, and store for that transaction. But many unique combinations of an individual customer, product, promotion, and store don't appear as records in the fact table. For example, a particular customer won't buy every product at every store. If you wrote an MDX query that requested a combination of a customer, a product, and a store and no such transactions existed, you'd get an empty (null) result.
Nulls can also get into your data in another, less common, way. Your fact table might include a record that contains a null for a measure value such as a transaction containing a customer and product that generated no store sales (e.g., a promotional giveaway). In such a situation, a transaction exists for this particular cross section of the cube, but that measure doesn't apply. I'm fairly sure that the FoodMart Sales cube doesn't have any null measure data, so to test how Analysis Services treats this situation, I added to the sales_fact_1997 table a record for September 1998 that had a null value for store_cost. I reprocessed the cube and tested null handling by using the supplied MDX Sample Application. Then, I used the MDX statement that Listing 1 shows to query for store cost for all months in the Sales cube.
As I suspected, the data that the query returned included a column for September 1998 that showed a 0 for the Store Cost measure. Even though the only record for September 1998 contained a null for Store Cost, Analysis Services considers the cube cell containing a 0 value to be nonempty. Analysis Services inserts the 0 value because of the way the aggregations of an OLAP cube work. Figure 1, page 40, shows a conceptual diagram of an OLAP cube aggregation. The bottom row of boxes represents the leaf-level cells of the cube. These boxes contain an entry for each measure in the cube. At the point in processing when Analysis Services has processed the dimensions but hasn't yet loaded the fact-table data, the diagram shows the cube structure (the lines and titles for December 1997, Jeffers Grits, Store 5) but no boxes. Analysis Services reads the fact table, looks at each record, and creates a box that corresponds to the combination of leaf-level dimension members. Then, Analysis Services uses the measures from the record to initialize the measure values in the box. If the box already exists because another fact-table record corresponded to the same leaf-level dimension members, Analysis Services aggregates the measures with the entries in the box. For measures such as Store Cost that aggregate as a sum, Analysis Services treats nulls in the records as zeros (just as Microsoft Excel would, for example).
After Analysis Services processes the fact table, some of the cube's leaf-level cells have boxes and some don't. Figure 1 shows that December 1997, Jeffers Grits, Store 5 doesn't have a measure box because that combination of dimension members has no associated facts.
When Analysis Services tests for empty cells, it's really testing to see whether a box of measures exists. Any cross section of the cube that has associated facts will have a box, regardless of the measure values in that box. So in my example, when I created a new record that showed a Store Cost of null, it didn't matter that the measure value was null; what mattered was that a record was associated with September 1998.
Dealing with Empties in MDX
As I mentioned, because OLAP cubes are sparse, you need to know how to handle empty cells when you write MDX queries. If you don't figure out how to eliminate empty cells from your result sets, you might end up with serious performance problems and large, unwieldy result sets because your queries will access more data or return more data than necessary. The following four MDX techniques will help you avoid empty cube cells.
The first technique is to use the NON EMPTY keyword. NON EMPTY is a useful keyword because it operates on the results of one entire axis of a query (e.g., the rows). For example, Listing 2 shows a query that uses the CROSSJOIN() function to create every combination of the top 10 customers and all the stores, but the NON EMPTY keyword eliminates the combinations that lead to empty data cells. The result set is a list of the top 10 customers and the stores where they shopped. (In this case, all top 10 customers shopped at store 16.)
A second, more efficient way to eliminate empty cells is to use the NONEMPTYCROSSJOIN() function, which combines two sets and produces only the nonempty combinations of the members. This technique is more efficient than the technique that Listing 2 shows in terms of time and memory because Listing 2's code creates all combinations of the stores and customers before eliminating the empty cells.
Be aware that the NONEMPTYCROSSJOIN() function also eliminates any combination of members that includes a calculated member. Listing 3 shows an example in which the query looks for nonempty combinations of years and measures. Even though the combination of 1997 and Profit is nonempty, the query eliminates the combination because Profit is a calculated member. If you rewrite Listing 3's query so that it uses the NON EMPTY keyword separately from the CROSSJOIN() function, the query includes the calculated measures Profit and Sales Average.
Even though the function eliminates calculated members, NONEMPTYCROSSJOIN() is still useful. In queries that combine large numbers of members from different dimensions, the performance of NONEMPTYCROSSJOIN() can be more than an order of magnitude faster than a typical CROSSJOIN() function. You should always consider using NONEMPTYCROSSJOIN() in large, complex queries.
A third technique for eliminating empties is using the MDX function ISEMPTY(). This function accepts a tuple (a list of dimension members from different dimensions) and tests whether the tuple has an associated box of measures. If no box exists, the function returns True. ISEMPTY() is useful when you're creating conditional logic or filtering members from one dimension. I frequently use the ISEMPTY() function to determine which time dimension members have data loaded in a cube. In the code that Listing 4 shows, I use ISEMPTY() to determine the most recent month that contains data in the Sales cube. Then, with the most recent month as a point of reference, I use the LASTPERIODS() function to display the most recent 3 months. You can use this technique to create reports that shift forward in time when you load new data into the OLAP cube.
The fourth MDX technique uses the COALESCEEMPTY() function to eliminate empties. COALESCEEMPTY() replaces empty values with an alternative formula. The code in Listing 5 shows how to use COALESCEEMPTY() to display empty values as N/A. This technique is equivalent to using conditional logic to replace empty expressions; for example, you could use a conditional expression such as IIF( ISEMPTY( <<expr1>> ), <<expr2>>, <<expr1>>). The difference is that COALESCEEMPTY() can accept a variable list of expressions. The function continually checks each expression until it finds one that isn't empty, then returns to the first nonempty expression from its list. COALESCEEMPTY() requires that all its parameters are string type or numeric type; you can't mix the two types. (This requirement is similar to the IIF() function's requirements.) Note that I use a trick in Listing 5's code to circumvent this limitation and check the tuple to see if it represents an empty data cell and replace it with a string. The trick is to embed the string in another calculated member so that COALESCEEMPTY() doesn't identify the second parameter as a string type.
Empty cells are so common in OLAP that you can't ignore them when you're developing MDX queries. Empties can affect not only the numeric correctness of a result but also Analysis Services' performance and memory utilization. Always consider options such as using NONEMPTYCROSSJOIN() or ISEMPTY() to filter sets and improve the efficiency of your queries. Also, when you count items in a set, be aware from the outset of whether you intend to count the empties. When you effectively manage your empty cells, you improve query performance and make analysis easier.