How to conduct high-powered analyses with simple operations in SQL Server 6.5

Although OLAP Services in SQL Server 7.0 is the hot new tool (see Bob Pfeiff, "OLAP: Resistance Is Futile," page 22, for details), you don't need to feel left out if you're still using SQL Server 6.5. You can use standard SQL-92 syntax in SQL Server 6.5 to slice and dice data for decision support. SQL Server 6.5 includes two extensions to GROUP BY—CUBE and ROLLUP—that let SQL Server optionally generate aggregate groupings in a single query. You can summarize data by using GROUP BY with one of the aggregate functions, sum, avg, min, max, or count. This action produces a summary value set from which you can answer questions about the distribution of data values.

For example, the Titles table in the SQL Server pubs sample database contains several values for each type of book and publisher. If you write a GROUP BY query with an aggregate function, one query can answer questions such as, "How many books of type 'business' are there?" "Which publisher publishes the most books of type 'modern cooking'?"

These kinds of questions are the hallmark of decision-support systems, OLAP, and data warehousing. The standard SQL-92 language doesn't provide a way to produce a result set that would answer every question about the various data value combinations. In this article, I'll demonstrate the functionality the CUBE and ROLLUP options add to SQL Server queries. These options provide the kinds of summary information that data warehousing applications supply. You can use CUBE and ROLLUP to generate and save this summary information so you won't have to write a separate query for each type of summary information you're interested in.

What Is a Cube?


For the examples in this article, I'll use a software distribution business that sells Microsoft Office products to stores in North America. The table product_sales1 tracks the number of each product sold over some period. This example tracks only sales of Word and Excel. Because the business covers North America, this example tracks the number of each product sold to each of the three countries in North America, in each of three languages. Assume we have the raw data listed in Table 1, the product_sales1 table, on page 30.

To answer questions regarding the number of a product sold in one country, you could formulate a GROUP BY query with an aggregate. For example, you could determine the number of products sold in each language in each country with the following query:

select language, country,
        Total_units_sold = sum(units_sold)
from product_sales1
group by language, country

This code would return the results you see in Table 2. From these query results, you could answer: "How many French products were sold in Mexico?" and "In which country were the most Spanish-language products sold?" However, you couldn't answer the question: "Which product sold the best in Canada?" or "How many English-language products were sold?" without doing additional computations. You need to write additional queries to get those answers.

To generate a result set that contains the answer to every question that you could ask regarding the quantities sold per language, per country, or per product, you could generate a cube. A cube represents three columns with changing values (product, language, and country) as its dimensions. A point in the cube's interior represents the quantity sold for specific values of each of the three columns. So from the data in Table 1, the interior point where Word, English, and Canada meet would have the value 49. Figure 1 can help you visualize this point. This figure also shows a cross tab, which is the result of every possible combination of two values.

A cube combines data to produce a result set containing a superset of SQL Server groups. The groups show each value of every grouped column with each possible value of every other grouped column. A cube also contains special values called superaggregates, which represent all values for a particular column. These are the values on the sides and edges of the cube structure in Figure 1. Table 3, page 32, represents the cube for the 18 rows of the product_sales1 table.

In this result set, every existing value for language is combined with every existing value for country, and each combination of language and country is combined with every existing value of product. The rows with are the superaggregates, and they provide additional information. For example, the row with the values , Mexico, French, and a units value of 58 shows that the total of French-language products sold in Mexico for all products is 58 units. The last row with , , Spanish, shows that the total of all products in all countries for the Spanish language was 369 units.

The number of rows SQL Server generates for a cube will depend on the number of attributes you want to group and on the combinations of attributes that exist in the data. Without detailed knowledge of the data, you can't predict the number of rows a cube operation will produce. However, the upper bound will equal the cross product of the number of distinct values + 1 for each attribute. The addition of 1 is for the case of . The product_sales1 example has three attributes: product, country, and language. The 18 rows of the product_sales1 table generate 48 rows in the cube.

Upper bound of number of rows
        = (Number of products+1) * (Number
        of countries+1) * (Number of languages+1)
        = (2+1) * (3+1) * (3+1)
        = 48

This upper bound depends only on the number of possible attribute values and is not related to the number of rows in the base table. Even if the table had 10 million rows of sales data, if the data were limited to two products, three countries, and three languages, the cube would never have more than 48 rows.

The CUBE Operator


In the data in Table 3, the placeholder represents the superaggregates. So if you want to know how many French-language products were sold in Mexico, regardless of the specific product, you would find the row where language = 'French', country = 'Mexico', and product = ''.

A simple GROUP BY column WITH CUBE will not show the value in the super-aggregate rows. The value is a nonvalue. Transact-SQL (T-SQL) will return this value as null, but it's a type of null called a grouping null. When used with the CUBE extension, null means "all existing values for this data column." A new function called GROUPING( ) lets you distinguish this special-purpose null from a traditional null, which means "no data available." I'll demonstrate this function in the next section. Table 4 shows a T-SQL query using CUBE and the results the query would return from the data shown above.

Earlier, I showed the formula to predict the upper limit of possible values that a cube can return. SQL Server will return that upper limit only if every possible combination of product, country, and language appears in the data. If the data set is missing a few combinations for a particular time period, the cube will have fewer rows. You can use SQL Server's ISNULL( ) function to get the to appear in the output instead of a null placeholder.

select units = sum(units_sold), product =
        isnull(product, ''),  
        country = isnull(country, ''),
        language = isnull(language, '')
from product_sales2
group by product, country, language
with cube

The GROUPING Function


You can use the ISNULL( ) function to replace NULL with the more meaningful if the data contains no other NULLs. Suppose the underlying table included a row with NULL for some of the columns (assuming the table definition allowed nulls), and you inserted the following row:

INSERT product_sales values
        NULL,NULL,'US',10)

The last row in the cube above (having 343 as total sales for all U.S. products sold) would be indistinguishable from this data row. To differentiate the two, SQL Server provides the GROUPING( ) function. This function returns 1 (TRUE) if the element is an value and 0 (FALSE) if the element is an actual data value or an explicit null. You can run the previous query again, after inserting null values in the row, to generate the value for the GROUPING( ) function:

SELECT 'Units'=SUM(units_sold),
        product, 'ALL Prods' =GROUPING(product),
        country, 'ALL Countries' =GROUPING(country),
        language, 'ALL Languages'
        =GROUPING(language)
        FROM product_sales2
        GROUP BY product, country, language
WITH CUBE

Table 5 shows the results. Note that the GROUPING() function takes a column name as an argument. A 0 returned for GROUPING(column_name) means the value returned for that column is a real value that occurs in the data. A 1 returned for GROUPING(column_name) means that the value returned for that column is not real; it was a special null returned because of the use of the CUBE (or ROLLUP) option. The first and last rows have the same values for product, country, and language, but different values in the GROUPING column because the first row corresponds to the raw data where product and country are null, and the nulls in the last row refer to all products and all countries.

In addition, you can use the CASE expression in SQL Server to differentiate between the GROUPING NULL, meaning all values, and the NULL meaning no value:

SELECTUnits = SUM(units_sold),
        Product = CASE    WHEN (GROUPING(product)=1) THEN 'ALL'
            ELSE ISNULL(product, 'N/A')
        END,
        Country = CASE    WHEN (GROUPING(country)=1) THEN 'ALL'
        ELSE ISNULL(country, 'N/A')
        END,
        Language = CASE    WHEN (GROUPING(language)=1) THEN 'ALL'
        ELSE ISNULL(language, 'N/A')
        END
        FROM product_sales2
        GROUP BY product, country, language
WITH CUBE

Table 6, page 34, shows the output from this query. You now have a result set for every combination of product, country, and language. To save this CUBE for future queries, you could turn the query into a VIEW or save the results as a new table. Here's how to save the results into a new table called all_combinations:

SELECT Units = SUM(units_sold),
        Product = CASE    WHEN (GROUPING(product)=1) THEN 'ALL'
        ELSE ISNULL(product, 'N/A')
        END,
        Country = CASE    WHEN (GROUPING(country)=1) THEN 'ALL'
        ELSE ISNULL(country, 'N/A')
        END,
        Language = CASE    WHEN (GROUPING(language)=1) THEN 'ALL'
        ELSE ISNULL(language, 'N/A')
        END
        INTO all_combinations
FROM product_sales2
        GROUP BY product, country, language
WITH CUBE

You can use this table to answer any question about this data. For example, if you wanted to know how many Spanish Excel products were sold in all countries, you could use the following query:

SELECT units
FROM all_combinations
WHERE language = 'Spanish'
  and product = 'Excel'
  and country = 'ALL'

Rolling Up the CUBE with ROLLUP


The CUBE option to GROUP BY is powerful, but this option might generate many more result rows than you want, making it hard to find the information you're really looking for. If you use CUBE, you'll get all permutations, including superaggregates, for all attributes for which corresponding data exists. SQL Server provides the ROLLUP option as an alternative to CUBE. ROLLUP returns only the values for a hierarchy of the attributes you specify.

For example, if you change the CUBE query to ROLLUP instead, the results are more compact and possibly easier to interpret when you drill into details for sales by product. Here is the query with ROLLUP:

SELECT Units = SUM(units_sold),
Product = CASE    WHEN (GROUPING(product)=1) THEN 'ALL'
    ELSE ISNULL(product, 'N/A')
    END,
Country = CASE    WHEN (GROUPING(country)=1) THEN 'ALL'
    ELSE ISNULL(country, 'N/A')
    END,
Language = CASE    WHEN (GROUPING(language)=1) THEN 'ALL'
    ELSE ISNULL(language, 'N/A')
    END
FROM product_sales2
GROUP BY product, country, language
WITH ROLLUP

The results of this query are much more compact than the full CUBE output. Although the results of the ROLLUP query won't answer every possible question that you can answer with CUBE, you obtain more information than with a simple GROUP BY. (For example, the previous ROLLUP query won't answer, "How many French products were sold?")The ROLLUP query results are similar to the results of the T-SQL COMPUTE BY operation, which generates something similar to a control-break report. The significant difference between ROLLUP and COMPUTE BY is that ROLLUP produces relational output, which you can save as a view or insert into another table. The COMPUTE BY results are not relational, and the extra nonrelational rows returned with that operation are difficult for many client tools to handle.

The choice between CUBE or ROLLUP depends on how deeply you want to drill down. If you want to save the results in a new table, use CUBE so that you have all possible combinations when you need them. Most of the work of generating the CUBE is in the GROUP BY operation. The additional work SQL Server requires to produce all combinations of grouped values usually will be negligible. In other words, if you're going to GROUP BY multiple columns, adding either CUBE or ROLLUP to the query doesn't cost much in performance overhead.

Simple Operations, Big Results


One benefit of a data warehouse is having large amounts of data available for OLAP and decision-support activities. If you don't have SQL Server 7.0's data warehousing capabilities, you can use SQL Server 6.5's CUBE and ROLLUP extensions to summarize large amounts of data. You need only T-SQL queries without additional tools or add-on products. And after you've produced a cube, you can ask any questions about any combination of the grouped data, without running additional queries. Even if you do have SQL Server 7.0, using CUBE and ROLLUP offloads some of the processing to SQL Server and lets you obtain data warehousing aggregations within SQL Server procedures.