SideBar    Getting Your Feet Wet in SQL Server 2008, The Logical Puzzle
DOWNLOAD THE CODE:
Download the Code 96805.zip

I want to devote this article to a cool concept that's been around for a while but has gained some new functionality in SQL Server 2008. I'm talking about grouping sets, which are very useful for analyzing aggregated data. They let you provide dataanalysis solutions against the relational database as alternatives to Analysis Services when appropriate. Now, I know many of you haven't even thought about checking out SQL Server 2008 yet, but I would urge that you do so. What better way to be prepared for the new version when it arrives? Check out the Web-exclusive sidebar "Getting Your Feet Wet in SQL Server 2008" (InstantDoc ID 96802) for information about downloading a Community Technology Preview (CTP) of SQL Server 2008.

With that in mind, let's examine grouping sets. All the code samples in this series of articles about grouping sets will be against an Orders table created in the tempdb database. (Assuming you have the latest CTP version of SQL Server installed, you can run all the code samples that I'll provide.) Run the code in Listing 1 to create the Orders table and populate it with sample data. As an aside, those of you with a keen eye will notice something special about the INSERT statement in Listing 1 - it utilizes a small but very handy new T-SQL feature in SQL Server 2008 called Row Value Constructors. That is, you can now insert multiple rows into a table with a single INSERT VALUES statement. You simply encapsulate each row's set of values in parentheses and separate the rows with commas.

Grouping Sets
Whenever learning about a new feature, a good way to start is to understand what kind of problems that feature helps you solve and how you solved such problems in the past. A grouping set isn't a new concept - it's nothing more than the set of elements/attributes that you group by. For example, the set of attributes in any traditional GROUP BY query is a grouping set. SQL Server 2008 introduces a new option/subclause called GROUPING SETS, which lets you define multiple grouping sets in the same query. Logically, a query with the GROUPING SETS option produces one result set as if unifying the result sets of multiple GROUP BY queries. It's not that you couldn't achieve the same thing in the past, but with the GROUPING SETS option, you can write much less code, your query will be much more efficient, and you'll have more flexibility than in the past.

As an example, suppose you need to query the Orders table and return total quantity values for each (customer, employee and order year), and for each (customer and year), and for each (employee and year), and also for (the grand total). Yes, you can obtain all result sets with four GROUP BY queries, as you see in Listing 2, page 26. But suppose you want to unify all four result sets into one. In other words, you want one result set unifying four grouping sets: (customer, employee, year), (customer, year), (employee, year), and (). One option is to use UNION ALL set operations between the sets. Because all four result sets originally have different structures/schema, and set operations require that all input sets have the same schemas, you can use NULLs as placeholders for the missing elements. Listing 3 shows the code that unifies all four result sets. The problem with this solution is that the code is lengthy and is very inefficient because the base data (the rows in the Orders table) is going to be fully scanned four times.

SQL Server 2008 lets you write much shorter and more efficient code by using the GROUPING SETS option in the GROUP BY clause, where you simply list your desired grouping sets. Listing 4 shows the new, shorter, and more efficient alternative to Listing 3's solution. Web Table 1 (http://www.sqlmag.com, InstantDoc ID 96805) shows the output of the query in Listing 4.

Note the use of empty parentheses, signifying a grouping set with no attributes. As an aside, I never felt quite comfortable with the fact that when I wanted to aggregate the entire input set of rows, I had to omit the GROUP BY clause altogether. For example, in the past, when you wanted to return the grand total quantity, you wrote a query such

SELECT SUM(qty) AS totalqty
FROM dbo.Orders;

It always felt a bit awkward to call such a query a GROUP BY query; it's a GROUP BY query, but the GROUP BY clause is implicit. As a byproduct of adding native support for multiple grouping sets, SQL Server 2008 lets you write the following explicit alternative to the above query:

SELECT SUM(qty) AS totalqty
FROM dbo.Orders
GROUP BY ();

Back to the query in Listing 4 using the GROUPING SETS option, it's evident that the code is much shorter than the code in Listing 3, with the UNION ALL set operations. As for efficiency, the beauty of the SQL Server 2008 solution is that the base data (the rows in the Orders table) needs to be scanned fewer times, because SQL Server can calculate higher-level aggregates on top of lower-level aggregates. I'll leave you with a full examination of the execution plans for both solutions as an exercise because I find them to be clear and straightforward. But I want to highlight a couple of key points. In the plan for the code in Listing 3, you'll find four branches (one for each GROUP BY query). Each branch starts with a clustered index scan, followed by one aggregate operator operating on the whole input set. (There will also be either sorting or hashing to support the aggregate operation.)

In the plan for the code in Listing 4, you'll find fewer clustered index scans (two in this case), and in some (or all) branches, you'll find multiple aggregate operators, where the non-first aggregate operator in the brunch aggregates the pre-aggregated data. If you compare the costs of the two plans, you'll find that the one for the code in Listing 4 is substantially cheaper than the one for the code Listing 3.

1 vs. Multiple Grouping Sets
Every casual GROUP BY query can be thought of as having a single grouping set. For example, the following two queries are logically equivalent:

-- Query 1
SELECT custid, empid, YEAR(orderdate)
AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY custid, empid,
YEAR(orderdate);

-- Query 2
SELECT custid, empid, YEAR(orderdate)
AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS( (custid, empid, YEAR(orderdate)) );

Note that you shouldn't confuse the use of GROUPING SETS in Query 2 with the following:

-- Query 3
SELECT custid, empid, YEAR(orderdate)
AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS( custid, empid,
YEAR(orderdate) );

Syntactically, all I did in Query 3 was remove one layer of parentheses; however, logically, I defined three grouping sets: (custid), (empid), and (YEAR(orderdate)) as opposed to one in Query 2: (custid, empid, YEAR(orderdate) ). As I mentioned earlier, Query 2 is logically equivalent to Query 1 (a single traditional GROUP BY query), whereas Query 3 is logically equivalent to UNION ALL operations between three traditional GROUP BY queries:

SELECT ... GROUP BY custid
UNION ALL
SELECT ... GROUP BY empid
UNION ALL
SELECT ... GROUP BY YEAR(orderdate);

Continued on page 2.

   Prev. page   [1] 2     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

Very nice article. I can see Group By Grouping Sets can be very useful in OLTP Systems since it provides REAL DATA data analysis comparing to OLAP.

edmundch

Article Rating 5 out of 5

Interesting improvement over the WITH CUBE option of the group by clause. WITH CUBE generate all possible grouping combinations, including those for which you have no insterest. However with the WITH CUBE option there is a T-SQL scalar function called GROUPING which tell in the result sets if NULL value is a real NULL value in the data data or a NULL generated for the purpose of mixing rows having not the same number of columns. Does the GROUPING function applies on NULL data generated by GROUPING SETS clause also ?

pelsql

Article Rating 4 out of 5

OOPS - The online version of the article is incomplete. The end of the article says "Continued on page 2." -and- "End of Article". I have the print copy and see that half the article is missing from the online version. Whats up?

MarkInChicago

Article Rating 3 out of 5