I want to devote this article to a cool concept
that's been around for a while but has gained
some new functionality in SQL Server 2008. I'm
talking about grouping sets, which are very useful for
analyzing aggregated data. They let you provide dataanalysis
solutions against the relational database as
alternatives to Analysis Services when appropriate.
Now, I know many of you haven't even thought
about checking out SQL Server 2008 yet, but I
would urge that you do so. What better way to be
prepared for the new version when it arrives? Check
out the Web-exclusive sidebar "Getting Your Feet
Wet in SQL Server 2008" (InstantDoc ID 96802)
for information about downloading a Community
Technology Preview (CTP) of SQL Server 2008.
With that in mind, let's examine grouping sets.
All the code samples in this series of articles about
grouping sets will be against an Orders table created
in the tempdb database. (Assuming you have the
latest CTP version of SQL Server installed, you
can run all the code samples that I'll provide.) Run
the code in Listing 1 to create the Orders table and
populate it with
sample data. As
an aside, those of
you with a keen
eye will notice
something special
about the INSERT statement in Listing 1 - it utilizes
a small but very handy new T-SQL feature in SQL
Server 2008 called Row Value Constructors. That is,
you can now insert multiple rows into a table with
a single INSERT VALUES statement. You simply
encapsulate each row's set of values in parentheses
and separate the rows with commas.
Grouping Sets
Whenever learning about a new feature, a good way
to start is to understand what kind of problems that
feature helps you solve and how you solved such
problems in the past. A grouping set isn't a new
concept - it's nothing more than the set of elements/attributes that you group by. For example, the set of
attributes in any traditional GROUP BY query is
a grouping set. SQL Server 2008 introduces a new
option/subclause called GROUPING SETS, which
lets you define multiple grouping sets in the same
query. Logically, a query with the GROUPING
SETS option produces one result set as if unifying
the result sets of multiple GROUP BY queries. It's
not that you couldn't achieve the same thing in the
past, but with the GROUPING SETS option, you
can write much less code, your query will be much
more efficient, and you'll have more flexibility than
in the past.
As an example, suppose
you need to query
the Orders table and
return total quantity
values for each (customer,
employee and order year),
and for each (customer
and year), and for each
(employee and year), and
also for (the grand total).
Yes, you can obtain all
result sets with four
GROUP BY queries, as
you see in Listing 2, page
26. But suppose you want
to unify all four result sets
into one. In other words,
you want one result set
unifying four grouping
sets: (customer, employee,
year), (customer, year),
(employee, year), and
(). One option is to use UNION ALL set operations
between the sets.
Because all four result
sets originally have different
structures/schema,
and set operations require
that all input sets have the
same schemas, you can use
NULLs as placeholders
for the missing elements.
Listing 3 shows the code
that unifies all four result
sets. The problem with
this solution is that the
code is lengthy and is
very inefficient because
the base data (the rows in
the Orders table) is going
to be fully scanned four
times.
SQL Server 2008
lets you write much
shorter and more efficient
code by using the
GROUPING SETS
option in the GROUP
BY clause, where you
simply list your desired
grouping sets. Listing 4 shows the new, shorter,
and more efficient alternative
to Listing 3's
solution. Web Table 1 (http://www.sqlmag.com, InstantDoc ID
96805) shows the output
of the query in Listing 4.
Note the use of
empty parentheses, signifying
a grouping set
with no attributes. As an
aside, I never felt quite
comfortable with the fact
that when I wanted to
aggregate the entire input set of rows, I had to omit
the GROUP BY clause altogether. For example, in
the past, when you wanted to return the grand total
quantity, you wrote a query such
SELECT SUM(qty) AS totalqty
FROM dbo.Orders;
It always felt a bit awkward to call such a query a
GROUP BY query; it's a GROUP BY query, but
the GROUP BY clause is implicit. As a byproduct
of adding native support for multiple grouping sets,
SQL Server 2008 lets you write the following explicit
alternative to the above query:
SELECT SUM(qty) AS totalqty
FROM dbo.Orders
GROUP BY ();
Back to the query in Listing 4 using the GROUPING
SETS option, it's evident that the code is much
shorter than the code in Listing 3, with the UNION
ALL set operations. As for efficiency, the beauty of
the SQL Server 2008 solution is that the base data
(the rows in the Orders table) needs to be scanned
fewer times, because SQL Server can calculate
higher-level aggregates on top of lower-level aggregates.
I'll leave you with a full examination of the
execution plans for both solutions as an exercise
because I find them to be clear and straightforward.
But I want to highlight a couple of key points.
In the plan for the code in Listing 3, you'll find
four branches (one for each GROUP BY query).
Each branch starts with a clustered index scan, followed
by one aggregate operator operating on the
whole input set. (There will also be either sorting
or hashing to support the aggregate operation.)
In
the plan for the code in Listing 4, you'll find fewer
clustered index scans (two in this case), and in some
(or all) branches, you'll find multiple aggregate
operators, where the non-first aggregate operator in
the brunch aggregates the pre-aggregated data. If
you compare the costs of the two plans, you'll find
that the one for the code in Listing 4 is substantially
cheaper than the one for the code Listing 3.
1 vs. Multiple Grouping Sets
Every casual GROUP BY query can be thought of
as having a single grouping set. For example, the following
two queries are logically equivalent:
-- Query 1
SELECT custid, empid, YEAR(orderdate)
AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY custid, empid,
YEAR(orderdate);
-- Query 2
SELECT custid, empid, YEAR(orderdate)
AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS( (custid,
empid, YEAR(orderdate)) );
Note that you shouldn't confuse the use of
GROUPING SETS in Query 2 with the following:
-- Query 3
SELECT custid, empid, YEAR(orderdate)
AS orderyear, SUM(qty) AS qty
FROM dbo.Orders
GROUP BY GROUPING SETS( custid, empid,
YEAR(orderdate) );
Syntactically, all I did in Query 3 was remove
one layer of parentheses; however, logically, I
defined three grouping sets: (custid), (empid), and
(YEAR(orderdate)) as opposed to one in Query 2:
(custid, empid, YEAR(orderdate) ). As I mentioned
earlier, Query 2 is logically equivalent to Query 1 (a
single traditional GROUP BY query), whereas Query
3 is logically equivalent to UNION ALL operations
between three traditional GROUP BY queries:
SELECT ... GROUP BY custid
UNION ALL
SELECT ... GROUP BY empid
UNION ALL
SELECT ... GROUP BY YEAR(orderdate);
Continued on page 2.
Prev. page  
[1]
2
next page