An ordered set function is a type of aggregate function; it's similar to a general set function (e.g., SUM, MIN, MAX) in the sense that you can apply it to groups defined by a grouped query and return one value per group. What's special about an ordered set function is that it has ordering relevance. For instance, consider a string concatenation aggregate function. It's a set function because it's applied to a set of values in a group, but unlike SUM, for example, there's ordering relevance to the concatenation. Standard SQL supports the concept of an ordered set function and defines a clause called WITHIN GROUP where you provide the ordering specification, like so:
<ordered set function>(<inputs>) WITHIN GROUP(ORDER BY <order by list>)
SQL Server 2012 doesn't support ordered set functions, but I hope Microsoft will add such support in a future version of the product. I submitted a feature enhancement request to Microsoft asking for support of ordered set functions in a future version of SQL Server. You can find my feature request for ordered set function support on the Microsoft Connect website ; please vote for my request if you agree that it's important.
Related: Inverse Distribution Functions
In this article I describe the types of calculations that can be implemented as ordered set functions, and for which at the moment people need to rely on various workarounds. Some of the calculations are needed frequently (e.g., string concatenation), whereas others are more specialized (e.g., hypothetical set functions). In my examples in this article, I use a table called ExamScores that represents student exam scores. Use the code in Listing 1 to create the table and fill it with sample data.
SET NOCOUNT ON;
IF OBJECT_ID('dbo.ExamScores') IS NOT NULL DROP TABLE dbo.ExamScores;
CREATE TABLE dbo.ExamScores
examid INT NOT NULL,
studentid VARCHAR(10) NOT NULL,
score INT NOT NULL,
CONSTRAINT PK_ExamScores PRIMARY KEY(examid, studentid),
CONSTRAINT CHK_ExamScores_score CHECK(score BETWEEN 0 AND 100)
CREATE INDEX idx_examid_score ON dbo.ExamScores(examid, score);
INSERT INTO dbo.ExamScores(examid, studentid, score) VALUES
(1, 'E', 90),
(1, 'I', 80),
(1, 'M', 85),
(1, 'P', 50),
(1, 'R', 50),
(2, 'B', 75),
(2, 'E', 90),
(2, 'M', 70),
(2, 'N', 40),
(2, 'R', 90),
(2, 'U', 45);
Before I describe the various calculations that can be implemented as ordered set functions, I'd like to give a more concrete example. Suppose that SQL Server supported an aggregate string concatenation function called CONCAT_AGG that accepted two inputs: an expression whose values you want to concatenate and a separator. You could then use the function as follows to return, for each exam, the IDs of the students who took the exam, ordered from highest to lowest score (don't try running this code, because it's not supported by SQL Server 2012):
CONCAT_AGG(studentid, ',') WITHIN GROUP(ORDER BY score DESC) AS students
GROUP BY examid;
Inverse Distribution Functions
Standard SQL defines a subset of ordered set functions called inverse distribution functions. Most people know these computations as percentiles. Given a percent @pct and a set of values S, a percentile is the value in S (or an interpolated value) with respect to which @pct percent of the values are less than. For example, if @pct is equal to 0.5, you're looking for the 50th percentile -- aka the median. That's the value with respect to which 50 percent of the values are less than.
The standard defines two functions that compute percentiles. The PERCENTILE_DISC function uses a discrete distribution model returning a value that's one of the members of the set. The PERCENTILE_CONT function uses a continuous distribution model. If there's no member of the set that represents the requested percentile, the percentile is interpolated from the two existing values surrounding the nonexistent one, assuming continuous distribution.
According to standard SQL, both functions are supposed to be implemented as grouped aggregated functions. For example, to request the median score for each exam, you're supposed to use the following code (don't try running this code, because it's not supported by SQL Server 2012):
PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY score) AS percentiledisc,
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY score) AS percentilecont
GROUP BY examid;
Figure 1 shows the expected output.
SQL Server doesn't support ordered set versions of the percentile functions, but it does implement windowed versions of them. It was easier for Microsoft to implement percentile functions as window functions because SQL Server already has the infrastructure for window functions. In the windowed versions of the percentile functions, you define window partitioning based on what you'd typically group by (examid in our example), and you define window ordering based on what you'd typically define as the group ordering (score in our example). The awkward part is that the windowed versions of the functions return the same percentile result with all members of the same partition. Remember that the grouped versions of the functions are supposed to return one value per group. Because there's no real benefit in returning the same fact multiple times, you'll likely want to eliminate the duplicates. One simple way to do so is with a DISTINCT clause, as follows:
SELECT DISTINCT examid,
PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY score)
OVER(PARTITION BY examid) AS percentiledisc,
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY score)
OVER(PARTITION BY examid) AS percentilecont
It's a bit awkward and unnatural, but it does work in SQL Server 2012.
Implementing a solution for these computations that's supported pre–SQL Server 2012 is no simple task.
Hypothetical Set Functions
Standard SQL defines another subset of ordered set functions called hypothetical set functions. The name might sound a bit fancy, but the concept is actually quite straightforward. You're probably familiar with the RANK and DENSE_RANK window functions. The former computes a rank that's one more than the number of rows that have lower ordering values than the current row's ordering value in the respective partition. The latter computes a rank that's one more than the distinct number of ordering values that are lower than the current row's ordering value in the respective partition.
Hypothetical set functions are grouped and ordered set function versions of RANK and DENSE_RANK. Given an input value @val, the functions compute the rank (or dense rank) of the input with respect to the existing values in the group. The hypothetical set functions use the same logic as the window function counterparts, only they return one result value per group for the given input. Consider the following query as an example (don't try running this code, because it's not supported by SQL Server 2012):
RANK(80) WITHIN GROUP(ORDER BY score DESC) AS rnk,
DENSE_RANK(80) WITHIN GROUP(ORDER BY score DESC) AS densernk
GROUP BY examid;
This query groups the student exam scores by examid, computing the rank and dense rank of the score 80 with respect to the existing scores in each exam group. If these ordered set functions were supported in SQL Server, the output of this query would be the one shown in Figure 2.
It's not difficult to come up with alternative solutions that do currently work in SQL Server; for example:
COUNT(CASE WHEN score > 80 THEN 1 END) + 1 AS rnk,
COUNT(DISTINCT CASE WHEN score > 80 THEN score END) + 1 AS densernk
GROUP BY examid;
However, it would be nice to have built-in support for the standard functions in the future.
Other Possible Use Cases for Ordered Set Functions
Inverse distribution functions and hypothetical set functions are ordered set functions defined by standard SQL. However, the concept is applicable to other computations that people often ask for. Common examples are string concatenation, ordered CLR aggregates, and offset functions. I already provided an example of how a string concatenation function would look as an ordered set function.
Oracle has already implemented such a function, called LISTAGG. In Oracle, you'd use the following code to return for each exam a string with concatenated student IDs ordered from the highest score to the lowest:
LISTAGG(studentid, ',') WITHIN GROUP(ORDER BY score DESC) AS students
GROUP BY examid;
Another common request from SQL Server developers is to support ordered CLR user-defined aggregates (UDAs). Support for ordered set functions in SQL Server could also be applied to CLR aggregates. If added, you'd define the ordering for the aggregate using the standard WITHIN GROUP clause, like so:
dbo.MyCLRAgg(<inputs>) WITHIN GROUP(ORDER BY <order by list>) AS aggresult
GROUP BY grpcol;
Even offset, the functions FIRST_VALUE and LAST_VALUE, which are currently implemented as window functions, would be very useful as ordered set functions applied to groups. Think about how nice it would be if you could ask to return, for each exam, the IDs of the lowest and highest scoring students, like so:
FIRST_VALUE(studentid) WITHIN GROUP(ORDER BY score) AS minscorestudent,
LAST_VALUE(studentid) WITHIN GROUP(ORDER BY score) AS maxscorestudent
GROUP BY examid;
Microsoft Feature Request
Several commonly needed calculations fit the concept of an ordered set function. Because SQL Server 2012 doesn't support ordered set functions, people have to resort to alternative methods to compute them.