What is in this article?:
- Inverse Distribution Functions
- Percentile Continuous Distribution
- Listing 1: DDL and Sample Data
This article starts with a brief reminder of the meaning of the computation of percentiles. I then provide solutions to the computations that are supported in SQL Server versions starting with SQL Server 2005. I hope you had a chance to try to come up with your own solutions first.
Last month, I covered the concept of ordered set functions (see "Ordered Set Functions"). I described the support of the ISO and ANSI SQL standards for ordered set functions, as well as the SQL Server 2012 workarounds for the missing functionality. I described inverse distribution functions (percentiles), hypothetical set functions, and other types of calculations that can benefit from a design based on the ordered set function concept. Regarding inverse distribution functions, I described the implementation of the PERCENTILE_DISC and PERCENTILE_CONT functions as window functions in SQL Server 2012. I explained how you can use these functions to compute percentiles. I also included a challenge for readers to come up with their own solutions to both computations that work in SQL Server versions prior to 2012 -- namely, solutions that don't use the PERCENTILE_DISC and PERCENTILE_CONT functions.
This article starts with a brief reminder of the meaning of the computation of percentiles. I then provide solutions to the computations that are supported in SQL Server versions starting with SQL Server 2005. I hope you had a chance to try to come up with your own solutions first.
I use the same sample data in this article that I used last month -- the ExamScores table, which represents student exam scores. Use the code in Listing 1 to create and populate the table.
Inverse Distribution Calculations
As a reminder, inverse distribution calculations compute percentiles. Loosely speaking, given a percent @pct (a float value in the range 0 through 1) and an ordering element o, the percentile computation returns a value below which @pct percent of the o values fall. For example, consider the ExamScores table. Suppose that you want to find the 50th percentile (aka the median) score for each exam. In this case @pct is 0.5, the ordering element is score, and the computation needs to be applied for each exam group. So for each exam, you're looking for the middle score -- the score below which 50 percent of the scores fall.
There are two distribution models used to compute percentiles: discrete and continuous. I'll cover each separately, starting with the standard's description of the computation, followed by a suggested pre-SQL Server 2012 compatible solution. Again, I hope you had a chance to try to come up with your own solutions first.
Percentile Discrete Distribution
Standard SQL defines the PERCENTILE_DISC function as an ordered set function that computes percentiles assuming a discrete distribution model. The standard describes the computation like so:
If PERCENTILE_DISC is specified, by treating the group as a window partition of the CUME_DIST window function, using the specified ordering of the value expression as the window ordering, and returning the first value expression whose cumulative distribution value is greater than or equal to the argument.
As a reminder, the CUME_DIST distribution function is a window function implemented in SQL Server 2012 that computes a relative rank of a row. The computation is defined as NP/NR, where NP is the number of rows that precede or peer with the current row and NR is the number of rows in the respective partition. Because your solution is supposed to work prior to SQL Server 2012, you can't use the CUME_DIST function.
You can compute NR with the COUNT window function. Computing NP is more complex. You can compute it as one more than the minimum RANK that's greater than the current rank. The implementation of this logic in T-SQL is expensive. What's interesting is that for our purposes of computing percentile discrete distribution, we can use the ROW_NUMBER function instead of NP and still get the correct result. (Many thanks to Adam Machanic for discovering this while reviewing one of my solutions to this task.) It can be a bit tricky to see how the logic works, so to help simplify things, first examine the information in Figure 1. This figure shows for each row in the ExamScores table a number of values: the row number (rownum), the number of rows that precede or peer with the current row (np), the number of rows in the partition (nr), the cumulative distribution (cd), and the result of the computation rownum/nr as an alternative to np/nr to compute cd. Remember that we're looking for "the first value expression whose cumulative distribution value is greater than or equal to the argument." Peers in terms of score have row numbers in the range <previous np + 1> through np. So "the first value expression whose np/nr value is greater than or equal to the argument" will always be the same as "the first value expression whose rownum/nr value is greater than or equal to the argument." In other words, for our computation's purposes we can use rownum/nr instead of np/nr and get the correct result. With this in mind, here's the pre-SQL Server 2012 compatible solution to computing percentile discrete distribution implementing this logic:
WITH C AS
(
SELECT examid, score,
ROW_NUMBER() OVER(PARTITION BY examid ORDER BY score) AS np,
COUNT(*) OVER(PARTITION BY examid) AS nr
FROM dbo.ExamScores
)
SELECT examid, MIN(score) AS percentiledisc
FROM C
WHERE 1.0 * np / nr >= @pct
GROUP BY examid;
Most of the complexity here is in figuring out the logic behind the computation and realizing that rownum can be used instead of np. As you can see from the solution, after you get this logic, the translation to T-SQL is very straightforward. Figure 2 shows the output of this solution with the input @pct = 0.5.

