Median is a common calculation used in statistical analysis of data. Given a set of observations, the median is the value below which half of the observations fall. For example, suppose that a group of students participate in an exam. The median score is the score below which half of the scores fall.
The type of median calculation that I'll focus on in this article is one that assumes a continuous distribution model. If there's an odd number of elements involved, the median is simply the middle point. If there's an even number of elements, the median is the average of the two middle points.
SQL Server 2012 introduces new T-SQL features that provide improved solutions to the classic median problem. In this article, I examine the new solutions and compare them with one of the more efficient older solutions.
For sample data, let's use a table called T1. The table has two columns called grp and val. The task at hand is to compute the median of the values in the val column for each distinct grp value. It's recommended that you create an index on (grp, val) to support the optimization of the solution queries. Use the code in Listing 1 to create the table T1 and the recommended index, and to populate the table with a small set of sample data.
- SET NOCOUNT ON;
- USE tempdb;
- IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1;
- CREATE TABLE dbo.T1
- id INT NOT NULL IDENTITY
- CONSTRAINT PK_T1 PRIMARY KEY,
- grp INT NOT NULL,
- val INT NOT NULL
- CREATE INDEX idx_grp_val ON dbo.T1(grp, val);
- INSERT INTO dbo.T1(grp, val)
- VALUES(1, 30),(1, 10),(1, 100),
- (2, 65),(2, 60),(2, 65),(2, 10);
Figure 1 shows the desired results you're supposed to get from your solution query against the small set of sample data generated by Listing 1. You'll need a bigger table to evaluate the performance of different solutions. For this task, you'll need a helper function called GetNums that accepts input parameters called @low and @high and returns a sequence of integers between the two. Use the code in Listing 2 to create the GetNums function. Use the code in Listing 3 to populate the table with a large number of rows.
- IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
- CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS ABLE
- L0 AS (SELECT c FROM (VALUES(1),(1)) AS D(c)),
- L1 AS (SELECT 1 AS C FROM L0 AS A CROSS JOIN L0 AS B),
- L2 AS (SELECT 1 AS C FROM L1 AS A CROSS JOIN L1 AS B),
- L3 AS (SELECT 1 AS C FROM L2 AS A CROSS JOIN L2 AS B),
- L4 AS (SELECT 1 AS C FROM L3 AS A CROSS JOIN L3 AS B),
- L5 AS (SELECT 1 AS C FROM L4 AS A CROSS JOIN L4 AS B),
- Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
- FROM L5)
- SELECT @low + rownum - 1 AS n
- FROM Nums
- ORDER BY rownum
- OFFSET 0 ROWS FETCH FIRST @high - @low + 1 ROWS ONLY;
- -- for high density use 10 groups x 1000000 rows per group
- -- for low density use 1000000 groups x 10 rows per group
- @numgroups AS INT = 10,
- @rowspergroup AS INT = 1000000;
- TRUNCATE TABLE dbo.T1;
- DROP INDEX idx_grp_val ON dbo.T1;
- INSERT INTO dbo.T1 WITH(TABLOCK) (grp, val)
- SELECT G.n, ABS(CHECKSUM(NEWID())) % 101
- FROM dbo.GetNums(1, @numgroups) AS G
- CROSS JOIN dbo.GetNums(1, @rowspergroup) AS R;
- CREATE INDEX idx_grp_val ON dbo.T1(grp, val);
Figure 1: Desired Results Against Small Set of Sample Data
Notice that the code in Listing 3 lets you control the number of distinct groups by setting the variable @numgroups, and it lets you control the number of rows per group by setting the variable @rowspergroup. The current values in the code (10 groups with 1,000,000 rows each) are what I used to generate sample data with high group density (small number of groups, each with a lot of rows). To test the performance against sample data with low group density, I used 1,000,000 groups with 10 rows each. So in both density cases I filled the table with 10,000,000 rows, only in each case the number of distinct groups was different (10 for high density and 1,000,000 for low density).
Solution 1 Using the PERCENTILE_CONT Function
SQL Server 2012 introduces a function called PERCENTILE_CONT that computes a requested percentile (e.g., 0.5 for median), assuming a continuous distribution model. In the ANSI SQL standard, the function is defined as a grouped ordered set function, but SQL Server 2012 implements the function as a window function using partitioning instead of grouping.
However, remember that unlike grouped queries, which hide the detail and return one row per group, windowed queries don't hide the detail. Therefore, instead of using the function in a grouped query and returning the requested percentile once per group, you use it in a windowed query and return the same percentile repeated for all members of the same partition. To avoid returning duplicates in the result, you can simply add a DISTINCT clause to your query. So, to compute the requested median against the T1 table using the PERCENTILE_CONT function, you'd use the following code:
- SELECT DISTINCT grp,
- PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY val)
- OVER(PARTITION BY grp) AS median
- FROM dbo.T1;
The solution is very short, and it's certainly simpler than any other solution I can think of. This is in great contrast to what can be said about the query's execution plan. Figure 2 shows the execution plan for this query, reduced to a scale of about 30 percent. Don't worry about not being able to see exactly what's going on there; my point is just to show how elaborate the plan is.
The plan is quite inefficient. There are multiple points in the plan where the optimizer spools the data (writes it to a work table) and then scans the spool twice—once to get the detail rows and once to compute an aggregate. The spooling adds quite a lot of I/O-related overhead. When I tested the solution on my machine with high group density, it ran for 69 seconds and used 58,131,217 logical reads. When I tested it with low group density, it ran for 48 seconds and used 50,022,662 logical reads. For a set with 10,000,000 rows, that's quite slow and expensive.
Solution 2 Using the ROW_NUMBER Function
As you noticed, the solution using the new PERCENTILE_CONT function is very slow. This section describes a solution that uses the ROW_NUMBER function and is supported in SQL Server versions prior to SQL Server 2012. The solution defines two CTEs: one called Counts that uses a simple grouped query to compute the count of rows in each group and another called RowNums that computes row numbers that are partitioned by grp and ordered by val. The column in Counts holding the counts is named cnt and the column in RowNums holding the row numbers is named n. The outer query performs the following steps:
1. Join Counts with RowNumbers based on a match between the grp values in both sides
2. Filter in each group only the rows that need to take part in the median calculation using the predicate: n IN ( ( C.cnt + 1 ) / 2, ( C.cnt + 2 ) / 2 )
3. Group the remaining rows by grp
4. Return the average val per group
Note that the filter in step 2 uses integer divisions and identifies two qualifying middle points when there's an even number of elements and only one qualifying middle point when there's an odd number of elements. For example, if cnt is 100 the expressions in the IN predicate result in 50 and 51. When cnt is 101, both expressions in the IN predicate evaluate to 51. Here's the complete solution query:
- WITH Counts AS
- SELECT grp, COUNT(*) AS cnt
- FROM dbo.T1
- GROUP BY grp
- RowNums AS
- SELECT grp, val,
- ROW_NUMBER() OVER(PARTITION BY grp ORDER BY val) AS n
- FROM dbo.T1
- SELECT C.grp, AVG(1. * R.val) AS median
- FROM Counts AS C
- INNER JOIN RowNums AS R
- on C.grp = R.grp
- WHERE n IN ( ( C.cnt + 1 ) / 2, ( C.cnt + 2 ) / 2 )
- GROUP BY C.grp;
Figure 3 shows the query plan for the solution in a high group density case; Figure 4 shows the plan in a low group density case.
The two plans are similar. In both cases, the index idx_grp_val is scanned twice in order—once to compute the counts and another time to compute the row numbers. The plans differ in the places where some of the aggregate calculations are handled.
The plan is quite efficient in both cases. Most of the cost is associated with the two ordered scans of the data. With high density, the query ran for 13 seconds; with low density, it ran for 11 seconds. In both cases, the number of logical reads was 44,680. This solution is much faster than solution 1, and it performs significantly fewer logical reads. Also, the performance of the solution doesn't depend much on density—the two extreme types of densities performed similarly.
But what if more than 10 seconds of run time isn't good enough? The next section describes a new solution based on another new feature in SQL Server 2012 that performs much better than solution 2.
Solution 3 Using the OFFSET-FETCH Option
The third solution relies on the OFFSET-FETCH option that was introduced in SQL Server 2012. The solution defines a CTE called C based on a query that groups the rows from T1 by grp and computes a count of rows per group, naming the resulting column cnt. The inner query also computes, based on the count of rows in the group, how many rows need to be skipped (offset_val) and how many rows need to be fetched (fetch_val). The resulting column offset_val is computed as (COUNT(*) - 1) / 2 using integer division. For example, if there are 100 rows in the group, the result is 49. If there are 101 rows in the group, the result is 50. The resulting column fetch_val is computed as 2 - COUNT(*) % 2. When there's an even number of elements, the result is 2; when there's an odd number, the result is 1.
The outer query in the solution queries the CTE C. It uses the CROSS APPLY operator to apply to each group in C a table expression called A. The query defining the table expression A queries the table T1; using the OFFSET-FETCH option, the query filters for the current group fetch_val rows after skipping offset_val rows.
Finally, the outer query groups the remaining rows by grp and then computes the average values per group, producing the desired median. Here's the complete solution query:
- WITH C AS
- SELECT grp,
- COUNT(*) AS cnt,
- (COUNT(*) - 1) / 2 AS offset_val,
- 2 - COUNT(*) % 2 AS fetch_val
- FROM dbo.T1
- GROUP BY grp
- SELECT grp, AVG(1. * val) AS median
- FROM C
- CROSS APPLY ( SELECT O.val
- FROM dbo.T1 AS O
- where O.grp = C.grp
- order by O.val
- OFFSET C.offset_val ROWS FETCH NEXT C.fetch_val ROWS ONLY ) AS A
- GROUP BY grp;
Figure 5 shows the plan generated by the solution for a high density case; Figure 6 shows the plan for a low density case.
In both cases, the plan performs an ordered scan in the index idx_grp_val to compute the counts and derivative values (offset_val and fetch_val) per group. Then the plan uses a Nested Loop operator that performs per group an Index Seek in the same index. The seek reaches the leaf level of the index to the beginning of the current group's section, then scans offset_val rows and returns fetch_val rows. Finally, the plan handles the AVG aggregate to compute the median from the filtered values.
An interesting thing about this solution is that it efficiently utilizes parallelism, using multiple threads for both the computation of the counts and the derivative values, and to apply the query with the OFFSET-FETCH option per group. The solution especially excels when there's high group density, because in that case the number of seek operations is very small (10). With high density, the solution ran for only 1 second on my system and performed only 33,821 logical reads. With low density, the number of seek operations was much higher (1,000,000), taking 6 seconds to complete and performing 3,348,753 logical reads.
In this article, I covered three solutions for computing the median. The first solution used the PERCENTILE_CONT function that was introduced in SQL Server 2012. This solution is very slow, mainly because of the excessive use of work tables. The second solution used the ROW_NUMBER function. It's a much faster solution than the first one. The performance of the solution is fairly similar for both high and low group densities. The third solution used the OFFSET-FETCH option. This solution is even faster than the second. The performance of this solution does depend on the group density—the higher the density, the better it performs.