Aggregates with an Outer Reference

Before getting started I’d like to thank Simon Sabin, Paul White and Umachandar Jayachandran (UC) for their input in a discussion we held on the topic.

Related: Partial Aggregate

T-SQL doesn’t support a certain form of aggregate expressions that include outer references. As an example, consider the following query:

USE AdventureWorks2008R2;

SELECT CustomerID, SalesOrderID, SubTotal,
  (SELECT AVG(O2.SubTotal - O1.SubTotal)
   FROM Sales.SalesOrderHeader AS O2
   WHERE O2.CustomerID = O1.CustomerID
     AND O2.SalesOrderID  O1.SalesOrderID) AS AvgDiff
FROM Sales.SalesOrderHeader AS O1;

This query attempts to return, for each order, the average of all differences between the current order value and the values of all other orders by the same customer. However, when you try running this query, you get the following error:

Msg 8124, Level 16, State 1, Line 4
Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.

The source for the restriction is in the SQL-92 standard, and SQL Server inherited it from the Sybase codebase. The problem is that SQL Server needs to figure out which query will compute the aggregate.

As you can see, the error message doesn’t say that outer references in aggregate expressions are banned altogether, but those that involve both outer and inner references are (even though the message text isn’t very clear about that). Here’s a simple example showing an aggregate expression with an outer reference that is allowed since you’re not mixing things:

SELECT
  (SELECT AVG(O1.SubTotal) - AVG(O2.SubTotal)
   FROM Sales.SalesOrderHeader AS O2) AS AvgDiff
FROM Sales.SalesOrderHeader AS O1
WHERE CustomerID = 29825;

Since there are only outer references in the first aggregate expression, it’s clear to SQL Server that the entire expression (AVG(O1.SubTotal)) should be resolved against the outer query, which is logically equivalent to:

SELECT AVG(O1.SubTotal) -
  (SELECT AVG(O2.SubTotal)
   FROM Sales.SalesOrderHeader AS O2) AS AvgDiff
FROM Sales.SalesOrderHeader AS O1
WHERE CustomerID = 29825;

Since there are only inner references in the second aggregate expression, It’s also clear to SQL Server that the entire expression (AVG(O2.SubTotal)) should be resolved against the inner query. But when you try mixing both outer and inner references, that’s when SQL Server gets confused and complains:

SELECT
  (SELECT AVG(O1.SubTotal - O2.SubTotal)
   FROM Sales.SalesOrderHeader AS O2) AS AvgDiff
FROM Sales.SalesOrderHeader AS O1
WHERE CustomerID = 29825;

Msg 8124, Level 16, State 1, Line 2
Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.

As mentioned, the message text is a bit misleading. It sounds like if you have an outer reference, only one outer column reference is allowed in the aggregate expression. But as the following supported example shows, that’s not really the case:

SELECT
  (SELECT AVG(O1.SubTotal + O1.TaxAmt) - AVG(O2.SubTotal + O2.TaxAmt)
   FROM Sales.SalesOrderHeader AS O2) AS AvgDiff
FROM Sales.SalesOrderHeader AS O1
WHERE CustomerID = 29825;

In short, the mixing of inner and outer references in aggregate expressions is what SQL Server seems to have a problem with.

Now let’s get back to the unsupported query I presented initially:

SELECT CustomerID, SalesOrderID, SubTotal,
  (SELECT AVG(O2.SubTotal - O1.SubTotal)
   FROM Sales.SalesOrderHeader AS O2
   WHERE O2.CustomerID = O1.CustomerID
     AND O2.SalesOrderID  O1.SalesOrderID) AS AvgDiff
FROM Sales.SalesOrderHeader AS O1;

To remind you, this query attempts to return, for each order, the average of all differences between the current order value and the values of all other orders by the same customer. You want SQL Server to resolve the O2.SubTotal reference against the inner query and the O1.SubTotal reference against the outer query and apply the AVG function to the differences, but SQL Server won’t allow you to mix those. However, SQL Server will allow mixing those in expressions that are not aggregate expressions, e.g., predicates with correlations.

A simple workaround is to add another instance of the table as an inner instance and correlate that instance to the outer table by the table’s key. This way you can refer to elements from the outer row implicitly by referring to the respective elements from the new inner instance. Then the original inner instance can be joined to the new inner instance instead of being correlated to the outer one. But implicitly, it’s as if the original inner instance was correlated to the outer one. And then you can apply your aggregate expression mixing elements from the two inner instances—original and new. I bet this sounds confusing. Hopefully the code that implements this workaround will make things clearer:

SELECT CustomerID, SalesOrderID, SubTotal,
  (SELECT AVG(O2.SubTotal - T.SubTotal)
   FROM Sales.SalesOrderHeader AS T
     JOIN Sales.SalesOrderHeader AS O2
       ON T.SalesOrderID = O1.SalesOrderID
      AND O2.CustomerID = T.CustomerID
      AND O2.SalesOrderID  T.SalesOrderID) AS AvgDiff
FROM Sales.SalesOrderHeader AS O1;

This query is supported and addresses the original request.

You will face the same problem when using the APPLY operator, where the right table expression includes aggregate expressions that mix inner and outer references. Here’s an example for an unsupported query:

SELECT O1.CustomerID, O1.SalesOrderID, O1.SubTotal,
  A.AvgDiff, A.MinDiff, A.MaxDiff
FROM Sales.SalesOrderHeader AS O1
  CROSS APPLY (SELECT
                 AVG(O2.SubTotal - O1.SubTotal) AS AvgDiff,
                 MIN(O2.SubTotal - O1.SubTotal) AS MinDiff,
                 MAX(O2.SubTotal - O1.SubTotal) AS MaxDiff
               FROM Sales.SalesOrderHeader AS O2
               WHERE O2.CustomerID = O1.CustomerID
                 AND O2.SalesOrderID  O1.SalesOrderID) AS A;

The workaround is the same:

SELECT O1.CustomerID, O1.SalesOrderID, O1.SubTotal,
  A.AvgDiff, A.MinDiff, A.MaxDiff
FROM Sales.SalesOrderHeader AS O1
  CROSS APPLY (SELECT
                 AVG(O2.SubTotal - T.SubTotal) AS AvgDiff,
                 MIN(O2.SubTotal - T.SubTotal) AS MinDiff,
                 MAX(O2.SubTotal - T.SubTotal) AS MaxDiff
               FROM Sales.SalesOrderHeader AS T
                 JOIN Sales.SalesOrderHeader AS O2
                   ON T.SalesOrderID = O1.SalesOrderID
                  AND O2.CustomerID = T.CustomerID
                  AND O2.SalesOrderID  T.SalesOrderID) AS A;

If you have other workarounds of your own you are welcome to share those.

Cheers,

BG

Discuss this Blog Entry 3

on Jun 20, 2011
Hello, your previous to last statement doesn't work however little modification and it works: SELECT DISTINCT CustomerID, SalesOrderID, SubTotal, AVG(AvgDiff) OVER(PARTITION BY CustomerID, SalesOrderID, SubTotal) FROM Sales.SalesOrderHeader AS O1 CROSS APPLY ( SELECT (O2.SubTotal - O1.SubTotal) AS AvgDiff FROM Sales.SalesOrderHeader AS O2 WHERE O2.CustomerID = O1.CustomerID AND O2.SalesOrderID <> O1.SalesOrderID ) AS CA
on Jul 24, 2011
my goood shame on me .. different results
on Mar 4, 2014

I believe this can also be solved by wrapping up the inner SELECT into another derived table. This appears to have about half the query cost and also eliminates duplicating the formula for the calculation. Here is the code with the extra wrapper (tested on SQL Server 2005):

SELECT O1.CustomerID, O1.SalesOrderID, O1.SubTotal,
A.AvgDiff, A.MinDiff, A.MaxDiff
FROM Sales.SalesOrderHeader AS O1
CROSS APPLY (SELECT
AVG(O2.SubTotal - T.SubTotal) AS AvgDiff,
MIN(O2.SubTotal - T.SubTotal) AS MinDiff,
MAX(O2.SubTotal - T.SubTotal) AS MaxDiff
FROM Sales.SalesOrderHeader AS T
JOIN Sales.SalesOrderHeader AS O2
ON T.SalesOrderID = O1.SalesOrderID
AND O2.CustomerID = T.CustomerID
AND O2.SalesOrderID <> T.SalesOrderID) AS A;

SELECT O1.CustomerID, O1.SalesOrderID, O1.SubTotal,
A.AvgDiff, A.MinDiff, A.MaxDiff
FROM Sales.SalesOrderHeader AS O1
CROSS APPLY (
SELECT
AVG(Diff) AS AvgDiff,
MIN(Diff) AS MinDiff,
MAX(Diff) AS MaxDiff
FROM
(
SELECT
(O2.SubTotal - O1.SubTotal) AS Diff
FROM Sales.SalesOrderHeader AS O2
WHERE O2.CustomerID = O1.CustomerID
AND O2.SalesOrderID <> O1.SalesOrderID
) AS RawDiff) AS A;

Please or Register to post comments.

What's Puzzled By T-SQL Blog?

T-SQL tips and logical puzzles from Itzik Ben-Gan.

Blog Archive

Sponsored Introduction Continue on to (or wait seconds) ×