Filtering rows isn’t as simple or straightforward as it seems
Query filters perform a seemingly simple and straightforward task—filtering rows—but if you’ve been writing SQL code for some time, you know that there’s a lot of depth and many confusing aspects to query filters. Working with query filters is far from trivial. Over the next couple of months I’ll discuss query filters that T-SQL supports, query filters that standard SQL defines but that aren’t implemented in SQL Server (as of SQL Server 2008 R2), query filters that database platforms other than SQL Server support, and query filters that aren’t standard and that no platform currently supports but that are useful. This month I focus on predicate-based filters. Next month I’ll cover other types of filters.
In my examples I use a sample database called InsideTSQL2008. You can download the source code to create and populate the sample database from www.InsideTSQL.com/books/source_code/InsideTSQL2008.zip.
T-SQL is based on standard SQL (both ISO and ANSI standards), which in turn is based on the relational model, which in turn is based on two mathematical branches: set theory and predicate logic. Understanding some of the key aspects of the mathematical foundations that T-SQL is ultimately based on can give you important insights into the language and can help you write better and more correct code.
A predicate is a statement or an expression that either holds or doesn’t hold. An example of a predicate is region = 'WA'. When evaluating a predicate for a particular entity instance (e.g., of a customer), the expression represents a proposition. For example, suppose that customer 43 is from Walla Walla, Washington, United States, and customer 42 is from Vancouver, British Columbia, Canada. The proposition that customer 42’s region is equal to Washington ('BC' = 'WA') is false. The proposition that customer 43’s region is equal to Washington ('WA' = 'WA') is true. In other words, you can think of a predicate as a general form of the more specific proposition, or as a parameterized proposition. The predicate can be true for some propositions but false for others.
The relational model relies on predicates for several purposes: as a modeling strategy, to define data integrity, and to filter rows in queries. Use of predicates as a modeling strategy involves listing examples for propositions that you need to represent in your database, taking out the data and keeping the headers (predicates), and defining the relations based on the predicates. An example of the use of predicates to enforce data integrity occurs in CHECK constraints. Examples of using predicates to filter rows in queries include the ON, WHERE, and HAVING query clauses that T-SQL supports; the standard FILTER clause for aggregates that T-SQL doesn’t support; and the nonstandard QUALIFY clause that Teradata supports. I cover all these clauses in this article.
NULLS and Three-Valued Predicate Logic
Part of the complexity and confusion of working with predicates is because SQL and its various implemented dialects don’t by default enforce a law in logic known as the law of excluded middle—instead, SQL uses three-valued predicate logic (or what most people know as simply three-valued logic). When the law of excluded middle is enforced, predicates can evaluate only to true or false. Some people believe that a valid relational model must enforce this law. However, SQL also supports cases in which predicates evaluate to unknown—this approach has to do with SQL’s support for the NULL mark, which represents the missing value concept from Codd’s relational model. Although Codd’s relational model defines two different kinds of missing values—missing and applicable, as well as missing and inapplicable—SQL implements only one mark representing all kinds of missing values.
Given a predicate that compares values (e.g., region = 'WA', when both sides aren't NULL), the predicate evaluates to true or false as I explained earlier. But when one or both sides are NULL, the predicate evaluates to unknown.
Determining how SQL treats true and false cases is pretty straightforward and intuitive, but treatment of unknown cases can be quite tricky. For example, it’s important to understand that all query filters (ON, WHERE, and HAVING, which T-SQL supports, as well as FILTER and QUALIFY, which T-SQL doesn’t support) accept true cases, meaning that they reject both false and unknown cases. The IF and WHILE statements, as well as the WHEN clause of CASE expressions, treat predicates similarly to query filters in the sense that they deal with true cases in one manner and false and unknown cases in another manner. The CHECK constraint, which enforces declarative integrity as part of a table definition, rejects false cases—meaning that it accepts both true and unknown cases.
So, for example, consider the predicate salary > 0. If this predicate appears in a query WHERE clause, a row with salary NULL is rejected. But if the same predicate appears in a CHECK constraint in a table, a row with salary NULL is accepted.
As I mentioned, you get unknown even if you compare two NULLs. For example, the predicate region = NULL will never yield true (I won’t get into nonstandard SET options that change the standard NULL treatment). For this reason, SQL introduced the IS NULL and IS NOT NULL predicates. If you want to filter rows in which a certain attribute is NULL, instead of using the form attribute = NULL, use the form attribute IS NULL. Similarly, if you want to return all rows in which a certain attribute isn’t NULL, instead of using the form attribute <> NULL, use attribute IS NOT NULL. It’s clear that when negating true with a NOT operator you get false, and when negating false you get true. But note that when negating unknown, you still get unknown. This is another reason why the IS NULL and IS NOT NULL predicates are so important.
As an example, suppose that you need to query the Sales.Customers table and return all customers that aren’t from the Washington region. Some customers have NULL in the region, but this is an example for a NULL representing the concept of missing and inapplicable—meaning a customer from a location where the region part of the location hierarchy is inapplicable (e.g., from the city London in the country UK). If you issue the following query, you get only customers for which the region is not NULL and different from WA:
SELECT custid, city, region, country<br>FROM Sales.Customers<br>WHERE region <> 'WA';
But for our purposes, we want to consider a NULL as being different from WA, because we know exactly why we placed a NULL in this attribute—when the region was inapplicable. This is an example in which SQL’s default treatment of unknown in predicates isn’t suitable for us, and when we need to intervene by explicitly asking to return cases in which the attribute is NULL (as opposed to being equal to NULL):
SELECT custid, city, region, country<br>FROM Sales.Customers<br>WHERE region <> 'WA'<br> OR region IS NULL;
Things become even trickier when you compare two attributes and you need to consider them as equal both when the values aren't NULL and the same, as well as when both are NULLs. If you use the form ON T1.col1 = T2.col1, you get unknown when both sides are NULLs, and the row is then filtered out. Some people use a trick in which they substitute NULLs with a value that they know can’t appear in the data—such as ON COALESCE(T1.col1, -1) = COALESCE(T2.col1, -1). However, besides the fact that the expression is awkward and relies on your knowledge of the data, this technique might not use indexes efficiently. A form that the optimizer tends to treat better is ON T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL). In a very similar manner, if you want to use a predicate that checks that the two sides are different, including one side being NULL and the other not, you should use the form ON T1.col1 <> T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL) OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL). However, this form is still awkward and convoluted. The SQL standard addresses this need by defining two very interesting predicates: IS NOT DISTINCT FROM and IS DISTINCT FROM. The former returns true when both sides are equal, as well as when both are NULLs, and false otherwise. The latter returns TRUE when the two are different and when one is NULL and one isn’t, and false otherwise. These predicates use two-valued logic—they never yield unknown. Unfortunately, T-SQL doesn’t support these predicates. If you’d like to see these predicates in SQL Server, you can vote for Steve Kass’s proposal to add them; go to the Microsoft Connect page at connect.microsoft.com/SQLServer/feedback/details/286422.
ON, WHERE Outer Join Confusion
Sometimes it can be very confusing to determine whether to place a certain query filter in the ON clause of a certain join table operator or in the query’s WHERE clause. Let me start by saying that with inner joins it makes no difference in terms of the final result. SQL Server is also aware of this fact, and as long as there are no outer joins involved, it can internally rearrange the order in which it evaluates the predicates. So with inner joins, my recommendation is to write in a way that feels intuitive to you. Some people prefer to specify predicates that compare attributes between tables in the ON filter and the rest of the predicates in the WHERE filter. Others prefer to think of predicates that intuitively relate to one join table operator in the ON filter, and those that seem logically to filter the result of all table operators in the WHERE filter. Some people might also prefer to write in a way that’s safer in cases in which the join could be revised in the future to an outer join. But the final result of the inner join query will be the same regardless.
When using outer joins, placing the predicate in the ON or WHERE filter can have different logical meanings and can produce different results. So you want to make sure you understand what the meaning is in each case and what you’re trying to achieve. With outer joins you mark tables as preserved, meaning that you want to preserve all rows from the marked tables. Using the keywords LEFT, RIGHT, and FULL, you mark the left, right, or both tables as preserved, respectively. In an outer join a predicate placed in the ON clause serves only a matching purpose with respect to rows from the preserved side—not a filtering purpose. A predicate in the WHERE clause serves a filtering purpose, as usual.
As an example, the following two inner join queries produce the same output:
SELECT C.custid, O.orderid, O.orderdate<br>FROM Sales.Customers AS C<br> JOIN Sales.Orders AS O<br> ON C.custid = O.custid<br> AND O.orderdate = '20080212';<br><br>SELECT C.custid, O.orderid, O.orderdate<br>FROM Sales.Customers AS C<br> JOIN Sales.Orders AS O<br> ON C.custid = O.custid<br>WHERE O.orderdate = '20080212';
However, the following two outer join queries produce different outputs:
SELECT C.custid, O.orderid, O.orderdate<br>FROM Sales.Customers AS C<br> LEFT JOIN Sales.Orders AS O<br> ON C.custid = O.custid<br> AND O.orderdate = '20080212';<br><br>SELECT C.custid, O.orderid, O.orderdate<br>FROM Sales.Customers AS C<br> LEFT JOIN Sales.Orders AS O<br> ON C.custid = O.custid<br>WHERE O.orderdate = '20080212';
The first query uses both the predicate that compares the customer IDs from both sides, as well as the predicate that compares the order date with February 12, 2008, as matching predicates. All rows from the preserved table (Customers in our case) will be returned. However, a row from the Orders table will be matched with the respective Customers table only if the ON predicate is true—meaning only if the order’s customer ID is the same as the customer’s customer ID, and the order’s order date is February 12, 2008. In other words, you get all customers in the output, and if the customer placed orders on February 12, 2008, you’ll see those orders.
The second query’s outer join logically returns customers and their orders, including customers who didn’t place orders, but then the WHERE filter keeps rows in which the orderdate filter evaluates to true. Both false cases (in which the orderdate is not NULL and different from February 12, 2008) and unknown cases (in which the orderdate is NULL even if it’s a NULL produced by the outer join) are filtered out. All outer rows here are filtered out by the WHERE filter, and therefore the outer join is nullified, practically becoming an inner join query. In fact, SQL Server’s optimizer realizes this fact and doesn’t bother to process the join as an outer one. You can see this in the query’s execution plan.
In short, the second query is doubtless a bug. Either you planned to write it as an inner join in which you return only customers who placed orders on the date of interest along with their orders placed on that date, or you planned to write it as an outer join with the filter on orderdate as part of the ON clause, returning all customers, but matching orders only if they were placed on the date of interest.
Another similar bug has to do with multi-join queries. A left outer join that's later followed by either an inner join or a right outer join (and refers to elements from the non-preserved side of the join in the subsequent filters) nullifies the outer join. Similarly, a right outer join that's later followed by either an inner join or a left outer join nullifies the outer join. Here’s an example:
SELECT C.custid, O.orderid, OD.productid, OD.qty<br>FROM Sales.Customers AS C<br> LEFT OUTER JOIN Sales.Orders AS O<br> ON C.custid = O.custid<br> JOIN Sales.OrderDetails AS OD<br> ON O.orderid = OD.orderid;
Here, the first join is an outer one, returning customers and their orders, including customers who didn’t place orders as outer rows with NULLs in the order attributes. However, the next inner join removes the outer rows when comparing the NULL order IDs in those rows with order IDs from the OrderDetails table. Fortunately, both standard SQL and T-SQL support separating some of the table operators to their own independent unit:
SELECT C.custid, O.orderid, OD.productid, OD.qty<br>FROM Sales.Customers AS C<br> LEFT OUTER JOIN<br> (Sales.Orders AS O<br> JOIN Sales.OrderDetails AS OD<br> ON O.orderid = OD.orderid)<br> ON C.custid = O.custid;
This time the query performs a left outer join between Customers and the result of the inner join between Orders and OrderDetails, preserving customers without orders.
The HAVING filter is also based on a predicate, but unlike the other filters based on predicates, the HAVING filter is evaluated at the group level in grouped queries. It filters whole groups as opposed to individual rows. Common filtering tasks that are applicable to the HAVING clause are filters based on aggregate functions. For the most part, filters in the HAVING clause are pretty straightforward.
There’s an interesting filtering clause defined by standard SQL for aggregate functions that T-SQL doesn’t support, probably because it has a simple workaround. The idea is to allow an aggregate function to filter only a subset of the rows in a group based on a predicate. The clause in standard SQL is called FILTER; it takes the general form of
<aggregate_function> FILTER (WHERE <predicate>)
As an example, the following query uses three different filtered aggregate expressions to separate the aggregated values for each order year (don’t try to run the query, because it’s not supported):
SELECT empid,<br> SUM(val) FILTER (WHERE YEAR(orderdate) = 2006) AS sum_2006,<br> SUM(val) FILTER (WHERE YEAR(orderdate) = 2007) AS sum_2007,<br> SUM(val) FILTER (WHERE YEAR(orderdate) = 2008) AS sum_2008<br>FROM Sales.OrderValues<br>GROUP BY empid;
Simple workarounds include using standard CASE expressions that T-SQL supports, as in the following example:
SELECT empid,<br> SUM(CASE WHEN YEAR(orderdate) = 2006 THEN val END) AS sum_2006,<br> SUM(CASE WHEN YEAR(orderdate) = 2007 THEN val END) AS sum_2007,<br> SUM(CASE WHEN YEAR(orderdate) = 2008 THEN val END) AS sum_2008<br>FROM Sales.OrderValues<br>GROUP BY empid;
or nonstandard elements that T-SQL supports, such as the PIVOT table operator.
Standard SQL doesn’t allow referring to window functions (e.g., ROW_NUMBER) in any of the query’s filtering clauses. In fact, window functions are allowed only in the SELECT and ORDER BY query clauses. The reason for this limitation is that window functions are supposed to operate on the final query result set, and the final result set is achieved only at the SELECT phase. SQL thus avoids ambiguity in regards to which set of rows the window function is supposed to apply to. However, this limitation means you can’t achieve basic filtering if you need the filter predicate to refer to a window function. For example, the following isn’t supported:
SELECT custid, orderdate, orderid, empid<br>FROM Sales.Orders<br>WHERE ROW_NUMBER() OVER(PARTITION BY custid ORDER BY orderdate DESC, orderid DESC) <= 3;
Of course, you could use a table expression such as a common table expression (CTE) or derived table as a workaround, but doing so would add another layer of complexity to the query. Being able to directly filter by a window function would be useful.
Teradata came up with an answer to this need by introducing a new filtering clause called QUALIFY that’s logically processed after the SELECT clause and is therefore allowed to refer to window functions directly, as well as to column aliases assigned in the SELECT clause. So in Teradata you can issue a query such as
SELECT custid, orderdate, orderid, empid<br>FROM Sales.Orders<br>QUALIFY ROW_NUMBER() OVER(PARTITION BY custid ORDER BY orderdate DESC, orderid DESC) <= 3;
as well as
SELECT custid, orderdate, orderid, empid,<br> ROW_NUMBER() OVER(PARTITION BY custid ORDER BY orderdate DESC, orderid DESC) AS rownum<br>FROM Sales.Orders<br>QUALIFY rownum <= 3;
Both of these solutions are very elegant and straightforward, and they don’t require an extra layer in the form of a table expression. If you want to vote for this feature’s inclusion in SQL Server, go to the Microsoft Connect page at connect.microsoft.com/SQLServer/feedback/details/532474.
In this article I covered query filters that are based on predicates. I touched on some of the interesting and confusing aspects of working with filters in T-SQL, as well as some of the filtering clauses that T-SQL doesn’t support but standard SQL and other database platforms do. Next month I’ll cover query filters that aren’t based on predicates, but rather on logical order and number of rows.