To me, one of the most exciting improvements in SQL Server 2016 is the introduction of the batch mode Window Aggregate operator, which significantly improves the performance of most window functions. This article is the third in a three-part series on the topic. In Part 1, I discussed the concept of batch processing and explained how you can leverage its benefits irrespective of whether your data is organized as columnstore. I also covered frameless window aggregates and demonstrated the performance improvements you get with the Window Aggregate operator. In Part 2, I covered ranking and statistical window functions. This month I conclude the series with coverage of aggregate window functions with a frame, as well as offset window functions. If you need a reminder about working with these functions, you can find coverage of aggregate window functions with a frame here and of offset window functions here.

Sample Data

In this article I will use the same sample data that I used in the first two articles in the series. If you don’t have it available already use the following code to create it:

                              
-- testwindow database
SET NOCOUNT ON;
IF DB_ID(N'testwindow') IS NULL CREATE DATABASE testwindow;
GO
USE testwindow;
GO

-- GetNums helper function
DROP FUNCTION IF EXISTS dbo.GetNums;
GO

CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
             FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
GO

-- Transactions table with 10M rows (200 accounts x 50000 transactions per act)
-- Traditional rowstore B-tree index
DROP TABLE IF EXISTS dbo.Transactions;

CREATE TABLE dbo.Transactions
(
  actid  INT   NOT NULL,
  tranid INT   NOT NULL,
  val    MONEY NOT NULL,
  CONSTRAINT PK_Transactions PRIMARY KEY(actid, tranid)
);
GO

DECLARE
  @num_partitions     AS INT = 200,
  @rows_per_partition AS INT = 50000;

INSERT INTO dbo.Transactions WITH (TABLOCK) (actid, tranid, val)
  SELECT NP.n, RPP.n,
    (ABS(CHECKSUM(NEWID())%2)*2-1) * (1 + ABS(CHECKSUM(NEWID())%5))
  FROM dbo.GetNums(1, @num_partitions) AS NP
    CROSS JOIN dbo.GetNums(1, @rows_per_partition) AS RPP;
GO

-- TransactionsCS
-- Clustered columnstore index
DROP TABLE IF EXISTS dbo.TransactionsCS;
SELECT * INTO dbo.TransactionsCS FROM dbo.Transactions;
CREATE CLUSTERED COLUMNSTORE INDEX idx_cs ON dbo.TransactionsCS;

-- TransactionsDCS
-- Traditional rowstore B-tree index
--   + dummy empty filtered nonclustered columnstore index
--     to enable using batch mode operators
DROP TABLE IF EXISTS dbo.TransactionsDCS;
SELECT * INTO dbo.TransactionsDCS FROM dbo.Transactions;
ALTER TABLE dbo.TransactionsDCS
  ADD CONSTRAINT PK_TransactionsDCS PRIMARY KEY(actid, tranid);
CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs_dummy ON dbo.TransactionsDCS(actid)
  WHERE actid = -1 AND actid = -2;

As a reminder, the sample data is created in a database called testwindow. There are three tables holding identical copies of the data, which consists of 10,000,000 rows representing bank account transactions (500 accounts x 20,000 transactions per account). The table Transactions uses only rowstore representation of the data with a clustered B-tree index defined on the key list: (actid, tranid). Queries against this table currently use only row mode processing due to the requirement to have at least one columnstore index present to enable batch mode processing. So I’ll use this table to demonstrate row mode processing over rowstore. The table TransactionsCS has only a clustered columnstore index, so I’ll use it when I want to demonstrate batch mode processing over columnstore. The table TransactionsDCS uses the same organization as Transactions (clustered B-tree index), but in addition has a dummy empty filtered columnstore index to enable batch processing. I’ll use this table when I want to demonstrate batch mode processing over rowstore.

Aggregate Window Functions with a Frame

Aggregate window functions with a frame allow you to compute very classic data analysis calculations like running totals, ytd, mtd, moving averages and many others. As long as you use the frame delimiters UNBOUNDED and CURRENT ROW, there’s potential that SQL Server’s optimizer will use the batch mode Window Aggregate operator. With other delimiters, the optimizer is more limited. The upcoming sections cover different frame categories and the way they get optimized.

ROWS with Delimiters UNBOUNDED and CURRENT ROW

Probably the most commonly used frame specification is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (or, in short, ROWS UNBOUNDED PRECEDING). It represents the rows from the beginning of the partition (no low boundary point) and until the current row. This is the frame that you specify when you want to compute a classic running total. The frame specification ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING is optimized similarly, although it’s used less often.

The following query (call it Query 1) against the table Transactions (rowstore) computes a simple running total:

                              
SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid ORDER BY tranid
                ROWS UNBOUNDED PRECEDING) AS balance
FROM dbo.Transactions;

The execution plan for this query is shown in Figure 1.

Figure 1: Plan for Query 1 (ROWS, row mode over rowstore)

Since there’s no columnstore index present on the Transactions table, the optimizer is currently limited to using row mode processing. The plan is a serial one. It performs an ordered scan of the clustered index, and this way avoids the need to explicitly sort the rows for the window function’s calculation. The plan uses the Segment and Sequence Project operators to compute row numbers in order to know which rows are part of the current row’s frame. (See Part 2 for details on how row numbers are computed.) The plan then uses another Segment operator to segment the rows by the partitioning column actid for the aggregate.

Next, the plan writes rows into a spool (the Window Spool operator). The window spool is supposed to hold the frame of rows for the Stream Aggregate operator to aggregate. Since the frame specification is ROWS UNBOUNDED PRECEDING, theoretically, the window spool is supposed to hold 1 row for the first row in the partition, 2 rows for the second, and so on, and n rows for the nth row. Had this been the case, the spool would have had to hold (n + n2) / 2 rows for a partition with n rows, and consequently scale in a quadratic manner.

Fortunately, whenever the frame starts with UNBOUNDED PRECEDING, the optimizer uses fast-track optimization where it takes the previous row’s result, and adds the current row’s value to compute the new running total. What’s curious is why you see 10,000,000 rows going into the spool and twice the number going out, instead of the same number.

As it turns out, since the Stream Aggregate operator originally was written for grouped queries, it computes the grouped result and discards the detail. Since window functions aren’t supposed to discard the detail, the plan adds the detail row into the spool beside the grouped row in order to return both the detail and the result of the aggregate. So, although fast-track optimization results in linear scaling, there’s still room for better optimization by both avoiding writing two rows per underlying row into the spool and by eliminating the spool altogether when using this classic frame.

I got the following performance statistics for this query on my machine: duration: 28 sec, CPU: 28 sec, logical reads: 31K, writes: 0.

I’ll use a similar query (call it Query 2) against TransactionsCS to demonstrate batch mode processing over columnstore with the new Window Aggregate operator:

                              
SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid ORDER BY tranid
                ROWS UNBOUNDED PRECEDING) AS balance
FROM dbo.TransactionsCS;

The execution plan for this query is shown in Figure 2.

Figure 2: Plan for Query 2 (ROWS, batch mode over columnstore)

This time the plan is parallel and uses batch mode processing. It pulls the data from a columnstore index, hence the I/O cost drops; however, it does need to sort the data for the window function’s calculation, albeit a batch mode Sort operator. Then most of the magic happens in the batch mode Window Aggregate operator, replacing the following five operators from the previous plan: Segment, Sequence Project, Segment, Window Spool and Stream Aggregate. In addition to the obvious advantages of batch processing, this operator has an optimized code path for each window function. The inefficiencies related to the spool and the writing of the two rows per underlying row are gone. Finally, the Compute Scalar operator returns a NULL if the row count is zero, and the Gather Streams operator gathers the rows from the multiple threads into one stream of rows.

I got the following performance statistics for this query: duration: 8 sec, CPU: 19 sec, logical reads: 6k, writes: 0. Notice the drop in the run time from 28 seconds to 8 seconds.

The most expensive part in this plan seems to be the explicit sorting. You could avoid the need for sorting by having the data presorted in a B-tree index, but you do want a columnstore index present—even if it’s a dummy one—to enable batch mode processing. To demonstrate this batch mode over rowstore strategy, I’ll use a similar query (call it query 3) against TransactionsDCS (rowstore with dummy columnstore):

                              
SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid ORDER BY tranid
                ROWS UNBOUNDED PRECEDING) AS balance
FROM dbo.TransactionsDCS;

The plan for this query is shown in Figure 3.

Figure 3: Plan for Query 3 (ROWS, batch mode over rowstore)

The plan is serial. It pulls the data presorted from the B-tree clustered index using row mode processing. It then converts the rows to batches, and uses the batch mode Window Aggregate operator to calculate the window function. Note that if you have both rowstore and nondummy columnstore indexes on the table, in this scenario the optimizer picks the batch mode over rowstore strategy.

Here are the performance statistics I got for this query: duration: 7 sec, CPU: 7 sec, logical reads: 31k, writes: 0. What’s still very much missing in this plan is parallelism. At the date of this writing, SQL Server 2016 doesn’t support an efficient combination of a parallel ordered Index Scan and Window Aggregate operators. Some additional work needs to be done in the engine and hopefully we’ll see such support in the future. Imagine dropping the run time that you see here even further!

RANGE with Delimiters UNBOUNDED and CURRENT ROW

I generally don’t like to raise people’s expectations too much before I show them something, but I will make an exception here. The next thing I will demonstrate is to me the most significant and exciting benefit that you get from the Window Aggregate operator. It has to do with an inefficiency in the row mode processing of window aggregates when using the RANGE window frame unit. Consider the following query (call it Query 4):

                              
SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid ORDER BY tranid
                RANGE UNBOUNDED PRECEDING) AS balance
FROM dbo.Transactions;

The logical difference between ROWS and RANGE is in the handling of ties in the ordering values when the ordering isn’t unique. For example, suppose that the table also included a column holding the transaction date, and that you used that column in the window order clause. Whereas the ROWS option computes the running total without including ties, the RANGE option includes ties. So, with RANGE, all rows with the same date would get the same running total. As you can imagine, when the ordering is unique within the partition—such as is our case, since it’s based on the transaction ID—ROWS and RANGE have the same meaning.

The plan for this query is shown in Figure 4.

Figure 4: Plan for Query 4 (RANGE, row mode over rowstore)

This plan doesn’t use a row number computation, but rather two Segment operators—one for the window partitioning and another to tell when the ordering value changes. The rest seems similar to the computation when using ROWS, as shown earlier for Query 1 in Figure 1. Thus, you would expect Query 4 and Query 1 to perform similarly. To quote Hobie Doyle from Hail Caesar!, "Would that it were so simple … "

There’s a hidden performance problem in the plan. Whenever you use the RANGE option, SQL Server uses an on-disk spool for the Window Spool operator, as opposed to the in-memory spool that it uses for the ROWS option in the fast-track mode. An on-disk spool is essentially a worktable in tempdb with all of the usual I/O- and latching-related overhead. Consequently, a calculation with RANGE can take an order of magnitude longer than a similar calculation with ROWS, even when the ordering is unique. Again, this is something that you cannot see in the plan. Here are the performance statistics that I got for this plan: duration: 221 sec, CPU: 204 sec, logical reads: 104M, writes: 31K. Notice that the duration of this query is an order of magnitude longer than Query 1. Also notice the excessive number of logical reads. If you enable STATISTICS IO before running the queries, the output will include an entry for the spool called worktable. This entry will show 0 as the number of logical reads when you use ROWS and a very large number of reads when you use RANGE. That’s one way to tell that you got the on-disk spool. Another is to run an Extended Events session with the event window_spool_ondisk_warning.

To add insult to injury, if you omit the clause with the window frame unit, standard SQL defines that RANGE UNBOUNDED PRECEDING is the default. So the following query is semantically equivalent to Query 4, with the same poor performance:

                              
SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid ORDER BY tranid) AS balance
FROM dbo.Transactions;

I can’t tell you how many times I’ve seen people using RANGE unknowingly, myself included. It’s especially hard to notice when you test your code with a small set of sample data and the ordering is unique within the partition, as in our example. Imagine how many systems have such code with queries running an order of magnitude longer than they should. As it turns out, the Window Aggregate operator solves this problem, too. It basically optimizes ROWS and RANGE similarly.

The following query (call it Query 5) demonstrates using batch mode over columnstore:

                              
SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid ORDER BY tranid
                RANGE UNBOUNDED PRECEDING) AS balance
FROM dbo.TransactionsCS;

The plan looks similar to the one shown earlier for Query 2 in Figure 2. Here are the performance statistics that I got for this query: duration: 9 sec, CPU: 21 sec, logical reads: 6k, writes: 0.

The following query (call it Query 6) demonstrates using batch mode over rowstore:

                              
SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid ORDER BY tranid
                RANGE UNBOUNDED PRECEDING) AS balance
FROM dbo.TransactionsDCS;

The plan looks similar to the one shown earlier for Query 3 in Figure 3. Here are the performance statistics that I got for this query: duration: Performance statistics: duration: 7 sec, CPU: 7 sec, logical reads: 31k, writes: 0.

The performance of queries 1 through 6 is summarized in Figure 5.

Figure 5: Performance of aggregates with UNBOUNDED and CURRENT ROW

Now, this is something. This picture is so beautiful that I almost want to make a tattoo out of it. (Don’t worry, I won’t.)

Delimiters Other than UNBOUNDED and CURRENT ROW

When using aggregate window functions with a frame whose delimiters are different than UNBOUNDED and CURRENT ROW, even when you query columnstore data, there will be use of row mode operators to compute the window function. Still, some of the work can possibly be done with batch mode operators.

Suppose that you compute an aggregate window function with a frame that starts with UNBOUNDED PRECEDING (fast-track) but doesn’t end with CURRENT ROW; for example: SUM(val) with the frame ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING. If you query Transactions (rowstore), you get row mode over rowstore optimization similar to the plan that you got for Query 1 in Figure 1 since it’s still considered fast-track.

If you query TransactionsCS (columnstore), the part in the plan that computes row numbers is optimized with the batch mode Window Aggregate operator, but the aggregate itself is optimized using the traditional row mode operators for the fast-track case.

If you query TransactionsDCS (rowstore + dummy columnstore), the optimizer just chooses to handle everything with row mode processing, and you get a plan similar to the one you got for Query 1 in Figure 1.

If the frame does not start with UNBOUNDED PRECEDING, technically the case does not qualify for fast track optimization. However, as long as the aggregate is considered cumulative (like SUM, COUNT, AVG, but not like MIN, MAX), the optimizer can compute a single non-fast-track calculation based on two fast-track calculations; for example, SUM(val) with the frame ROWS BETWEEN 99 PRECEDING AND CURRENT ROW can be computed as SUM(val) with the frame ROWS UNBOUNDED PRECEDING minus SUM(val) with the frame ROWS BETWEEN UNBOUNDED PRECEDING AND 100 PRECEDING. The following query (call it Query 5) against the table Transactions (rowstore) demonstrates this strategy with row mode optimization over row store:

                              
SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid ORDER BY tranid
                ROWS BETWEEN 99 PRECEDING AND CURRENT ROW) AS last100
FROM dbo.Transactions;

The plan for this query is shown in Figure 6.

Figure 6: Plan for Query 5 (ROWS 99 PRECEDING AND CURRENT ROW, row mode over rowstore)

The upper part of the plan computes the ComulativeBottom aggregate (ROWS UNBOUNDED PRECEDING) and the lower part of the plan computes the CumulativeTop aggregate (ROWS BETWEEN UNBOUNDED PRECEDING AND 100 PRECEDING). The Compute Scalar operator that follows subtracts CumulativeTop from CumulativeBottom to compute the final aggregate. Here are the performance statistics that I got for this query: duration: 59 sec, CPU: 59 sec, logical reads: 31K, writes: 0.

As for a similar query against TransactionsCS (columnstore), the plan uses the batch mode Window Aggregate operator to handle the CumulativeBottom calculation. It then uses the Window Aggregate operator to compute row numbers and the traditional fast-track row mode processing for the CumulativeTop calculation. Here are the performance statistics that I got for this query: duration: 34 sec, CPU: 34 sec, logical reads: 5k, writes: 0. Notice that it cuts the run time to half compared to the completely traditional row mode processing.

Similarly, if you query TransactionsDCS (rowstore + dummy columnstore), the plan handles the ComulativeBottom calculation with the Window Aggregate operator, and then a row number calculation with the Window Aggregate operator, and the ComulativeTop calculation with traditional fast-track row mode operators. The difference from the previous case is that the data is pulled from a B-tree and therefore there’s no need for explicit sorting. Here are the performance statistics that I got for this case: duration: 33 sec, CPU: 33 sec, logical reads: 5k, writes: 0.

Suppose that you compute the SUM(val) window function with the frame ROWS BETWEEN 99 PRECEDING AND 1 PRECEDING. Against Transactions (rowstore) the plan computes row numbers and two fast-track aggregates all using row mode operators.

Against TransactionsCS (columnstore) the plan computes row numbers with the Window Aggregate operator, but the two aggregates with fast-track row mode operators.

Against TransactionsDCS (rowstore + dummy columnstore) the optimizer chooses a plan that uses only row mode processing as it did for the query against Transactions.

When you use a noncumulative aggregate like MIN and MAX with a frame that doesn’t start with UNBOUNDED PRECEDING (for example, ROWS BETWEEN 99 PRECEDING AND 1 PRECEDING), the optimizer cannot use the trick where it computes two fast-track aggregates and derives the final calculation from the two.

Against Transactions (rowstore), both the row number and the aggregate are computed with row mode operators; there’s no fast-track optimization, so all applicable frame rows need to be written to the window spool. Against TransactionsCS (columnstore) the plan computes row numbers with the Window Aggregate operator and the aggregate using row mode operators.

Against TransactionsDCS (rowstore + dummy columnstore) the plan computes both the row numbers and the aggregate using row mode operators just like against Transactions.

Offset Window Functions

T-SQL supports two pairs of offset window functions. The LAG and LEAD functions return an element from a preceding or following row, respectively. The FIRST_VALUE and LAST_VALUE functions return an element from the first or last row in the window frame, respectively.

The LAG and LEAD functions

The following query against Transactions (call it Query 6) demonstrates the traditional row mode over rowstore optimization of the LAG function (LEAD gets optimized similarly):

                              
SELECT actid, tranid, val,
  LAG(val) OVER(PARTITION BY actid ORDER BY tranid) AS prevval
FROM dbo.Transactions;

Internally, the LAG and LEAD functions are converted to the LAST_VALUE function with a frame that contains only the requested row. For example, LAG with the default offset 1 is converted to LAST_VALUE with the frame ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING. Therefore, Query 6 is equivalent both in meaning and in optimization to the following query (call it Query 7):

                              
SELECT actid, tranid, val,
  LAST_VALUE(val) OVER(PARTITION BY actid ORDER BY tranid
                       ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS prevval
FROM dbo.Transactions;

The execution plan for both Query 6 and Query 7 is shown in Figure 7.

Figure 7: Plan for Query 6 (LAG/LEAD, row mode over rowstore)

As you can see, the plan is similar to the one you got for Query 1 in Figure 1. Here are the performance statistics that I got for this query: duration: 33 sec, CPU: 32 sec, logical reads: 31K, writes: 0.

If a columnstore index is present on the queried table, SQL Server can optimize LAG and LEAD with the batch mode Window Aggregate operator as long as the offset is 1 (the default). For example, the following query (call it Query 8) against TransactionsCS demonstrates batch mode over columnstore optimization:

                              
SELECT actid, tranid, val,
  LAG(val) OVER(PARTITION BY actid ORDER BY tranid) AS prevval
FROM dbo.TransactionsCS;

The plan for this query is shown in Figure 8.

Figure 8: Plan for Query 8 (LAG/LEAD, batch mode over columnstore)

As you can see, the plan is similar to the one you got for Query 2 in Figure 2. The plan is parallel. Since the data is pulled from a columnstore index, it has to be sorted. Most operators use batch processing. Here are the performance statistics I got for this query: duration: 10 sec, CPU: 19 sec, logical reads: 6K, writes: 0. Notice the drop in the query run time from 33 seconds to 10 seconds.

The following query (call it Query 9) against TransactionsDCS demonstrates batch mode over rowstore optimization:

                              
SELECT actid, tranid, val,
  LAG(val) OVER(PARTITION BY actid ORDER BY tranid) AS prevval
FROM dbo.TransactionsDCS;

The plan for this query is shown in Figure 9.

Figure 9: Plan for Query 9 (LAG/LEAD, batch mode over rowstore)

As you can see, the plan is similar to the one you got for Query 3 in Figure 3. The plan is serial. The B-tree index scan uses row mode processing, but since it pulls the data ordered, there’s no need for explicit sorting. Then the batch mode Window Aggregate operator applies the window function calculation. Here are the performance statistics that I got for this query: duration: 7 sec, CPU: 7 sec, logical reads: 31K, writes: 0. The run time dropped to just 7 seconds!

As mentioned, currently SQL Server can use the batch mode Window Aggregate operator for LAG and LEAD only when used with the offset 1. With a different offset, these functions are optimized with row mode operators. For example, the following query (call it Query 10) demonstrates using the LAG function with the offset 2:

                              
SELECT actid, tranid, val,
  LAG(val, 2) OVER(PARTITION BY actid ORDER BY tranid) AS prev2val
FROM dbo.TransactionsDCS;

The plan for this query is similar to the one you got for Query 6 in Figure 7. Here are the performance statistics that I got for this query: duration: 31 sec, CPU: 31 sec, logical reads: 31K, writes: 0

Note that some parts of the plan, in the right conditions, can be optimized with batch mode operators. For example, if you query TransactionsCS with offset 2, the parts in the plan that involve reading from a columnstore index, sorting and computing row numbers can still use batch mode operators. But the calculation of the window function itself will be handled with traditional row mode operators.

Figure 10 summarizes the performance the LAG and LEAD functions with an offset of 1.

Figure 10: Performance of LAG and LEAD functions

The FIRST_VALUE and LAST_VALUE Functions

As mentioned, the LAG and LEAD functions are internally converted to LAST_VALUE. When using the LAST_VALUE function in a mode that is equivalent to LAG and LEAD with an offset of 1 (with ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING or 1 FOLLOWING AND 1 FOLLOWING), you get the same optimization as for LAG and LEAD, as described in the previous section. Otherwise (when using LAST_VALUE with an offset that is different than 1, or when using the FIRST_VALUE function with any offset), the actual window function calculation is optimized with traditional row mode operators. The parts in the plan that involve reading from a columnstore index, sorting, and computing row numbers, can still use batch mode operators.

Conclusion

The batch mode Window Aggregate operator, which was introduced in SQL Server 2016, shows impressive performance improvements for most window functions. It can be used with frameless aggregate window functions, ranking and statistical window functions, aggregate window functions with a classic frame (UNBOUNDED and CURRENT ROW as delimiters), and the offset window functions LAG and LEAD with the classic offset of 1.

This operator can be used even when the data is pulled from a rowstore representation of the data, but you need at least one columnstore index present in the queried table. You could have both columnstore and rowstore representations of the data coexist, and let the optimizer choose the best source for any given query. If you have a case where you only want rowstore representation of the data, you can create a dummy filtered columnstore index just to enable batch mode processing. This operator shows much better scaling with degree of parallelism (DOP) n > 1 versus DOP 1 compared to row mode processing. The tag line “It Just Runs Faster” in Microsoft’s performance improvements in SQL Server 2016 is indeed very true. Let’s hope to see many more improvements like this and also in the functionality of window functions in the future.