An easy stored procedure lets you control data flow and efficiently access millions of records
Web applications commonly use record paging to present large amounts of information to users. For example, an Internet search engine usually returns tens of thousands of results for a user's query. If the engine returned all those results at once, the receiving system would be overwhelmed. But paging breaks the data into fixed-size blocks that make the results manageable and reduces the amount of information that moves between the server and the client at one time. The application gives users only a few records at a time, preferably giving them the most relevant information first. In addition to making the data easier to understand and browse, paging also helps improve application performance—retrieving and displaying large amounts of information creates unnecessary overhead that can slow your system. If the system pages records correctly, a search engine's users likely won't even need to browse beyond the first one or two pages. (See also, "Paging in SQL Server 2005.")
Unfortunately, many programmers are unaware of some of the important performance considerations of paging. In a typical IIS and SQL Server environment, the most common way to implement paging is to use standard ADO Recordset paging features such as the AbsolutePage, PageSize, and PageCount properties. For relatively small amounts of data (from dozens to hundreds of records), these features work well and the overhead they generate doesn't dramatically affect performance. However, as the number of records increases, this technique becomes less efficient and produces noticeable decreases in application performance.
In large-volume applications such as a procurement application that displays large numbers of orders, a dating site that serves thousands of users, or an e-commerce site that displays hundreds of products as a result of a user's search, you need to use advanced server-side paging techniques. This article presents a simple example of a coding technique that I use for tables containing several million records.
Limits of ADO Recordset Paging
The cause of paging problems when you're accessing large numbers of records lies in the way ADO handles data. To retrieve information from the database, the ADO infrastructure needs to maintain a pointer to the data. This pointer, called a cursor, lets the client (e.g., the Active Server Pages—ASP—environment) retrieve records one by one.
The ADO Recordset object supports two types of cursors: server-side (the default) and client-side. The application can leave all the data at the SQL Server and use the server-side cursor to retrieve each record in sequence when it's needed. Or the application can transfer all the data to the client and use the client-side cursor to retrieve the data record by record from the client's buffer memory space. If you need to use or display only some of the records that result from a query, as is the case in a paging scenario, a server-side cursor is more efficient because SQL Server transfers only the required page to the client, leaving the rest of the records on the database server. A server-side cursor limits the data transferred to the client to the 20 or 30 records that constitute a particular page.
However, to use some Recordset object paging features (e.g., PageCount), you need a client-side cursor. To tell ADO to use a client-side cursor, you set the ClientLocation property of the Recordset object from adUseServer to adUseClient. Listing 1's Visual Basic (VB) code shows how to use a Recordset object with a client-side cursor and with a server-side cursor, respectively. Switching the ClientLocation property to adUseClient causes all a query's returned records to transfer to the client so that the Recordset object can determine the total number of pages the data will need.
For example, imagine that you've launched a query that returns 5000 records from the database. If the application pages these records with a page size of 20 records per page and the user is looking at page 1, the application needs to transfer only the first 20 records to the client. When the user moves to the second page, the application needs to transfer only records 21 through 40. However, when you're using a client-side cursor, ADO transfers all 5000 records every time the user looks at the data, even though the user needs only 20 records. This complete transfer results in a noticeable and annoying delay and can seriously impair performance if the number of records is very large.
A Paging Alternative
Because the performance problems associated with ADO Recordset paging were affecting my application's performance, I implemented an alternative paging technique that I use for large Web applications that typically page several thousand records. This simple technique relies on SQL Server's efficient use of indexes to select the top records in a result set.
You know that the following query selects the first 10 records in the Northwind database's Orders table:
SELECT TOP 10 * FROM Orders
You use this syntax to select a block of 10 records from a larger list. To select a given page of records, you simply have to know how many records are on each page and which page you want to access to return the information your user needs. So if the page size is set at 10 records and you want to retrieve page 3 of the result set, you can use the following query:
OrderID NOT IN (SELECT TOP
20 OrderID FROM Orders)
This query returns the block of 10 records that come after the first 20 records in the list—records 21 through 30. You can generalize this query as the following pseudo code shows:
(current_page - 1))
OrderID FROM Orders)
This technique works well for large numbers of records, but its performance decreases as the page number you're looking at increases. The reason is the relative inefficiency of the IN operator. If you're looking for page 500 in a recordset that has a page size of 10, for example, the subquery that follows the IN operator looks like this:
(SELECT TOP 4990 OrderID FROM Orders)
The server now has to compare each OrderID it attempts to retrieve against 4990 other OrderIDs. That's a lot of work. Still, SQL Server can process such queries quickly because its indexes organize the data in a way that makes searches efficient and fast. (Because OrderID is a primary key, SQL Server indexes it by default.) Although performance decreases as the page number increases, the slowing becomes noticeable only at large page numbers. This gradual performance reduction isn't much of a concern because users are unlikely to look at more than the first few pages, and for the first pages, the query's performance is extremely good. If the application sorts and filters the data properly, users will most likely find what they're looking for on the first or second page. If they don't find the records they need quickly, most users will initiate a new search by using different sorting and filtering parameters rather than trying to find the information they're looking for on page 2479, for example.
As you might have noticed, having a primary key defined on the table you're selecting from is crucial in this situation. The primary key provides for each record a unique identifier that lets the query differentiate between the records it's supposed to select and those it's supposed to ignore. Also, elements such as WHERE and ORDER BY clauses are missing from this example paging query. A generalized query that includes these elements would look like the query that Listing 2 shows.
Sorting is generally inefficient and significantly slows queries. Because SQL Server can perform sorting and filtering most efficiently on indexed columns and because SQL Server always indexes the primary key column by default, you can further optimize this query. Instead of selecting all the table's columns, you can restrict the search to the primary key until you've narrowed the window of records to the page you're looking for and only then retrieve all the table's columns, as the code in Listing 3, page 39, shows. If you have an index that includes the primary key, the fields that you're sorting on, and the fields that you're filtering on, SQL Server can use its index tables to find the records you're trying to retrieve. This type of index is called a covering index because it includes all the fields that the query needs. A query that runs against a covering index is faster than a query that runs against the table. When you finally retrieve all the records you're searching for (i.e., those that the SELECT * portion of the query specifies), the primary key index lets SQL Server access those particular records quite efficiently.
Listing 3's query provides an efficient, simple way of retrieving a particular block of records. You can add to the query by including the GROUP BY and HAVING clauses. To simplify using this query, I encapsulated the functionality in a stored procedure called SELECT_WITH_PAGING, which Listing 4, page 39, shows. The parameters that this stored procedure accepts (and their types) are fields_to_return (string), primary_key (string), table_name (string), page_number (integer, default 1), page_size (integer), get_record_count (true/false), filter_conditions (string), sort_columns (string), and group_by (string).
So, for example, if your page size is 10 records and you want to retrieve page 3 of a list that includes user IDs and ship-to names from the Northwind Orders table sorted by date, you'd issue the following statement:
3, 10, 1, '', 'OrderDate'
When you execute this query, notice that it returns a second recordset that contains the total number of records that the filtered query would normally retrieve. This second recordset is useful if you want to display the total number of pages for a user, which is the case in most situations; in my experience, users like to see page numbers and the total number of pages rather than just links to "previous page" and "next page." To retrieve the total number of records, I set the sixth parameter to 1. If you don't reset this parameter, you can't know how many records you have because the query will return only 10 records (the page size) from the database.
Note that the table_name parameter can also be a join of two or more tables or even a subquery. For example, both of the following strings are valid as parameters to the stored procedure:
ON A.CustomerID = B.CustomerID'
'(SELECT * FROM Northwind.dbo.Orders
WHERE OrderDate > ''8/1/1996'') AS tbl'
The technique in this article allows simple, efficient, server-side paging of large numbers of records, and I've used it successfully for tables that contain millions of rows. In one case, retrieving the first few pages of a table with 25 million records took more than 40 seconds when I used the ADO Recordset object; when I used this stored procedure, the same query took less than 1 second. This approach is particularly useful when an excessive amount of data and performance considerations make using the ADO Recordset paging features impractical.