Clear the way for large-scale snapshot replication

Editor's Note: This article is the first of a three-part series showing how to optimize snapshot replication. It shows how to evaluate snapshot replication's performance and lays out the steps for optimizing the performance. The second and third parts of the series will detail the full implementation of these optimization steps.

If you're an application or database developer, you're probably familiar with the excessive length of time that utilities such as bulk copy program (bcp), Data Transformation Services (DTS), and snapshot replication need to copy a massive amount of data to multiple locations. For example, a snapshot of a database that holds 500MB of data defined with key constraints and indexes might take 2 hours during the production day or 1.5 hours at night to transfer between two 4-way, high-end database servers with 1GB of RAM and RAID 5. This performance is unacceptable in a 24*7 load-balanced server farm or a distributed enterprise data warehouse whose available off-peak time is too short to accommodate the transfer of multiple massive data sets to all servers. So how can you speed up your data distribution?

I choose to optimize the replication of a large snapshot of data. Snapshot replication copies a set of data out of a source SQL Server at the moment when the replication is invoked, even while other queries or updates exist on the same set of data. The snapshot replication then copies this set of data, immediately or at a later time, into one or multiple target SQL Servers. The target servers see the data as it existed in the source server at the moment of copy until the next snapshot replication refreshes the entire set of data. DTS copies a set of data out of an ODBC (or OLE DB) source and into another ODBC (or OLE DB) target immediately. The source and the target can be two completely different ODBC data providers. Bcp copies one SQL Server table or view out to a native- or character-formatted file, or it copies one native- or character-formatted file into a SQL Server table or view.

Snapshot replication actually uses bcp and takes place in two stages. The first stage copies (bcps out) a snapshot of a selected data source at a scheduled time. The second stage distributes (bcps in) the snapshot to one or multiple target servers on separate schedules. Although the DTS Copy SQL Server Objects task also uses bcp by supporting the BULK INSERT task, this task can't schedule copy out and copy in separately, and it lacks the efficiency of sharing one snapshot for multiple targets. A DTS job can copy to only one target. However, bcp is a raw command-line utility. It lacks a user-friendly interface and the capability of copying from one SQL Server to another SQL Server directly. Each bcp execution can copy only one table at a time. Snapshot replication and DTS are user-friendly and functionally powerful alternatives to the bcp utility. (To find out why snapshot replication might be more appropriate than transactional replication for your project, see the sidebar "Why Not Transactional Replication?" page 46.

How Snapshot Replication Works

To lay the groundwork for the optimization process, let's briefly review the basics of snapshot replication, which replicates data in two stages. In the first stage, the Snapshot Agent copies out selected tables (called published articles) from the source database (the publisher) to a distribution area (the distributor) as a set of .bcp files by running the snapshot.exe command-line utility. Note that in SQL Server 2000, you can also define an article based on a view, stored procedure, or user-defined function (UDF).

In the second stage, the Distribution Agent copies the set of .bcp files into the target databases (the subscribers) by executing the distrib.exe command-line utility. The snapshot.exe and distrib.exe utilities use bcp to copy the articles one at a time. The Snapshot Agent copies the data out to .bcp files by executing a simple SELECT statement that might contain a column filter and a WHERE clause (to filter rows). Then, the Snapshot Agent writes the result set to .bcp files. When the Distribution Agent copies (inserts) .bcp files into database tables, it has to delete the existing rows, possibly log the deletions and insertions, and maintain the table indexes.

Before you begin this two-stage process for replicating your data, you need to establish a framework for it. Let's look at the default steps for setting up snapshot replication.

Defining and configuring the publications and subscriptions. You can easily define and configure snapshot replication by using Enterprise Manager or system stored procedures. Typically, you create a new publication by first selecting the articles you want to include, then setting properties for the publication and the articles. Next, you use Enterprise Manager to add subscriptions, one at a time, that subscribe to all articles of the publication. Alternatively, you can use the sp_addsubscription stored procedure to select only those publication articles you want to subscribe to or all the articles. You call the procedure by setting the @article parameter to each article name, one at a time, or by specifying all for all articles. For example, in a Query Analyzer window, you issue the call

EXEC sp_addsubscription @publication = N'pubs',
@article = N'authors', @subscriber = N'CGIS1',
@destination_db = N'pubs', @sync_type =
N'automatic', @update_mode = N'read only',
@offloadagent = 0, @dts_package_location = N'distributor'

to subscribe only to the authors article for subscriber CGIS1. If you set @article = N'all', you subscribe to all articles. The subscription can be a push or a pull subscription. A push subscription is created at the publisher and is suitable for a tightly integrated, connected, and centrally administrated environment, such as a development server (the publisher) connected to a production server farm (the subscribers) in a data center. (The Distribution Agent in this case runs at the distributor.) A pull subscription is created at the subscriber and is best for a loosely coupled, connectionless, more autonomous relationship, such as that between a traveling salesperson's laptop and the office server. (In this case, you off-load the Distribution Agent to run at the subscriber.) The optimization technique I describe in this article applies to both pull and push subscriptions.

Assigning Snapshot and Distribution agents. After you've defined and configured the publication and subscriptions, Enterprise Manager's Create Publication Wizard assigns a Snapshot Agent to each publication. Further, by default, Enterprise Manager's Create Push Subscription Wizard assigns only one Distribution Agent to distribute all publications in a database that a subscriber subscribes to. For example, a subscriber might subscribe to both publication A and publication B from the same publisher and publication database. By default, only one Distribution Agent serves these two publications to the subscriber. If the Snapshot agents of these publications run at the same time, you can't distribute the two snapshots in different schedules. To make the distribution schedule more flexible for each publication in a large-scale replication environment, I prefer to assign one Distribution Agent to publication A and one to publication B. To assign a separate Distribution Agent to each publication in a subscription, go to the publication's Properties window, click the Subscription Options tab, then select the Use a Distribution Agent that is independent of other publications from this database option, as Figure 1 shows. For each publication, you can view the result of this configuration in Enterprise Manager by following the path that the left pane in Figure 2 shows. The right pane shows one Snapshot Agent and two Distribution agents (one for each publication in each subscription) for the pubs1 publication in the Pubs database highlighted in the left pane.

Assigning Snapshot and Distribution Agent jobs. Enterprise Manager automatically defines a SQL Server job for each Snapshot Agent and each Distribution Agent. The agents carry out the replication by executing steps in their jobs. By default, a Snapshot Agent's job consists of three steps—Log agent startup message, Run agent, and Detect nonlogged agent shutdown. Figure 3 shows these three steps on the Steps tab in the CGIS1-EGH-address-28 Properties-CGIS1 window. Figure 4 shows the command that runs during the Run agent step—which executes snapshot.exe. The command's full syntax is

snapshot -Publisher \[CGIS1\] -PublisherDB \[EGH\]
 -Distributor \[CGIS1\] -Publication \[address\]
      -DistributorSecurityMode 1

Snapshot.exe copies the schemas, indexes, and records of published articles from the publisher to the distributor. Then, in the distributor's Snapshot folder, snapshot.exe saves the article schemas in .sch T-SQL script files (as the code snippet in Figure 5, page 48, shows) and saves the indexes in .idx files (as Figure 6, page 48, shows). Finally, the records are stored in.bcp files. Each article has one set of .sch, .idx, and .bcp files.

Also by default, a Distribution Agent's job consists of the same three steps as that of a Snapshot Agent. However, its Run agent step executes distrib.exe, which the code in Figure 7, page 48, shows. Distrib.exe's full syntax is

distrib -Subscriber \[HERTSCHEN3\] -SubscriberDB \[sde\]
 -Publisher \[CGIS1\] -Distributor CGIS1
      -DistributorSecurityMode 1 -PublisherDB \[EGH\]

First, distrib.exe applies the schemas in all the .sch files to a subscription database (for convenience, I call it the sch step). Next, it uses bcp to copy the records from all the .bcp files to the target tables that the sch step creates (the bcp step). Finally, it applies the indexes in all the .idx files to the copied target tables (the idx step).

Snapshot.exe generates each set of .sch and .idx files according to the article properties you set in the Snapshot tab of the respective article's Properties dialog box, as Figure 8, page 49, shows. For example, selecting the Delete data in the existing table that matches the row filter statement option in Figure 8's ADDRESS Properties dialog box adds the DELETE FROM \[ADDRESS\] statement to the .sch file that Figure 5 shows for the ADDRESS article. However, if you select the Delete all data in the existing table (using TRUNCATE) option in Figure 8, a TRUNCATE TABLE \[ADDRESS\] statement results instead. In addition, selecting the Copy...indexes options in Figure 8 adds the CREATE INDEX statements to the .idx file that Figure 6 shows for the ADDRESS article. Conversely, clearing all Copy...indexes options empties the .idx file. Note that in SQL Server 2000, indexes are always saved to the .idx file even if you clear them in Enterprise Manager. To force an empty .idx file, use the sp_addarticle or sp_changearticle store procedures with schema_option = '0x01', as the following example shows:

sp_changearticle 'pubs', 'authors', 'schema_option', '0x01', 1, 1
sp_addarticle @publication = N'pubs', @article = N'authors', @schema_option =
0x0000000000000001, ...

Where Bottlenecks Occur

The Snapshot and Distribution agents perform their tasks according to design, but a couple of these tasks can cause serious bottlenecks that compromise performance. Let's evaluate the agents' performance by applying a standard set of metrics.

Setting up performance criteria. To quantify an application's execution time, you can either empirically measure its execution or mathematically evaluate its algorithm. In this section, I do both an empirical measurement and a mathematical evaluation of the agents' steps to compare the results from these different methods. To quantify the execution time mathematically, you represent the execution time of the algorithm by the order of magnitude the time grows when the input-data size increases. This growing trend in execution time shows how well an algorithm can scale with the load. For example, if the time taken to process data of size n grows linearly with the data size (e.g., processing 1000 records takes 2 seconds and processing 10,000 records takes 20 seconds), the algorithm's execution time grows in linear order with the data size. This growing trend as a function of the data size is called the time complexity of the algorithm and is denoted as O(n) (or order n) for linear-order time complexity. An algorithm that always finishes in a constant amount of time no matter how much data you input has a constant-time complexity, which I denote as O(c).

Now, let's use these formulas to evaluate the Snapshot Agent's performance. Regardless of the number of subscriptions that you need to copy data to, snapshot.exe executes only once for each publication at each scheduled time. Generating .sch and .idx script files in this command takes a constant time (complexity O(c)), independent of the number of records (denoted as n) in the publication. The time that snapshot.exe takes to generate .bcp files, however, is directly proportional to the size of the publication (time complexity O(n)). My empirical measurement on a 500MB publication that has 6 million records shows that snapshot.exe takes 7 seconds to generate the .sch and .idx files, but takes 2 minutes and 38 seconds to generate .bcp files. These results confirm that the speed of generating .bcp files depends on the size of the publication. Compared to the Distribution Agent's command (distrib.exe), which I evaluate next, the Snapshot Agent's command (snapshot.exe) is still very fast, even in generating large .bcp files. Its performance is free of bottlenecks.

Because the Snapshot Agent doesn't slow down the snapshot replication process, the Distribution Agent must be the culprit. Applying the formulas to this agent's steps shows you where the bottlenecks occur.

First bottleneck: deleting target records. Distrib.exe executes once for each subscription. The command's functionality is exactly the opposite of snapshot.exe's functionality; the command applies (in order) all the .sch, .bcp, and .idx files of a publication to a target subscription database. In doing so, the command carries the overhead of logging all the DELETE statements in the sch step and indexing all the target tables in the idx step. A DELETE statement in the default sch step deletes all the records (assuming no filter has been specified) in a target table; that process has a complexity of O(n), where n is the number of records in that target table. In addition, each deletion records each deleted row in the transaction log and deletes the corresponding row from the clustered index and nonclustered index B-trees. According to the execution plan that Figure 9 shows, the DELETE statement copies each row to tempdb (Table Spool) once for each nonclustered index and deletes each row from each index B-tree. Each copy or delete in this execution plan costs about 10 percent of the overall cost of the DELETE statement. Therefore, for a table with m indexes, each index contributes roughly 1/m of the total processing cost. The overall complexity of the DELETE statement in the sch step thus is O(m * n). In the example I provided earlier (using a 4-CPU development server—with 1GB of RAM and RAID 5—to process a 500MB publication that has 6 million records), the typical duration for deleting all records in the target tables is 58 minutes. You can see that deleting your target records creates a bottleneck that grows with a publication's size and number of indexes.

Second bottleneck: bulk copying the source data. After the sch step is complete, the bcp step bulk-inserts all records in the .bcp files into respective indexed target tables. You can tune a Distribution Agent's execution by setting parameters in its profile. (To access the profile, right-click a Distribution Agent—CGIS1:pubs in the right pane of Figure 2—then select Agent Profile.) For example, raising the value of BcpBatchSize (the number of records copied per bcp batch) in the profile can reduce distribution history logging. Or you can adjust the value of MaxBcpThreads (the number of concurrent bcp threads in the bcp step) to match the number of CPUs on the distribution server. However, I didn't find the profile helpful for improving performance and scalability during the distribution of my test data. Moreover, distrib.exe lacks a useful bcp parameter—the ORDER hint. When you set it, the ORDER hint causes bcp to sort the data according to the clustered index (if any) on the target table before copying the data into the target table. In effect, the data is replicated in the clustered index order, and records copied later never cause SQL Server to enforce the clustered index order by rearranging earlier records. Consequently, ordered bcp is significantly faster than dropping the clustered index before and recreating it after an unordered bcp.

Let's mathematically evaluate the best-case scenario for the bcp step by assuming it has the ORDER hint set. The bcp step in this scenario essentially involves inserting n sorted records (i.e., records ordered by the clustered index) into each target table and inserting each row of this table's indexed columns into each of the m index B-trees (assuming m indexes exist). The complexity is O(m * n) + O(n * log(n)), where O(n * log(n)) is a typical time complexity of sorting and O(m * n) is the time complexity of inserting n records and maintaining m indexes for each of the n inserted records. Keep in mind that this is the best-case scenario time complexity for the bcp step. Using the earlier example of a 500MB, 6 million­record publication running on a 4-CPU development server with 1GB RAM and RAID 5, the bcp step takes 55 minutes, nearly the same as the time that the sch step takes. This result confirms the similarity in complexity between the sch and bcp steps. Like the sch step, the bcp step creates a bottleneck that grows with the publication's size and number of indexes.

Optimizing Performance

Clearly, the bottlenecks that form when you try to copy massive amounts of data can render large-scale snapshot replication impractical. However, you can take steps to avoid these bottlenecks.

Solving the DELETE bottleneck. You can easily eliminate the bottleneck that DELETE statements cause in the sch step by selecting the Delete all data in the existing table (using TRUNCATE) option in the Snapshot tab of an article's Properties dialog box, which Figure 8 shows. The result is a TRUNCATE <table> statement instead of a DELETE FROM <table> statement in the .sch file. TRUNCATE TABLE deallocates all pages associated with the table in one operation and logs only the deallocation operation. Therefore, it always takes a constant time—or complexity O(c)—which will scale up to any number of records and indexes in a publication. Testing in the same publication example shows that when you select the TRUNCATE option, the sch step takes less than a second, instead of 58 minutes. Using the TRUNCATE option is only possible, however, under one condition: The target table can't contain a foreign key constraint. If the target table is referenced by a foreign key constraint, the corresponding TRUNCATE statement in the sch step will fail. To fully optimize its sch step, the Distribution Agent has to drop all foreign key constraints associated with all target tables before taking the sch step, then recreate them after the sch step.

Solving the bcp bottleneck. Unfortunately, you can't fix the bcp bottleneck through configuration the way you can configure the TRUNCATE statement. The only way to optimize the bcp step is to drop all indexes and key constraints before the bcp step and recreate them after the bcp step. Dropping and recreating the indexes and key constraints reduces the complexity of bcp to a linear time proportional to the size of the publication—O(n)—because bcp no longer has to do sorting and indexing. This optimized bcp step will scale up to any number of indexes in a publication. The publication in my example comprises 10 clustered indexes, 29 nonclustered indexes, and 1 foreign key constraint. Dropping all indexes and keys takes 2 minutes, 8 seconds; recreating all indexes and keys takes 10 minutes, 36 seconds, and completing the bcp step takes 28 minutes—a total of 41 minutes in contrast to 55 minutes. These tests confirm that the optimization result of the bcp step isn't as dramatic as that of the sch step because of the bcp step's linear time complexity. The overall benefit of eliminating both the sch and bcp bottlenecks in the example is a savings of 72 minutes.

Failed shortcut to optimization. To implement these optimizations for the sch and bcp steps, you might be tempted to try a shortcut. For example, suppose you set the @creation_script parameter of the sp_addarticle stored procedure to a custom script file when you add each article to the publication. Such a custom script file would drop all indexes and keys in each target table. (For information about creating a custom script file, see the Web sidebar "Custom Scripting Feature in SQL Server 2000," InstantDoc ID 24661.) Remember that the Distribution Agent applies sch, bcp, and idx steps sequentially. The goal is to replace the .sch file with a custom script file while retaining the original .bcp and .idx script files. The task looks easy because replacing the .sch file requires no significant implementation except preparing the custom script files.

Unfortunately, this idea doesn't work for two reasons. First, you might expect to use the .idx file that the Snapshot Agent created to recreate foreign key constraints and indexes that the custom script dropped. However, the .idx file doesn't include primary and unique key constraints because in the .idx file, the Snapshot Agent captures these key constraints as indexes instead. For example, in Figure 6, UNIQUE CLUSTERED INDEX \[PK_ADDRESS\] comes from a primary key constraint named \[PK_ADDRESS\] but survives as an index only. The second and more significant reason is that although the @schema_option parameter of the sp_addarticle stored procedure lets you disable the Snapshot Agent's scripting and use the provided @creation_script parameter, the @creation_script parameter's setting disables the generation of .sch and .idx script files as well. Therefore, no .idx file will be available for recreating the indexes that the @creation_script file dropped. Setting the @creation_script parameter to a custom script file doesn't achieve the goal.

Customizing Snapshot Replication

Now, let's sketch out conceptually the steps for customizing—thereby optimizing—snapshot replication. I'll detail the full implementation of these steps in the second article of this series. Remember that each Distribution Agent runs within a job, as Figure 3 shows. The job's Run agent step is a command that applies to the whole publication; you can't interrupt this step between articles to customize it. However, Enterprise Manager lets you customize a job by inserting new steps and modifying the flow of the steps. You can insert a step—which I call the drop step—before the Run agent step to drop all indexes and keys and insert another step—called the recreate step—to recreate all indexes and keys. Because you insert these steps, the bcp substep within the Run agent step (which executes distrib.exe) never sees indexes and keys and hence is optimized to a linear time complexity. Further, because the inserted recreate step will be responsible for recreating all the indexes and keys, you won't need the .idx file anymore. So, on the Snapshot tab of an article's Properties dialog box, which Figure 8 shows, clear all Copy indexes options to empty the .idx file and select the Delete all data in the existing table (using TRUNCATE) option to customize the .sch file. Because the drop step dropped all foreign key constraints, the sch substep within the Run agent step can run TRUNCATE statements successfully and therefore is fully optimized to a constant time complexity. So, the optimized steps occur in the following order: drop, sch (TRUNCATE), bcp, idx (empty), and recreate. Figure 10 illustrates the steps of a customized Distribution Agent job.

Scale to Your Environment

The only drawback to the optimization is that between the drop and recreate steps of a Distribution Agent job, the drop step has dropped all indexes and keys of all target tables at the subscriber. The dropped indexes and keys can cause concurrent queries to execute more slowly and might compromise key constraints for the duration of that window. However, for a mostly read-only data warehouse server farm, the main requirement is to synchronize all data warehouses and data marts quickly within limited off-peak operating hours while keeping concurrent queries to a minimum. The optimization lends itself to this environment very well.