Learn what goes into making systems truly available
High availability is a hot phrase in the IT industry, but do you know what it means and how to accomplish it? IT professionals attempt to implement high availability every day by buying a lot of expensive hardware and software without tackling the real problem. Many people, both in technology jobs and in management, believe that high availability is a canned technology solution that they can put in place and forget about. But technology is only a small piece of the puzzle called high availability.
True high availability combines people and processes with technology. Although you can build highly available solutions without one of these components, attention to all three lets you provide enterprise-class availability even in the smallest environments. Without the people in place to manage the solution, along with processes designed to ensure that systems remain highly available, the technology won't be up to the task.
How do you measure availability? IT professionals define availability as
A = (F - (D + R))/F
where A is availability, F is mean time between failures, D is mean time to detect failure and decide on a remedy, and R is mean time to repair. The factors you have the most control over are the mean time to repair, mean time to detect a failure, and mean time to decide on a remedy. To detect a failure, you need technology and trained personnel. A highly trained person can prevent some failures and can detect others before they cause any problems, implementing remedies so that the system doesn't go offline at all. This preventative maintenance encompasses strong code-deployment processes that deliver baseline performance numbers for an operations team to monitor against. One of the most crucial abilities for a DBA is knowing when the system is performing as expected and when something abnormal is occurring. If load-testing processes aren't in place, you can derive performance numbers only by observing the application after it's in production, which can give you false results if an application experiences problems from the instant it's deployed. The performance-testing results of an application make DBAs better equipped to rapidly diagnose problems in production.
The only factor you can't control is the mean time between failures—it's entirely unpredictable. Therefore, your investments in high availability should be in people and processes to reduce detection and decision-making time and in the technology to reduce detection and repair time.
You should take away this set of rules from any discussion of high availability:
- Business needs dictate high availability.
- People manage and maintain high-availability solutions.
- Processes ensure those solutions will work.
- Technology provides physical support for high-availability needs.
The simple diagram that Figure 1 shows summarizes these rules. Without the people and processes, you're missing the vital core of high availability.
SQL Server High Availability
Availability is often measured in terms of "nines"—your system's percentage of yearly uptime. For four nines (99.99 percent), you can't be offline for more than 52 minutes per year. Five nines means that you can't be offline for more than 5.26 minutes per year. And six nines gives you an offline margin of just 32 seconds per year. Depending on your business definition, this downtime might include scheduled downtime. As you try to achieve more "nines" of availability, your processes must improve along with your technology solutions.
By leveraging three core technologies in SQL Server 2000—failover clustering, log shipping, and replication—you can achieve five nines of availability. I mention another aspect of availability in the sidebar "Disaster Recovery Means Availability, Too."
Any discussion of high availability needs to start with the editions of SQL Server you can use. To achieve highly available configurations, you must use the Enterprise Edition of SQL Server 2000. Only Enterprise Edition supports failover clustering and log shipping. Although it's possible to achieve high availability with the Standard Edition, you're limited to using replication and custom log shipping. Let's look at these three core technologies for availability.
The primary high-availability technology in SQL Server 2000 is failover clustering. In "Clustering SQL Server," page 27, Brian Knight walks you through the details of setting up a two-node SQL Server 2000 cluster on Windows 2000 Enterprise Edition. Before SQL Server 2000, failover clustering wasn't pretty and rarely worked well. However, Microsoft completely rewrote failover clustering in SQL Server 2000, and the technology became easy to implement because of SQL Server 2000's many new capabilities. You manage the cluster the same way you manage a single-machine SQL Server installation. Unless you're directly interacting with the desktop and Cluster Administrator (the utility that lets you set up clustering), SQL Server doesn't expose the fact that it's running on a cluster. This transparency makes administration of a cluster simple.
Although administrative procedures don't change when you go from a single-machine configuration to a multimachine configuration, you gain an important capability. In the event of hardware failure, the cluster automatically moves the SQL Server 2000 application to another available piece of hardware in the cluster and brings it online. If you've used certain development practices (which I describe later), your end users will never know that you had a hardware failure. This seamlessness is important because the availability rate at the server level is a relatively useless number. Only the availability at the data-consumer's desktop is important.
When the piece of hardware (called a node) that SQL Server is running on fails, SQL Server shuts down. The Microsoft Cluster service moves the ownership of the disk space to another node and starts up SQL Server. The failover of resources in a cluster occurs within about 15 seconds under normal circumstances. However, many people don't realize the implications of SQL Server shutting down and restarting. When SQL Server starts, it runs through a specific, sequential process called restart recovery. The crucial part of restart recovery is the undo-redo phase for a database. The redo phase is always short; you control it through the recovery- interval configuration option. In the undo phase, you're at the mercy of how an application was coded. Long-running transactions can severely degrade the cluster's failover time.
For example, say a user initiates an operation in the database that takes 6 hours to complete. At 5:59:00 into the process, the node running SQL Server fails. The Cluster service takes over, moves resources, and brings SQL Server online within 15 seconds. Although the cluster failed over in 15 seconds, users now have to wait nearly 6 hours before their data becomes available again because of the undo phase of recovery. Had the developer written this application so that it ran in recoverable chunks with a maximum duration of 3 to 5 seconds, the cluster would have failed over and data would have become available to users within 20 seconds of the primary node failure.
The other area where close developer support can produce an environment that gives significantly higher end-user availability is in the connection and transaction handling inside an application. Regardless of the hardware a SQL Server is running on, it will always have the same name. If a user is connected to a clustered SQL Server that suddenly fails over, that user is disconnected. To continue processing, all the user has to do is reestablish the connection to SQL Server. Developers can use this opportunity to make cluster failovers transparent to end users by programming the application to detect a disconnection from SQL Server, wait a few seconds, then reconnect. Once reconnected, the application should be able to reissue any transaction that hadn't completed before the failover. Leveraging the application this way allows for a resilient system architecture that ensures that no user data is lost even during a system failure.
Now, let me dispel some common misconceptions about clusters. A SQL Server cluster is a hardware-protection mechanism only. A SQL Server cluster won't protect you from data failures, application-logic failures, user errors, or data corruption. To protect against these errors, you need to leverage database backups. A SQL Server cluster also doesn't increase performance or scalability. You might have multiple pieces of hardware acting as one resource, but SQL Server can't access resources on another piece of hardware, and an application can't scale any further on a cluster than it can on one node.
Because failover clustering only protects against hardware failures without any ability to offload processing or produce scalability, it isn't all you need for high availability. Log shipping, another SQL Server high-availability technology, is a fancy name for automating a database backup and restoring to a different server. Several articles by Ron Talmage (see the Related Reading box) cover log shipping in detail, so I won't examine the internal mechanisms of the technology here.
Log shipping provides a resilient option for protecting against hardware failure. Log shipping works by copying a primary server's backup files to a secondary server, then restoring the files on the secondary server. The basic design of log shipping provides for a primary server and a secondary server that are independent, not sharing any hardware components. This configuration removes the single point of hardware failure present in a cluster that uses a shared drive array.
You can use log shipping to provide significantly higher data redundancy with many more environments than is possible with failover clustering. For example, Windows Server 2003 Datacenter Edition supports a maximum of eight nodes in a cluster. But the only thing that limits the number of secondary servers you can have with log shipping is how much infrastructure you can manage; you can create as many copies of the backup as you want. Log shipping can also operate across much longer distances than failover clustering. Most failover clusters are limited to approximately 80 to 100 miles. With log shipping, only the length of the cable between the primary and secondary servers limits the geographic dispersion of your secondary servers.
Although log shipping is a component of a high-availability solution, it's missing an essential feature. Log shipping doesn't have a mechanism for automatically detecting failure and initiating a failover to the secondary server. Log shipping also has a lag time during the failover when all backups are applied to the secondary server before it can be recovered and made available. Because log shipping relies on copies of the backups from the primary server, you risk losing some committed transactions if your primary server has a complete hardware failure. With failover clustering, once SQL Server is back online, clients can reconnect without any change in connection strings. Log shipping doesn't have this transparency; clients need to connect to a server with a different name than the primary server.
You access another feature of log shipping through the restore options you use when applying subsequent transaction log backups. If you use the WITH STANDBY option for restore, you can make the secondary server available for read-only activities. This secondary server is an excellent place to move reporting and other read operations from the primary server and gives you a level of scalability that you can't get from a failover cluster.
Replication, the third high-availability technology in SQL Server, is complex but not complicated. Because of its complexity, people don't use it as frequently as log shipping or failover clustering for high availability. For purposes of high availability, I restrict this discussion to transactional replication in the case of one Publisher—the primary server—sending data to one or more Subscribers—the secondary servers.
Failover clustering operates at the server level, ensuring that the entire SQL Server is available. Log shipping works at a more granular level, ensuring the availability of one database within a server. Replication moves to the most granular level: It ensures availability transaction by transaction and can even provide availability for only a subset of one table in a database. Whereas log shipping works by restoring transaction log backups from the primary server to a secondary server, replication works within the transaction log to send individual transactions to a secondary server as they occur. Because of this granular function, replication can provide extremely rapid failover from a primary to a secondary server. In most stable configurations, you can keep the data on the secondary server current to within 1 or 2 seconds of the primary.
Although it can accomplish extremely low latency, replication has no scheduling mechanism that requires a maximum latency between primary and secondary server. The replication engine replicates the data and applies it to the secondary server whenever it gets there. Although the replication engine will deliver the data eventually, the uncertain delivery time causes the most problems with replication implementations.
Fortunately, the delivery time is entirely within the control of developers, DBAs, and other IT professionals. You can't replicate a transaction until it's been committed, so developers who write large transactions create latency problems in replication on two fronts. First, because these long-running transactions aren't replicated to the secondary server until they're complete, the secondary (i.e., the replication target) falls significantly behind the primary. Secondly, when a transaction is committed, packaging up the transaction, transmitting it, and executing it on the secondary server takes a long time. Additionally, transactional replication guarantees that transactions will be committed on the secondary server in exactly the same order as on the primary. Because the replication engine serializes all committed transactions, all subsequent committed transactions will queue up behind a large transaction, waiting for their turn to execute.
This might seem like a severe limitation, and it has a significant impact on the replication engine, but it doesn't affect availability too badly. Replication implementations for high availability generally employ at least three servers. The Publisher is the primary server where the applications send all transactions. The Subscriber is the secondary server, which acts as a standby for the primary and is available if the primary fails. The third server is the Distributor, whose sole function is to guarantee delivery of transactions to the secondary server. As soon as transactions are copied from the Publisher to the Distributor, the Distributor delivers them to the Subscriber. This process lets an application immediately fail over to the secondary server—even though not all transactions might have been committed on the secondary—because the Distributor guarantees their delivery. In general, how you manage this latency in data availability will dictate how rapidly you can fail over from the primary server to the secondary. Under ideal conditions, clustering can accomplish a failover in about 15 seconds, log shipping can accomplish a failover in about 1 to 2 minutes, but replication can give you instantaneous failover.
Like log shipping, replication can maintain as many secondary servers as you want to deploy. The only limitation comes from your processing capacity, network bandwidth, and ability to manage the infrastructure. As with log shipping, you can use a secondary server to offload read operations to provide additional scalability. The secondary server remains 100 percent available to read operations even while the Distributor is applying transactions. It remains available because— unlike the transaction-log restores that log shipping relies on, which can't happen while another user is accessing the database—the replication engine issues the same transactions on the secondary server as the application on the primary issued. Also like log shipping, replication doesn't have a single point of hardware failure because it's using a physically separate server, and it provides no automated failover. As a side effect of the replication engine essentially reprocessing transactions on the secondary server, the replication engine can't propagate data corruption from the primary server to the secondary. Note that, as with every other high-availability technology, replication won't protect you from user error. This is the one availability problem that you can't handle with technology.
Sounds like replication is a great choice for high availability, but it isn't that simple. Although you have physically separate hardware components, because replication binds multiple machines together, failure of one server can cause the other server to fail. This is called a soft failure, and it usually results from the transaction log filling the available disk space and taking the database offline. Issuing a transaction-log backup doesn't solve this problem, because although such a backup removes committed transactions from the log, it doesn't remove transactions on replicated tables from the log until they arrive at the next server upstream. Therefore, replicated transactions on the Publisher can't be removed from the transaction log until they've been successfully written to the Distributor. And you can't remove transactions from the Distributor until they've been successfully written to all Subscribers. If a Subscriber fails and remains offline for a long time, the data volume on the Distributor might grow until it exceeds available disk space and causes the Distributor to go offline. This in turn prevents transactions from being removed from the transaction log on the Publisher until they've been written to the Distributor. If the transaction log fills up and can't expand, the primary server goes offline. If you're implementing replication for high availability, you need to closely monitor it and fix failures in the environment rapidly to prevent this cascading-failure scenario.
Three Ways to Available Systems
You can use SQL Server 2000's three core technologies alone or in concert to produce highly available architectures. Failover clustering protects against hardware failures while providing an automated mechanism to restart SQL Server on functioning hardware. But a failover cluster has a single point of failure in the shared drive array, can't protect from data errors, doesn't provide any scalability features, and is extremely sensitive to the duration of transactions. Log shipping protects against hardware failures and can prevent some data failures; it also provides a rudimentary mechanism to offload reporting from the primary server to a secondary server and increase scalability on the system. But log shipping is also sensitive to transaction duration and has the drawback of maintaining a separate copy of the data with no easy way to fail back from a secondary server to the primary. Replication can protect against hardware failures and SQL Server data corruption while providing a high degree of scalability in any environment. But replication can't protect against user error, is sensitive to the size and duration of transactions, and is relatively complex.
The bottom line is that no perfect solution exists. Each of these options can increase your availability and provide more data protection, and you can use them together to produce even higher degrees of availability. Your challenge is to figure out how to combine them to minimize each technology's drawbacks.
| BRIAN MORAN and DAVID SAPERY|
"SQL Server Clustering," June 1999, InstantDoc ID 5331
"Log Shipping in SQL Server 2000, Part 1," December 2001, InstantDoc ID 23056
"Log Shipping in SQL Server 2000, Part 2," January 2002, InstantDoc ID 23230
"Log Shipping with SQL Server 7.0," December 2000, InstantDoc ID 15804
Lab Report: "Database Replication," February 2002, InstantDoc ID 23331
JEANNINE HALL GAILEY
"Control Replication with ActiveX," July 2003, InstantDoc ID 39079
Microsoft's "SQL Server 2000 Operations Guide"