I recently spoke at WindowsIT Pro’s Keeping Mission-Critical Applications Available online event.
My session, High Availability Isn’t Disaster Recovery, focused on a number of topics and concepts to the point where I wanted to provide a blog post with additional resources and links for organizations and IT pros who are either managing or contemplating HA solutions – in order to ensure that they’re not exposed to common flaws and misconceptions about how HA compares to DR.
If you didn’t have the opportunity to attend any of the sessions in this event, they’ll be available (free) for roughly 3 months for on-demand viewing starting (roughly) the day after the event is over.
Otherwise, what follows are a set of additional resources, links, and content that anyone who is dealing with HA solutions should skim.
In my presentation I stressed the importance of RPOs and RTOs – something I’ve already blogged about in my post on SQL Server Recovery Time Objectives and Recovery Point Objectives.
Stated simply, if you haven’t defined RPOs and RTOs, then your best efforts in responding to a disaster are still likely to meet with failure, as management has likely been working under the assumption (as crazy as it is) that everything was protected and that there really wasn’t much of an opportunity for down-time and lost-data. As such, RPOs and RTOs represent a ‘reality check’ for management who might not give much thought about the true cost and scope of data-loss and down time while also expressly quantifying acceptable levels of data-loss and down-time per system/database/etc.
Or, in even simpler terms: RPOs and RTOs are essential not only as a CYA mechanism, but as a means for helping define and set expectations – something that’s essential to tackle any time we’re talking about the potential for disaster.
Another common problem I run into as a SQL Server Consultant is cases where many IT pros mistakenly assume that since they’ve implemented vast/expensive High Availability solutions to ensure the redundancy and fault-tolerance of their systems and data, that they’re protected from disaster.
I’ve written on this before – for SQL Server Magazine: Confusing High Availability with Disaster Preparedness.
As such, a large portion of my presentation was focused upon demonstrating ways in which disasters could ‘defeat’ High Availability Solutions. And, in terms of specifics, my presentation focused on how two kinds of corruption, logical corruption and physical corruption, can defeat many High Availability solutions.
Simply stated, logical corruption (something that I also like to call ‘soft’ corruption, because it’s either caused by software glitches or bugs in your applications or by soft-skinned-air-breathers (or ‘meatbags’)) is something like DROPing or TRUNCATEing a table – or doing an UPDATE against a table without a WHERE clause – it’s something that quickly renders your data BAD. And the problem with HA solutions is that they don’t care about the state of your data – they’re simply tasked with making it available and redundant. But highly available and redundant data that’s bad data is still busted – or a disaster.
Unlike logical corruption which is almost always software or humans, physical corruption is corruption at the IO subsystem level – or where SQL Server asks the OS to write data (either for your data files OR even for your log file) to disk and the disk, controllers, drivers, or something else mangles that request and renders gibberish instead of well-persisted zeroes and ones.
As I pointed out in my presentation, dealing with logical corruption can be a SERIOUSLY ugly task. High Availability solutions just don’t help in dealing with most logical corruption – unless you’re willing to potentially sacrifice substantial LOST data in many cases.
I touched upon some of the ugliness involved in using backups (to create a copy of your non-disaster’d database and to try and copy values out of that copy back over/into production – but this approach is tedious and error prone) along with an overview of how third party log reader agents a while back in a competitive review article for SQL Server Magazine.
Unfortunately, the problem with third-party log reader agents, or tools that ‘magically’ read SQL Server Log Files and then give you the ability to generate scripts that can roll-out/back unwanted operations while ensuring that desired operations are rolled forward, is that they’re terribly hard to create.
Kalen Delany touched upon this a bit in a post for SQL Server Magazine where she asked Can the Transaction Log Tell us What Happened? She then followed up on this on her own blog – where comments from other DBAs and SQL Gurus helped point out just how hard it is to be able to actively READ the log. And that’s because, for all intents and purposes, SQL Server Log files are ‘proprietary’ and ‘internal’ storage mechanisms that are subject to change.
As such, vendors who attempt to offer the ability to read log files and offer DEPENDABLE information from them have to spend a decent amount of time working closely with Microsoft to stay abreast of changes and ensure that their applications work correctly. This, in turn, incurs some serious costs and overhead for products that everyone should buy – but which (strangely) not everyone seems to purchase or deploy.
At any rate to my knowledge, there are only TWO vendors today that provide Log Reader Agents/Solutions. I’ve used both and love them both – even though they both come with different price-points (because those different price-points are indicative of different features, options, etc). And those solutions are:
In my presentation, I also covered techniques for addressing physical corruption. Effectively, handling these kinds of problems in HA environments takes a couple of interesting turns or twists.
To learn more about those last two bullets, I recommend that you check out my free presentation on SQL Server Disaster Recovery and Prevention (which should be available until roughly July 25th) – where I’ve outlined best practices for detecting, assessing, and addressing physical corruption. Likewise, I’ve also made a blog post available that covers SQL Server Disaster Recovery and Prevention Resources.
Finally, as part of my presentation I ‘harped’ upon common best-practices that I iterate over and over again to clients and during presentations. Specifically, I touched upon the fact that many HA solutions or initiatives (as well as DR plans) tend to forget about the importance of environment – meaning that ensuring application up-time requires more than just ensuring that you have your data and systems to run it on. Or, more specifically, if you haven’t addressed issues with connection strings, permissions, certificates, paths, firewall ports, and other things prior to initiating an HA failover or a DR recovery operation, then you have no idea if things will work as needed.
As such, I touched upon the need/requirement to not only test HA/DR solutions initially for initial validation but to regularly test these plans/operations as well. Otherwise you run the VERY serious risk that a subtle environment change somewhere has busted your HA or DR solution in a way that will be unpleasant to deal with in the case of a true disaster.
Likewise, as I regularly point out, if you’re REGULARLY testing HA failover and DR plans, then you’re REGULARLY building familiarity and skill with the commands, operations, and tasks necessary to recover from a disaster – something that’s worth it’s weight in gold when an actual disaster strikes (because there’s no worse time to learn about how to restore a database, for example, than when your database is down, end-users are ticked, and your manager is standing over your shoulder).
And, of course, in order to make sure that Disaster Recovery operations go as smoothly as possible in both regular testing and disaster scenarios, you need to clearly document your HA and DR plans as granularly and succinctly as possible.