In this first post for the Practical SQL Server blog, I wanted to lay the groundwork for topics and concepts that I’m sure will show up over and over again when addressing future topics such as High Availability, Disaster Recovery, Performance Tuning, Budgeting, and so on.
Based upon my SQL Server consulting experience, many organizations haven’t actually sat down to quantify their Recovery Time Objectives (RTOs) or their Recovery Point Objectives (RPOs)—especially when it comes to their SQL Server databases.
Without properly identifying RTOs and RPOs, IT professionals risk a huge potential mismatch between what they’re able to recover during a disaster, and what a business needs to effectively survive a disaster.
Stated differently, if you haven’t actually quantified and defined these objectives, then it’s safe to say that you’re at risk.
Within the IT community, Recovery Time Objectives are commonly defined as the amount of time it takes to recover from a disaster – or to get a system back online after it goes offline or crashes. Similarly, Recovery Point Objectives are commonly described as the amount of data that was lost during the outage and recovery period.
Personally, I’m not a big fan of those definitions – because they fail to address the fact that RTOs and RPOs are both centered on the notion of ‘Objectives’ instead of actual down time and data loss.
In other words, when used correctly, RTOs and RPOs cease being mere buzz words (or an annoyance put in place by management) and can become very effective tools for addressing the very real potential for disaster and proactively ensuring data protection and business continuity. More specifically, when leveraged correctly, RTOs and RPOs represent a great way for businesses (meaning IT and management) to work together to both establish acceptable windows for downtime and data loss and then begin working towards solutions that meet (or exceed) those windows or objectives.
Even better, once RTOs and RPOs are defined, IT departments can take these benefits to the next level by codifying them into full-blown Service Level Agreements (SLAs), which shift the focus from a discussion of acceptable amounts of loss to a pro-active focus on overall uptime and availability. This, in turn, helps IT professionals clearly establish their commitment to business continuity and growth—and can be a pivotal component in transitioning from a tactical (or reactive) approach to systems management into a more strategic (or proactive) approach.
Practically speaking then, the question is how do you go about figuring out what your objectives should be when it comes to restoring availability and dealing with data loss?
To address that issue, I typically like to think of RTOs and RPOs in terms of downtime—which really comes in two forms:
In other words, when IT professionals typically address RTOs and RPOs, they commonly only focus on how much time and data loss is acceptable—instead of focusing on the total amount of disruption that a disaster can incur.
To put this into better perspective, consider an example centered on a medical billing application with, say, 20 or 30 semi-active users who regularly log data into the application during business hours. Then assume that the system goes down. Something similar to the following will characterize how things play out in most organizations:
Then, let’s assume a happy ending: After 35 minutes of downtime, the database is brought back online with only 15 minutes of data being lost from before the crash. In such an event, for example, the following considerations help contribute to addressing the true, or total, cost of this outage:
Consequently, when it comes to figuring out what your own RTOs and RPOs should be, you need to consider the potential costs associated with lost time and lost data for your own business. There is no ‘one size fits all’ approach to RTO or RPOs—nor is it adequate to just assume that you can achieve zero loss of time or data because you may not have the budget, solutions, or resources needed to meet such lofty goals.
As such, one way that I commonly recommend that organizations address the potential costs of lost time and data is simply to take average monthly business revenues and then divide those amounts into days, hours, and minutes as necessary. While this is a vastly over-simplified approach to calculating the potential costs of outages, it typically does a great job of underscoring just how at-risk many organizations are when they don’t have any plans or solutions in place to address this potential for loss. Likewise, another big benefit of this overly-simplistic approach to calculating the cost of downtime is that it can help IT professionals establish budget and resource allocations to address the cost of downtime by making a very clear case to management of the risks involved.
Once you’ve figured out how much data loss and downtime will cost or impact your business, you’re then able to formulate objectives for how to mitigate those costs—which is the exact role and nature of RTOs and RPOs—in that they specify the objectives (or goals) you’d like to meet in minimizing those costs. From this point, you then able to contrast the kinds of budgets and solutions available to meet these objectives and begin putting technical solutions into place that meet your RTOs and RPOs, which help you meet your service level agreements.
However, without testing, RPOs and RTOs are just an expression of how potentially expensive outages can be, because
Furthermore, without regular testing you also:
As such, in future posts we’ll look at some practical approaches to validation, testing, and documentation that you can leverage to make sure that you can meet (or exceed) your RTOs and RPOs.