We’ve all rolled our eyes during school fire drills — they were loud, they were long, and they interrupted what we were doing. Worst of all, they were fake. But all that drama had a purpose. We’re all invincible until we’re not.
Look, we get it. No one wants to spend their time and resources practicing for a “what-if” situation. There’s always so much else to do instead.
But what happens when you actually need your disaster recovery (DR) plan? Having a failover scheme fail isn’t a good look on anybody.
So how do you make sure your disaster recovery plan works? These seven tips should help.
1: Make someone the owner of stress testing
It’s easy to skip stress testing a DR plan if nobody’s in charge of making sure it happens. So add this to someone’s job description. This person is now responsible for scheduling regular stress tests, overseeing the tests, and deciding how to move forward based on the outcomes.
They should also be the one who updates the existing DR plan whenever you roll out a new product or feature (or make other business changes) that would change your DR needs.
Be sure to communicate that the outcome of the test is not a reflection of their performance. Often, people in charge of stress testing are self-conscious about the results, worried that they’re failing at their job if the test fails.
In reality, the point of the test is to identify failures and remediate them. A successful person in this role will identify weaknesses and develop a more comprehensive plan that evolves with the business.
2: Review your existing plan
If you haven’t reviewed your existing DR plan in a year or more, do that now. Does it reflect the current priorities of your business? If not, don’t bother testing it; instead, focus on updating it so that it does.
Doing this will require you to think about business questions that go beyond the IT department. Figure out what’s necessary for your business to function: which apps have to run, how to implement application failover, which hardware has to be online, what dependencies flow from those two things, etc.
To be sure you’re getting a full picture, consult with leaders from other departments to make sure you understand the needs of the business as a whole. That understanding should dictate recovery point objectives (RPO) and recovery time objectives (RTO) so that you have a DR plan that meets everyone’s needs during a disaster.
(Note: Tech teams tend to err on the side of overdoing DR plans, but complexity leads to fragility. The goal here is right-sizing.)
3: Get buy-in from leadership
Any time you’re asking for major input from other departments, you’ll need buy-in from leadership. Otherwise, your busy colleagues may not take your request seriously.
And even if your leadership already understands the theoretical importance of a disaster recovery plan, it’s probably wise to alert them that you’re changing your policy around stress tests (in that you will now be doing them).
Without buy-in from your peers, you will always be pushed to complete more urgent projects and tasks than stress testing. While it may be obvious to you that that’s a short-sighted policy, you may have to make the case for non-IT leaders. (How? Keep reading.)
See how scalability with cloud computing is done
4: Frame disaster recovery & stress testing in financial terms
At their core, disaster recovery plans focus on one question: how long can your company afford to be offline? Five minutes? An hour? A day?
Help executive leaders think about disaster recovery in financial terms: how much revenue will the company lose in each minute of downtime? How long will it be able to pay vendors? What kind of reputational hit will it take?
Once you have answers to these questions, you can put together a DR plan that allows for no more than the acceptable amount of downtime and ensures that critical business functions are online as quickly as possible.
Keep in mind that there’s no one-size-fits-all DR solution, and that the various options come with different price tags. Full online backup will always be more expensive than offline data replication –but sometimes it’s the best option for your business needs.
5: Stress test during scheduled downtime
Most companies schedule downtime into their SLAs. Take advantage of yours to plan a stress test. The main advantage of doing this is that, if something goes wrong, you have an opportunity to roll back to prod and reassess. Which brings us to our next point.
6: Accept that things will go wrong
The whole point of stress testing is acknowledging that your DR plan may not work – that there may be some problem you didn’t foresee.
“Failed” tests can be frustrating, but it’s much better to discover a problem during a stress test than during an actual disaster.
When you discover a problem, take time to understand what that problem is, fix it, roll back to production, and try again.
Remember: you’re doing this test because the business has evolved and you expect things to fail. This is part of the process of keeping your DR plan healthy and relevant to the business. If you never find any failures, you should seriously question your plan.
7: Lay out explicit rules for disaster recovery
Once you’re confident that your DR plan works from a technical perspective, it’s important to define when exactly you’ll put it into play.
For example, what are the triggers that set your plan into motion? Will you allow your team a certain amount of time (say, four hours) to try to fix a problem before failing over? Is that time different during peak hours versus off-hours? During certain times of the year? Can certain components or apps go down without triggering the DR protocol?
As with your DR plan more broadly, these rules of engagement should be based on larger business goals and imperatives. Be sure to get sign-off from stakeholders in various parts of the business.
Keep in mind, too, that IT alone can’t fully test a DR plan. You’ll need people in other departments to test the functionalities they depend on. You’ll need customers (or customer-like actors) to test customer-facing features. Ensure all stakeholders know what they’re expected to do during a stress test and how they’re supposed to communicate their findings with you.
Even better: list everyone you need to participate in a DR plan, randomize the list, identify two or three key people, and run an application failover test without them. This will let you simulate a disaster where the entire team isn’t available to help recover (as will likely be the case in the real world).
The cost-benefit analysis of stress testing & disaster recovery
It’s important to have a disaster recovery plan, but it’s equally important to make sure that it works.
There’s no one way to handle DR or to stress test a DR plan. As with any business decision, the right way for you will depend on how much downtime your company can afford, how much risk you can handle, and how much money you’re willing and able to spend. Similarly, there’s no secret trick to doing this; it’s simply a matter of following the steps you lay out for yourself and adjusting as needed.