Don’t use AWS out of the box and expect resilient systems

Too many companies assume that if they have data backups, business continuity (BC) and disaster recovery (DR) are handled. As anyone who’s been through an outage can tell you, it’s not that easy.

BC and DR are both critically important — and not at all the same. Each requires different work from different sectors of a company to set up. Both need to be fully in place, with teams trained on the plans and processes, before an incident. AWS can help, but you have to know what you’re setting up first.

Every company’s digital, physical and geographic storage needs are different. Your plan for keeping business afloat and data secure during disaster relies just as much on technology as it does on teamwork.

Business continuity vs disaster recovery — why they’re different and how to do both

If it was only about technology, our work wouldn’t be very interesting. The challenge lies in fitting the technology to the needs of the business and the dynamics of the team. That’s where you have an opportunity to provide the kinds of solutions that can truly save a business.

Disaster recovery lives on the technology side, but it’s entirely dependent on the needs of the business. Knowing what needs to be recovered, from where and how fast — that’s what matters most. Business continuity builds on that, developing a plan that makes the best use of the roles and personalities available to recover from disaster so quickly and seamlessly that business remains continuous. One depends on the other, and both depend on a deep understanding of the business.

Let the business determine disaster recovery, not IT

A disaster can all too quickly make clear if a company’s IT department is operating in a vacuum. Both business and technology need a clear understanding of what is being recovered and how quickly. Without these well-defined RPOs and RTOs, valuable time can be spent fighting over expectations instead of working towards fixing the issues.

Too often, IT determines what its goals are based on its capabilities — instead of the business stating its needs and IT adapting to meet them. Then, when systems do go down, the company realizes these RTOs and RPOs fall short of what is actually needed.

By being realistic about what’s needed ahead of time, IT can proactively set up the technology needed to deliver it. They can also make sure they’re not keeping anything that doesn’t need to be kept, saving the company money over the long run. This is where the available technologies really shine, enabling IT to deliver on the needs of the business. It’s not the whole job however. IT also needs to be sure that what it’s architecting is capable of disaster recovery as well.

High-availability systems aren’t a substitute for disaster recovery

Using public cloud providers like AWS enables an almost limitless number of ways to avoid outages. Proper load balancing prevents lapses in service in the first place, and makes it possible to bring data and applications back up fast enough to make it seem like a blip and not a crash.

Still, you have to use it. AWS allows you to determine your level of redundancy and availability. Spin up a single resource in a single datacenter, or Availability Zone, and you’ll save some cost at the expense of availability. Or take the opposite approach, and design a complex system that runs in multiple Availability Zones in multiple geographic regions with the ability to scale automatically and even self-heal. AWS will even let you distribute your data across the world with five-nines of durability, reducing your chances of losing significant data in a systems failure to roughly the same odds as a mass-extinction event happening in the next week!

However, just because you’ve architected out the possibility of a systems failure causing an outage doesn’t mean your systems will be disaster-proof. Human error, misconfiguration, external vendors (like networking outages) and malicious activities all have the possibility to wreak havoc. Having your data and systems fully operational and available does little good when a ransomware attack encrypts all of your business-critical information.

Fortunately, AWS allows you to prepare for disaster recovery as well, with a suite of services centered around automatically and regularly backing up your systems. Many key services such as Elastic Block Store (EBS) and Relational Database Service (RDS) have RPOs of as little as five minutes, and most of these backups can be restored well under 30 minutes. Even more importantly, backups of your systems can be shipped to other locations and other AWS accounts, in case your primary AWS account is breached. A combination of Infrastructure-as-Code tools, configuration management tools and CI/CD tools can enable you to stand up an entirely new environment from backups in record time with no configuration drift.

Still, architecting for high availability and building in automated DR can never be a replacement for great communicating and quick-acting teams. Having a team that can capably and agilely respond to any scenario is far more important than trying, and failing, to plan for every possible failure.

Laying the best BC and DR plans

The key to any successful plan is ownership. Every member of your team needs to know who is responsible and when. It will prevent the finger pointing and recriminations, certainly, but it will also make sure everyone knows when they have the authority to leap into action — and when they’ll be held accountable if they don’t.

Once you’ve identified who is responsible, you need to clearly outline for what. This is where your business’s specific security and compliance standards will come into play, as they’ll determine many of the requirements of your plan. Clear documentation here will keep everyone moving in the right direction.

Build a strong operations mindset for business continuity

Similar to planning for a data migration, mapping all the applications, knowing what they rely on to function and where data is backed up is more important than the how of any of it. Process and documentation are everything during an emergency. Success or failure depends on your team’s ability to jump into action almost automatically, without any questions of what to do or where to find it.

If you promise that things are recoverable, you’re promising that it can be done fast. Waiting days to get apps and servers back up and running isn’t feasible — in most cases the damage is done.

Even the most thorough, tested plans will have faults. A well-aligned team with great communication is the difference between a stumble and total failure. Stressing the importance of operations now will save you during tense scenarios.

There is no business continuity or disaster recovery plan without testing

There’s a word for plans that haven’t been put into practice: Hopes. If you want your team to be equipped with more than a hope when disaster strikes, you need to test every aspect of your DR plan. You can’t go back in time and make sure the right data was backed up or the right person was notified. You need to know now if something needs fixing.

This is often overlooked, both because IT teams are too focused on the work in front of them, and because no one is over-eager to really think about what happens in a worst-case scenario. That’s why testing needs to be as clear and mandatory a part of your planning as anything else. It’s the only way to keep everything up-to-date, and to keep panic from setting in when something does go awry.

Train your team, and test their training, until the concept of an emergency is boring. The more rote their response, the more bulletproof your business continuity.

Want to know exactly what happens in a crisis? Contact us, and we’ll help you document a response.

Deft logo

2200 Busse Rd.
Elk Grove Village, IL 60007 | +1 (312) 829-1111

Inc. 5000 America's Fastest Growing Private Companies