Building Robust Disaster Recovery for the Cloud

Using multiple regions can be more testable and cost-effective than multiple availability zones

7 min readJun 18, 2020

Your mission-critical cloud workload should be protected against a disaster at the datacenter hosting your production cloud instances.

How do you protect against this remote possibility that an AWS or GCP region will go down, without spending a lot of time and money?

One approach is to leverage a cloud’s multiple availability zones within a given region. Vendors offer turnkey applications like databases in which dual instances are automatically deployed and data is replicated between them across availability zones (also called multi-AZ). In case of a failure of the zone hosting the primary instance, the backup instance automatically takes over.

You might however ask yourself — and your cloud vendor — the following: how do I test this? How do you know it will work? You frequently test your own code; you should be in a position to test any infrastructure you deploy as well.

Certainly you can power off your application and watch it fail over. But is that a close enough simulation of what you are protecting against?

What you are protecting against is the entire primary zone instantly becoming unavailable, taking with it all of the hardware and software services it supports, as well as tens of thousands of other applications that other users have deployed. Will the user console still accept commands to that region in order to drive the failover? Will the standby availability zone have enough capacity to absorb all of the workload? Can the cloud vendor’s code to manage the failover handle such huge volume of operations simultaneously? When was the last time they tried a real test by shutting down an availability zone?

Have they ever tried it?

Without answers to these questions you might be uncomfortable about the possibility that the solution will not function as expected. In that case, you might have to reach out to your cloud’s support team whom most likely would be inundated with other requests.

That is why we think a better approach to disaster recovery is to replicate instances across different regions precisely because you can manage it yourself and more importantly regularly validate the failover in a reasonable simulation of what might actually happen.

Ha ha, this does not mean asking your vendor to shut down an entire region. Instead, it means doing the necessary steps to make up-to-date duplicate of your workload in a different region, and regularly testing that it can start, assuming the primary is no longer available.

You can do this yourself: the steps are reasonably straightforward. Essentially the strategy involves replicating a snapshot of the primary instance’s volumes to the DR region; creating a new volume from that snapshot and swapping that volume with the existing volume of a copy of the instance at the DR region.

Specifically here are the steps using the AWS command-line interface, though it can apply to other clouds like Google Cloud Platform. In this example assume you deployed a Bitnami LAMP server of size t3.2xlarge in region us-east-2 (Ohio), and have chosen ca-central-1 (Canada) as your failover region, you can run the following to create an identical instance at the DR region:

aws ec2 run-instances --image-id ami-3863e45c --instance-type t3.2xlarge --region ca-central-1

Next, create a snapshot of the volume(s) underlying the block devices of the primary instance:

aws ec2 create-snapshot --volume-id vol-0451a152f71e29f7b --region us-east-2

Wait for the snapshot to be completed, and locate its ID. Then, copy the snapshot to the DR region:

aws ec2 copy-snapshot --source-region us-east-2 --source-snapshot-id snap-0d57054959357bc83 --region ca-central-1

Wait for the snapshot to be completed at the DR site, and locate its ID. Then, create a volume from that snapshot:

aws ec2 create-volume --snapshot-id snap-08b2d1db2d9b62a40 --availability-zone ca-central-1a --region ca-central-1

Wait for that new volume to become available, and locate its ID. Then, detach the existing volume from the DR instance and replace it with the newly replicated volume with the latest data from the primary:

aws ec2 detach-volume --volume-id vol-017c92fedadc2bfb9 --region ca-central-1aws ec2 attach-volume --volume-id vol-051ca3131c8a6f5c4 --instance-id i-049f4525c23eeaf71 --device /dev/sda1 --region ca-central-1

The backup instance is up-to-date, so test that it can start with a brief power on and power off. This also proves that your DR site has the capacity for your backup instances. Ideally when it is powered on attempt to connect to the application, such as MySQL, to make sure it recovers:

aws ec2 start-instances --instance-id i-049f4525c23eeaf71 --region ca-central-1# optional application test hereaws ec2 stop-instances --instance-id i-049f4525c23eeaf71 --region ca-central-1

Repeat the steps starting from the first snapshot creation regularly to keep the backup copy up-to-date. In case of a true disaster, just start the backup instance in the DR region, in this case ca-central-1. There is no dependency on the primary region at all for this operation.

You can script this yourself, and in fact simulate the primary being down as part of the overall steps. On the AWS CLI, you can override the URL of the API server for a particular region, with the — endpoint-url option and instead specify a bogus IP URL to simulate that site being down. This will as closely simulate a true site failure as possible. For example, when snapshotting the primary instance, instead run:

aws ec2 create-snapshot --volume-id vol-0451a152f71e29f7b --region us-east-2 --endpoint-url https://10.20.30.40/

This will hang and timeout, which is what would happen if us-east-2 were down. There plainly is no similar approach to simulate an availability zone failure.

But what about the cost? Some cloud vendors include the data replication price within the cost of multi-availability zone applications, like Amazon RDS. Multi-region failover however is not necessarily more expensive. Using calculator.aws to estimate costs:

A MySQL RDS multi-AZ of size t3.2xlarge with no read replica is $800 per month (using on-demand pricing); adding a read-replica doubles it to $1600 per month

However inter-region failover does not necessarily cost more:

A single Bitnami LAMP server of size t3.2xlarge from AWS Marketplace costs .397/hr or $300 per month; a backup instance at the DR region is almost always offline so its cost is negligible
Replication fees are .02/GB. If you replicate 64 GB / day or 2 TB per month the cost is an additional $40. Cross-region replication is differential, only the changes are replicated, not the entire volume.

What’s missing in a cross-region DR scenario is the automation of the setup, replication, and failover. Though the steps shown above demonstrate how you can do it yourself, you might not want to spend precious time customizing it for all of your existing and any new instances.

While multi-AZ RDS orchestrates the failover between zones, there are many solutions on cloud Marketplaces to orchestrate cross-region failover for you. Our products, Thunder for EC2 on AWS Marketplace (https://go.aws/35SuxHu) and Thunder for GCP on GCP Marketplace (https://bit.ly/2XRHU8M) are by far the most cost-effective: at $20 per month flat fee they cost less per year than the nearest competitor charges per month.

We charge so little because we develop exclusively for the cloud so we have no overhead for capital equipment, real estate or traditional corporate infrastructure. Also given that our products merely orchestrate a straightforward approach that was described in the article, any solution hardly justifies a high price. But most importantly, cross-region replication is not cost-effective if the price to manage it makes it more expensive than multi-AZ failover. At an additional $20 per month Thunder for EC2 and Thunder for GCP add only a minor addition to your DR preparation spend. At, say, $200 per month or more it no longer makes financial sense.

By deploying a cross-region disaster recovery protection solution, you will have accomplished the following:

built a robust, testable solution that as closely as possible simulates the real operation
taken control of disaster recovery protection in your own hands (this is not a criticism of our cloud partners but no one can expect an strategy that is difficult to test to work flawlessly the first time)
potentially reduced your costs over RDS, or at a minimum kept them relatively similar if you replicate a lot of data; low cost management solutions from Thunder Technologies keep those costs low
built a generic solution for any application; RDS for example only applies to databases
stretched your tolerance for failure significantly — availability zones are dozens of miles apart while regions are hundreds or thousands of miles apart

A cross-region disaster recovery strategy certainly involves some trade-offs, but they can be mitigated. These trade-offs include:

your data is replicating asynchronously so some will be lost on a failover, but you can replicate frequently to keep it as close as possible
you pay for inter-region data transfer, but only for differential
you have to manage the infrastructure yourself as AWS and GCP have no native solution to do it, but inexpensive solutions like ours exist

Finally, if you want hands-on experience with a cross-region DR automation solution, check-out our online demo of Thunder for EC2 (https://bit.ly/36cx6EM) and Thunder for GCP (https://bit.ly/30Dve7g). These demos include a step-by-step tutorial of the provisioning, replication, testing, and failover scenarios using instances in our own account at no cost or obligation to you. The tutorial are brief, not only because the approach is straightforward, but because they run in our account they turn off automatically after 30 minutes … because we’re paying for them!

However you protect your mission-critical applications against disaster we look forward in this space to providing you helpful information to keep your business humming in the cloud.

Building Robust Disaster Recovery for the Cloud

Using multiple regions can be more testable and cost-effective than multiple availability zones

Written by Thunder Technologies