All businesses from the smallest startups to the Fortune 500 need to protect their mission-critical cloud workload from permanent loss due to natural disasters and other hazards at the region hosting their applications.
Any discussion of disaster recovery protection, as a form or insurance, must include a transparent calculation of the expenses involved so you have a full understanding of the cost/benefit ratio. So let’s talk about, ahem, money.
As a baseline let’s start with AWS RDS, a cloud database that can be configured with redundant instances across two availability zones within the same region. Data is replicated between zones at no cost, and the redundant instance can be started in case the primary zone fails. The cost for the simplest instance (db.m5.xlarge with 20 GB SSD allocated at ca-central-1) is $275 per month:
However, as I detailed in a previous post there seems no realistic way to test this scenario. At a minimum, how do you know that AWS has the capacity to absorb all of the redundant RDS instances from all its customers in case of a zone failure.
You can optionally configure RDS with an online standby that can take over immediately in case of a failure. However, you are continuously charged for this online backup, which essentially doubles your costs from $275 to $550 per month.
Cross-region replication and failover is an alternative DR architecture. It is arguably superior to RDS in that it is easily testable (just power on duplicate instances in your DR region regularly), separates your backup copy by a much larger distance (availability zones are at most tens of miles apart, regions are hundreds if not thousands), and RDS only applies to database engines (surely any cloud workload includes applications other than a database, how will those be protected?)
Cross-region replication is accomplished be replicating snapshots of the EBS volumes underlying an EC2 instance to a remote region; these snapshots are then converted to volumes which can be attached to the instance. The backup instance is always off except during a brief test. You however bear the snapshot replication cost, which is generally 2 cents per gigabyte USD.
That said, when replicating a snapshot, only the differential data written since the previous snapshot is copied between regions. As a result you are only charged for that amount of data written since the last snapshot was replicated, not the entire size of the disk being snapshotted every time.
As an example, consider the 20 GB disk in the RDS example above. The first time you replicate the disk to a remote region AWS will copy the entire 20 GB as there is no previous snapshot to which it can compare. This will incur a cost of 20 x $0.02 = $0.40.
If you choose to replicate hourly, and your application writes 1 GB of new data per hour, each subsequent snapshot will only send the 1 GB @ $0.02 / GB = $0.02 x 24 hours or $0.48 per day.
Over the course of a month given a similar write load the cost is 0.48 x 30 days + 0.40 for the initial snapshot = $14.40, far lower than the $275 per month for a hot RDS standby.
Even in a scenario where your application re-writes the entire disk every hour your monthly cost will be 20 GB x $0.02 / GB x 24 hours/day x 30 days / month = $288, roughly the cost of the RDS hot standby.
It is important to note that modern applications and especially databases optimize data transfer. MySQL for example, creates a unique file for each table in a database, sized only to house the data stored in the table. This means at the time of a snapshot, any tables that were not modified since the last snapshot will not be part of a differential snapshot and will not incur replication cost.
Consider a hypothetical situation where you had many tables from which data was read but not written, and a single 1GB table was mainly updated. If you could stand to lose at most one day’s worth of data, you could choose to replicate on a daily rather than hourly basis. Accordingly, your monthly replication fee could be in the neighborhood of 1 GB x $0.02 / GB x 30 days / month + the one-time full snapshot replication @ $0.40 = $1.00!
While I’ve focused on AWS, Google Compute Engine in Google Cloud Platform is similar: replicating snapshots across regions optimizes data transfer to minimize costs.
Replicating snapshots though is not the entire effort required for effective business continuity. Something has to automate the periodic replication, as well as leverage the replicated data into duplicate EC2 instances or GCP virtual machines at the DR region. You can painstakingly click through the console yourself but time is, of course, money.
While there are many solutions on AWS and GCP Marketplaces almost all are very expensive. For small businesses with modest cloud spends the high subscription fee essentially negates the cost-savings of cross-region replication that I have just described. Only our solutions, Thunder for EC2 and Thunder for GCP, at $20 per month flat fee with free trial, add a relatively small cost to manage the solution on top of the actual data replication itself, which is why we priced it that way.
If you’re a small business using the cloud with, say, 10 applications, each replicating 1 GB hourly per day, your monthly cost with be $140 for the replication. Adding $20 for Thunder for EC2 or GCP subscription and the roughly $15 per month to Amazon or Google host our SaaS solution as a micro instance to run our automation software, still makes it economical, both relative to the cost of data and in comparison to RDS. Add, however, a $200 or $500 monthly subscription, as our competitors do, and the cost/benefit ratio significantly declines.
We can charge so little because developing software for the cloud costs so little: no equipment to buy, no lab space to rent, no corporate overhead. Our business model involves just a few seasoned technologists using published APIs to automate a straightforward procedure that can appeal to the huge addressable market of AWS and GCP users.
Our product for AWS optionally keeps track of your replication spend in the UI. If given read permission to your usage report bucket, each instance can show the amount and cost of data replicated.
As this is one of my longer articles I will close with a summary of various DR approaches and their tradeoffs, focusing on AWS but which could apply to GCP. For the sake of comparison I am simplifying the cloud spend simulating a small business with 10 applications in the cloud, each similar to a single RDS instance in size and capacity, so $275 x 10 = $2750 per month.
Earlier I compared disaster recovery to insurance, essentially a premium you pay on top of your regular spend to guarantee continuity in case of catastrophe. While there exists a solution with no premium, you must accept on faith that it will function properly and it only applies to databases. Other solutions involve double-digit premiums, unjustifiably high given the remote nature of the threat. Only our products Thunder for EC2 and Thunder for GCP offer testable, application-independent disaster recovery automation for both AWS and GCP at the lowest possible premium.