From SaaS to Serverless

Thunder Technologies
4 min readMar 25, 2021
Logos used with permission as an AWS and GCP partner

In our last article we introduced our upcoming release of Thunder for EC2 Serverless, based on our cost-effective AWS Marketplace solution for disaster recovery automation, now architected for serverless computing. This article will provide details how we converted the logic from a SaaS environment — that runs 24/7 in an EC2 instance — to a lightweight AWS Lambda function.

Cloud disaster recovery for workloads hosted in EC2 is a straightforward process. On a periodic basis, snapshot an instance’s EC2 volumes; replicate those snapshots to another region; then attach volumes based on those snapshots to a duplicate instance at the DR region that was provisioned from a copy of the primary’s AMI. Step-by-step details are furnished in a previous post here.

This logic, however, involves a lot of waiting. First, each replication job runs only occasionally, maybe hourly for an aggressive recovery point objective, but possible daily to minimize replication costs. Replicating snapshots also takes time, depending on the volume of data replicated at each job. Even copying the primary’s AMI to the DR region during initial provisioning can be lengthy.

A SaaS solution spending most of its time waiting does not use its CPU cycles efficiently. As a result it incurs unnecessary additional cost paid to AWS for hosting the solution as an EC2 instance.

We embarked on our journey to rearchitect our solution for Lambda by creating a function with all of our automation logic that is invoked every 15 minutes through AWS CloudWatch Events, with one target for each EC2 instance being protected. We then divided the automation logic into discrete units, with each unit determining whether it should run during a given invocation based on the state of the instance at the time of the particular invocation:

  • If no snapshots exist, take a snapshot of each volume attached to the instance, then replicate them and exit; do not wait for the replication to complete because it could take some time, possibly longer than the maximum 15-minute Lambda timeout. It is not necessary for the caller to wait as the replication takes place in the background directly in AWS
  • If replicated snapshot exist and are completed (having been created by the previous step in a previous invocation), create a volume from each snapshot and swap those volumes with those currently attached to the DR instance; tag the instance as requiring testing and exit
  • If the instance requires testing, power on the instance, run any deep test that has been configured (more on that in a later article), then power it off, untag it, and exit immediately; no need to wait for it to power off

At any given invocation, if there is nothing to do, the function exits immediately, incurring essentially no cost. This will frequently be the case, for example if it is not time to run a job, or if a long-running snapshot replication is still in progress.

Users can specify at what frequency they wish to replicate each instance, for example, every four hours. The SaaS solution in this case merely created a Linux cron job on its EC2 instance to run every four hours. Thunder for EC2 Serverless cannot run its own internal cron job; instead, when it is invoked, it checks the timestamp of the snapshot it previously took of the primary instance; if that snapshot is older than the required frequency, it is time for a new replication job, otherwise the function exits immediately. If an error is encountered — for example a snapshot replication failed — it cleans up the error and tries again.

And if the DR instances are already powered on, there must have been a failover or the user is testing their own DR infrastructure, in which has the function also exits immediately in order not to interfere.

As a result, all of the idle waiting time — and its associated cloud costs — have been excised from the solution. However, with straightforward logic changes to have the code determine the correct operation to perform for a given invocation based on the current state of the infrastructure that it discovers, means no loss of functionality. For modest environments with aggressive RPOs — and our target market is the small- and medium-sized moving their mission-critical workload to the cloud — the function might execute for 20000 seconds per day, or 600000 seconds per month. Subtracting the 400000 free-tier seconds for Lambda, 200000 seconds @ $0.0000000167 per second, the cost is just pennies.

We had originally looked at Lambda when embarking on the initial development of our solution, but felt that the original five-minute function timeout was too restrictive. Now that it has been extended to 15 minutes, and with all of the supporting services around it, Lambda makes sense for management infrastructure that can work in occasional discrete units driven by a state machine. Our solution protects EC2 instances — probably the bulk of workload hosted on AWS and certainly the only solution to underpin realtime, 24/7 data processing applications and those that require a sophisticated management interface. For workload that doesn’t require EC2, ironically including ours that protects EC2, as our experience shows, the reasoning and mechanics behind the move to Lambda are clear.

Also, by substituting Google Cloud Functions for AWS Lambda in the content above, we will additionally have the serverless version of our Thunder for GCP solution on GCP Marketplace as well.

In my next article we will detail how jettisoning SaaS drove the need to find new solutions to host a management interface, logging, and other infrastructure, and how going 100% cloud-native seamlessly filled those gaps. In the meantime if you are interested in beta-testing Thunder for EC2 Serverless please write to info@thundertech.io

--

--

Thunder Technologies

Thunder Technologies provides robust, cost-effective disaster recovery automation for the public cloud