When we tell people about how PawaIT Solutions Ltd fully automates business continuity and disaster recovery on Google Cloud Platform, we’re commonly asked the same question. “Wait isn’t that built into the cloud?”

You can be forgiven if you asked it too. Google Cloud Platform provides a lot of built-in redundancy, and it can be easy to assume they’ve got it all covered. But at the end of the day, the major cloud providers leave it to you to mitigate the most critical disasters.

Understanding your DR(Disaster Recovery) needs for your cloud-based workloads requires assessing normal IT risks with a cloud specific lens. In this blog post, we’ll do just that.

Let’s revisit some cloud disasters from the recent past. Luckily, disasters in the cloud (just like disasters in life) aren’t a daily occurrence. But they happen frequently enough that most cloud practitioners can tell you a story about one they’ve experienced themselves. Here are a couple of stories that we like to tell.

AZURE SOUTH CENTRAL REGION OUTAGE

On September 4th, 2018, Azure had some pretty impressive weather in San Antonio, and a lightning strike took out power to this Azure region.

Normally, power outages at data centers aren’t the end of the world; they have backup generators that automatically kick in. But the lightning strike surged power through the data center’s cooling systems, and they were fried. Temperatures in the datacenter got hot. Too hot. Hardware failures started happening. Microsoft decided they needed to shut everything down.

It took 24 hours for Microsoft to repair the damage and to start bringing services back online. Some services weren’t restored for 3 days. Read More Source: Microsoft

AWS NORTHERN VIRGINIA REGION OUTAGE

On February 28, 2017, an Amazon employee was performing a routine maintenance operation when he accidentally typed a command. With that one mistake, he took down a massive number of servers supporting Amazon’s Simple Storage Service (S3). Read more Source: AWS

The services provided by these clouds are either zonal, regional, or global. A zonal service is hosted within a given availability zone, and is therefore vulnerable to an outage within that availability zone (i.e. fire in the data center). A regional service spans availability zones within a region, and is therefore not vulnerable to an outage of a single availability zone.  But it is vulnerable to an outage of multiple availability zones in the region (i.e. earthquake, regional power grid outage).

Undoubtedly a global service is not vulnerable to a single regional outage. You’d need multiple regions to fail to bring it down. In general, low-level infrastructure services (like virtual machines and virtual hard drives) are zonal. Higher-level platform services (like hosted databases) are often regional. And global services are pretty rare (they’re really hard to engineer).

So, to understand what kind of disasters you’ve mitigated, you need to understand the architecture of the cloud services you’re consuming and what they’ve already mitigated for you. If you’re consuming a zonal service (for example), it doesn’t mean your application isn’t resilient to an availability zone outage. It just means you have to build that resilience yourself; it’s not automatic.

Don’t forget about the cyber threats

Most organizations work hard to ensure that they aren’t vulnerable to an attacker gaining access to their accounts. But search online for “cloud accounts hacked” and you’ll find countless stories where it has happened. An in-depth approach to security requires that you always lock down your backups just in case.

How do you know if you’re covered?

Assessing your readiness for a cloud disaster boils down to answering 3 simple questions.

  1. If we suffer an extended outage of an availability zone, are we still available? If not, how would we recover?
  2. If we suffer an extended outage of multiple availability zones (or an entire region), are we still available? If not, how would we recover?
  3. If an attacker gains access to our cloud account, could they corrupt (i.e. ransomware) or delete our data and our backups?

If your answers to these questions are satisfactory for your business needs, you’re mostly good. You just need to answer one more question, Have you tested it?