Date: November 7, 2018
Tags: Azure, azure outage post mortem
What happened? Here’s our Azure Outage Post Mortem Part 2
My previous blog post says that Cloud-to-Cloud or Hybrid-Cloud would give you the most isolation from just about any issue a CSP could encounter. However most of the downtime caused by this natural disaster could have been avoided if Availability Zones was available in the South Central region. Microsoft published a Preliminary RCA of the September 4th South Central Outage.
The most important part of that whole summary is as follows…
“DESPITE ONSITE REDUNDANCIES, THERE ARE SCENARIOS IN WHICH A DATACENTER COOLING FAILURE CAN IMPACT CUSTOMER WORKLOADS IN THE AFFECTED DATACENTER.”
What Does That Mean To You?
If your applications all run in the same datacenter, you are susceptible to the same type of outage in the future. In Microsoft’s defense, this really shouldn’t be news to you. This has always been true whether you run in Azure, AWS, Google or even your own datacenter. Failure to plan ahead with data replication to a different datacenter and a plan in place to quickly recover your applications is simply a lack of planning on your part.
Microsoft didn’t publish exact Availability Zone locations. If you believe this map published here, you could guess that they are probably anywhere from a 2-10 miles apart from each other.
In all but the most extreme cases, replicating data across Availability Zones should be sufficient for data protection. Some applications such as SQL Server have built in replication technology. However for a broad range of applications, operating systems and data types, do investigate block level replication SANless cluster solutions. SANless cluster solutions have traditionally been used for multisite clusters. But the same technology can also be used in the cloud across Availability Zones, Regions, or Hybrid-Cloud for high availability and disaster recovery.
Implementing a SANless cluster that spans Availability Zones, be it Azure, AWS or Google, is a pretty simple process given the right tools. As part of the azure outage post mortem, here are a few resources to help get you started.
Lessons From Azure Outage Post Mortem
If you are in Azure, you may also want to consider Azure Site Recovery (ASR). ASR lets you replicate the entire VM from one Azure region to another region. ASR will replicate your VMs in real-time and allow you to do a non-disruptive DR test whenever you like. It supports most versions of Windows and Linux and is relatively easy to set up.
You can also create replication jobs that have “Multi-VM Consistency”. This means that servers must be recovered from the exact same point in time can be put together in this consistency group and they will have the exact same recovery point. Essentially, if you build a SANless cluster with DataKeeper in a single region for high availability, you have two options for DR. One is you could extend your SANless cluster to a node in a different region, or you could use ASR to replicate both nodes in a consistency group.
What’s The Difference?
The trade off with ASR is that the RPO and RTO is not as good as you will get with a SANless multi-site cluster. Although it is easy to configure and works with just about any application. Just be careful. If your application exceeds 10 MBps in disk write activity on a regular basis, ASR will not be able to keep up. Also, clusters based on Storage Spaces Direct cannot be replicated with ASR and in general lack a good DR strategy when used in Azure.
For a while after Managed Disks were released, ASR did not fully support them until about a year later. Full support for Managed Disks was a big hurdle for many people looking to use ASR. Fortunately since about February of 2018, ASR fully supports Managed Disks. However, there is another problem that was just introduced.
With the introduction of Availability Zones, ASR is once again caught behind the times. They currently don’t support VMs that have been deployed in Availability Zones.
I went ahead and tried it anyway. It seems possible to configure replication and I was able to do a test failover.
Read more about my analysis of the Azure Outage Post Mortem
Reproduced with permission from Clusteringformeremortals.com