Azure Outage Post Mortem Part 3 - SIOS SANless clusters

Date: November 8, 2018

Tags: azure outage post mortem, Microsoft

Concluding The Azure Outage Post-Mortem Part 3

My previous blog posts, Azure Outage Post-Mortem – Part 1 and Azure Outage Post-Mortem Part 2, made some assumptions based upon limited information coming from blog posts and twitter. I just attended a session at Ignite which gave a little more clarity as to what actually happened. Sometime tomorrow you should be able to view the session for yourself.

BRK3075 – Preparing for the unexpected: Anatomy of an Azure outage

The official Root Cause Analysis will be published soon. In the meantime, here are some tidbits of information gleaned from the session.

The Cause

From the azure outage post mortem, the outage was NOT caused by a lightning strike as previously reported. Instead, due to the nature of the storm, there were electrical storm sags and swells. As a result, it locked out a chiller plant in the 1^st datacenter. During this first outage they were able to recover the chiller quickly with no noticeable impact. Shortly thereafter, there was a second outage at a second datacenter which was not recovered properly. That began an unfortunate series of events.

2nd Outage

During this outage, Microsoft states that “Engineers didn’t triage alerts correctly – chiller plant recovery was not prioritized”. There were numerous alerts being triggered at this time. Unfortunately the chiller being offline did not receive the priority it should have. The RCA as to why that happened is still being investigated.

Microsoft states that of course redundant chiller systems are in place. However, the cooling systems were not set to automatically failover. Recently installed new equipment had not been fully tested. So it was set to manual mode until testing had been completed.

After 45 minutes, the ambient cooling failed, hardware shutdown, air handlers shut down because they thought there was a fire. Staff had been evacuated due to the false fire alarm. During this time temperature in the data center was increasing. Some hardware was not shut down properly, causing damage to some storage and networking.

After manually resetting the chillers and opening the air handlers, the temperature began to return to normal. It took about 3 hours and 29 minutes before they had a complete picture of the status of the datacenter.

The biggest issue was there was damage to storage. Microsoft’s primary concern is data protection. Microsoft will work to recover data to ensure no data loss. This of course took some time, which extend the overall length of the outage. The good news is that no customer data was lost. The bad news is that it seemed like it took 24-48 hours for things to return to normal. This was based upon what I read on Twitter from customers complaining about the prolonged outage.

Assumptions

Everyone expected that this outage would impact customers hosted in the South Central Region. But what they did not expect was that the outage would have an impact outside of that region. In the session, Microsoft discusses some of the extended reach of the outage.

Azure Service Manager (ASM)

This controls Azure “Classic” resources, AKA, pre-ARM resources. Anyone relying on ASM could have been impacted. It wasn’t clear to me why this happened. It appears that South Central Region hosts some important components of that service which became unavailable.

Visual Studio Team Service (VSTS)

Again, it appears that many resources that support this service are hosted in the South Central Region. This outage is described in great detail by Buck Hodges (@tfsbuck), Director of Engineering, Azure DevOps this blog post.

POSTMORTEM: VSTS 4 SEPTEMBER 2018

Azure Active Directory (AAD)

When the South Central region failed, AAD did what it was designed to due and started directing authentication requests to other regions. As the East Coast started to wake up and online, authentication traffic started picking up. Now normally AAD would handle this increase in traffic through autoscaling. But the autoscaling has a dependency on ASM, which of course was offline. Without the ability to autoscale, AAD was not able to handle the increase in authentication requests. Exasperating the situation was a bug in Office clients which made them have very aggressive retry logic, and no backoff logic. This additional authentication traffic eventually brought AAD to its knees.

They ran out of time to discuss this further during the Ignite session. One feature that they will be introducing will be giving users the ability to failover Storage Accounts manually in the future. So in the case where recovery time objective (RTO) is more important than (RPO), the user will have the ability to recover their asynchronously replicated geo-redundant storage in an alternate data center if Microsoft experience another extended outage in the future.

What You Can Do Now

Until that time, you will have to rely on other replication solutions such as SIOS DataKeeper Azure Site Recovery. Or application specific replication solutions which has the ability to replicate data across regions and put the ability to enact your disaster recovery plan in your control.

Read more about our azure outage post mortem
Reproduced with permission from Clusteringformeremortals.com