cloud outage Archives - SIOS SANless clusters

Major Cloud Outage Impacts Google Compute Engine – Were You Prepared?

Google first reported an “Issue” on Jun 2, 2019 at 12:25 PDT. As is now common in any type of disaster, reports of this outage first appeared on social media. Social media seems to the most reliable place to get any type of information early in a disaster now.

Twitter is quickly becoming the first source of information on anything from revolutions, natural disasters to cloud outages.

Many services that rely on Google Compute Engine were impacted. I’ve three teenage kids at home. Something was up when all three kids emerged from their caves, aka, bedrooms, at the same time with a worried look on their faces. Snapchat, Youtube and Discord were all offline!

They must have thought that surely this was the first sign of the apocalypse. I reassured them this was not the beginning of the new dark ages. And instead they should go outside and do some yard work. That scared them back to reality and they quickly scurried away to find something else to occupy their time.

All kidding aside, there were many services being reported as down, or only available in certain areas. The dust is still settling on the cause, breadth and scope of the outage. But it certainly seems that the outage was pretty significant in size and scope, impacting many customers and services including Gmail and other G-Suite services, Vimeo and more.

Many services were impacted by this outage, Gmail, YouTube and SnapChat just to name a few.

While we wait for the official root cause analysis on this latest Google Compute Engine outage, Google reported “high levels of network congestion in the eastern USA” caused the downtime. We will have to wait to see what they determine caused the network issues. Was it human error, cyber-attack, hardware failure, or something else?

Were You Prepared For This Cloud Outage?

I wrote during the last major cloud outage. If you are running business critical workloads in the cloud, regardless of the cloud service provider, it is incumbent upon you to plan for the inevitable outage. The multi-day Azure outage of Sept 4th, 2018 was related to a failure of the secondary HVAC system to kick in during a power surge related to an electrical storm. While the failure was just within a single datacenter, the outage exposed multiple services that had dependencies on this single datacenter. This made the datacenter itself a single point of failure.

Have A Sound Disaster Recovery Plan

Leveraging the cloud’s infrastructure, minimize risks by continuously replicating critical data between Availability Zones, Regions or even cloud service providers. In addition to data protection, having a procedure in place to rapidly recover business critical applications is an essential part of any disaster recovery plan. There are various replication and recovery options available. This includes services provided by the cloud vendor themselves like Azure Site Recovery, to application specific solutions like SQL Server Always On Availability Groups, to third party solutions like SIOS DataKeeper that protect a wide range of applications running on both Windows and Linux.

Having a disaster recovery strategy that is wholly dependent on a single cloud provider leaves you susceptible to a scenario that might impact multiple regions within a single cloud. Multi-datacenter or multi-region disasters are not likely. However, as we saw with this recent outage and the Azure outage last fall, even if a failure is local to a single datacenter, the impact can be wide reaching across multiple datacenters or even regions within a cloud. To minimize your risks, consider a multi-cloud or hybrid cloud scenario where the disaster recovery site resides outside of your primary cloud platform.

The cloud is just as susceptible to outages as your own datacenter. You must take steps to prepare for disasters. I suggest you start by looking at your most business critical apps first. What would you do if they were offline and the cloud portal to manage them was not even available? Could you recover? Would you meet your RTO and RPO objectives? If not, maybe it is time to re-evaluate your Disaster Recovery strategy.

“By failing to prepare, you are preparing to fail.”

― Benjamin Franklin

Managing A Real-Time Recovery In A Major Cloud Outage

Disasters happen, making sudden downtime reality. But there are things all customers can do to survive virtually any cloud outage.

Stuff happens. Failures—both large and small—are inevitable. What is not inevitable is extended periods of downtime.

Consider the day the South Central US Region of Microsoft’s Azure cloud experienced a catastrophic failure. A severe thunderstorm led to a cascading series of problems that eventually knocked out an entire data center. In what some have called “The Day the Azure Cloud Fell from the Sky,” most customers were offline, not just for a few seconds or minutes, but for a full day. Some were offline for over two days. While Microsoft has since addressed the many issues that led to the outage, the incident will long be remembered by IT professionals.

That’s the bad news. The good news is: There are things all Azure customers can do to survive virtually any outage. It can be from a single server failing to an entire data center going offline. In fact, Azure customers who implement robust high-availability and/or disaster recovery provisions, complete with real-time data replication and rapid, automatic failover, can expect to experience no data loss, and little or no downtime whenever catastrophe strikes.

Managing The Cloud Outage

This article examines four options for providing disaster recovery (DR) and high availability (HA) protections in hybrid and purely Azure cloud configurations. Two of the options are specific to the Microsoft SQL Server database, which is a popular application in the Azure cloud; the other two options are application-agnostic. The four options, which can also be used in various combinations, are compared in the table and include:

The Azure Site Recovery (ASR) Service
SQL Server Failover Cluster Instances with Storage Spaces Direct
SQL Server Always On Availability Groups
Third-party Failover Clustering Software

RTO and RPO 101

Before describing the four options, it is necessary to have a basic understanding of the two metrics used to assess the effectiveness of DR and HA provisions: Recovery Time Objective and Recovery Point Objective. Those familiar with RTO and RPO can skip this section.

RTO is the maximum tolerable duration of an outage. Online transaction processing applications generally have the lowest RTOs, and those that are mission-critical often have an RTO of only a few seconds. RPO is the maximum period during which data loss can be tolerated. If no data loss is tolerable, then the RPO is zero.

The RTO will normally determine the type of HA and/or DR protection needed. Low recovery times usually demand robust HA provisions that protect against routine system and software failures, while longer RTOs can be satisfied with basic DR provisions designed to protect against more widespread, but far less frequent disasters.

The data replication used with HA and DR provisions can create the need for a potential tradeoff between RTO and RPO. In a low-latency LAN environment, where replication can be synchronous, the primary and secondary datasets can be updated concurrently. This enables full recoveries to occur automatically and in real-time, making it possible to satisfy the most demanding recovery time and recovery point objectives (a few seconds and zero, respectively) with no tradeoff necessary.

Across the WAN, by contrast, forcing the primary to wait for the secondary to confirm the completion of updates for every transaction would adversely impact on performance. For this reason, data replication in the WAN is usually asynchronous. This can create a tradeoff between accommodating RTO and RPO that normally results in an increase in recovery times. Here’s why: To satisfy an RPO of zero, manual processes are needed to ensure all data (e.g. from a transaction log) has been fully replicated on the secondary before the failover can occur This extra effort lengthens the recovery time, which is why such configurations are often used for DR and not HA.

Azure Site Recovery (ASR) Service

ASR is Azure’s DR-as-a-service (DRaaS) offering. ASR replicates both physical and virtual machines to other Azure sites, potentially in other regions, or from on-premises instances to the Azure cloud. The service delivers a reasonably rapid recovery from system and site outages, and also facilitates planned maintenance by eliminating downtime during rolling software upgrades.

Like all DRaaS offerings, ASR has some limitations, the most serious being the inability to automatically detect and failover from many failures that cause application-level downtime. Of course, this is why the service is characterized as being for DR and not for HA.

With ASR, recovery times are typically 3-4 minutes depending, of course, on how quickly administrators are able to manually detect and respond to a problem. As described above, the need for asynchronous data replication across the WAN can further increase recovery times for applications with an RPO of zero.

SQL Server Failover Cluster Instance with Storage Spaces Direct

SQL Server offers two of its own HA/DR options: Failover Cluster Instances (discussed here) and Always On Availability Groups (discussed next).

FCIs afford two advantages: The feature is available in the less expensive Standard Edition of SQL Server, and it does not depend on having shared storage like traditional HA clusters do. This latter advantage is important because shared storage is simply not available in the cloud—from Microsoft or any other cloud service provider.

A popular choice for storage in the Azure cloud is Storage Spaces Direct (S2D), which supports a wide range of applications, and its support for SQL Server protects the entire instance and not just the database. A major disadvantage of S2D is that the servers must reside within a single data center, making this option suitable for some HA needs but not for DR. For multi-site HA and DR protections, the requisite data replication will need to be provided by either log shipping or a third-party failover clustering solution.

SQL Server Always On Availability Groups

While Always On Availability Groups is SQL Server’s most capable offering for both HA and DR, it requires licensing the more expensive Enterprise Edition. This option is able to deliver a recovery time of 5-10 seconds and a recovery point of seconds or less. It also offers readable secondaries for querying the databases (with appropriate licensing), and places no restrictions on the size of the database or the number of secondary instances.

An Always On Availability Groups configuration that provides both HA and DR protections consists of a three-node arrangement with two nodes in a single Availability Set or Zone, and the third in a separate Azure Region. One notable limitation is that only the database is replicated and not the entire SQL instance, which must be protected by some other means.

In addition to being cost-prohibitive for some database applications, this approach has another disadvantage. Being application-specific requires IT departments to implement other HA and DR provisions for all other applications. The use of multiple HA/DR solutions can substantially increase complexity and costs (for licensing, training, implementation and ongoing operations), making this another reason why organizations increasingly prefer using application-agnostic third-party solutions.

Third-party Failover Clustering Software

With its application-agnostic and platform-agnostic design, failover clustering software is able to provide a complete HA and DR solution for virtually all applications in private, public and hybrid cloud environments. This includes for both Windows and Linux.

Being application-agnostic eliminates the need for having different HA/DR provisions for different applications. Being platform-agnostic makes it possible to leverage various capabilities and services in the Azure cloud, including Fault Domains, Availability Sets and Zones, Region Pairs, and Azure Site Recovery.

As complete solutions, the software includes, at a minimum, real-time data replication, continuous monitoring capable of detecting failures at the application level, and configurable policies for failover and failback. Most solutions also offer a variety of value-added capabilities that enable failover clusters to deliver recovery times below 20 seconds with minimal or no data loss to satisfy virtually all HA/DR needs.

Making It Real

All four options, whether operating separately or in concert, can have roles to play in making the continuum of DR and HA protections more effective and affordable for the full spectrum of enterprise applications. This includes from those that can tolerate some data loss and extended periods of downtime, to those that require real-time recovery to achieve five-9’s of uptime with minimal or no data loss.

To survive the next cloud outage in the real-world, make certain that whatever DR and/or HA provisions you choose are configured with at least two nodes spread across two sites. Also be sure to understand how well the provisions satisfy each application’s recovery time and recovery point objectives. As well as any limitations that might exist, including the need for manual processes required to detect all possible failures, and trigger failovers in ways that ensure both application continuity and data integrity.

About Jonathan Meltzer

Jonathan Meltzer is Director, Product Management, at SIOS Technology. He has over 20 years of experience in product management and marketing for software and SaaS products that help customers manage, transform, and optimize their human capital and IT resources.

Reproduced from RTinsights