|December 18, 2020||
Calculating Application Availability In The Cloud
When deploying business critical applications in the cloud, you want to make sure they are highly available. The good news is that if you plan properly, you can achieve 99.99% (4-nines) of availability or more. However, calculating your true availability may not be as straightforward as it seems.
When considering availability, you must consider the key components that make access to your application possible, which I’ll call the availability chain. Component of the availability chain are:
Your application is only as available as your weakest link, and your downtime increases exponentially with each additional link you add to the chain. Let’s examine each of the links.
Each of the three major cloud service providers have some similarities. One thing in common across all three platforms is the service level agreements (SLA) they will commit to for compute.
The SLA for all three public cloud providers for VMs when you have two or more VMs configured across different availability zones is 99.99%. Keep in mind, this SLA only guarantees the remote accessibility of one of the VMs at any given time, it makes no promises as to the availability of the services or application(s) running inside the VM. If you deploy a single VM within a single datacenter, this SLA varies from “90% of each hour” (AWS) to 99.5% (Azure and GCP) or 99.9% (Azure single VM when using Premium SSD).
True high availability starts at 99.99%, so the first step is to ensure your application is available is to make sure the application is distributed across two or more VMs that span availability zones. With two VMs spread across two availability zones, giving you 99.99% availability of at least one of those VMs, you could theorize that if you had three VMs spread across three availability zones your availability would be even greater than 99.99%. Although the cloud providers’ SLA will never guarantee beyond 99.99% availability regardless of the number of availability zones in use, if you use pure statistics you might come to the conclusion that your availability could jump to as high as 99.999999% or 8-nines of availability, 26.30 milliseconds downtime per month.
1-(.0001*.0001) = .99999999
99.999999% availability with three availability zones?
Don’t go around quoting that number. But just keep in mind that it makes sense that if two availability zones can give you 99.99% availability. It stands to reason that three availability zones is going to give you something significantly more than 99.99% availability.
Compute is just one link in the availability chain. We still have to address network, storage and other dependent services, which all represent possible points of failure.
In order for your application to be available, every network hop between the client and the application and all the resources that the application depends on, must be available and working within tolerable latency ranges. You need to understand the network links between database servers, application servers, web servers and clients to know precisely where the network might fail. Remember, the more links in your availability chain the lower your overall availability will be.
Although network availability betweens VMs in the same vNet are covered under the standard compute SLA, there are other network services that you may be utilizing. Here are just a few examples of network services you could be utilizing which would impact overall application availability.
Building on what we have learned so far, let’s take a look at the availability of an application that is deployed across two availability zones.
99.99% compute availability
99.99% load balancer availability
.9999 * .9999 = .9998
99.98% availability = ~9 minutes downtime per month
Now that we have addressed compute and network availability, let’s move on to storage.
Now here is where the story gets a little hairy. Have a look at the following storage SLAs
It seems pretty clear that Azure and Google are giving you a 99.9% SLA on block storage solutions. AWS doesn’t mention EBS specifically here. They only talk about VMs and measure their single instance VMs availability by the hour instead of by the month as the other cloud providers do. For sake of discussion, lets use the 99.9% availability guarantee that both Azure and GCP have published.
Building upon our previous example, let’s add some storage to the equation.
99.99% compute availability
99.99% load balancer availability
99.9% managed disk
.9999 * .9999 * .999 = .9988
99.88% availability = ~53 minutes of downtime per month.
53 minutes of downtime is a lot more than the 9 minutes of downtime we calculated in our previous example. What can we do to minimize the impact of the 99.9% storage availability? We have to build more redundancy in the storage!
Fortunately, we usually include storage redundancy when planning for application availability. For instance, when we stand up web servers, each web server will typically store data on the locally attached disk. When deploying domain controllers, Microsoft Active Directory takes care of replicating AD information across all the domain controllers. In the case of something like SQL Server, we leverage things Always On Availability Groups or SIOS DataKeeper to keep the data in sync across locally attached disks.
The more copies of the data we have distributed across different availability zones, the more likely we will be able to survive a failure.
For example, an application that stores its data across two different disks in different availability zones will benefit from the redundancy and instead of 99.9% availability it is more likely to achieve 99.9999% availability of the storage.
1 – (.001 * .001) = .999999
If we throw that into the previous equation, the picture starts to look a little brighter.
.9999 * .9999 * .999999 = .9998
99.98% availability = ~9 minutes of downtime
By duplicating the data across multiple AZs, and therefore multiple disks, we have effectively mitigated the downtime associated with cloud storage.
Application And Dependent Services Availability
You’ve done all you can do to ensure compute, network, and storage availability. But what about the application itself? Some applications can scale out and provide redundancy by load balancing between multiple instances of the same application. Think of your typical web server farm where you may typically load balance web requests between five servers. If you lose one server, the load balancer simply removes it from its rotation until it is once again responsive.
Other applications require a little more care and monitoring. Take SQL Server for instance. Typically Always On Availability Groups or Failover Cluster Instances are used to monitor database availability and take recovery actions should a database become unresponsive due to application or system level failures. While there is no published SLA for SQL Server availability solutions, it is commonly accepted that when configured properly for high availability, a SQL Server can provide 99.99% availability.
You may rely on other cloud based services, like hosted Active Directory, hosted DNS, microservices, or even the availability of the cloud portal itself should all be factored into your overall availability equation.
Application availability is the sum of all the moving parts. Skimping in just one area can exponentially impact the overall availability of your application. Take your time and investigate all the links in your availability chain for weakness including compute, network, storage, application and dependent services.
In general the numbers presented here are hopefully worst case scenarios and your actual availability should exceed the published SLAs. Do your homework and be wary of any service that can not guarantee 99.99% availability, the typical threshold of what is considered highly available.
Human error and security were not addressed in this article. You can make your application as highly available as possible. However, if you have not taken steps to secure your application against external threats and stupid human mistakes then all bets are off when it comes to availability.
|December 11, 2020||
|December 8, 2020||
5 Signs That It Will Take More Than A Blog Post To Fix Your High Availability
The signs are there. The warning lights are flashing. In your gut, you can sense it. Maybe you can’t sleep. Your problems with high availability are deep. But, maybe you are not quite sure.
1. If you think your cloud SLA is all you need for high availability
Cloud solutions have provided great advancements in increased hardware availability and resilience. However, application high availability requires more than just selecting the right hypervisor or cloud provider. Your strategy for high availability cannot stop with the SLA provided by the cloud or a virtualization provider. As quoted by Wired, “The almost four-day Amazon outage of April 2011 did not breach Amazon’s EC2 SLA, which as a FAQ explains, “guarantees 99.95% availability of the service within a Region over a trailing 365 period.” In this DZone article, our own David Bermingham breaks down the differences between cloud SLAs and application availability in detail. If you want a highly available infrastructure, it must include monitoring, recovery, and resilience at the data and application layers as well.
2. If you are just using the high availability clustering that came with your open source operating system
If so, then chances are you didn’t select your database based on what was bundled with the OS, so why would you select your HA solution based on that criteria alone. Bundled tools go a long way in providing extra assurance, possibilities, and capabilities. However, despite the ease of access, bundled tools and OS clustering software are not always capable of meeting your SLA, RPO, RTO, and availability requirements. If your enterprise has a combination of Operating Systems, your team will likely need help navigating different tools and understanding how they integrate together. It’s kind of like choosing the hedge clippers and push reel mower left on the curb to shape “Azalea” on the 13th hole par 5 (at Augusta). Both lawn mowers are designed to cut grass but how much time do you have? How are you going to handle the complexity? Which would you trust? Your strategy for high availability requires more than just considering the conveniences of what is bundled with the OS, otherwise, you’d be running MySQL instead of SAP HANA.
3. If you think that enterprise application licensing, such as SQL Enterprise or Oracle Enterprise, is the same thing as enterprise high availability
In addition to increased cost, many enterprise application licenses also increase the ability of the application to recover in some high availability scenarios. However, it is highly unlikely that your entire enterprise is based on a single application. Your high availability is going to require more than just a highly available database solution. You’ll need an enterprise grade application monitoring and recovery solution with a breadth of support for all of your applications and databases. In addition, you’ll need the ability to manage and replicate not just database data, but critical application and configuration data as well. Availability for a single database or a simple application is one thing – but HA for a complex, multipart application and supporting database is very different. More services, more parts that need to be coordinated, more complex architecture to orchestrate, more specific best practices to adhere to before, during and after failover/switchover. More than what your enterprise license paid for.
4. If your downtime is growing and your uptime is shrinking
The pace of life is ever increasing in many fields. When was the last time your team recovered from backup, manually restarted the applications that were deemed critical, or restarted a set of failed virtual machines or nodes? The pace of your outage events cannot continue to outpace sustainability, or your team’s ability to move beyond firefighting to fire prevention and fire proofing. “You can only run so hard so long (Carey Nieuwhof).” For some of you, you’ve been firefighting for too long, and your outages are becoming more common than your up-time.
5. If your first failover test was on the production server
A recent client remarked that it is simply impossible to test for every possible disaster scenario. As new software is created, deployed, updated, and patched the challenges in higher availability are increasing. But, your live, production data is not the place to find out what does not play well together. And while Go-Live and Post-Go-Live will always have their share of surprises, the inability to actually failover and run on the backup node should not be one of them.
Scouring blogs can provide you with helpful tips and insights to define, redefine, and improve your higher availability. But, if the warning signs are going off that you’ve traded true availability for some semblance of ‘just enough’, then it will take more than a blog post, or scouring every blog post in the availability world for that matter, to fix your HA.
– Cassius Rhue, Vice President, Customer Experience
Reproduced with permission from SIOS
|November 27, 2020||
|November 20, 2020||
SIOS AppKeeper now available in the AWS Marketplace
Making it easier to add automated remediation to your DevOps environment.
Today we are excited to announce that our SIOS AppKeeper solution is now available on the AWS Marketplace, a digital catalog with thousands of software listings from independent software vendors that make it easy to find, test, buy and deploy software that runs on Amazon Web Services (AWS). Now it is easier than ever for end-users and AWS Partner Network (APN) members to try, acquire, and deploy SIOS AppKeeper to add automated remediation to their DevOps environments. Click here to see AppKeeper in the AWS Marketplace.
SIOS AppKeeper continuously monitors and protects your applications running on Amazon EC2. We’ve been selling AppKeeper in Japan since 2017 and brought the SaaS service to the U.S. market earlier this year. We created AppKeeper in response to the demand we were hearing from our customers who were moving to the cloud and were concerned about reducing potential downtime while struggling with limited resources. Click here if you would like to see a video on how easy it is to install and use AppKeeper.
How often are Amazon EC2 users experiencing downtime? According to our customer data, the average customer with only three Amazon EC2 instances experiences downtime at least once a month. That could be due to software configuration mistakes, etc.
Going beyond application monitoring to offer automated remediation
Many AWS users are deploying application performance monitoring (APM) solutions, such as from AppDynamics, Datadog, Dynatrace or New Relic, to monitor their AWS environments. But these only alert you to the fact that something happened, and why it happened. They don’t do anything to anything to reduce your downtime.
That’s where AppKeeper comes in. If AppKeeper detects downtime with any application services running on Amazon EC2 it automatically responds by restarting affected services and rebooting instances if necessary. AppKeeper addresses 85% of application service failures. Reducing the need for expensive outsourced monitoring or distractions for your IT team with automated recovery. Learn more about APM automation from AppKeeper.
AWS customers who are already using an APM solution and want to extend the functionality to include automatic remediation, if and when Amazon EC2 downtime is detected, can take advantage of AppKeeper’s webhooks API to integrate with their chosen APM solution.
Why we decided to list SIOS AppKeeper to the AWS Marketplace
Here at SIOS Technology Corporation we have had a strategic partnership with Amazon AWS since 2014, primarily around our SIOS DataKeeper and SIOS LifeKeeper high-availability solutions. SIOS Technology is an APN Advanced Partner today, and we share 100’s of joint customers.
Now that we have customer proofpoints for the effectiveness of SIOS AppKeeper (here are some recent case studies that you might enjoy), we wanted to make it easier for Amazon customers and APN partners to try, buy and use AppKeeper. By many estimates there are over 200,000 active AWS customers using software from the AWS Marketplace, all of whom are taking advantage of how easy the AWS Marketplace makes it to discover, acquire and use complementary solutions as they continue on their cloud journeys.
And our friends at Amazon couldn’t have said it better: “As our customers migrate more and more applications to the cloud, they are looking for flexibility in balancing the level of availability with costs across all of their applications,” said Chris Grusz, Director, AWS Marketplace, Amazon Web Services, Inc. “We’re delighted to welcome SIOS AppKeeper to AWS Marketplace and to provide our customers with more choice when performance changes occur.”
AWS customers who are interested in protecting their EC2 applications from unnecessary downtime can now quickly try out AppKeeper for themselves, and can acquire AppKeeper under their Amazon Enterprise Discount Plan, if they have one in place. Pricing for SIOS AppKeeper starts at only US$40 per instance, per month.
Partners are now integrating AppKeeper into their customer solutions
A variety of partners are now integration AppKeeper into their customer solutions, and having AppKeeper available in the AWS Marketplace means it will be easier for APN members to evaluate if the solution is a fit for their business and their customers. Managed Service Providers (MSPs) are starting to include AppKeeper into how they monitor and manage their customers’ AWS environments, as a way to reduce downtime and their own operational costs. Other ISVs are integrating AppKeeper’s automated remediation functionality into their own cloud management solutions, and AWS Consulting Partners are packaging AppKeeper as they develop and deploy applications on AWS for their customers.
APN members who are interested in evaluating whether AppKeeper is a fit for their business should contact us by at email at email@example.com.
We hope you will try out SIOS AppKeeper for yourself (we have a 14-day free trial and an easy installation process), and join the many customers who are now relaxing knowing that they have automated remediation in place to reduce any Amazon EC2 downtime that they might experience.