Archives for December 2020

How To Clone Availability In The Cloud With Better Outcomes

December 30, 2020 by Jason Aw Leave a Comment

How To Clone Availability In The Cloud With Better Outcomes

Tips from the movies – Multiplicity

Multiplicity is a 1996 American science fiction comedy film starring Michael Keaton as Doug Kinney, a busy construction worker struggling to make time for his family and his demanding job. When a scientist offers to clone him, Doug agrees to just make meeting his schedule and commitments easier. But then the copies of him begin making copies of themselves. By the time the last copy is made, the point is clear. Cloning may not be all it’s cracked up to be, or at the very least comes with some strong warnings, challenges and side effects. The famous original Star Trek episode “Trouble with Tribbles” illustrates a similar point.

Like cloning on the big screen (or small), cloning in the cloud is a great tool, but not without its challenges.

Tips for how to get better outcomes when you clone availability in the cloud

1. Clone operational systems

This sounds obvious, but I have seen it happen more than once in real enterprise environments. If you clone your non-functional system, the clone will be equally non-functional and problematic when you restore it. Be sure that the clone you make was from an operational and functional system.

2. Sync data to disk and resync on restore

File system integrity is critical. If you don’t ensure your application and/or VM are in a consistent state, most vendors will not guarantee the resulting created image. Since snapshots only capture data that has been written to your volume at the time the snapshot command is issued, this might exclude any data that has been cached by any applications or the operating system. Making sure data has been properly synced to the file system is an important step, and absolutely critical in a cluster environment.

File system integrity is also critical to keep in mind when you restore from an image. If you are using data replication and you restore an image as source or target in the cluster, making sure the two nodes are in sync is paramount. Failing to do so may lead to file system errors on failover or switchover, or even potential data loss. Clone availability in the cloud to get the result you want.

3. Stop your instance

Many environments do not require you to stop an instance to create an image, and some, such as AWS will do the step of powering down the node before making the copy. However, many tools and sites recommend making sure applications are stopped and file system access is properly synced to avoid damage, loss of integrity, or creating images that have trouble starting, stopping, or running installed applications.

4. Label everything in the cloud (nodes, disks, NICs, everything)

While creating a clone is a free operation, the resulting disks and components typically are not. AWS states, for example, that you are “charged for the snapshots until you deregister the image and delete the snapshots.” When things aren’t labeled, knowing what is in use or not in use and why it was created can become problematic. It also becomes subjected to the fleeting memories or poor concentration of existing team members. Label everything.

5. Prune clones and snapshots often (cost savings and headache savings)

Pruning old snapshots and clones is not only good for the cost savings, but it is also good for reducing headaches. Older snapshots run the risk of reintroducing vulnerabilities that have been addressed or resolved in newer copies. As VP of Customer Experience at SIOS Technology Corp., I saw the consequences firsthand when we worked with a customer who restored from a snapshot. They ran into several problems as they restarted the application. After troubleshooting, we determined that the clone was running an older version of security software. The cached credentials and metadata stored in the user profile were no longer in sync with the actual application data stored on the externally mounted data drives.

6. Limit or restrict cloning of clones in the cloud

Lastly, not everything you do in the cloud needs to be cloned. Consider limiting the types of workloads that you will clone and restrict the number or roles who can create clones in your environment.

In the movie, when Doug’s clones sparked their own series of duplications, an already overwhelmed Doug (Michael Keaton) is forced to exert extra energy to manage his many clones while trying to hide the mess he created from his wife. Achieving clone availability in the cloud with better outcomes is not difficult. Clone carefully to avoid making more work and adding risk from a tool that was supposed to make your work easier and your environment safer.

– Cassius Rhue, Vice President, Customer Experience

Reproduced from SIOS

New Product Release: SIOS Protection Suite for Linux 9.5.1

December 26, 2020 by Jason Aw Leave a Comment

New Product Release: SIOS Protection Suite for Linux 9.5.1

SIOS is continually updating our products to meet our customer’s evolving needs for high availability for mission-critical applications. We are excited to announce the general availability of SIOS Protection Suite for Linux version 9.5.1! This release features adds support for a wider range of platforms and enhancements to our command-line interface feature.

Key updates include

- - Support for the following operating systems and platforms: VMware Vsphere 7 , Red Hat Enterprise Linux (RHEL) 8.2 , Oracle Linux 8.2 , CentOS 8.2, SUSE Linux Enterprise (SLES) 12 SP5, RHEL 7.8 , CentOS 7.8 , Oracle Linux 7.8 , SLES 15 SP2
  - CLI Auto Install with enhanced setup script – for faster, easier implementation
  - Expanded CLI support for ARKs and Cloning – enables simple, consistent deployment of multiple clusters

Reproduced with permission from SIOS

Six Reasons Your Cloud Migration Has Stalled

December 22, 2020 by Jason Aw Leave a Comment

Six Reasons Your Cloud Migration Has Stalled

More and more customers are seeking to take advantage of the flexibility, scalability and performance of the cloud. As the number of applications, solutions, customers, and partners making the shift increases, be sure that your migration doesn’t stall.

Avoid the Following Six Reasons Cloud Migrations Stall

1. Incomplete cloud migration project plans

Project planning is widely thought to be a key contributor to project success. The planning plays an essential role in helping guide stakeholders, diverse implementation teams, and partners through the project phases. Planning helps identify desired goals, align resources and teams to those goals, reduce risks, avoid missed deadlines, and ultimately deliver a highly available solution in the cloud. Incomplete plans and incomplete planning are often a big cause of stalled projects. At the ninth hour a key dependency is identified. During an unexpected server reboot an application monitoring and HA hole is identified (see below). Be sure that your cloud migration has a plan, and work the plan.

2. Over-engineering on-premises

“This is how we did it on our on-premises nodes,” was the phrase that started a recent customer conversation. The customer engaged with Edmond Melkomian, Project Manager for SIOS Professional Services, when their attempts to migrate to the cloud stalled. During a discovery session, Edmond was able to uncover a number of over-engineered items related to on-premises versus cloud architecture. For some projects, reproducing what was done on premises can be a resume for bloat, complexity, and delays. Analyze your architecture and migration plans and ruthlessly eliminate over-engineered components and designs, especially with networking and storage.

3. Under-provisioning

Controlling cost and preventing sprawl are an important and critical aspect of cloud migrations. However, some customers, anxious about per hour charges and associated costs for disks and bandwidth fall into the trap of under-provisioning. In this trap, resources are improperly sized, be that disks that have the wrong speed characteristics, compute resources with the wrong CPU or memory footprint, or clusters with the wrong number of nodes. In such under-provisioned cases, issues arise when User Acceptance Test (UAT) begins and expected/anticipated workloads create a log jam on undersized resources. Or a cost optimization of a target node is unable to properly handle resources in a failover scenario. While resizing virtual machines in the cloud is a simple process, these sizing issues often create delays while architects and Chief Financial Officers try to understand the impact of re-provisioning resources.

4. Internal IT processes

Every great enterprise company has a set of internal processes, and chances are your team and company are no exception. IT processes are usually key among the processes that can have a large impact on the success of your cloud migration strategy. In the past, many companies had long requisition and acquisition processes, including bids, sizing guides, order approvals, server prep and configuration, and final deployment. The cloud process has dramatically altered the way compute, storage, and network resources, among others, are acquired and deployed. However, if your processes haven’t kept up with the speed of the cloud your migration may hit a snag when plans change.

5. Poor High Availability planning

Another reason that cloud migrations can stall involves high availability planning. High availability requires more than a bundle of tools or enterprise licenses. HA requires a careful, thorough and thoughtful system design. When deploying an HA solution your plan will need to consider capacity, redundancy, and the requirements for recovery and correction. With a plan, requirements are properly identified, solutions proposed, risks thought through, and dependencies for deployment and validation managed. Without a plan, the project and deployment are vulnerable to risks, single point of failure issues, poor fit, and missing layers and levels of application protection or recovery strategies. Often when there has been a lack of HA planning, projects stall while the requirements are sorted out.

6. Incomplete or invalid testing

Ron, a partner migrating his end customer to the cloud, planned to go-live over an upcoming three day weekend. The last decision point for ‘go/no-go’ was a batch of user acceptance testing on the staging servers. The first test failed. In order to make up for lost time due to other migration snags, Ron and team skipped over a number of test cases related to integrating the final collection of security and backup software on the latest OS with supporting patches. The simulated load, the first on the newly minted servers, tripped a series of issues within Ron’s architecture including a kernel bug, a CPU and memory provisioning issue, and storage layout and capacity issues. The project was delayed for more than four weeks to address customer confidence, proper testing and validation, resizing and architecture, and apply software and OS fixes.

The promises of the cloud are enticing, and a well planned cloud migration will position you and your team to take advantage of these benefits. Whether you are beginning or in the middle of a cloud migration, we hope this article helps you be more aware of common pitfalls so you can hopefully avoid them.

– Cassius Rhue, Vice President, Customer Experience

Reproduced from SIOS

Calculating Application Availability In The Cloud

December 18, 2020 by Jason Aw Leave a Comment

Calculating Application Availability In The Cloud

When deploying business critical applications in the cloud, you want to make sure they are highly available. The good news is that if you plan properly, you can achieve 99.99% (4-nines) of availability or more. However, calculating your true availability may not be as straightforward as it seems.

When considering availability, you must consider the key components that make access to your application possible, which I’ll call the availability chain. Component of the availability chain are:

Compute
Network
Storage
Application
Dependent services

Your application is only as available as your weakest link, and your downtime increases exponentially with each additional link you add to the chain. Let’s examine each of the links.

Compute Availability

Each of the three major cloud service providers have some similarities. One thing in common across all three platforms is the service level agreements (SLA) they will commit to for compute.

The SLA for all three public cloud providers for VMs when you have two or more VMs configured across different availability zones is 99.99%. Keep in mind, this SLA only guarantees the remote accessibility of one of the VMs at any given time, it makes no promises as to the availability of the services or application(s) running inside the VM. If you deploy a single VM within a single datacenter, this SLA varies from “90% of each hour” (AWS) to 99.5% (Azure and GCP) or 99.9% (Azure single VM when using Premium SSD).

True high availability starts at 99.99%, so the first step is to ensure your application is available is to make sure the application is distributed across two or more VMs that span availability zones. With two VMs spread across two availability zones, giving you 99.99% availability of at least one of those VMs, you could theorize that if you had three VMs spread across three availability zones your availability would be even greater than 99.99%. Although the cloud providers’ SLA will never guarantee beyond 99.99% availability regardless of the number of availability zones in use, if you use pure statistics you might come to the conclusion that your availability could jump to as high as 99.999999% or 8-nines of availability, 26.30 milliseconds downtime per month.

1-(.0001*.0001) = .99999999

99.999999% availability with three availability zones?

Don’t go around quoting that number. But just keep in mind that it makes sense that if two availability zones can give you 99.99% availability. It stands to reason that three availability zones is going to give you something significantly more than 99.99% availability.

Compute is just one link in the availability chain. We still have to address network, storage and other dependent services, which all represent possible points of failure.

Network Availability

In order for your application to be available, every network hop between the client and the application and all the resources that the application depends on, must be available and working within tolerable latency ranges. You need to understand the network links between database servers, application servers, web servers and clients to know precisely where the network might fail. Remember, the more links in your availability chain the lower your overall availability will be.

Although network availability betweens VMs in the same vNet are covered under the standard compute SLA, there are other network services that you may be utilizing. Here are just a few examples of network services you could be utilizing which would impact overall application availability.

Express Route – 99.95%
VPN Gateway – 99.9% through 99.95%
Load Balancer – 99.99%
Traffic Manager – 99.99%
Elastic Load Balancer – 99.99%
Direct Connect – 99.9% – 99.99%

Building on what we have learned so far, let’s take a look at the availability of an application that is deployed across two availability zones.

99.99% compute availability

99.99% load balancer availability

.9999 * .9999 = .9998

99.98% availability = ~9 minutes downtime per month

Now that we have addressed compute and network availability, let’s move on to storage.

Storage Availability

Now here is where the story gets a little hairy. Have a look at the following storage SLAs

https://azure.microsoft.com/en-us/support/legal/sla/storage/v1_5/

https://cloud.google.com/storage/sla

https://aws.amazon.com/compute/sla/

It seems pretty clear that Azure and Google are giving you a 99.9% SLA on block storage solutions. AWS doesn’t mention EBS specifically here. They only talk about VMs and measure their single instance VMs availability by the hour instead of by the month as the other cloud providers do. For sake of discussion, lets use the 99.9% availability guarantee that both Azure and GCP have published.

Building upon our previous example, let’s add some storage to the equation.

99.99% compute availability

99.99% load balancer availability

99.9% managed disk

.9999 * .9999 * .999 = .9988

99.88% availability = ~53 minutes of downtime per month.

53 minutes of downtime is a lot more than the 9 minutes of downtime we calculated in our previous example. What can we do to minimize the impact of the 99.9% storage availability? We have to build more redundancy in the storage!

Fortunately, we usually include storage redundancy when planning for application availability. For instance, when we stand up web servers, each web server will typically store data on the locally attached disk. When deploying domain controllers, Microsoft Active Directory takes care of replicating AD information across all the domain controllers. In the case of something like SQL Server, we leverage things Always On Availability Groups or SIOS DataKeeper to keep the data in sync across locally attached disks.

The more copies of the data we have distributed across different availability zones, the more likely we will be able to survive a failure.

For example, an application that stores its data across two different disks in different availability zones will benefit from the redundancy and instead of 99.9% availability it is more likely to achieve 99.9999% availability of the storage.

1 – (.001 * .001) = .999999

If we throw that into the previous equation, the picture starts to look a little brighter.

.9999 * .9999 * .999999 = .9998

99.98% availability = ~9 minutes of downtime

By duplicating the data across multiple AZs, and therefore multiple disks, we have effectively mitigated the downtime associated with cloud storage.

Application And Dependent Services Availability

You’ve done all you can do to ensure compute, network, and storage availability. But what about the application itself? Some applications can scale out and provide redundancy by load balancing between multiple instances of the same application. Think of your typical web server farm where you may typically load balance web requests between five servers. If you lose one server, the load balancer simply removes it from its rotation until it is once again responsive.

Other applications require a little more care and monitoring. Take SQL Server for instance. Typically Always On Availability Groups or Failover Cluster Instances are used to monitor database availability and take recovery actions should a database become unresponsive due to application or system level failures. While there is no published SLA for SQL Server availability solutions, it is commonly accepted that when configured properly for high availability, a SQL Server can provide 99.99% availability.

You may rely on other cloud based services, like hosted Active Directory, hosted DNS, microservices, or even the availability of the cloud portal itself should all be factored into your overall availability equation.

Summary

Application availability is the sum of all the moving parts. Skimping in just one area can exponentially impact the overall availability of your application. Take your time and investigate all the links in your availability chain for weakness including compute, network, storage, application and dependent services.

In general the numbers presented here are hopefully worst case scenarios and your actual availability should exceed the published SLAs. Do your homework and be wary of any service that can not guarantee 99.99% availability, the typical threshold of what is considered highly available.

Human error and security were not addressed in this article. You can make your application as highly available as possible. However, if you have not taken steps to secure your application against external threats and stupid human mistakes then all bets are off when it comes to availability.

Using Datadog for Amazon EC2 Monitoring? Pair with SIOS AppKeeper for Automated Remediation

December 11, 2020 by Jason Aw Leave a Comment

Using Datadog for Amazon EC2 Monitoring? Pair with SIOS AppKeeper for Automated Remediation

Have you ever thought to yourself, “It would be nice if Datadog could monitor our Amazon EC2 services and automatically restart them when it detects a failure?” I thought the same thing, and decided to try it out for myself.

SIOS AppKeeper automatically monitors Amazon EC2 instances for failures and automatically restarts instances or even reboots services when failures are detected. I thought to myself, “What if we combined the monitoring capabilities of Datadog with AppKeeper’s automated remediation capabilities?”

It worked, and here is how I did it.

If you are already using Datadog and are interested in trying this out for yourself, please sign up at the end of this article for access to our API.

Here are the steps I took to set up AppKeeper to receive alerts from Datadog and restart the webserver on Amazon EC2 when downtime is detected.

To run this experiment successfully, we already had a Datadog account, an AppKeeper account and a NGINX webserver running on Amazon EC2 (using Linux 2).

How to integrate Datadog with AppKeeper to provide automated remediation

Step One: Get the Restart API Token from AppKeeper

Request the API Token for the Datadog integration from this form:

https://mk.sios.jp/BC_AppKeeper_Datadog_api_application

If you request it from the form, the token will be sent to the email address you provide.

Step Two: Create the tenant in AppKeeper

The next step was to register the AWS account to which the monitored instance belongs in AppKeeper. (AppKeeper refers to the registered AWS accounts as “tenants.”)

https://sioscoati.zendesk.com/hc/en-us/articles/900000123406-Quick-Start-Guide#h_39404cfb-4a76-450f-99c2-e197cc63e50d

Step Three: Create IAM Role in AWS

I then created an IAM Role in AWS (you need this to set up your AppKeeper account). Here are instructions if you are unfamiliar with this process.

Step Four: Add the tenant in AppKeeper

The next step was to add the “tenant” in AppKeeper (AppKeeper considers an AWS account a “tenant”). Here is a link to detailed instructions on doing this.

Step Five: Set up the Synthetics Test in Datadog

I then needed to configure Datadog’s outline monitoring for the Nginx server (EC2 instance) that we want to monitor. Here’s how to do that:

Open the Datadog dashboard and select UX Monitoring > Synthetic Tests from the menu.

Click the [New Test] button in the upper right corner and select [New API Test] to create an outline monitoring case.

Enter the following information in the form to create an outline monitoring case.

Choose Request Type
Select “HTTP”.
Define Request:
Set the following values.
URL : GET http://{{{ EC2 IP address }}
Name : AppKeeper Datadog Integration Test (any name)
Locations : Tokyo

3. Specify test frequency
No Change

4. Define assertion
Click on “New Assertion” and set the following values

When : [status code] [is] [200]

5. Define Alert Condition
No Change

6. Notify Your Team
No Change

Step Six: Run the Synthetics test in Datadog

Once the above inputs are completed, press “Create Test” to create the test case for external monitoring.

The results are visible and we can see that the webserver is working properly in the “Test Results” section.

That was all that had to be done to configure Synthetics monitoring using Datadog.

Step Seven: Set AppKeeper to receive Synthetics alerts

Next I had to set AppKeeper as the notification destination. From the Datadog menu, go to Integrations and select the Integrations tab.

In the search box, enter “Webhooks” to find the Webhooks integration.

Click “Available” to enable the Webhooks integration in your Datadog account. (Once enabled, it will appear in the “Installed” column.)

Click on “Configure” to open the Webhooks integration configuration page.

In the “Webhooks” column at the bottom of the page, click “New +” to create a new Webhooks notification destination. For the parameters, enter the following

Name : The name of the integration (any name)

URL : https://api.appkeeper.sios.com/v2/integration/{{ AWS account ID }}/actions/recover

Payload :

{

“instanceId”: “{{ EC2 Instance ID}}”,

“name”: “nginx”

}

Custom Headers: Check the box and enter the following

{
“Content-type”: “application/json”,
“accept”: “application/json”,
“appkeeper-integration-token”: “{{ Get AppKeeper external integration tokens The tokens obtained in }}”
}

When you are done, press “Save.”

Step Eight: Connecting AppKeeper to the Synthetics test

Next, I had to configure AppKeeper (the registered Webhooks integration) to be called when an alert of the Synthetics monitoring occurs.

Open the test case that you set up in “Configuring the Synthetic Monitoring with Datadog” from UX Monitoring > Synthetic Tests in the menu.

Select “Edit test details” from the top-right gearbox and enter the following values in the “5. Notify Your Team” box to save the changes.

@webhook-{{ Name of Webhook integration in Datadog }}

※ You can set “renotify if the monitor has not been resolved”. You can retry if AppKeeper fails to recover for the first time. It is not required for testing purposes, but we recommend you to set it to [10 minutes] (minimum interval).

Setup is now complete.

Step Nine: Confirm the integration by running the test again

I then confirmed that AppKeeper would restore the webserver if Datadog detected it to be down.

Open the Synthetics monitoring test case you just set up from UX Monitoring > Synthetic Tests in Datadog.

Click “Resume Test” in the upper right corner and turn on the Synthetics monitoring.

Now Datadog will perform Synthetics monitoring at regular intervals.

The Test Results show that the server is successfully accessed.

Next, I created a pseudo-failure of the web server to test AppKeeper’s automated remediation.

Since it is difficult to cause a real failure, I stopped the service and created a situation in which you cannot view the web page. To do this I connected to the EC2 instance where the Nginx server is installed using SSH and stopped Nginx.

sudo systemctl stop nginx

After a short wait, Datadog detected that the web server is no longer accessible.

The Synthetic Tests page in Datadog also shows that the test case has failed.

If the test case fails, Datadog will notify AppKeeper that the Synthetics monitoring has failed.

When AppKeeper receives the notification, it will automatically attempt to restart Nginx.

So, if you wait a little while, you see that Datadog’s Synthetics monitoring check will pass again.

Also, if you log in to your AppKeeper dashboard, you’ll see that the recovery has been performed.

—

In this exercise I used a web server (Nginx) as an example to automate the process of detecting a failure with Datadog and restoring the service with AppKeeper.

Similar automation could be achieved by integrating Datadog with EventBridge and Lambda or by creating custom scripts.

However, if you frequently add target instances or restart a wide variety of services, the cost and complexity of maintaining EventBridge and Lambda or scripts will increase.

AppKeeper’s proven integration with Datadog and the ease with which you can add target instances to your application makes it easy to add automation to your DevOps environment to reduce your downtime.

If you are currently using Datadog and would like to try out AppKeeper’s Restart API, please first sign up for our 14-day free trial here (you can purchase a subscription once you have installed the free trial). Then click here to request a free trial. We’ll walk you through the process and provide you with a free evaluation token to help you get started.

Apply for an evaluation token

Thank you. I hope you will take this opportunity to learn more about SIOS AppKeeper, which provides automatic monitoring and recovery of applications running on EC2.

— Tatsuya Hirao on the SIOS Technology technical team.

Reproduced with permission from SIOS