Clustering Simplified Archives - Page 65 of 104

Using Datadog for Amazon EC2 Monitoring? Pair with SIOS AppKeeper for Automated Remediation

December 11, 2020 by Jason Aw Leave a Comment

Using Datadog for Amazon EC2 Monitoring? Pair with SIOS AppKeeper for Automated Remediation

Have you ever thought to yourself, “It would be nice if Datadog could monitor our Amazon EC2 services and automatically restart them when it detects a failure?” I thought the same thing, and decided to try it out for myself.

SIOS AppKeeper automatically monitors Amazon EC2 instances for failures and automatically restarts instances or even reboots services when failures are detected. I thought to myself, “What if we combined the monitoring capabilities of Datadog with AppKeeper’s automated remediation capabilities?”

It worked, and here is how I did it.

If you are already using Datadog and are interested in trying this out for yourself, please sign up at the end of this article for access to our API.

Here are the steps I took to set up AppKeeper to receive alerts from Datadog and restart the webserver on Amazon EC2 when downtime is detected.

To run this experiment successfully, we already had a Datadog account, an AppKeeper account and a NGINX webserver running on Amazon EC2 (using Linux 2).

How to integrate Datadog with AppKeeper to provide automated remediation

Step One: Get the Restart API Token from AppKeeper

Request the API Token for the Datadog integration from this form:

https://mk.sios.jp/BC_AppKeeper_Datadog_api_application

If you request it from the form, the token will be sent to the email address you provide.

Step Two: Create the tenant in AppKeeper

The next step was to register the AWS account to which the monitored instance belongs in AppKeeper. (AppKeeper refers to the registered AWS accounts as “tenants.”)

https://sioscoati.zendesk.com/hc/en-us/articles/900000123406-Quick-Start-Guide#h_39404cfb-4a76-450f-99c2-e197cc63e50d

Step Three: Create IAM Role in AWS

I then created an IAM Role in AWS (you need this to set up your AppKeeper account). Here are instructions if you are unfamiliar with this process.

Step Four: Add the tenant in AppKeeper

The next step was to add the “tenant” in AppKeeper (AppKeeper considers an AWS account a “tenant”). Here is a link to detailed instructions on doing this.

Step Five: Set up the Synthetics Test in Datadog

I then needed to configure Datadog’s outline monitoring for the Nginx server (EC2 instance) that we want to monitor. Here’s how to do that:

Open the Datadog dashboard and select UX Monitoring > Synthetic Tests from the menu.

Click the [New Test] button in the upper right corner and select [New API Test] to create an outline monitoring case.

Enter the following information in the form to create an outline monitoring case.

Choose Request Type
Select “HTTP”.
Define Request:
Set the following values.
URL : GET http://{{{ EC2 IP address }}
Name : AppKeeper Datadog Integration Test (any name)
Locations : Tokyo

3. Specify test frequency
No Change

4. Define assertion
Click on “New Assertion” and set the following values

When : [status code] [is] [200]

5. Define Alert Condition
No Change

6. Notify Your Team
No Change

Step Six: Run the Synthetics test in Datadog

Once the above inputs are completed, press “Create Test” to create the test case for external monitoring.

The results are visible and we can see that the webserver is working properly in the “Test Results” section.

That was all that had to be done to configure Synthetics monitoring using Datadog.

Step Seven: Set AppKeeper to receive Synthetics alerts

Next I had to set AppKeeper as the notification destination. From the Datadog menu, go to Integrations and select the Integrations tab.

In the search box, enter “Webhooks” to find the Webhooks integration.

Click “Available” to enable the Webhooks integration in your Datadog account. (Once enabled, it will appear in the “Installed” column.)

Click on “Configure” to open the Webhooks integration configuration page.

In the “Webhooks” column at the bottom of the page, click “New +” to create a new Webhooks notification destination. For the parameters, enter the following

Name : The name of the integration (any name)

URL : https://api.appkeeper.sios.com/v2/integration/{{ AWS account ID }}/actions/recover

Payload :

{

“instanceId”: “{{ EC2 Instance ID}}”,

“name”: “nginx”

}

Custom Headers: Check the box and enter the following

{
“Content-type”: “application/json”,
“accept”: “application/json”,
“appkeeper-integration-token”: “{{ Get AppKeeper external integration tokens The tokens obtained in }}”
}

When you are done, press “Save.”

Step Eight: Connecting AppKeeper to the Synthetics test

Next, I had to configure AppKeeper (the registered Webhooks integration) to be called when an alert of the Synthetics monitoring occurs.

Open the test case that you set up in “Configuring the Synthetic Monitoring with Datadog” from UX Monitoring > Synthetic Tests in the menu.

Select “Edit test details” from the top-right gearbox and enter the following values in the “5. Notify Your Team” box to save the changes.

@webhook-{{ Name of Webhook integration in Datadog }}

※ You can set “renotify if the monitor has not been resolved”. You can retry if AppKeeper fails to recover for the first time. It is not required for testing purposes, but we recommend you to set it to [10 minutes] (minimum interval).

Setup is now complete.

Step Nine: Confirm the integration by running the test again

I then confirmed that AppKeeper would restore the webserver if Datadog detected it to be down.

Open the Synthetics monitoring test case you just set up from UX Monitoring > Synthetic Tests in Datadog.

Click “Resume Test” in the upper right corner and turn on the Synthetics monitoring.

Now Datadog will perform Synthetics monitoring at regular intervals.

The Test Results show that the server is successfully accessed.

Next, I created a pseudo-failure of the web server to test AppKeeper’s automated remediation.

Since it is difficult to cause a real failure, I stopped the service and created a situation in which you cannot view the web page. To do this I connected to the EC2 instance where the Nginx server is installed using SSH and stopped Nginx.

sudo systemctl stop nginx

After a short wait, Datadog detected that the web server is no longer accessible.

The Synthetic Tests page in Datadog also shows that the test case has failed.

If the test case fails, Datadog will notify AppKeeper that the Synthetics monitoring has failed.

When AppKeeper receives the notification, it will automatically attempt to restart Nginx.

So, if you wait a little while, you see that Datadog’s Synthetics monitoring check will pass again.

Also, if you log in to your AppKeeper dashboard, you’ll see that the recovery has been performed.

—

In this exercise I used a web server (Nginx) as an example to automate the process of detecting a failure with Datadog and restoring the service with AppKeeper.

Similar automation could be achieved by integrating Datadog with EventBridge and Lambda or by creating custom scripts.

However, if you frequently add target instances or restart a wide variety of services, the cost and complexity of maintaining EventBridge and Lambda or scripts will increase.

AppKeeper’s proven integration with Datadog and the ease with which you can add target instances to your application makes it easy to add automation to your DevOps environment to reduce your downtime.

If you are currently using Datadog and would like to try out AppKeeper’s Restart API, please first sign up for our 14-day free trial here (you can purchase a subscription once you have installed the free trial). Then click here to request a free trial. We’ll walk you through the process and provide you with a free evaluation token to help you get started.

Apply for an evaluation token

Thank you. I hope you will take this opportunity to learn more about SIOS AppKeeper, which provides automatic monitoring and recovery of applications running on EC2.

— Tatsuya Hirao on the SIOS Technology technical team.

Reproduced with permission from SIOS

5 Signs That It Will Take More Than A Blog Post To Fix Your High Availability

December 8, 2020 by Jason Aw Leave a Comment

5 signs that it will take more than a blog post to fix your high availability

5 Signs That It Will Take More Than A Blog Post To Fix Your High Availability

The signs are there. The warning lights are flashing. In your gut, you can sense it. Maybe you can’t sleep. Your problems with high availability are deep. But, maybe you are not quite sure.

1. If you think your cloud SLA is all you need for high availability

Cloud solutions have provided great advancements in increased hardware availability and resilience. However, application high availability requires more than just selecting the right hypervisor or cloud provider. Your strategy for high availability cannot stop with the SLA provided by the cloud or a virtualization provider. As quoted by Wired, “The almost four-day Amazon outage of April 2011 did not breach Amazon’s EC2 SLA, which as a FAQ explains, “guarantees 99.95% availability of the service within a Region over a trailing 365 period.” In this DZone article, our own David Bermingham breaks down the differences between cloud SLAs and application availability in detail. If you want a highly available infrastructure, it must include monitoring, recovery, and resilience at the data and application layers as well.

2. If you are just using the high availability clustering that came with your open source operating system

If so, then chances are you didn’t select your database based on what was bundled with the OS, so why would you select your HA solution based on that criteria alone. Bundled tools go a long way in providing extra assurance, possibilities, and capabilities. However, despite the ease of access, bundled tools and OS clustering software are not always capable of meeting your SLA, RPO, RTO, and availability requirements. If your enterprise has a combination of Operating Systems, your team will likely need help navigating different tools and understanding how they integrate together. It’s kind of like choosing the hedge clippers and push reel mower left on the curb to shape “Azalea” on the 13th hole par 5 (at Augusta). Both lawn mowers are designed to cut grass but how much time do you have? How are you going to handle the complexity? Which would you trust? Your strategy for high availability requires more than just considering the conveniences of what is bundled with the OS, otherwise, you’d be running MySQL instead of SAP HANA.

3. If you think that enterprise application licensing, such as SQL Enterprise or Oracle Enterprise, is the same thing as enterprise high availability

In addition to increased cost, many enterprise application licenses also increase the ability of the application to recover in some high availability scenarios. However, it is highly unlikely that your entire enterprise is based on a single application. Your high availability is going to require more than just a highly available database solution. You’ll need an enterprise grade application monitoring and recovery solution with a breadth of support for all of your applications and databases. In addition, you’ll need the ability to manage and replicate not just database data, but critical application and configuration data as well. Availability for a single database or a simple application is one thing – but HA for a complex, multipart application and supporting database is very different. More services, more parts that need to be coordinated, more complex architecture to orchestrate, more specific best practices to adhere to before, during and after failover/switchover. More than what your enterprise license paid for.

4. If your downtime is growing and your uptime is shrinking

The pace of life is ever increasing in many fields. When was the last time your team recovered from backup, manually restarted the applications that were deemed critical, or restarted a set of failed virtual machines or nodes? The pace of your outage events cannot continue to outpace sustainability, or your team’s ability to move beyond firefighting to fire prevention and fire proofing. “You can only run so hard so long (Carey Nieuwhof).” For some of you, you’ve been firefighting for too long, and your outages are becoming more common than your up-time.

5. If your first failover test was on the production server

A recent client remarked that it is simply impossible to test for every possible disaster scenario. As new software is created, deployed, updated, and patched the challenges in higher availability are increasing. But, your live, production data is not the place to find out what does not play well together. And while Go-Live and Post-Go-Live will always have their share of surprises, the inability to actually failover and run on the backup node should not be one of them.

Scouring blogs can provide you with helpful tips and insights to define, redefine, and improve your higher availability. But, if the warning signs are going off that you’ve traded true availability for some semblance of ‘just enough’, then it will take more than a blog post, or scouring every blog post in the availability world for that matter, to fix your HA.

– Cassius Rhue, Vice President, Customer Experience

Reproduced with permission from SIOS

9 Signs You Have an Application Availability Problem

November 27, 2020 by Jason Aw Leave a Comment

9 Signs You Have an Application Availability Problem

You’ve heard the saying “recognizing a problem is the first step in solving it.” But, many small, medium, and surprisingly, even large enterprise businesses aren’t aware that their application availability isn’t what it should be.

Read on for these nine signs that you still have an application availability problem:

1. You spend more time restarting an application than using it

Application crashes may be a fact of life, but if your application is down more often than it is up, that is a problem.

2. You’ve started to snooze through the alert storm in your inbox or control center

You have deployed alerts for application or server downtime, but the alert storm has so overwhelmed your inbox that you have silenced them all.

3. You have one data center for all your critical operations

A single data center for operations may sound convenient, but one well intended but misdirected construction crew has been known to turn single data centers into costly unavailability zones.

4. Your idea of data protection involves backup retrieval and archives

Your data protection strategy is critical. Data replication technology and site to site, region to region replication has become a mainstay, so if your replication or data protection strategy is non-existent or involves a lengthy jog to the vault this could be a big problem.

5. Your recovery procedures always require manual intervention

Manual intervention itself is not a problem. Some events are so difficult and complex that some amount of manual effort could be required. But, if manual intervention is always the first, second and third order of business after a server or application outage, that is a problem.

6. Your RTO is measured in days not hours or minutes

How are you measuring your recovery time objective (RTO)? Do you measure your RTO in days or hours instead of minutes per month? True, every business has a tolerance level for their RTO. However, your RTO should not be a function of server rebuilds and gross instabilities in your architecture.

7. You don’t know your RPO because your standby is never reliably in sync

You’ve checked the box on reliable monitoring and recovery of your application, and taken it a step further to provide a standby cluster ready system. Great job. But, before I let you off the hook, what is your recovery point objective (RPO)? An RPO should be something more accurate than “somewhere between day 0 and last night.”

8. Single points of failure don’t just exist, they are the norm

Where are your single points of failure? Your budget may not allow you to eliminate every single point of failure, but if you can identify a single point of failure in every major category and every critical component of your enterprise…

9. Your last disaster made local, regional, or national news

If the last major storm, grid failure, or failure event put a blight on your business due to downtime, then higher availability is the next order of business.

Downtime costs your business in terms of customers, productivity, and peace of mind. Unaddressed risks have a definite impact on your business and reputation. If these warning signings are there, you may have an availability problem. And, if you ignore them you’ll likely have even bigger problems soon thereafter, hence the importance of application availability.

— Cassius Rhue, VP, Customer Experience

Reproduced with permission from SIOS

APM Automation – The Missing Ingredient For Application Performance Monitoring Solutions

November 12, 2020 by Jason Aw Leave a Comment

Application Performance Monitoring Solutions

APM Automation – The Missing Ingredient For Application Performance Monitoring Solutions

Companies that move to the cloud to host their applications understand that while they have outsourced the hosting of their applications to third-party cloud vendors such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform, they still need to monitor and manage those applications themselves, usually with an Application Performance Monitoring (APM) solution. With yesterday’s client-server computing applications, IT departments had almost complete control over the servers, the networks, and the end-user computing environments. But today’s cloud environments are more complex, with many more moving parts often outside of your control.

Some companies have embarked on digital transformations, pushing customer interactions into critical, web-based applications. It is now more important than ever to quickly respond to any application performance and downtime issues via an APM automation solution.

How to select an APM solution

Many companies turn to Application Performance Management (APM) solutions such as those from AppDynamics, Datadog, Dynatrace, or New Relic. An APM solution should identify any performance bottlenecks in your code, and help you fix those issues before your users are impacted.

Good APM solutions will let you know what happened, why, and how to prevent it from happening in the future. An APM solution will alert you when the application or systems being monitored meets a certain condition (load, response times, etc.). Once you receive an alert you should be able to identify why the application is not performing properly. Armed with this information you can provide your development team with very detailed diagnostics that will allow them to address the issue and prevent them from happening in the future.

But how do you select the right APM solution? A quick search on Google for “cloud APM solutions” returns 5,830,000 results! That can be overwhelming to anyone unfamiliar with the space. Thankfully another Google search will also provide you with a lot of advice and resources on how to select an APM solution that is right for you. You should look for third-party, non-vendor advice to help you frame your requirements and develop a short-list of choices that meet those requirements. Gartner has been watching this category for a while and publishes its APM Magic Quadrant every year. It is a good resource when it comes to understanding how to evaluate APM solutions and give a good overview of the top vendors.

Add APM automated to your remediation requirements list

Here at SIOS Technology Corporation, we are always working with customers who are migrating their applications to the cloud. They often want to know how to protect their applications from unnecessary downtime and ask us for our advice. The choice of how to protect their applications is a function of the criticality of those applications (more critical applications often require failover solutions, etc.). But we also help them understand why their applications might be vulnerable.

It used to be that backup and data protection was a separate function (one that was needed only if the APM solution identified downtime). But in today’s complex cloud environments we believe that organizations should look for a holistic approach when it comes to monitoring and managing their critical applications. If a traditional APM solution identifies when something happens and lets you diagnose why it happened, then why doesn’t it prevent unnecessary downtime where possible?

We believe that automation is the missing ingredient from most cloud APM solutions. Many of our customers tell us how they are being overwhelmed by receiving too many alerts from their APM solutions, each requiring them to stop and understand what happened and why. They quickly understand what to ignore and what to pay attention to (and good APM solutions help them do this through machine learning). And if and when their applications go down, the APM solution alerts them to the downtime and diagnoses why to help prevent it from happening again. But the APM solution won’t reduce their immediate downtime.

Save yourself from downtime. Talk to SIOS experts about how SIOS high availability clustering software can monitor the entire application stack – server, storage, network, and application layers – to ensure applications are operating. SIOS Protection Suite includes application recovery kits that monitor and, in the event of a failure, orchestrate failover according to application-specific best practices. SIOS clusters uniquely failover across cloud availability zones and regions for disaster recovery.

Reproduced with permission from SIOS

Six Reasons Not to Buy SIOS High Availability Software . . . If You Dare

October 25, 2020 by Jason Aw Leave a Comment

Six Reasons Not to Buy SIOS High Availability Software . . . If You Dare

You need SIOS Protection Suite (for Linux or Windows) or SIOS DataKeeper Cluster Edition for high availability protection for business critical applications.

UNLESS

1. You prefer free solutions only.

I get it. There are definitely times when I do the same thing when I need to learn a new skill, get a quick tip, drop a few pounds, or set up a quick demo. Rather than signing up for a subscription, purchasing a license, or investing in a combination of the two, I have gone the free route.

However, the saying often holds true, you get what you pay for. Free trials are fine. Permanently free high availability is like gas station sushi – is the risk really worth it? Be sure that free doesn’t prevent you from utilizing the fullness available for optimizing uptime and increasing availability. Make sure you aren’t passing over a reasonably priced high availability solution that is proven to protect your mission-critical applications.

2. Being a single solution shop solution is more important than meeting your HA needs.

We were a “Ford tough” family for decades. Seriously. I understand what it is like to be a one solution shop. My dad owned a Ford truck for work, a Ford Mustang for leisure, a Ford 3600 tractor for the farm, and a Ford minivan for family travel. There was even a season where we received model toy cars with the brandished blue oval as well.

But, when my wife and I were branching out on our own family needs, we broke away from the single solution to address needs that fell out of the Ford wheelhouse (at the time). You may be a single shop buyer, but if your needs have changed and the HA provider or solution hasn’t kept up, consider whether expanding the solution set will eliminate risks, improve success, or be worth the investment in a complementary solution for those new needs. When we needed a reliable, gas efficient, sleek, family friendly, and economical solution for our family, we supplemented Ford tough with a Honda Odyssey. If you are a single stop shop, and you are not worried about vendor lock-in best of luck.

3. You are more of a do it yourself-er coder.

You like coding. You like to write a lot of scripts, and don’t mind pulling out your bash, ksh, perl, python, powershell, batch or command tool kit and wiring things up yourself. You value the joy in flexibility and adding your own tweaks.

I love writing code as well, but there are times when the last thing I want to do is spend time writing a lot of code and scripts for a problem that is solved, proven, and off the shelf ready. For the do-it-yourself admin, off-the-shelf may not be your preference, but consider whether 20 years of expertise and experience should be rehashed and re-architected for your enterprise. But, if you have to get the code writing fix in, High Availability Software SIOS provides the Generic Application Recovery Kit for you to get in a coding fix.

4. You need Ubuntu support (or Solaris).

Your environment is unique. You have customers who’ve cut their teeth on Solaris and are hanging on to it for dear life. Or you’ve got those who have fully embraced the Linux realm and have moved to Ubuntu. In either case, you look at the SIOS products matrix and Ubuntu isn’t currently a match for your SIOS version. Bummer!

While this is true, consider the rich and vast features and flavors of support that are still available. While there are parts of your enterprise that have dug in on Solaris and others that have raced to embrace Ubuntu and newer variants of Linux, it is more likely that you need a solution capable of supporting RHEL, OEL, SuSE, CentOS and possibly Windows as well. Be sure not to single out a high availability solution by what it doesn’t provide and consider the depth of what it does.

5. You don’t run a hybrid of anything in your environment.

I heard it in the middle of a movie last week. The lead character commented on the idea of moving forward with some new idea of an overly excited owner. The classic line: “Sometimes the juice isn’t worth the squeeze.” In your mind you feel that you aren’t running a hybrid environment. Your applications are critical, but not complex. The moving parts are simple- a database, front end and a supporting application. It makes sense that you might not want to “complicate” things with additional processes, products, solutions or services, and you may feel like the juice isn’t worth the squeeze.

Before you make that final decision for a High Availability Software, assess whether a non-hybrid environment is the same as a simple environment. Consider whether or not the moving parts are as simple as you imagine or whether a solution with failover orchestration would be beneficial to reducing your overall RTO and increasing your RPO.

6. Endorsements from HA experts and experience don’t matter.

I bought a set of headphones online in mid-April. As I suspected, I discovered that anyone can do bluetooth headphones. But, not everyone can do them well. Ergonomically, the “new to market” headphones are a nightmare. Pairing was a breeze, but accidental unpairing is a constant battle. The sound quality is amazing, but that amplifies my annoyance when the headphones randomly chirp – loudly and clearly – for system sounds or at the end of a song.

You may believe that high availability and application monitoring can be done by anyone and that experience doesn’t matter. However, consider your own experiences and mine and ask if you’d really want to trust your enterprise environment to a group that just started thinking about the complexities of hybrid environments, or the dependencies and application-centric knowledge needed for the applications you use most frequently.

When deciding the right High Availability Software for your environment, consider carefully whether you want to go without the many best in class features, hardened and tested solutions, knowledgeable experts, broad swath of supported applications and environments, and industry leading experience and decades of insight. Then after careful consideration, choose wisely.

-Cassius Rhue, Vice President, Customer Experience, SIOS

Reproduced with permission from SIOS