SIOS SANless clusters

SIOS SANless clusters High-availability Machine Learning monitoring

  • Home
  • Products
    • SIOS DataKeeper for Windows
    • SIOS Protection Suite for Linux
  • News and Events
  • Clustering Simplified
  • Success Stories
  • Contact Us
  • English
  • 中文 (中国)
  • 中文 (台灣)
  • 한국어
  • Bahasa Indonesia
  • ไทย

How to Understand & Respond to Availability Alerts

January 29, 2021 by Jason Aw Leave a Comment

Understand & Respond to Availability Alerts

Houston We Have a Problem (or How to Understand & Respond to Availability Alerts)

A Successful Failure

Houston we have a problem!  It is an iconic line that reminds countless space buffs and movie fans about the great difficulty, potential disaster, and the perilous state of the Apollo 13 space mission – a mission NASA now calls “A Successful Failure.”  Ignoring your own application availability alerts may not go down in history as a defining moment, but can also wreak similar havoc

Now back to 1970:

“A routine stir of an oxygen tank ignited damaged wire insulation inside it, causing an explosion that vented the contents of both of the Service Module’s (SM) oxygen tanks to space. Without oxygen, needed for breathing and for generating electric power, the SM’s propulsion and life support systems could not operate. The Command Module’s (CM) systems had to be shut down to conserve its remaining resources for reentry, forcing the crew to transfer to the Lunar Module (LM) as a lifeboat. With the lunar landing canceled, mission controllers worked to bring the crew home alive.”

An explosion of oxygen tanks triggered alarms, warnings, pressure and voltage drops, interrupted communications, and then the now famous radio communication between the astronauts and Mission Control.  But what if, after the explosion, the crew did nothing? What if they never checked on the explosion, never responded to the warnings and gauges, and never informed Mission Control of there being an issue?  What if Mission Control, after being notified or alerted back at their dashboard in the control center, never attempted to provide any assistance?  What if the team buried their heads in the sand, or resigned themselves to fate and chance, never tried to learn, improvise, or improve from the failure they encountered?  The result would have been tragic!  It may have made it to a documentary, but hardly a blockbuster movie featuring an iconic line.

What Do You Do When an Alert is Triggered in Your Environment?

A;ert

Space walks are a far cry from our own day to day activities, unless of course you work for NASA, but recent blogs on Apollo 13 do spark a question applicable to availability.  What do you do when there is an alert triggered in your environment? Do you just ignore it?  Do you downplay it, waiting to see if the alerts, log messages, or other indicators will just go away?  Do you contact your vendor support to understand how you can disable these alerts, warnings, and messages?  Or do you say, “We have a problem here and we need to work it out”?

As a VP of Customer Experience at SIOS Technology Corp. we have experienced both sides of alerts and indicators.  We have painstakingly walked with customers who chose to ignore warnings, turning off critical alerts that indicated issues, ranging from application thresholds to network instability to potential data inconsistency.  And we have also seen customers who have tuned into their alerts, investigated why their alarms were going off, uncovered the root cause and enjoyed the fruit of their labor.  This fruit is most often the sweet reward of improved stability, innovation and learning, or an averted disaster.

4 things you can do when you your availability product triggers an alert

1. Determine if the type and criticality of the availability alert.

Is the alert or error indicative of a warning, an error, or a critical issue? A good place to assist you and your team with understanding criticality is to consult with available documentation. Check the product documentation, online forums, knowledge base articles (KBA), and internal team data and process manuals.

2. Assess the immediacy of the alert. 

For warnings and errors, how likely are they to progress into a critical issue or event.  For critical issues and alerts, this may be obvious but an assessment, even of critical events will provide some guidance on your next steps; self-correction, issue isolation, or immediate escalation.

3. Consult additional sources. 

What other sources can you access to make a determination about the alert condition? For example, if the alert is storage related, are there other tools that can expose the health of your storage?  If the issue is a network alert, are there hypervisor tools, traffic tools, NIC statistics, or other specialized monitoring tools deployed to help with analysis.

4. Contact support.

In other words, if you are unsure, alert Mission Control. After determining the type, assessing the immediacy, and consulting additional sources, it is a good idea to contact your vendor for support.  A warning about a threshold for API calls may seem innocent. But if the API calls will fail once such a limit is reached, this could be cause for immediate action. Getting the authority of the specialist can be helpful in keeping peace of mind and avoiding disaster.

An experienced vendor like SIOS can help you quickly identify the causes of problems and recommend the best solution.

Repeatedly ignoring problems in your availability environment can lead to unexpected, but no less devastating results. Addressing the problems indicated by alerts, log messages, warning indicators, or other installed and configured indicators gives your customers, your business, your teams, and yourself the “opportunity to solve the problems,” before it becomes a disaster. And at the same time, strengthens your availability strategy and infrastructure.  Which will you choose?

–  Cassius Rhue, VP, Customer Experience

Reproduced from SIOS

 

Filed Under: Clustering Simplified Tagged With: Application availability, application monitoring, disaster recovery, High Availability, high availability - SAP, SQL Server High Availability

Should I Still Use Zabbix In AWS?

January 16, 2021 by Jason Aw Leave a Comment

Should I Still Use Zabbix In AWS

Should I Still Use Zabbix In AWS?

Amazon EC2 monitoring

For mission-critical applications, ERPs, and databases, such as SQL Server, SAP, HANA, and Oracle your application monitoring needs are best served by a clustering software like SIOS Protection Suite that monitors the full application stack (on-premises or in the cloud). If it detects an application issue, it orchestrates the failover of application operation to a standby node automatically.

However, for applications that don’t require high availability clustering, Zabbix has a high market share as an integrated OSS monitoring tool.  Although it has been widely used in on-premise environments, there are many examples of Zabbix being used in AWS environments.  In spite of the fact that AWS also has monitoring services such as Amazon CloudWatch, why should you use Zabbix?  This section explains the benefits of monitoring EC2 instances and other instances, as well as the configuration process.

Why use Zabbix instead of Amazon CloudWatch?

In an AWS environment, all of the infrastructure is operated by AWS, but you must be responsible for the operation of the Amazon EC2 instances themselves and the applications built on Amazon EC2. In other words, you must monitor the applications to ensure that they are operating properly, and you must take action when a problem occurs.  For non-mission-critical applications, Zabbix is a good candidate for this kind of monitoring tool.

Zabbix has the advantage of being able to monitor not only on-premises, but also cloud and virtual environments in an integrated manner.

Whereas the standard Amazon CloudWatch is limited to monitoring AWS resources (CPU, memory, etc.), Zabbix allows you to monitor even the state of your applications in detail. The following is a list of other advantages of Zabbix.

Integrated monitoring of environments with multiple AWS accounts

Amazon CloudWatch performs monitoring on a per AWS account basis.  Zabbix can monitor an environment of multiple AWS accounts, that can be monitoring business systems consisting of multiple accounts.  It can also detect anomalies not only by simple alerts based on thresholds, but also by multiple thresholds and conditions in combination. 

It can be configured Detailed notifications to suit the actual conditions of operation

Amazon CloudWatch can notify you with a message in the event of an anomaly.  For example, if your system is down for maintenance, you don’t need to be notified by message.  This is where Zabbix allows you to configure these cases in a way that allows you to suppress unwanted messages.  This way you can ensure that you are only notified when something is really wrong that needs to be addressed.

No retention period for metrics (monitoring log)

With Amazon CloudWatch, metrics can be stored for up to 15 months.  Moreover, you can only store metrics in hourly increments for 15 months, and if the monitoring interval is set to less than 60 seconds, you can only store them for a maximum of 3 hours.  Zabbix allows for long-term storage of metrics without changing the granularity of information.

How to monitor AWS environment with Zabbix

If you want to use Zabbix in an AWS, you will need to create an Amazon EC2 and DB instance and install Zabbix on it.  After installation, the process of configuring Zabbix is basically the same as on-premise, except that you will need to set up the following

  1. User account (in addition to the Admin user of Zabbix, you will need to create a user for production use)
  2. Zabbix host agent (determines where the data is collected from)
  3. Items (setting what data to collect)
  4. Triggers (defining what state the data is in that is abnormal)
  5. Actions (defining the actions to be taken when an error occurs)

In addition, you can configure AWS-specific settings, such as creating a user in AWS IAM with the necessary permissions for Zabbix, which will allow Zabbix to monitor applications and other aspects of your AWS environment.

Use the right tool for your monitoring needs

Not all corporate systems operate in isolation, but many systems are linked together to exchange data and ensure consistency as a whole.  In these environments, Zabbix is a great tool for monitoring and detecting anomalies across multiple servers and systems.  For example, if a DB-based web application has an anomaly on the web application server, it is possible to disable the data, for example.

On the other hand, Zabbix has a lot of configuration options, so you will have to decide what to monitor and how, and what conditions are abnormal.

On the other hand, Zabbix has a lot of settings, so you have to design the operation exactly what to monitor and what to do about it, and what to do about it. Of course, for critical systems such a design is essential, however, for relatively simple systems, such as “if a process stops, just restart it”, there is no match for Zabbix monitoring.

For mission-critical applications, SIOS Protection Suite includes application recovery kits that provide application-specific monitoring of the entire application environment, server, storage and network as well as failover orchestration according to application-specific best practices on Amazon EC2.

Don’t trust your application availability and monitoring to just anyone.  Get in touch with the availability experts at SIOS to see how we can help you.

Reproduced from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability, application monitoring, Zabbix

5 Signs That It Will Take More Than A Blog Post To Fix Your High Availability

December 8, 2020 by Jason Aw Leave a Comment

5 signs that it will take more than a blog post to fix your high availability

5 Signs That It Will Take More Than A Blog Post To Fix Your High Availability

The signs are there. The warning lights are flashing.  In your gut, you can sense it. Maybe you can’t sleep.  Your problems with high availability are deep. But, maybe you are not quite sure.

1. If you think your cloud SLA is all you need for high availability

Cloud solutions have provided great advancements in increased hardware availability and resilience. However, application high availability requires more than just selecting the right hypervisor or cloud provider. Your strategy for high availability cannot stop with the SLA provided by the cloud or a virtualization provider. As quoted by Wired, “The almost four-day Amazon outage of April 2011 did not breach Amazon’s EC2 SLA, which as a FAQ explains, “guarantees 99.95% availability of the service within a Region over a trailing 365 period.” In this DZone article, our own David Bermingham breaks down the differences between cloud SLAs and application availability in detail. If you want a highly available infrastructure, it must include monitoring, recovery, and resilience at the data and application layers as well.

2. If you are just using the high availability clustering that came with your open source operating system

If so, then chances are you didn’t select your database based on what was bundled with the OS, so why would you select your HA solution based on that criteria alone. Bundled tools go a long way in providing extra assurance, possibilities, and capabilities. However, despite the ease of access, bundled tools and OS clustering software are not always capable of meeting your SLA, RPO, RTO, and availability requirements. If your enterprise has a combination of Operating Systems, your team will likely need help navigating different tools and understanding how they integrate together. It’s kind of like choosing the hedge clippers and push reel mower left on the curb to shape “Azalea” on the 13th hole par 5 (at Augusta). Both lawn mowers are designed to cut grass but how much time do you have? How are you going to handle the complexity? Which would you trust? Your strategy for high availability requires more than just considering the conveniences of what is bundled with the OS, otherwise, you’d be running MySQL instead of SAP HANA.

3. If you think that enterprise application licensing, such as SQL Enterprise or Oracle Enterprise, is the same thing as enterprise high availability

In addition to increased cost, many enterprise application licenses also increase the ability of the application to recover in some high availability scenarios. However, it is highly unlikely that your entire enterprise is based on a single application. Your high availability is going to require more than just a highly available database solution. You’ll need an enterprise grade application monitoring and recovery solution with a breadth of support for all of your applications and databases. In addition, you’ll need the ability to manage and replicate not just database data, but critical application and configuration data as well. Availability for a single database or a simple application is one thing – but HA for a complex, multipart application and supporting database is very different. More services, more parts that need to be coordinated, more complex architecture to orchestrate, more specific best practices to adhere to before, during and after failover/switchover. More than what your enterprise license paid for.

4. If your downtime is growing and your uptime is shrinking

The pace of life is ever increasing in many fields. When was the last time your team recovered from backup, manually restarted the applications that were deemed critical, or restarted a set of failed virtual machines or nodes? The pace of your outage events cannot continue to outpace sustainability, or your team’s ability to move beyond firefighting to fire prevention and fire proofing. “You can only run so hard so long (Carey Nieuwhof).” For some of you, you’ve been firefighting for too long, and your outages are becoming more common than your up-time.

5. If your first failover test was on the production server

A recent client remarked that it is simply impossible to test for every possible disaster scenario. As new software is created, deployed, updated, and patched the challenges in higher availability are increasing. But, your live, production data is not the place to find out what does not play well together. And while Go-Live and Post-Go-Live will always have their share of surprises, the inability to actually failover and run on the backup node should not be one of them.

Scouring blogs can provide you with helpful tips and insights to define, redefine, and improve your higher availability. But, if the warning signs are going off that you’ve traded true availability for some semblance of ‘just enough’, then it will take more than a blog post, or scouring every blog post in the availability world for that matter, to fix your HA.

– Cassius Rhue, Vice President, Customer Experience

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Amazon AWS, Application availability, application monitoring, High Availability, high availability - SAP, SQL Server High Availability

9 Signs You Have an Application Availability Problem

November 27, 2020 by Jason Aw Leave a Comment

9 Signs You Have an Application Availability Problem

9 Signs You Have an Application Availability Problem

You’ve heard the saying “recognizing a problem is the first step in solving it.”  But, many small, medium, and surprisingly, even large enterprise businesses aren’t aware that their application availability isn’t what it should be.

Read on for these nine signs that you still have an application availability problem:

1. You spend more time restarting an application than using it

Application crashes may be a fact of life, but if your application is down more often than it is up, that is a problem.

2. You’ve started to snooze through the alert storm in your inbox or control center

You have deployed alerts for application or server downtime, but the alert storm has so overwhelmed your inbox that you have silenced them all.

3. You have one data center for all your critical operations

A single data center for operations may sound convenient, but one well intended but misdirected construction crew has been known to turn single data centers into costly unavailability zones.

4. Your idea of data protection involves backup retrieval and archives

Your data protection strategy is critical. Data replication technology and site to site, region to region replication has become a mainstay, so if your replication or data protection strategy is non-existent or involves a lengthy jog to the vault this could be a big problem.

5. Your recovery procedures always require manual intervention

Manual intervention itself is not a problem. Some events are so difficult and complex that some amount of manual effort could be required.  But, if manual intervention is always the first, second and third order of business after a server or application outage, that is a problem.

6. Your RTO is measured in days not hours or minutes

How are you measuring your recovery time objective (RTO)? Do you measure your RTO in days or hours instead of minutes per month?  True, every business has a tolerance level for their RTO.  However, your RTO should not be a function of server rebuilds and gross instabilities in your architecture.

7. You don’t know your RPO because your standby is never reliably in sync

You’ve checked the box on reliable monitoring and recovery of your application, and taken it a step further to provide a standby cluster ready system.  Great job.  But, before I let you off the hook, what is your recovery point objective (RPO)? An RPO should be something more accurate than “somewhere between day 0 and last night.”

8. Single points of failure don’t just exist, they are the norm

Where are your single points of failure?  Your budget may not allow you to eliminate every single point of failure, but if you can identify a single point of failure in every major category and every critical component of your enterprise…

9. Your last disaster made local, regional, or national news 

If the last major storm, grid failure, or failure event put a blight on your business due to downtime, then higher availability is the next order of business.

Downtime costs your business in terms of customers, productivity, and peace of mind.  Unaddressed risks have a definite impact on your business and reputation.  If these warning signings are there, you may have an availability problem.  And, if you ignore them you’ll likely have even bigger problems soon thereafter, hence the importance of application availability.

— Cassius Rhue, VP, Customer Experience

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Amazon AWS, Application availability, application monitoring, High Availability, high availability - SAP, SQL Server High Availability

Reducing downtime for WordPress sites hosted on Amazon EC2

October 19, 2020 by Jason Aw Leave a Comment

 

 

Reducing downtime for WordPress sites hosted on Amazon EC2

Going from ignorance to bliss with SIOS AppKeeper

WordPress is an open-source content management system (CMS) used by millions of companies to create websites, blogs, or apps.  According to estimates, there are over 75 million websites today that use WordPress and many companies are beginning to host their WordPress instances on Amazon EC2. Users love WordPress for its flexibility and the ease with which you can create and modify layouts.  If you are using WordPress for your website, then you are in good company.

With so many users relying on WordPress to power their websites, you can imagine that there is a rich set of third-party tools (plugins and services) designed to meet the needs of those users.  Some of these plugins are to add security functionality, such as scanners to probe for vulnerabilities.  Because more plugins can lead to more vulnerabilities.

Trust, but verify.  Why monitoring WordPress uptime matters.

Deploying a website or application running on WordPress without monitoring it properly would be like leaving your car running outside with the keys in it.  You’ll want to protect your investment.  For companies managing WordPress sites (or any applications, for that matter), there are three primary reasons to monitor:

  1. To understand the visitors and optimize their experience;
  2. To monitor the speed of the site and ensure that it meets expected service level agreements (SLAs); and
  3. To ensure that you maximize uptime.  Downtime can mean (serious) lost revenue for any e-commerce sites running on WordPress.

You believe your WordPress site is working properly, but you really want to know what is going on.  The goal of monitoring should be to know quickly what is going on and why, allowing you to respond quickly to any issues.

There is a wide range of tools available to help WordPress users monitor their sites.  Some are very focused on WordPress, such as ManageWP and JetPack, while others are industry-standard solutions that apply to many different CMSs and applications.  Some go “deep” and are focused on one element of monitoring, such as Google Analytics and its focus on visitor analytics, while others try to go “broad” and address all three key aspects of monitoring.  What you decide to use depends on your budget, your requirements, and your technical capabilities.

Here at SIOS, we believe that the best of breed approach makes sense.  We focus on monitoring applications and ensuring that our customers’ experience as little downtime as possible with those applications.  Many of our customers are using SIOS AppKeeper today to monitor and protect their WordPress sites running on Amazon EC2.

SIOS AppKeeper – simple but powerful monitoring and automated remediation for WordPress sites

Many WordPress monitoring solutions (from free plugins to low-cost freemium services) will tell you when your WordPress site is down.  And depending on the sophistication (and cost) of your monitoring solution, it may tell you why your WordPress site is down.  But will it help you reduce downtime and automatically restart your services or reboot your instances when downtime is experienced?

Many companies host their WordPress sites on Amazon EC2 using either Apache or NGINX webservers.  SIOS AppKeeper is a SaaS service that can be configured to automatically discover WordPress sites or applications running on Amazon EC2 instances and their services, and then automatically take any number of actions if and when downtime is experienced.  So instead of only getting alerts that something is wrong, you get notified that something happened and was automatically addressed.

Downtime matters.  If you are running an e-commerce site using WordPress, then downtime will result in lost revenue.  How much revenue?  Simply divide your annual revenues by 365 days and 24 hours (Annual revenue/365/24) to understand your revenue per hour.  In 2013 Google experienced a 5-minute outage that cost them $545,000 in revenue. Now, you may not be Google, but you certainly do want to eliminate downtime wherever possible.

Now imagine what happens when you receive an alert that your WordPress site is down.  Are you ready to respond immediately?  Do you know what should be addressed to get your WordPress site back up and running?  According to our customer research, the average customer using only three Amazon EC2 instances experiences downtime at least once a month.

SIOS AppKeeper monitors Amazon EC2 and alerts you to any downtime AND takes action to remediate the situation, by either restarting your Amazon EC2 services or rebooting your instances.

AppKeeper addresses over 85% of our customers’ Amazon EC2 downtime issues automatically.  This means that you get notified that a failure was identified and addressed, without you having to drop everything or lose any significant revenue.

Today hundreds of companies rely on AppKeeper to keep their cloud environments running. We invite you to check out the video below see how easy it is to install and use AppKeeper.

Video: Installing AppKeeper and recovering from AWS EC2 failures Demo

And if you like what you see, please feel free to sign up for a free 14-day trial of AppKeeper. AppKeeper starts at only US$40 per instance per month.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Amazon EC2, AppKeeper, Application availability, application monitoring

  • 1
  • 2
  • Next Page »

Recent Posts

  • Why Does High Availability Have To Be So Complicated?
  • How to Fix Inherited Application Availability Problems
  • Quick Start Guide to High Availability for SQL Server Using SIOS Protection Suite for Linux
  • Version 8.7.2 SIOS Protection Suite-Windows and DataKeeper Cluster Edition
  • About Using Amazon FSX for SQL Server Failover Cluster Instance

Most Popular Posts

Maximise replication performance for Linux Clustering with Fusion-io
Failover Clustering with VMware High Availability
create A 2-Node MySQL Cluster Without Shared Storage
create A 2-Node MySQL Cluster Without Shared Storage
SAP for High Availability Solutions For Linux
Bandwidth To Support Real-Time Replication
The Availability Equation – High Availability Solutions.jpg
Choosing Platforms To Replicate Data - Host-Based Or Storage-Based?
Guide To Connect To An iSCSI Target Using Open-iSCSI Initiator Software
Best Practices to Eliminate SPoF In Cluster Architecture
Step-By-Step How To Configure A Linux Failover Cluster In Microsoft Azure IaaS Without Shared Storage azure sanless
Take Action Before SQL Server 20082008 R2 Support Expires
How To Cluster MaxDB On Windows In The Cloud

Join Our Mailing List

Copyright © 2021 · Enterprise Pro Theme on Genesis Framework · WordPress · Log in