SIOS SANless clusters

SIOS SANless clusters High-availability Machine Learning monitoring

  • Home
  • Products
    • SIOS DataKeeper for Windows
    • SIOS Protection Suite for Linux
  • News and Events
  • Clustering Simplified
  • Success Stories
  • Contact Us
  • English
  • 中文 (中国)
  • 中文 (台灣)
  • 한국어
  • Bahasa Indonesia
  • ไทย

SIOS Technology Helps Telcos Navigate the Complexities of High Availability

November 30, 2022 by Jason Aw Leave a Comment

SIOS Technology Helps Telcos Navigate the Complexities of High Availability

In today’s connected world, high availability is not only essential — it is considered a lifeline of society and the economy. No internet means no cloud. Just a few minutes of downtime can cost companies millions of dollars and damage customer relationships. While cloud computing, edge computing, and IoT present many opportunities, they are also challenging to navigate.

In this ongoing series about high availability (HA) and disaster recovery (DR) for various industries, Harry Aujla, EMEA Pre-Sales Director of SIOS Technology Corp., discusses the importance of high availability in the telecom sector. He points out the challenges telcos are currently facing with high availability and how SIOS Technology is helping.

Check out the whole interview above to learn more.

Key highlights from this video interview:

  • The telecom sector is unique because it has a direct impact on most aspects of our professional and personal lives. We live in an interconnected society, so even if critical communication services are down for just a few minutes, it can cause havoc and lead to millions of dollars in lost revenue.
  • Telecom is constantly innovating and evolving with technologies such as 5G rollouts and cloud computing. Maintaining uptime and having high availability remain critical for customer facing systems and backend systems.
  • Aujla discusses how software-defined systems and decentralization are solving some of the challenges telcos face, particularly when considering agility and reliability.
  • Continuity is essential in the telco market, particularly since they deal with infrastructure critical systems, which are often protected with a fault tolerance solution. Aujla tells us that that means telcos often have an SLA of 99.99% availability (or five minutes of downtime per year).
  • Software-defined systems have enabled telcos to embrace digital transformation into the cloud, and to leverage the cost and flexibility that comes with cloud computing.
  • For VoIP systems, the cloud platform provider will only provide high availability services to a certain level within the solution stack, and the responsibility of maintaining SLAs within the cloud falls on VoIP providers and customers. Aujla explains how this is addressed by high availability tools.
  • For telcos, billing systems are considered critical. If customers cannot view or pay their bills, it will cause disruption for all parties involved, potentially affecting revenue and customer satisfaction. SIOS aims to deliver application availability and 99.99% SLAs.
  • IoT is a key use case of cloud computing, with data-capturing devices constantly feeding back into a central brain. Aujla explains why it makes sense to leverage the flexibility and agility cloud platforms offer.
  • Endpoints of IoT devices typically do not need high availability, but rather the data itself. However, the volume of data being collected today presents numerous challenges on the underlying infrastructure in terms of servers, storage, and networking. Aujla explains how this is driving cloud consumption and edge computing.
  • Cloud providers are geared towards availability of the infrastructure itself rather than monitoring what is happening inside the cloud instance where the IoT-centric applications could be running. For this reason, Aujla feels that it is up to customers to take responsibility for their application availability in the cloud.
  • SIOS has the ability to cluster and monitor applications and databases in the cloud, and should it detect failures, it can failover to the next available node in the cluster. Aujla explains how this improves the overall high availability SLA for protected applications.

Solutions

  • Learn more about SIOS solutions for data protection 
  • Check out SIOS Data Keeper Cluster Edition
  • Check out SIOS Lifekeeper for Linux

Connect with Harry Aujla (LinkedIn)

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: disaster recovery, High Availability

Create A Disaster Recovery Plan For Critical Applications

November 26, 2022 by Jason Aw Leave a Comment

Create A Disaster Recovery Plan For Critical Applications

Create A Disaster Recovery Plan For Critical Applications

Business-critical applications can be anything from your database servers, to your SAP infrastructure. These are the applications your business relies on. Organizations can’t afford to have these applications down. In this episode, we sat down with Dave Bermingham, Director of Customer Success at SIOS Technology, to discuss how to build a concrete disaster recovery strategy for business-critical applications.

Create A Disaster Recovery Plan For Critical Applications

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Cloud, disaster recovery

The simple days of HA & DR are gone

November 20, 2022 by Jason Aw Leave a Comment

The simple days of HA & DR are gone

Flipping through the TV channels I stumbled on the scene in the movie “He’s Just Not That Into You” with Drew Barrymore, saying what most of us in 2022 are feeling about Technology and especially high availability and disaster recovery:

“I miss the days when you had one phone number and one answering machine and that one answering machine had one cassette tape and that one cassette tape either had a message from a guy or it didn’t. And now you just have to go around checking all these different portals just to get rejected by seven different technologies. It’s exhausting.”

Sometimes, don’t you wish there was only one cloud or maybe even no cloud platform; one DB running on one OS; and only a front end application to worry about. But, the world has changed and is moving faster, and becoming more complicated.  Advances in technology, the fallout of mergers and acquisitions, and the increasing appetites and pace of our 24/7 society, with billions of consumers looking for the latest deal and the best experience, means that the simple days are gone.

4 hard truths about your availability

  1. Your solution isn’t as simple as you think

Of course your enterprise environment isn’t simple.  You have legacy systems and applications, the kind that have been around almost since punch cards.  You have new systems, made for the new generation of applications and databases.  In addition you have solutions that were created a decade ago to bridge the gap or span the time between migrating from one platform to another, but despite your best efforts, these systems linger. Added to these challenges is a growing set of systems and IT resources from the merger and acquisition of Company U.  Delivering HA is not as simple as you think in the new era.

  1. Bad architecture is a bigger problem than you realize

As VP of Customer Experience, we’ve seen the damage caused by bad architecture.  While deploying HA software can definitely help improve an application and database’s availability, HA software will never fully overcome incomplete requirements, poor networking, lack of redundant hardware, or other missing architectural components.  Our team once worked with a customer to correct an undersized environment that left their system unstable during peak operating times.  Because of their bad architecture, which included networking and hardware instability, their teams frequently found themselves scrambling to recover from avoidable downtime issues.  In order to have a complete and sound, highly available and resilient solution you will need to deploy great software as a part of a sound architecture.

  1. Your admins need more help than they’ll admit

Developing an enterprise grade, highly available resilient HA solution, built on a solid architecture with the ability to grow is not a simple process.  Designing and architecting for resilience, application and data availability is not as easy as grabbing a box of cake mix off the shelf.  Throw in an array of tools, processes from different teams, a mixture of SLA’s, and the varieties of OS, applications, databases, and platforms and you have a recipe for needing help.   Recently, I interviewed a 20 year veteran working in an enterprise support environment.  He described how many of his peers, and even at times himself, have not been able to handle the weight of maintaining critical enterprise availability.  Your admins, not only need help when they have been up since 2am dealing with a catastrophic, multi-system, multi-application, nearly complete data center collapse, but also in the day to day hard work of enterprise availability in one of the most technologically complex eras ever.

  1. Your solution may not be as highly available as you think 

“While public cloud providers typically guarantee some level of availability in their service level agreements, those SLAs only apply to the cloud hardware.” There are many other reasons for application downtime that aren’t covered by cloud provider SLAs including:

  • Software issues and bugs
  • Human errors
  • Software failure
  • System or application hangs

As VP of Customer Experience we’ve seen a thing or two, including a denial of service attack caused by a failed exit in a recursion routine, system exhaustion, security software quarantine of healthy, critical applications, kernel panics, and virtual machines that randomly reboot.  If your HA strategy is relying solely on the SLAs of your hypervisor, your solution may not be as highly available as you think. You need to protect critical applications with clustering software that can monitor and detect issues, respond to problems reliably, and if necessary move operations to a standby server to ensure that your products and services remain reliable and available when and where they are needed.

Our single data center has become a series of cloud platforms, spanning dozens of data centers.  Our skunk work application has become a part of the bevy of critical front end, middleware and backend solutions that we must manage across Windows, Linux, and a few different *Nix varieties.  The march of technology means that our high availability has become more complex and requires better architecture.  It also means that our teams need more help to manage it all, and if we aren’t careful it could mean that we remain vulnerable and exposed.  Which of the four truths is your team facing most?

Cassius Rhue, VP Customer Experience

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: disaster recovery, High Availability

Explaining the Subtle but Critical Difference Between Switchover, Failover, and Recovery

November 9, 2022 by Jason Aw Leave a Comment

Explaining the Subtle but Critical Difference Between Switchover, Failover, and Recovery

High availability is a speciality and like most specialities, it has its own vocabulary and terminology. Our customers are typically very knowledgeable about IT but if they haven’t been working in an HA environment, some of our common HA terminology can cause a fair amount of confusion – for them and for us. They are simple-sounding but with very specific meaning in the context of HA.Three of these terms are discussed here – swithover, failover, and recovery.

What is a Switchover?

A switchover is a user-initiated action via the high availability (HA) clustering solution user interface or CLI. In a switchover, the user manually initiates the action to change the source or primary server for the protected application. In a typical switchover scenario, all running applications and dependencies are stopped in an orderly fashion, beginning with the parent application and concluding when all of the child/dependencies are stopped. Once the applications and their dependencies are stopped, they are then restarted in an orderly fashion on the newly designated primary or source server.

For example, if you have resources Alpha, Beta, and Gamma. Resource Alpha depends on resources Beta and Gamma. Resource Beta depends on resource Gamma.  In a switchover event, resource Alpha is stopped first, followed by Beta, and then finally Gamma.  Once all three are stopped, the switchover continues to bring the resources into an operational state on the intended server.  The process starts with resource Gamma, followed by Beta, and then finally the start up operations complete for resource Alpha. 

Traditionally, a switchover operation requires more time as resources must be stopped in a graceful and orderly manner. A switchover is often performed when there is a need to update software versions while maintaining uptime, performing maintenance work (via rolling upgrades) on the primary production node, or doing DR testing.

Key Takeaway: If there was no failure to cause the action, then it was a switchover

What is a Failover?

A failover operation is typically a non-user initiated action in response to a server crash or unexpected/unplanned reboot. Consider the scenario of an HA cluster with two nodes, Node A and Node B.  In this scenario, all critical applications Alpha, Beta, and Gamma are started and operational on Node A. In this scenario, a failover is what takes place when Node A experiences an unexpected/unplanned reboot, power-off, halt, or panic. Once the HA software detects that Node A is no longer functioning and operationally available within the cluster (as defined by the solution), it will trigger a failover operation to restore access of the critical applications, resources, services and dependencies on the available cluster node, Node B in this case.  In a failover scenario, because Node A has experienced a crash (or other simulated immediate failure) there are no processes to stop on Node A, and consequently once proper detection and fencing actions have been processed, Node B will immediately begin the process of restoring resources. As in the switchover case, the process starts with resource Gamma, followed by Beta, and then finally the start up operations complete for resource Alpha. Traditionally, a failover operation requires less time than a switchover. This is because the processing of a failover does not require any resources to be stopped (or quiesced) on the previous primary (in-service or active) node.

Key Takeaway: A failover occurs in response to a system failure.

What is Recovery?

A recovery event is easy to confuse with a failover. A recovery event occurs when a process, server, communication path, disk, or even cluster resource fails and the high availability software operates in response to the identified failure. Most HA software solutions are capable of multiple ways of handling a recovery event. The most prominent methods include:

  1. Graceful restart locally, then a graceful restart on the remote
    1. A restart is always attempted locally, if recovery is successful no further action occurs. If a local restart fails the next operation occurs
    2. If a local restart fails, resources are gracefully moved to the remote node
  2. Graceful restart locally, then a forced restart on the remote
    1. A restart is always attempted locally, if recovery is successful no further action occurs.  If a local restart fails the next operation occurs.
    2. Resources are moved to the remote node by fencing the primary node
  3. Forced restart on the remote
    1. A restart is never attempted locally
    2. Resources are always forced to the next available cluster node as described in method 2b.
  4. Forced server restart, no remote failover
    1. A restart is always attempted locally
    2. If a local restart fails, the primary node is restarted to attempt to recover services.
    3. Resources will not fail to a remote system
  5. Policy based local restart, then remote
    1. Policies may govern the number of retries before a remote attempt a recovery occurs

Due to the number of variations in recovery policy it is easy to see a recovery event that resembles the behavior of a switchover. This is often the case in methods 1 and 5. In these scenarios applications and services are gracefully stopped in an orderly fashion before being started on the remote node. Methods 2 and 3, customers will often see a behavior similar to a failover. In methods 2 and 3, the primary server is restarted or fenced by the HA software which creates an observable behavior similar to a failover.  Method 4 is typically an option that is rarely used, but is a hybrid of both a switchover and a failover.  Method 4 begins with a graceful stop of the applications and services, followed by a restart of the applications and services (much like a switchover). However, if the local restart of the applications and services fails, the system will be restarted (much like a failover), but without actually failing to the remote cluster node. While rare, Method 4 is often invoked in cases where an unbalanced cluster is present, or used with a policy based methodology.

Key Takeaway: A recovery event depends on the method chosen

HA terminology between vendors is an area where common terms can take on different meanings. As you deploy and maintain your cluster solution with enterprise applications, be sure that you understand the solution provider terms for failover, switchover and recovery.  And, while you are at it, make sure you know whether the restaurant will put the sauce on the side (in a saucer), or on the side (your mashed potatoes)

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: disaster recovery, failover clustering, High Availability

Webinar: Disaster Recovery for SQL Server on Public Clouds

September 20, 2022 by Jason Aw Leave a Comment

Disaster Recovery for SQL Server on Public Clouds

Webinar: Disaster Recovery for SQL Server on Public Clouds

Register for the On-Demand Webinar

Running your SQL Server instances to any of the major public cloud platforms requires solid strategies for disaster recovery and high availability. Learn how to plan for disaster recovery and high availability and how to decide what’s best for your environment.

Register for the On-Demand Webinar

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: disaster recovery, SQL Server, Webinar

  • 1
  • 2
  • 3
  • …
  • 15
  • Next Page »

Recent Posts

  • Video: The SIOS Advantage
  • Demo Of SIOS DataKeeper For A Three-Node Cluster In AWS
  • 2023 Predictions: Data Democratization To Drive Demand For High Availability
  • Understanding the Complexity of High Availability for Business-Critical Applications
  • Epicure Protects Business Critical SQL Server with Amazon EC2 and SIOS SANLess Clustering Software

Most Popular Posts

Maximise replication performance for Linux Clustering with Fusion-io
Failover Clustering with VMware High Availability
create A 2-Node MySQL Cluster Without Shared Storage
create A 2-Node MySQL Cluster Without Shared Storage
SAP for High Availability Solutions For Linux
Bandwidth To Support Real-Time Replication
The Availability Equation – High Availability Solutions.jpg
Choosing Platforms To Replicate Data - Host-Based Or Storage-Based?
Guide To Connect To An iSCSI Target Using Open-iSCSI Initiator Software
Best Practices to Eliminate SPoF In Cluster Architecture
Step-By-Step How To Configure A Linux Failover Cluster In Microsoft Azure IaaS Without Shared Storage azure sanless
Take Action Before SQL Server 20082008 R2 Support Expires
How To Cluster MaxDB On Windows In The Cloud

Join Our Mailing List

Copyright © 2023 · Enterprise Pro Theme on Genesis Framework · WordPress · Log in