SIOS SANless clusters

SIOS SANless clusters High-availability Machine Learning monitoring

  • Home
  • Products
    • SIOS DataKeeper for Windows
    • SIOS Protection Suite for Linux
  • News and Events
  • Clustering Simplified
  • Success Stories
  • Contact Us
  • English
  • 中文 (中国)
  • 中文 (台灣)
  • 한국어
  • Bahasa Indonesia
  • ไทย

How APM Tools and High Availability Clusters Improve Network Resilience

April 13, 2026 by Jason Aw Leave a Comment

How APM Tools and High Availability Clusters Improve Network Resilience

How APM Tools and High Availability Clusters Improve Network Resilience

Network resilience refers to a network’s ability to maintain connectivity and continue functioning even when disruptions occur. For organizations that rely heavily on technology, maintaining this resilience has become an operational necessity. A recent analysis by Siemens found that even a single hour of downtime can cost organizations millions of dollars. Downtime can interrupt production, breach service level agreements (SLAs), halt transactions, and generate significant expenses related to overtime, external consultants, incident investigations, and regulatory penalties.

In some industries, such as financial services, the consequences of weak network resilience can ripple far beyond a single organization. Global economies depend on financial institutions that operate stable and efficient IT systems capable of supporting trillions of dollars in transactions each year. Any perception that these systems are unreliable can affect entire markets. As a result, regulatory bodies such as the Basel Committee and the US Federal Reserve enforce strict standards around operational resilience. Similarly, organizations operating in sectors such as healthcare, telecommunications, and critical infrastructure must follow guidelines that ensure strong levels of network reliability and continuity.

Resilient Organizations Invest in Smart Infrastructure

IT environments, whether deployed on-premises, in the cloud, or across hybrid architectures, continue to grow in size and complexity. As a result, IT teams need tools that provide better visibility and enable smarter decision-making. Modern IT operations rely increasingly on data-driven insights and automation to support the work of IT professionals.

For this reason, forward-thinking organizations are investing in technologies that strengthen resilience and improve operational awareness. Two technologies that work particularly well together are application performance monitoring (APM) platforms and high availability (HA) clustering solutions.

APM tools play a key role by collecting and analyzing performance data across the IT environment. This data helps organizations better understand the health and behavior of their systems, allowing administrators to establish more accurate thresholds for alerts and automated responses. High availability clusters complement this capability by ensuring services can fail over to standby systems when disruptions occur. These clusters may rely on shared storage in traditional SAN-based environments or use software-based SANless clustering that replicates data between nodes.

Combining APM and HA for Greater Network Resilience

When APM tools and HA clusters are deployed together, organizations gain stronger capabilities for improving network resilience. Monitoring insights from APM platforms can inform automation and operational decisions, while HA clusters ensure workloads continue running even when failures occur.

This combination supports capabilities such as automated failover, predictive analytics, self-healing processes, and faster incident response. These capabilities help organizations maintain higher uptime and deliver consistent application performance.

In multi-cloud environments, this approach becomes even more valuable. If a cloud provider experiences an outage, services can fail over to an alternate cloud environment. Organizations can also distribute workloads across multiple clouds to eliminate single points of failure and improve overall system resilience.

As enterprises continue moving toward more autonomous IT operations, the data gathered by APM tools provides a detailed view of system performance and health. This information allows IT teams to define precise policies and operational thresholds, enabling confident and informed decision-making when issues arise.

Using Monitoring Data to Support Failover Decisions

Consider a scenario where an IT administrator must decide whether to initiate a failover to prevent a potential outage. The cost of manually initiating the failover may exceed $50,000 due to operational disruption and recovery procedures. However, waiting too long could result in a far more expensive failure.

Without clear data, decision-makers may hesitate to act. They may worry about triggering a costly intervention based on incomplete information or intuition alone. Reliable performance data helps eliminate this uncertainty by providing objective evidence that supports informed action.

With accurate monitoring insights, teams can determine whether system conditions truly justify failover. If intervention becomes necessary, they can confidently act with data-backed justification.

This is where the combination of APM tools and HA clustering becomes particularly valuable. Together, they help maintain service continuity when performance degradation, unexpected incidents, or large-scale disruptions threaten operations. APM monitoring provides visibility into the health of infrastructure components, allowing administrators to identify issues early and respond before downtime occurs. If failover becomes necessary, the decision is guided by clearly defined parameters based on the organization’s risk tolerance.

The Advantages of HA Clusters with APM

When HA clusters are integrated with an organization’s APM platform, mission-critical applications and services can fail over automatically with minimal disruption. Automated failover reduces the risk of delays or errors that can occur during manual recovery efforts and allows operations to continue while underlying issues are addressed.

Today, many organizations are adopting SANless clustering approaches. These solutions provide the same failover capabilities as traditional SAN-based clusters but without the cost and complexity of shared storage infrastructure. SANless clusters replicate data across nodes and operate efficiently in on-premises, cloud, or hybrid environments.

They also support geographically distributed deployments across multiple data centers or regions, which is essential for effective disaster recovery planning.

Whether an organization operates in a highly regulated industry or simply wants to strengthen its reliability and operational stability, combining APM monitoring with high availability clustering offers a practical and effective strategy. Together, these technologies provide a straightforward and cost-efficient way to improve uptime, strengthen resilience, and meet the growing expectations for reliable IT services.

Strengthen Network Resilience with High Availability Clustering

Keep your applications running even when failures occur. SIOS high availability clustering helps organizations maintain uptime, automate failover, and protect critical systems from downtime.

Request a demo to see how SIOS can help strengthen your network resilience.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: clusters, High Availability

Ten Questions to Consider for Better High Availability Cluster Maintenance

April 24, 2023 by Jason Aw Leave a Comment

Ten Questions to Consider for Better High Availability Cluster Maintenance

Ten Questions to Consider for Better High Availability Cluster Maintenance

Maintenance is a part of every company’s lifecycle. Every infrastructure is constantly moving and changing, even those that are moving towards end of life. Your team has likely had a lot of success doing what you’ve done in the past, but as systems become more complicated and complex, what you have deemed success in the past may need a refresh.  Here are ten questions to improve cluster maintenance, maximize high availability, and minimize downtime.

How to Ensure High Availability During System Maintenance

  1. What are the best days for the business stakeholders?

Different from unplanned downtime, these are windows in which multiple teams, systems, and interconnected resources are simply not available for planned activities. For example, one company is required to do monthly system compliance checks and safety inspections. During this time, the business operations are shuttered by inspectors, auditors, and similar.

  1. What are the best dates for the team to schedule maintenance?

As VP of Customer Experience we’ve worked closely with a number of teams who have blackout dates for certain events and activities. Your team is likely responsible for more than one set of systems and servers, and reports to multiple teams with critical applications and infrastructure. Understanding which days are best for the team helps you avoid distractions, conflicts, and lost time due to known resource constraints.

  1. What dates and times coordinate best with partners, consultants, and non-company contractors?

Critical infrastructure typically includes many other providers and vendors who are not directly related to the company’s staffing. These resources include OS, security and HA vendors and consultants, as well as architects from the infrastructure providers and other partners.  Understanding in advance what days are best or included in your support tiers is critical to proper scheduling and staffing.

With the rise in global teams finding the right time for all of these resources is another question that is important to answer.  What is the best time for resources in EST, IST, EMEA, and other regions?

  1. What is the intended scope of the maintenance?  What is the desired outcome of the maintenance activities? Think holistically.

Think beyond simple maintenance of the application to include the entire environment where it is running. Recently, a customer who was planning to upgrade their application decided to upgrade their OS at the same time. Unfortunately, this slight change in scope came with larger than expected consequences. Their application did not support the newly upgraded OS and problems ensued.  Be sure that the scope of the maintenance window is well-defined and that outcomes for that scope are detailed.  It is not enough to say, the environment works.  Detail expected versions, behavior, and measurable outcomes wherever possible. See more about IT Resilience.

  1. What is the length of time for the maintenance window (anticipated, allowed)?

Ideally we’d all love to have all the time to perform maintenance, but having customers located around the world means there is little tolerance for planned downtime windows – even for critical tasks. As you plan for maintenance, what length of down time is anticipated? Can you realistically meet your maximum allowed windows? If not, then you will need to replan the maintenance events.

  1. What’s the rollback plan?

While we hope nothing goes wrong, we should be aware that we are dealing with software, complex environments and configurations, and lots of moving pieces being handled by numerous teams.  A rollback plan – that is, a means of returning the systems to the pre-maintenance versions and settings – is essential. Be sure that if something goes wrong you have a rollback plan, for example full backups or machine images. See more about disaster recovery.

  1. Who are the individual team members involved, what are their roles and responsibilities? Are all the required roles and responsibilities clearly defined?

As VP of Customer Experience our team was involved in a maintenance activity that encountered an unforeseen delay due to key team members that were missing.  As you lay out your plan and architecture be sure to identify the team members as well as the IT roles and responsibilities required.  As Sr. Support Engineer Greg Tucker reminds customers, HA touches every layer of your environment including storage, network, compute, OS, security, policies, etc.

  1. Where is the maintenance plan documented? When was the last time the plan was reviewed, updated, and tested?

Success is wonderful, but it can also make you complacent or comfortable.  After years of success, your process may no longer be well documented or actively being followed.  Answering these questions can make sure your team continues to have success.

  1. What issues were resolved in test/QA prior to the production plans?

Kudos for continuing to test maintenance steps.  Be sure that issues resolved in test environments are properly added to the production maintenance plans. The SIOS Customer Success team has seen customers perform QA tests, uncover false assumptions and make necessary corrections, but fail to place those corrections in their production checklist.

  1. Who or what is missing from your plans?

Now that you’ve looked over the plans, timing, teams, roles, and architecture one last question remains: who or what is missing?  As a last step, look over your plans and ask the question: “Who is missing from our plans?”  Also, consider asking “What is missing from our plans?”  As VP of Customer Experience I have worked with our team to review activity plans for countless customers. One of the most memorable maintenance plan reviews uncovered a series of steps within the rollback plan that included restoring servers from cloned images and data from backup.  However, the image cloning and data backup steps were not included in the task list. They had been overlooked and assumed to have been done earlier in the process.

System Maintenance is a Critical Element to Maintaining High Availability

System maintenance is a critical and necessary part of maintaining computer systems. The maintenance could be to correct errors, introduce new software functionality, or adapt a system to a new use case. When the systems in question are business critical systems that are essential for the organization to maintain business continuity, having a thought out plan is essential. Consider these ten questions and others of your own to make sure that your maintenance satisfies the needs of the business without unnecessary risk or delay.

Contact SIOS today for High Availability and Disaster Recovery Solutions.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Clustering Software, clusters, High Availability

White Paper: High Availability Clusters in VMware vSphere without Sacrificing Features or Flexibility

October 22, 2022 by Jason Aw Leave a Comment

White Paper High Availability Clusters in VMware vSphere without Sacrificing Features or Flexibility

High Availability Clusters in VMware vSphere without Sacrificing Features or Flexibility

Six key facts you should know about high availability protection in VMware vSphere

Many large enterprises are moving important applications from traditional physical servers to virtualized environments, such as VMware vSphere in order to take advantage of key benefits such as configuration flexibility, data and application mobility, and efficient use of IT resources. Realizing these benefits with business-critical applications, such as SQL Server or SAP can pose several challenges.

This paper explains these challenges and highlights six key facts you should know about HA protection in VMware vSphere environments that can save you money.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: clusters, High Availability

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

September 24, 2022 by Jason Aw Leave a Comment

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Microsoft introduced Azure Shared Disk in Q1 of 2022. Shared Disk allows you to attach a managed disk to more than one host. Effectively this means that Azure now has the equivalent of SAN storage, enabling Highly Available clusters to use shared disk in the cloud!

A major advantage of using Azure Shared Disk with a SIOS Lifekeeper cluster hierarchy is that you will no longer be required to have either a storage quorum or witness node to avoid so called split-brain – which occurs when the communication between nodes is lost and several nodes are potentially changing data simultaneously. Fewer nodes means less cost and complexity.

SIOS has introduced an Application Recovery Kit (ARK) for our LifeKeeper for Linux product; called LifeKeeper SCSI-3 Persistent Reservations (SCSI3) Recovery Kit that allows Azure Shared Disks to be used in conjunction with SCSI-3 reservations. This ARK guarantees that a shared disk is only writable from the node that currently holds the SCSI-3 reservations on that disk.

When installing SIOS Lifekeeper, the installer will detect that it’s running in Microsoft Azure EC2 and automatically install the LifeKeeper SCSI-3 Persistent Reservations (SCSI3) Recovery Kit to enable support for Azure Shared Disk.

Resource creation within Lifekeeper is straightforward and simple (Figure 1). Once locally mounted, the Azure Shared Disk is simply added into Lifekeeper as a file-system type resource. Lifekeeper will assign it an ID (Figure 2) and manage the SCSI-3 locking automatically.

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Figure 1] Creation of /sapinst within Lifekeeper.

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Figure 2] /sapinst created and extended to both cluster nodes.

SCSI-3 reservations guarantee that Azure Shared Disk is only writable on the node that holds the reservations (Figure 3). In a scenario where cluster nodes lose communication with each other the standby server will come online, causing a potential split-brain situation. However, because of the SCSI-3 reservations only one node can access the disk at a time, which prevents an actual split-brain scenario. Only one system will hold the reservation and it will either become the new active node (in this case the other will reboot) or remain the active node. Nodes that do not hold the Azure Shared Disk reservation will simply end up with the resource in an “Standby State” state because they cannot acquire the reservation.

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Figure 3] Output from Lifekeeper logs when trying to mount a disk that is already reserved.

Link to Microsoft’s definition of Azure Shared Disks https://docs.microsoft.com/en-us/azure/virtual-machines/disks-shared

At present SIOS supports Locally-redundant Storage (LRS) and we’re working with Microsoft to test and support Zone-Redundant Storage (ZRS). Ideally we’d like to know when there is a ZRS failure so that we can fail-over the resource hierarchy to the most local node to the active storage.

At present SIOS is expecting the Azure Shared Disk support to arrive in its next release of Lifekeeper 9.6.2 for Linux, Q3 2022.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: clusters, High Availability

Introduction To Clusters – Part 2

November 23, 2021 by Jason Aw Leave a Comment

Introduction to Clusters - Part 2

Introduction To Clusters – Part 2

What Types of Clusters Are There and How Do They Work?

An Overview of HA Clusters, and Load Balancing Clusters

Clustering helps improve reliability and performance of software and hardware systems by creating redundancy to compensate for unforeseen system failure. If a system is interrupted due to hardware or software failure or natural disaster, this can have a major impact on business and revenue, wasting crucial time and expense to get things back up and running.

This is where clustering comes in. There are three main types of clustering solutions – HA clusters, load balancing clusters, and HPC clusters. Which type will best increase system availability and performance for your business? Let’s have a look at the three types of clustering solutions in more detail below.

What is HA Clustering?

High Availability clustering, also known as HA clustering, is effective for mission-critical business applications, ERP systems, and databases, such as SQL Server SAP, and Oracle that require near-continuous availability.

HA clustering can be divided into two types, “Active-Active” configuration and active-passive configuration.

Let’s take a look at the difference between these two HA clustering types.

HA Clustering Type 1: Active-Active Configuration

In the active-active configuration, processing is performed on all nodes in the cluster. For example, in the case of two-node clustering, both nodes are active. If one node stops, the processing will be taken over the other.

However, if each node is operating at close to 100% and one node stops, it will be difficult for another node to take on the additional processing load. Therefore, capacity planning with a margin is important for HA clustering.

 HA Clustering Type 2: Active-Standby Configuration

Let’s use our two-node example again. In the active-standby configuration, one node is configured as the active node and the other node is configured as the standby node. The active node and the standby node exchange signals called “heartbeats” to constantly check whether they are operating normally.

If the standby node cannot receive the heartbeat of the active node, the standby node determines that the active node has stopped and will take over the processing of the active node. This mechanism is called “failover”. Conversely, the mechanism that recovers the stopped operating node and transfers the processing back to the recovered active node is called “failback.”

In an active/standby configuration, when a failure occurs, the simple switch from the active node to the standby node makes recovery relatively easy. However, it is necessary to consider that the resources of the standby node when the operating node is operating normally will be wasted.

Two Components of HA Clustering: Application and Storage

For an HA cluster to be effective, two areas need to be addressed: application orchestration and storage protection. Clustering software monitors the health of the application being protected and, if it detects an issue, moves operation of that application over to the standby node. The standby node needs access to the most up-to-date versions of data – preferably identical to the data that the primary node was accessing before the incident. This can be accomplished in two ways: shared storage, share-nothing storage. In the shared storage model, both cluster nodes access the same storage – typically a SAN. In shared-nothing (aka SANless) configurations, local storage on all nodes are mirrored using replication software.

Clustering software products vary widely in their ability to monitor and detect issues that may cause application failure and in their ability to orchestrate failovers reliably. Many clustering products only detect whether the application server is operational, but do not detect a wide range of software, services, network, and other issues that can cause application failure.

Application Awareness is Essential

Similarly, complex ERP and database applications have multiple component parts that have to be stored on the correct server or instance, started up in the right order, and brought on line in accordance with complex best practices. Choose a clustering software with specialized software called application recovery kits designed specifically to maintain best practices for the application/database-specific requirements.

There are multiple ways to configure an HA Cluster:

Traditional Two Node Clusters with Shared Storage

 

Two servers are clustered with shared storage.

Two Node SANless Cluster

Clusters can be configured using local LAN and high speed synchronous block-level replication.

Real-time replication can be used to synchronize storage on the primary server with storage on a standby server located in the same data center, in your disaster recovery site, or both. This allows you to build high availability and disaster recovery configurations flexibly; Two node or multi-nodeSIOS block level replication is highly optimized for performance. You can even use super fast, high-speed locally attached storage such as PCIe flash type storage devices on your physical servers to achieve very low cost, high performance, high availability configurations.  Your data is protected on the flash device and your application too.

 

SAN-based cluster with a third node

Third Node for Disaster Protection

This configuration uses a SAN-based cluster and adds a third, SANless node into a remote data center or the cloud and achieve full disaster recovery protection.  In the event of a disaster, the standby remote physical server is brought into service automatically with no data loss, eliminating the hours needed for restoration from backup media.

What is a Load Balancing Cluster?

Load balancing clustering is a mechanism that can be used as a single system by distributing processing to multiple nodes using a load balancer to improve performance by distributing processing. While it can isolate a failed node to prevent node failure from affecting the entire system, the load balancer is a critical single point of failure risk and not a high availability option. It is only effective for applications such web server load balancing. If the load balancer itself fails, the entire system stops.

What is HPC Clustering?

You can also use clustering for performance instead of high availability. High-Performance Computing clusters, or HPC clusters combine the processing power of multiple (sometimes thousands of nodes) to get the CPU performance needed in CPU-intensive environments such as scientific and technological environments requiring large-scale simulations, CAE analysis, and parallel processing.

Are you ready to find the right HA clustering solution for your business?

Learn more about SIOS High Availability clustering here.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: clusters

  • 1
  • 2
  • 3
  • 4
  • Next Page »

Recent Posts

  • Inheriting DataKeeper
  • High Availability vs. Fault Tolerance: Key Differences Explained
  • Business Continuity Planning for High Availability and Disaster Recovery
  • 3 Common Configuration Mistakes That Cause Clusters to Break
  • Guide: Deploying a Multi-Zone and Multi-Region SQL Server FCI in Azure

Most Popular Posts

Maximise replication performance for Linux Clustering with Fusion-io
Failover Clustering with VMware High Availability
create A 2-Node MySQL Cluster Without Shared Storage
create A 2-Node MySQL Cluster Without Shared Storage
SAP for High Availability Solutions For Linux
Bandwidth To Support Real-Time Replication
The Availability Equation – High Availability Solutions.jpg
Choosing Platforms To Replicate Data - Host-Based Or Storage-Based?
Guide To Connect To An iSCSI Target Using Open-iSCSI Initiator Software
Best Practices to Eliminate SPoF In Cluster Architecture
Step-By-Step How To Configure A Linux Failover Cluster In Microsoft Azure IaaS Without Shared Storage azure sanless
Take Action Before SQL Server 20082008 R2 Support Expires
How To Cluster MaxDB On Windows In The Cloud

Join Our Mailing List

Copyright © 2026 · Enterprise Pro Theme on Genesis Framework · WordPress · Log in