High Availability Archives - SIOS SANless clusters

Keeping Buildings Safe: High Availability in Maintenance and Security Systems

March 13, 2026 by Jason Aw Leave a Comment

Keeping Buildings Safe: High Availability in Maintenance and Security Systems

In this episode of TFiR: Let’s Talk, host Swapnil Bhartiya speaks with Dave Bermingham, Director of Customer Success at SIOS Technology, about why high availability and resiliency are critical for building maintenance and security systems. Bermingham explains how these systems differ from, but often interact with, other building technologies, and why uninterrupted operation is essential to occupant safety and building functionality. The conversation explores how organizations can balance security with accessibility, the role of emerging technologies such as AI, machine learning, and IoT in improving reliability, and best practices for ensuring system availability through redundancy, monitoring, and risk planning.

Author: Beth Winkowski, SIOS Technology Corp. Public Relations

Reproduced with permission from SIOS

Designing High Availability Through Modularity and Abstraction

March 6, 2026 by Jason Aw Leave a Comment

Designing High Availability Through Modularity and Abstraction

Thus far, this series has explored parallels between technical design and rhetoric. The “rhetoric” of a technical solution, the strategy of communicating meaning and purpose, is presented via the design patterns and concepts. The design patterns and concepts exist as a conceptual foundation, upon which the meaning is translated into an applied form when put into practice during implementation.

As previously discussed, the continuity and integrity of this conceptual foundation are paramount to ensuring that solutions are kept up to a standard that is conducive to maintenance, improvement, and long-term reliability. External factors influencing a solution’s design challenge the goal of upholding the conceptual foundations put forth in a solution’s design. These external factors can conflict with the standing principles, and thus, tools, applications, and platforms used in a solution must be chosen mindfully.

In the third and final part of this blog series, modularity and abstraction will be explored as a means to put boundaries in place and ensure that projects with a wide scope can continue to reap the benefits of a well-formed, rhetorically sound design.

High Availability Design Principles: Why Modularity and Abstraction Matter

Before addressing modularization and abstraction as strategies, it is important to understand why these should be implemented. Starting broadly with an analogy, a speaker trying to convince their audience to agree with their plan might first need to outline multiple foundational points. In doing so, each pillar of their argument’s foundation gets put forth and justified.

The speaker first must set up the “A implies B” and “C implies D” basis, upon which they can form the argument “B and D imply E”. This strategy ensures that the reasoning in which “A implies B” does not cross-contaminate and detract from the separate point “C implies D”. This strategy is frequently used because it allows each component of the speaker’s argument to stand independently of others. If the argument “C implies D” is flawed, it can be reconciled while the argument “A implies B” remains sound.

The reason for this structure is the same reason why technical systems are decentralized – a problem in a point of sale system can be remediated without the need to expand the remediation efforts to the databases, APIs, network architecture, and so on. The strategies referenced above are, of course, in reference to the concepts of modularity and abstraction.

Modularity in High Availability Architectures

First, addressing modularity, this is the practice of creating systems from components that are self-contained. In the rhetorical sense, the arguments “A implies B” and “C implies D” are simply modules of reasoning that get assembled into the argument as a whole.

More technically, modularized components (such as the point of sale system in the previous example) allow issues to be addressed entirely within the module where the issue originates. Each module in the solution acts as a building block, and a problem in a single building block can be resolved without dismantling the entire solution.

Abstraction as a Strategy for Scalable Infrastructure Design

Closely related to modularity is “abstraction”. Abstraction is the practice of ensuring the design of the overall solution is independent and agnostic to the design of the modules that compose the overall solution.

Further, abstraction as a design strategy also holds that each module is independent and agnostic to the design of every other module. When a solution is designed to use abstracted elements, these elements can be reused and applied in use cases that allow for understanding to be amplified throughout the project.

Designing High Availability That “Stays Out of the Way”

When designs are built of modular components, boundaries are drawn. These boundaries ensure that each module can “Stay out of the way” of the other modules. When the components are abstracted, the contents of each module can be understood more easily.

In turn, the boundaries serve as a structure by which the design can be understood, and the abstraction within these boundaries serves as an entry point to understand the foundations of the use case. The structure provided via modularity and abstraction mirrors the role of rhetoric in providing a framework by which purpose is understood.

Managing Complex Network Architectures with Modular HA Solutions

As technical solutions are being developed to address more complex problems, the need for a solid framework in that solution’s design grows as well. Network architecture, often the culmination of many solutions that are complex in their own right, serves as a fantastic example of the increasingly complex problem and growing requirement for solid frameworks in design. Furthermore, network architecture often suffers from continual growth as it has to absorb the sprawling web of systems that contribute to the purpose of a growing business.

Layered on top of this, the solution architecture must then employ solutions for High Availability and/or Disaster Recovery. This creates a hot spot for design conflicts to arise, but can be easily mitigated with the strategies of modularization and abstraction.

Applying Modularity and Abstraction with SIOS High Availability Software

The benefits of High Availability software can be achieved without the baggage of complexity and hacked solutions. SIOS LifeKeeper, as an example of a design-compliant High Availability tool, is created in a way that the principles of its operation can mesh seamlessly with the environment in which it is used.

LifeKeeper is modular, as it does not impose requirements outside of the LifeKeeper-protected systems. LifeKeeper also facilitates the abstraction of infrastructure components to bite-sized elements – systems that work together to ensure availability are grouped into a “cluster”.

Through this abstraction, the rhetoric of the environment remains strong – understanding the makeup of one cluster lays the foundation to understand all clusters. Layers of the design can be understood for their purpose; there is no need for asterisks and special considerations on how implementations differ across the design. As the clusters act independently of other clusters or external solution components, a boundary can be drawn where the design elements of each respective layer are contained, avoiding conflict with other layers of the infrastructure.

Building Long-Term Resilient Infrastructure with SIOS Protection Suite

As with any software or tool, SIOS Protection Suite (SIOS LifeKeeper and/or SIOS DataKeeper) influences the design of environments in which they are used. Though these patterns are brought in by virtue of having a LifeKeeper and DataKeeper protected environment, SIOS LifeKeeper and SIOS DataKeeper carefully selected the patterns in use to ensure that these patterns enable abstraction and modularity within the solution as a whole. As a result of the layered abstraction enabled by both LifeKeeper and DataKeeper, the introduction of these utilities facilitates integration with the IT infrastructure that maintains cohesion in the solution’s design.

As a result of the design patterns employed, clusters protected by SIOS Protection Suite (LifeKeeper and/or DataKeeper) compose an abstract and modular element that fits seamlessly into existing designs and solutions. LifeKeeper and DataKeeper do more than simplify the administration of single systems or each respective cluster; LifeKeeper and DataKeeper work with the principles at play in a deployment.

Creating infrastructure becomes simplified and more efficient as the use of SIOS Protection Suite allows for a simple method of understanding the system’s role in the design, while at the same time providing a simple method for implementing High Availability and Disaster Recovery. Administrators may use LifeKeeper and DataKeeper as a tool to improve their ability to understand, operate, and improve upon the solution for years to come.

See how high availability can support your infrastructure’s design—without adding complexity. Request a demo today!

Author: Philip Merry, CX Software Engineer at SIOS

Reproduced with permission from SIOS

The Critical Role of QA and Production Environments in High Availability

March 2, 2026 by Jason Aw Leave a Comment

The Critical Role of QA and Production Environments in High Availability

For IT teams managing modern applications and maintaining high availability while rolling out updates can be a challenge. An integral piece to achieving reliability is the separation of Quality Assurance (QA) and production environments. While it may seem like a trivial practice, it is important for catching potential issues and instilling confidence for maintenance tasks.

QA Environments as the Testing Ground for High Availability

The QA environment serves as a replica of the production environment. This provides a sandbox where new features, configuration changes, and patches can be thoroughly tested. Beyond functional testing, a QA environment allows for process validation, performance benchmarking, load testing, and security validation.

These are critical activities for identifying bottlenecks, vulnerabilities, or integration issues before they have the chance to impact end users or compromise your environment. For distributed systems or cloud architectures, QA environments can help simulate network latency, database replication delays, and other operational edge cases that can disrupt business operations if not tested.

Production Environments and the End-User Experience

The production environment is where end users rely on systems to perform consistently. Any unplanned downtime or failure can have direct business consequences, from lost revenue to reputational damage.

By keeping production isolated from ongoing development and testing, IT teams can ensure operational stability. Properly configured production environments should include redundancy strategies, failover mechanisms, and monitoring tools that were validated through testing in the QA environment before deployment.

Smooth Transitions Through Structured Deployment Pipelines

High availability doesn’t have to be just about keeping systems up. It can include making updates predictable. QA environments can support structured deployment pipelines, enabling various strategies like staged rollouts and blue-green releases. Rollback procedures, pre-validated in QA, allow teams to recover quickly if unexpected issues arise. A structured approach makes updates predictable and helps maintain customer trust.

Operational Benefits of Separating QA and Production Environments

Having separate QA and production environments can also support compliance, audit readiness, and cross-team coordination. Clear boundaries between testing and live systems can help operations and development collaborate efficiently. It also helps provide a repeatable framework for monitoring, troubleshooting, and disaster recovery planning.

QA and Production Environments in a High Availability Strategy

QA and production environments play a vital role in keeping systems running smoothly. By keeping environments separate, testing thoroughly, and managing deployments carefully, IT teams can reduce downtime, maintain high availability, and make transitions between updates seamless. These practices help ensure systems stay dependable and resilient as they evolve.

Ready to improve high availability across QA and production environments? Request a demo to see how SIOS helps teams deploy updates confidently and keep critical systems running.

Author: Tristan Allen, Associate Customer Experience Software Engineer at SIOS Technology

Reproduced with permission from SIOS

The Danger of Turn It Off, Turn It Back On Again Thinking in High Availability

February 23, 2026 by Jason Aw Leave a Comment

The Danger of Turn It Off, Turn It Back On Again Thinking in High Availability

“Turn it off, turn it back on again.” Anyone who has had experience troubleshooting any kind of computer issue has heard this piece of advice. It is notorious for being the most common tech solution, and for turning anyone into a master IT troubleshooter. The problem is that it is never actually the solution; it just happens to solve most things. By turning it off and turning it back on again, we quickly get back up and running, but we never really find out what the problem was in the first place.

Why “Turn It Off and Back On Again” Is Risky in High Availability Systems

Additionally, in the world of high availability, “turn it off” can be a huge problem. Even minutes of downtime can be a major problem for companies that must have their critical infrastructure remain up. Because of this, working in tech support for SIOS, we don’t often give this notorious piece of tech advice, but we do have our own version.

Many who have called in for tech support at SIOS for a Windows DataKeeper mirroring issue will have been told to run the command “cleanupmirror.” In the right situation, this is an excellent command for quickly getting someone out of a major problem. The command essentially completely deletes the mirror configuration and any possible remnants of it, so that we can recreate the mirror fresh, free from whatever problem plagued it previously. Note that this does not actually remove any data, just the replication between the systems.

The command does not require downtime, but it does mean that the systems are not highly available until the mirror finishes resyncing. This is one of our go-to troubleshooting steps in support, but like “turn it off, turn it back on again,” it can sometimes hide a more serious underlying issue, and it can sometimes be overkill.

Today, I wanted to talk about one such case, where running cleanupmirror got the customer out of an immediate problem, but almost made us miss a fairly serious issue, which could affect a wide range of customers, but had a really easy workaround and fix.

A Real-World DataKeeper Mirroring Issue During Migration

When the support team joined, the customer had already been troubleshooting this for quite some time, and they were starting to panic. They were doing their final switchover tests as part of their migration when DataKeeper mirroring started having issues. At this point, their critical infrastructure was down, and they were worried it was going to start affecting their business. This was a high-stress situation, but fortunately, the support engineers here did an excellent job. They balanced the pressure, rush, and need to find a good solution, and ran the tried and true “cleanupmirror” command, followed by recreating the mirror in working order. They got the customer out of a bind, and everybody moved on. Fortunately, they also asked the customer to send in logs, “for good measure.”

The logs on this case were somewhat confusing. The logs indicated that a volume had been resized, but the customer had claimed that they had not performed any resizing activity on the call. Sometimes customers leave out important information, so we thought that maybe they had left that detail out on the call, but the resize didn’t make any sense. The change in size was very small, and it happened to all volumes at the same time as the first switchover. It wouldn’t have made sense for the customer to resize their terabytes of large drives by subtracting less than a gigabyte all at once, perfectly in sync with the first switchover, so we looked a little deeper. It turned out that the target drives were slightly larger than the source drives, and there was an issue in our product with how it handled mismatched drive sizes.

Identifying the Root Cause Prevented Repeat Downtime

Once we figured this out, we realized that all that was needed to resolve this issue was to continue the mirror. This is a common, quick, and easy operation that would have taken seconds to completely fix the issue. No days-long resync before we got back to having high availability. Additionally, once we found this issue, it was a very quick and easy fix to implement for the next product version.

It turned out that the customer had a unique migration scenario, which required them to make the targets slightly larger, because matching up the sizes was impossible. They still had several systems left to migrate, and if we had left the case at “cleanupmirror,” they would have run into this issue every time. Because we found the root cause, we were able to give them a quick and easy workaround, and an even quicker preventative measure they could take before executing the first switchover. We were also able to publish a solution, so that the next customer who ran into this would be able to solve it in minutes.

Why Root Cause Analysis Matters in High Availability

So, what is the big problem with “turn it off, turn it back on again”? It hides the root cause. So, does that mean that you should never use it? It is still some of the best tech advice there is. Often, you really don’t need to know what the root cause is, and turning it off and back on again gets you out of a pinch really quickly.

The important part for an IT professional is that when you don’t need to get out of a pinch, and you can afford some time to investigate first, you should. When you don’t, you should go back later and look at the logs to try to see if you can figure out what happened.

So, please, turn it off and turn it back on to your heart’s content. Be the magician who solved that one problem in minutes, and leave everyone wondering how you did that. But… every once in a while… take some time to go back and figure out why you needed to turn it off and back on again… and consider the possibility that there could have been an even easier solution.

To learn more about how SIOS DataKeeper and high availability solutions can help you avoid hidden issues like this, request a demo from our team today.

Author: Carter Chandler CX Associate, Software Engineer at SIOS Technology

Reproduced with permission from SIOS

Common Customer Misconceptions

February 11, 2026 by Jason Aw Leave a Comment

Common Customer Misconceptions

High availability (HA) and disaster recovery (DR) are often misunderstood.

Listen to the full conversation in this podcast as Greg Tucker unravels common customer misconceptions about HA/DR—from overestimating what cloud providers guarantee to misjudging the complexity of solutions like SQL Server Always On Availability Groups. Greg shares real examples of avoidable mistakes, explains how SIOS helps customers rethink their HA/DR assumptions, and offers practical advice for IT leaders looking to make informed, resilient infrastructure decisions.

Reproduced with permission from SIOS