SIOS SANless clusters

SIOS SANless clusters High-availability Machine Learning monitoring

  • Home
  • Products
    • SIOS DataKeeper for Windows
    • SIOS Protection Suite for Linux
  • News and Events
  • Clustering Simplified
  • Success Stories
  • Contact Us
  • English
  • 中文 (中国)
  • 中文 (台灣)
  • 한국어
  • Bahasa Indonesia
  • ไทย

Broadcom/VMware: Time To Decouple High Availability From Your Hypervisor

March 24, 2026 by Jason Aw Leave a Comment

Broadcom VMware Time To Decouple High Availability From Your Hypervisor

Broadcom/VMware: Time To Decouple High Availability From Your Hypervisor

If you are an IT Architect, Admin, or Site Reliability Engineer (SRE) managing critical workloads on VMware, your 2026 likely began with a singular headache: The Renewal. Since the Broadcom acquisition, the “Broadcom Tax” has become a well-known cost. Between the elimination of perpetual licenses, mandatory shifts to massive subscription bundles, and aggressive 72-core minimums, “standardizing on VMware” has become an exercise in forced over-provisioning.

But there is a risk greater than the price hike: the cost of application downtime.

The “VM Restart” Fallacy: Why VMware HA Isn’t True High Availability

For years, the industry has mistaken “VMware HA” for true High Availability. If a host fails, VMware restarts the VM on another server. While this is a fast reboot, it is not High Availability.

VMware HA only monitors the physical server’s “heartbeat” to determine whether the host is operational or not. It is blind to the world inside the VM. It cannot detect a database that is hung, application services that are deadlocked, or storage that is unavailable.

Today’s mission-critical ecosystems—SAP HANA, SQL Server, Oracle, and AI-driven GPU systems—require more than a “power cycle” approach. They require application-level protection.

SIOS LifeKeeper: True HA via Application-Aware Intelligence

SIOS LifeKeeper provides visibility across your application environment: network, storage, OS, and database layers. It ensures rapid, Application-Aware Failover in compliance with application-specific best practices to deliver reliable uptime, not just a fast reboot.

While Broadcom’s licensing model effectively taxes your growth and tethers you to their ecosystem, SIOS offers true architectural freedom. Our platform-agnostic licensing allows you to migrate workloads to AWS, Azure, or alternative hypervisors without losing your HA protection. With SIOS, you aren’t just buying software; you’re securing an exit strategy from vendor lock-in.

Slashing TCO After VMware Pricing Changes: Protect the App, Not the Hypervisor

Broadcom not only requires you to buy subscription licenses, but it often requires you to upgrade your entire VMware stack or purchase bloated subscription tiers just to access the HA features needed for a single Tier-1 application.

Why upgrade your entire infrastructure license to protect one SQL Server or SAP instance? SIOS provides enterprise-class HA that lives with your application, regardless of which VMware “bundle” the Broadcom mandate. SIOS also gives you the flexibility to purchase subscription or perpetual licenses.

Eliminate the Cost and Complexity of SANs and vSAN Dependencies

Many new VMware bundles push customers toward vSAN, in environments where every millisecond counts, SIOS DataKeeper allows you to build clusters using local, high-performance NVMe storage. You get the protection of a cluster without the proprietary complexity or the “storage tax” of a virtual SAN.

SIOS delivers the capabilities—such as advanced data replication—that VMware typically gates behind its most expensive tiers. By decoupling HA from the hypervisor, you can maintain world-class uptime on more economical VMware licenses, potentially saving six or seven figures on your next renewal.

VMware HA vs. SIOS LifeKeeper and DataKeeper

Feature VMware HA (vSphere Foundation) SIOS LifeKeeper
& DataKeeper
Failover Trigger Host/Hardware failure only. Application, OS, Storage, or Network failure.
App Intelligence None. It’s a “black box” restart. Recovery Kits for SAP, SQL, Oracle, & more.
Cloud Flexibility Requires specific VMware Cloud stacks. Native in AWS, Azure, GCP, or Hybrid.
Storage Model Dependent on vSAN or Shared Storage. SANless Clusters via local NVMe/SSD.
Licensing Complex, Core-based, Bundle-heavy. Predictable, portable, and application-focused. Your choice of perpetual or subscription.

Reclaim Your Infrastructure Freedom with Application-Level High Availability

SIOS gives you the flexibility to maintain high availability on your own terms while you evaluate your long-term relationship with Broadcom.

By choosing SIOS, you gain the freedom to move workloads between VMware, Nutanix, or the Public Cloud without rewriting scripts or retraining your team. You get uptime determined by the health of the application environment, not just the server’s power light.

If your upcoming renewal feels like a dead end, it’s time to move your High Availability out of the hypervisor and into the application layer.

Request a demo today to see how SIOS delivers application-level high availability across VMware, cloud, and hybrid environments.

Author: Margaret Hoagland, VP Global Sales and Marketing at SIOS

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability

ARKs and Their Use Cases

February 1, 2026 by Jason Aw Leave a Comment

Arc and their uses

ARKs and Their Use Cases

Application Recovery Kits (ARKs) play a critical role in application-aware high availability (HA).

Cassius Rhue demystifies ARKs, explaining how this software add-on extends high availability beyond basic infrastructure protection by delivering intelligent, application-aware recovery. Cassius walks through real-world use cases, highlights the industries that benefit most from ARKs, and clears up common misconceptions about how they work.

Listen to the full conversation in this podcast to better understand the real business value of application-aware HA.

Reproduce with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability, High Availability

The Importance of Proper Memory Allocation in HA Environments

December 23, 2025 by Jason Aw Leave a Comment

The Importance of Proper Memory Allocation in HA Environments

The Importance of Proper Memory Allocation in HA Environments

Proper memory allocation is a critical yet often overlooked component in any highly available (HA) environment. When a server begins to experience memory allocation issues, the effects can transpire throughout the entire cluster, impacting application performance, slowing down replication, and even causing failover failures. In more severe cases, memory exhaustion can interrupt SIOS tools such as DataKeeper and LifeKeeper, further increasing the risk of unpredictable and unintentional behavior. Understanding the role memory plays in HA environments is key to maintaining stability, performance, and predictable failover behavior.

Below, we will explore why proper memory allocation matters, what symptoms to watch for, and how memory-related issues can impact the reliability of your cluster in LifeKeeper/DataKeeper environments.

Common Symptoms of Memory Allocation Issues

1. Replication Stalls or Unexpected Mirror Hangs/Application Termination

One of the most noticeable effects of low memory is degraded replication performance. Products like DataKeeper depend on consistent access to system memory for buffering write operations. When memory is constrained, queues begin to fill, replication slows, and in some cases, the mirror may be hung due to resource exhaustion. This can lead to resync operations that take significantly longer than expected, especially with respect to environments with high write rates. In unison, non-graceful terminations of the DataKeeper application can cause certain processes to be left unmonitored/unhandled, leading to unexpected behavior upon “starting” the DataKeeper service again.

2. Slow Application Response or Service Delays

When a system is running low on memory, the operating system may begin paging or swapping active processes. In HA environments running applications such as SQL Server, this can cause slow queries, delayed responses, and high disk activity as memory pages are constantly moved. These delays often cascade into longer failover times, as services take longer to gracefully stop or restart during a failover event.

3. Increased Risk of False Failovers

High availability solutions depend on timely heartbeat communication between nodes. When memory is exhausted, threads responsible for sending or processing heartbeat messages may be delayed. Even small delays can make a healthy node appear unresponsive, leading to unnecessary failovers or, in worst-case scenarios, split-brain events.

4. Kernel or System Logs Showing Memory Pressure

Memory starvation often results in specific system messages (Windows or Linux). These may include warnings about low available memory, paging activity spikes, or processes being terminated by the OS to reclaim memory. For systems running replication drivers or HA services, these warnings often precede more significant issues.

5. Unpredictable Performance in Virtual or Cloud Environments

In virtualized environments, memory issues can appear even when a VM reports “available” RAM. Hypervisors like VMware, Hyper-V, or cloud platforms may throttle memory access through techniques such as ballooning or overcommitment. This can silently impact VM performance, causing replication delays, heartbeat issues, etc., without obvious indications as to the root cause of the issue(s).

Tools for Diagnosing Memory Allocation Issues in HA Environments

  • Performance Monitor / Task Manager (Windows)
    Useful for identifying memory pressure, paging activity, and process-level consumption. Look for:  Highly committed memory values.

    • Large paging file usage
    • Processes consuming excessive RAM
  • Event Viewer (Windows) or journalctl / dmesg (Linux)
    Memory pressure often leaves clues in system logs. Watch for:

    • “Low Memory” warnings
    • Failed memory allocations
    • Replication driver warnings indicating resource exhaustion
  • top, htop, or free (Linux)
    These tools can reveal memory saturation, swap usage, and services using disproportionate amounts of RAM.
  • Hypervisor Tools ( vSphere (VMware) / Hyper-V Manager (Hyper-V) / Cloud Platform Managers) These tools identify ballooning, swapping, host-level contention, or overcommitment as produced by the lack of available, yet demanded, memory.

When to Reevaluate Memory Allocation?

You may need to increase or adjust memory allocation when:

  • Replication regularly enters PAUSED states or hangs under load.
  • Paging or swapping becomes a consistent pattern during peak workload.
  • Your application servers (e.g., SQL Server) frequently consume most of the available RAM.
  • The cluster experiences intermittent failovers with no underlying hardware failures.
  • You are operating in a cloud or virtual environment where host contention is possible.
  • You see “Resource Exhaustion” event logging from your system
  • Unexpected terminations of critical services

In HA environments, memory isn’t just for performance; it helps ensure predictable failover behavior and prevents cascading service interruptions.

Why Proper Memory Allocation Is Key to HA Reliability

Memory pressure can negatively affect nearly every layer of an HA environment, from replication drivers to application performance and failover timing. Proper memory allocation helps ensure predictable performance, stable cluster communication, and reliable recovery when a failover occurs. By proactively monitoring and planning memory usage, organizations can avoid unnecessary downtime and maintain the high availability their systems demand. If memory allocation challenges are impacting HA performance or failover behavior, request a SIOS demo to see how we can help strengthen reliability.

Author: Aidan Macklen, Associate Product Support Specialist at SIOS Technology Corp.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability, High Availability

Think Before You Script: Best Practices for Gen/App Recovery

September 20, 2025 by Jason Aw Leave a Comment

Think Before You Script Best Practices for GenApp Recovery

Think Before You Script: Best Practices for Gen/App Recovery

SIOS Recovery Kits provide a wealth of best practices for application-aware monitoring and recovery.  In general, each SIOS recovery kit provides a step-by-step programmatic approach to restoring the application, database, or service in accordance with High Availability (HA) best practices.  The SIOS Recovery Kits provide the intelligence needed to restore operation after a normal system shutdown, after an unexpected system failure or crash, and even in the case where the application, database, or service itself crashes or becomes unavailable.   In addition, each recovery includes experiential wisdom and improvements from over two decades in the field.

However, if a customer still needs to roll their own script for providing HA, SIOS LifeKeeper for Windows and SIOS LifeKeeper for Linux include an option for script integration via the Generic Application (Gen/App) Recovery Kit.

Best Practices for Writing Gen/App Recovery Scripts

1. Use Modern, Supported Scripting Languages for Gen/App Recovery

A common practice with existing solutions is to use the old existing scripts on new systems and architecture.  However, it is essential to make sure you are using a modern, supported scripting language.

2. Avoid Hardcoded Values in Gen/App Scripts

Using hardcoded values can cause portability issues, as well as challenges with long-term maintenance.  Avoid using hard-coded values that are subject to change in future deployments, for example, directory paths, user names, or similar.

3. Practice Code Reuse to Improve Gen/App Script Quality

Duplicate code is a common problem in customer-developed scripts.  Duplicate code creates quality, maintenance, and troubleshooting problems.  Practice code reuse, such as inheritance, functions, and subroutines.

4. Choose Meaningful Names for Functions and Variables

Descriptive variables are more helpful than single-character variables such as ‘n’ or ‘i.  When looking at code months or years later, will the variable ‘n’ mean as much as iReturnCode?

5. Remove Unused Functions and Variables to Prevent Code Bloat

While meaningful names for functions and variables are important, avoid cluttering the code with unused variables and functions.  Declaring variables and not using them can create confusion during future updates and troubleshooting.  While the days of 8 MB of memory are long gone, additional variables or functions that provided limited reuse or no additional value are still burdensome and create code bloat.

6. Verify All Input Parameters for Reliable Gen/App Execution

In the rush to get something working, don’t ignore input variable validation.  Verify all input to the script and to functions.  Don’t assume that if “we got here,” all of our inputs are valid.

7. Log Helpful and Actionable Messages

Consider what output needs to be logged for status/progress, error conditions, or troubleshooting.  Each message should be thoughtfully considered and appropriately worded to provide helpful feedback to operators and future developers.

8. Check Return Codes on All Method/Function/API Calls and Take Defensive Action

Commands that are executed within the body of the script or function will have return codes, explicitly pass, fail, or other.  Be sure to check, log, and properly handle both expected and unexpected return codes from methods, functions, and API calls.

9. Use Defensive Programming Techniques

Apply best practices for defensive programming, including least privilege access, input validation, error handling, etc.

10. Test Gen/App Recovery Scripts Beyond the Happy Path

Working code is not enough.  Develop a robust validation plan and test the code extensively, especially beyond the happy path when everything is expected to work.

11. Use Version Control for Script Management and Troubleshooting

Use version control and code management tools.  Version control is essential for troubleshooting, management, and tracking the inevitable fixes required for your scripts.

12. Catch Errors Early with Code Inspections and Peer Reviews

Use code inspections and peer reviews to increase the resilience and robustness of the code.  Code reviews help find problems early and reduce the cost, risk, and burden of late-stage failures and bugs.

13. Verify Permissions Required for Execution in Gen/App Recovery

Having well-organized, modern, reviewed, inspected, tested, and controlled code is an essential part of a well-crafted gen/app script.  However, the best-coded script will fail to execute if it does not have the right permissions.  Ensure that the script has the correct permissions to execute standalone as well as under the service/user accounts of the HA solution.

14. Comment Code Clearly to Explain Logic and Business Use Cases

Provide comments that help explain the business logic and use case, describe expected function inputs and returns, and contribute to overall understanding.  Well-written code still needs comments, especially if it is not obvious what business logic or requirement is being addressed.  An example comment block could look like:
Name:

Purpose:

Preconditions:

Postconditions:

Returns:

Ready to Simplify Gen/App Recovery with Confidence?

Don’t leave high availability to chance. With SIOS LifeKeeper and the Generic Application (Gen/App) Recovery Kit, you can safeguard critical applications, streamline recovery, and reduce downtime.

Request a demo today to see how SIOS can help you achieve reliable, cost-effective high availability and disaster recovery.

Author: Cassius Rhue, VP, Customer Experience at SIOS

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability, High Availability

How to Assess if My Network Card Needs Replacement

May 21, 2025 by Jason Aw Leave a Comment

How to Assess if My Network Card Needs Replacement

How to Assess if My Network Card Needs Replacement

A network interface card (NIC), often referred to as a network card, is a vital component of any server infrastructure. It enables systems in a cluster to communicate with each other and the outside world. If your NIC is experiencing issues, it can compromise the health of your cluster, lead to false node failures, or increase the risk of split-brain scenarios. Recognizing the signs of a failing NIC early can save time, reduce downtime, and maintain high availability.

In this blog, we’ll explore how to assess whether your network card needs replacement, the symptoms to look out for, and the tools that can aid you in diagnosing the issue.

Common Symptoms of a Failing NIC

1. Intermittent Connectivity

One of the first signs of NIC failure is unstable or sporadic connectivity. You may notice dropped packets, high latency, or difficulty reaching external hosts. These issues can cause nodes in a LifeKeeper cluster to temporarily lose connection and trigger unnecessary failovers.

2. Degraded Network Speed

If a system is underperforming on network-related tasks such as slow replication, sluggish application response, or delayed heartbeat communication, it may be due to a faulty NIC that is no longer operating at its rated speed (e.g., 1 Gbps vs. 10 Gbps). In clustered environments, slow replication is especially concerning because it delays data synchronization between nodes. This not only increases recovery time in the event of a failover but also raises the risk of data loss or inconsistent state across systems if a complete failure occurs before the replication finishes.

3. System Logs Showing Network Errors

Frequent kernel or system log messages related to the NIC driver or interface, such as “link down,” “NIC reset,” or “device not responding,” are red flags. These messages indicate the OS is having trouble communicating with the card at a hardware or driver level.

4. Unusual Heat or Physical Damage

While not common, physical inspection may reveal damage such as scorch marks or excessive heat emission. Hardware issues at this level can quickly deteriorate performance or cause complete failures, which is certainly not desirable in any environment.

5. Issues in Virtual or Cloud Environments

In virtualized and cloud environments, NIC behavior can be affected not just by the underlying hardware but also by the configuration of the hypervisor or virtual networking layer. For example, virtual NICs assigned through VMware or Hyper-V may show degraded performance if incompatible/outdated drivers are used, or even if the VM is assigned an adapter type that is not optimized for the desired workload.

Network Card Troubleshooting Tools for Windows and Linux

Diagnosing NIC issues early helps minimize downtime and prevent unnecessary failovers. The following are essential tools for identifying hardware or driver-related NIC issues, including options for both Linux and Windows environments:

  • ethtool (Linux):
    Use this to view NIC statistics, driver information, and up-to-date link status. A high number of transmit/receive errors, dropped packets, or failed auto negotiations could indicate a deteriorating NIC.
  • PowerShell cmdlets (Windows):
    Get-NetAdapter and Get-NetAdapterStatistics allow you to inspect link status, speed, and adapter health on Windows systems. Combined with Get-NetEventSession, you can also track event logs related to NIC behavior over time.
  • dmesg / journalctl (Linux) or Event Viewer (Windows):
    These tools help uncover system or kernel-level alerts. Look for messages such as “NIC reset,” “link down,” or “device not responding.” In Windows, these might appear under “System” or “Application”  logs and indicate driver crashes or hardware unresponsiveness.
  • ping / iperf (Cross platform):
    Useful for testing basic connectivity and throughput. If packet loss, jitter, or unexpected latency spikes occur during tests, it could point to faulty hardware or cabling.
  • Network Bonding Failover Behavior:
    When using bonded or teamed interfaces for redundancy, observe whether one interface is triggering failover events more frequently than the others. This could mean the failing NIC is silently degrading, even if no system errors are reported.

When to Replace Your NIC?

It may be time to replace your NIC if:

  • You observe consistent or worsening symptoms outlined above.
  • Logs and tools confirm hardware or driver issues that persist after driver updates or firmware reinstallation.
  • The issue follows the NIC when moved to another system (if removable).
  • The card is outdated and unsupported by the current OS or clustering tools.
  • You are in a highly available (HA) environment where the continuity of service is critical. In these cases, it is especially best practice to proactively move services or resources to nodes with verified healthy NICs while troubleshooting to avoid risking a failover delay or unexpected downtime.

Preventative Measures to Avoid Network Card Failures

To avoid NIC-related failures:

  • Use redundancy: Implement bonding or teaming across multiple NICs.
  • Keep firmware up to date: Periodically check for driver and firmware updates from your hardware vendor.
  • Monitor proactively: Use tools and third-party network monitoring to catch early signs of NIC degradation.
  • Regular testing: Validate link speed and latency as part of regular cluster health checks.

Final Thoughts on Maintaining Network Interface Card Health

The NIC may not be the most glamorous piece of hardware, but its health is critical to a stable, highly available environment. Knowing when and how to assess a network card’s performance helps prevent unexpected downtime, ensures seamless failover behavior, and keeps your cluster communication resilient.

SIOS Technology Corporation provides high availability cluster software that protects & optimizes IT infrastructures with cluster management for your most important applications. Request a demo today.

Author: Aidan Macklen, Customer Experience Engineer Intern at SIOS Technology Corp.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability

  • 1
  • 2
  • 3
  • …
  • 7
  • Next Page »

Recent Posts

  • High Availability for On-Premises Data Centers
  • How APM Tools and High Availability Clusters Improve Network Resilience
  • Selecting the Right Storage for SQL Server High Availability in the Cloud
  • Disaster Recovery Planning in an Unpredictable World
  • Active-Active vs. Active-Passive

Most Popular Posts

Maximise replication performance for Linux Clustering with Fusion-io
Failover Clustering with VMware High Availability
create A 2-Node MySQL Cluster Without Shared Storage
create A 2-Node MySQL Cluster Without Shared Storage
SAP for High Availability Solutions For Linux
Bandwidth To Support Real-Time Replication
The Availability Equation – High Availability Solutions.jpg
Choosing Platforms To Replicate Data - Host-Based Or Storage-Based?
Guide To Connect To An iSCSI Target Using Open-iSCSI Initiator Software
Best Practices to Eliminate SPoF In Cluster Architecture
Step-By-Step How To Configure A Linux Failover Cluster In Microsoft Azure IaaS Without Shared Storage azure sanless
Take Action Before SQL Server 20082008 R2 Support Expires
How To Cluster MaxDB On Windows In The Cloud

Join Our Mailing List

Copyright © 2026 · Enterprise Pro Theme on Genesis Framework · WordPress · Log in