Clustering Simplified Archives - Page 2 of 105

How to Assess if My Network Card Needs Replacement

May 21, 2025 by Jason Aw Leave a Comment

How to Assess if My Network Card Needs Replacement

A network interface card (NIC), often referred to as a network card, is a vital component of any server infrastructure. It enables systems in a cluster to communicate with each other and the outside world. If your NIC is experiencing issues, it can compromise the health of your cluster, lead to false node failures, or increase the risk of split-brain scenarios. Recognizing the signs of a failing NIC early can save time, reduce downtime, and maintain high availability.

In this blog, we’ll explore how to assess whether your network card needs replacement, the symptoms to look out for, and the tools that can aid you in diagnosing the issue.

Common Symptoms of a Failing NIC

1. Intermittent Connectivity

One of the first signs of NIC failure is unstable or sporadic connectivity. You may notice dropped packets, high latency, or difficulty reaching external hosts. These issues can cause nodes in a LifeKeeper cluster to temporarily lose connection and trigger unnecessary failovers.

2. Degraded Network Speed

If a system is underperforming on network-related tasks such as slow replication, sluggish application response, or delayed heartbeat communication, it may be due to a faulty NIC that is no longer operating at its rated speed (e.g., 1 Gbps vs. 10 Gbps). In clustered environments, slow replication is especially concerning because it delays data synchronization between nodes. This not only increases recovery time in the event of a failover but also raises the risk of data loss or inconsistent state across systems if a complete failure occurs before the replication finishes.

3. System Logs Showing Network Errors

Frequent kernel or system log messages related to the NIC driver or interface, such as “link down,” “NIC reset,” or “device not responding,” are red flags. These messages indicate the OS is having trouble communicating with the card at a hardware or driver level.

4. Unusual Heat or Physical Damage

While not common, physical inspection may reveal damage such as scorch marks or excessive heat emission. Hardware issues at this level can quickly deteriorate performance or cause complete failures, which is certainly not desirable in any environment.

5. Issues in Virtual or Cloud Environments

In virtualized and cloud environments, NIC behavior can be affected not just by the underlying hardware but also by the configuration of the hypervisor or virtual networking layer. For example, virtual NICs assigned through VMware or Hyper-V may show degraded performance if incompatible/outdated drivers are used, or even if the VM is assigned an adapter type that is not optimized for the desired workload.

Network Card Troubleshooting Tools for Windows and Linux

Diagnosing NIC issues early helps minimize downtime and prevent unnecessary failovers. The following are essential tools for identifying hardware or driver-related NIC issues, including options for both Linux and Windows environments:

ethtool (Linux):
Use this to view NIC statistics, driver information, and up-to-date link status. A high number of transmit/receive errors, dropped packets, or failed auto negotiations could indicate a deteriorating NIC.
PowerShell cmdlets (Windows):
Get-NetAdapter and Get-NetAdapterStatistics allow you to inspect link status, speed, and adapter health on Windows systems. Combined with Get-NetEventSession, you can also track event logs related to NIC behavior over time.
dmesg / journalctl (Linux) or Event Viewer (Windows):
These tools help uncover system or kernel-level alerts. Look for messages such as “NIC reset,” “link down,” or “device not responding.” In Windows, these might appear under “System” or “Application” logs and indicate driver crashes or hardware unresponsiveness.
ping / iperf (Cross platform):
Useful for testing basic connectivity and throughput. If packet loss, jitter, or unexpected latency spikes occur during tests, it could point to faulty hardware or cabling.
Network Bonding Failover Behavior:
When using bonded or teamed interfaces for redundancy, observe whether one interface is triggering failover events more frequently than the others. This could mean the failing NIC is silently degrading, even if no system errors are reported.

When to Replace Your NIC?

It may be time to replace your NIC if:

You observe consistent or worsening symptoms outlined above.
Logs and tools confirm hardware or driver issues that persist after driver updates or firmware reinstallation.
The issue follows the NIC when moved to another system (if removable).
The card is outdated and unsupported by the current OS or clustering tools.
You are in a highly available (HA) environment where the continuity of service is critical. In these cases, it is especially best practice to proactively move services or resources to nodes with verified healthy NICs while troubleshooting to avoid risking a failover delay or unexpected downtime.

Preventative Measures to Avoid Network Card Failures

To avoid NIC-related failures:

Use redundancy: Implement bonding or teaming across multiple NICs.
Keep firmware up to date: Periodically check for driver and firmware updates from your hardware vendor.
Monitor proactively: Use tools and third-party network monitoring to catch early signs of NIC degradation.
Regular testing: Validate link speed and latency as part of regular cluster health checks.

Final Thoughts on Maintaining Network Interface Card Health

The NIC may not be the most glamorous piece of hardware, but its health is critical to a stable, highly available environment. Knowing when and how to assess a network card’s performance helps prevent unexpected downtime, ensures seamless failover behavior, and keeps your cluster communication resilient.

SIOS Technology Corporation provides high availability cluster software that protects & optimizes IT infrastructures with cluster management for your most important applications. Request a demo today.

Author: Aidan Macklen, Customer Experience Engineer Intern at SIOS Technology Corp.

Reproduced with permission from SIOS

SIOS Technology to Demonstrate High Availability Clustering Software for Mission-Critical Applications at Red Hat Summit, Milestone Technology Day and XPerience Day, and SQLBits 2025

May 18, 2025 by Jason Aw Leave a Comment

SIOS Technology to Demonstrate High Availability Clustering Software for Mission-Critical Applications at Red Hat Summit, Milestone Technology Day and XPerience Day, and SQLBits 2025

All practitioners are invited to provide input on high availability and disaster recovery trends as SIOS gathers insights for its 2025 HA/DR Practices Survey Report

SAN MATEO, Calif. – May 6, 2025 – SIOS Technology Corp., a leading provider of application high availability (HA) and disaster recovery (DR) solutions, today announced it will demonstrate its high availability clustering software for business-critical applications at four leading technology events this spring. SIOS also announced that it is inviting all IT practitioners to participate in its newly launched 2025 HA/DR Practices Survey, designed to gather insights into current trends, challenges, and strategies for ensuring application uptime and data protection

Milestone Technology Day 2025 Benelux – May 8, 2025 – Eindhoven-Best, Netherlands
Red Hat Summit – May 19–22, 2025, Boston, MA – Booth #854
Milestone XPerience Day – June 4, 2025, London, UK
SQLBits 2025 – June 18–21, 2025, ExCeL London, UK

At each event, SIOS experts will demonstrate how SIOS LifeKeeper and DataKeeper software provide high availability and disaster recovery for critical applications like SQL Server, SAP, and Oracle. Attendees will learn how SIOS clustering software ensures application uptime, eliminates data loss, and simplifies HA/DR across physical, virtual, cloud, and hybrid environments.

SIOS clustering software enables IT teams to create highly available application environments without the need for shared storage. Through intelligent application monitoring, real-time data replication, and automated failover and recovery, SIOS ensures business continuity with minimal complexity and reduced cost. With support for Windows and Linux in any infrastructure, SIOS solutions are trusted by enterprises worldwide to protect mission-critical operations.

SIOS Launches Survey to Gather Insights on HA/DR Practices

As part of its commitment to advancing resilience strategies in the enterprise, SIOS is launching its 2025 HA/DR Practices Survey to collect insights into the challenges, priorities, and real-world strategies used by IT professionals to ensure application uptime and data protection. The results will be compiled into the SIOS 2025 State of High Availability and Disaster Recovery Report, providing valuable benchmarks for the industry.

All practitioners, including attendees of the Red Hat Summit, Milestone Technology Day, Milestone XPerience Day, and SQLBits, are invited to participate in the survey here.

# # #

About SIOS Technology Corp.

SIOS Technology Corp. high availability and disaster recovery solutions ensure availability and eliminate data loss for critical Windows and Linux applications operating across physical, virtual, cloud, and hybrid cloud environments. SIOS clustering software is essential for any IT infrastructure with applications requiring a high degree of resiliency, ensuring uptime without sacrificing performance or data – protecting businesses from local failures and regional outages, planned and unplanned. Founded in 1999, SIOS Technology Corp. (https://us.sios.com) is headquartered in San Mateo, California, with offices worldwide.

SIOS, SIOS Technology, SIOS DataKeeper, SIOS LifeKeeper and associated logos are registered trademarks or trademarks of SIOS Technology Corp. and/or its affiliates in the United States and/or other countries. All other trademarks are the property of their respective owners.

Media Contact:

Beth Winkowski
Winkowski Public Relations, LLC for SIOS
978-649-7189
bethwinkowski@US.SIOS.com

Reproduced with permission from SIOS

Application Intelligence in Relation to High Availability

May 12, 2025 by Jason Aw Leave a Comment

Application Intelligence in Relation to High Availability

Application Intelligence in the context of High Availability (HA) refers to the system’s ability to understand and respond intelligently to the behavior and health of applications in real time to maintain continuous service availability.

What is Application Intelligence?

So, what is Application Intelligence? Application intelligence involves monitoring, analyzing, and reacting to several factors. These can include application state, like whether the application is up or down? Performance metrics include response time, error rates, throughput, and memory usage. Application dependencies, such as databases or external services. Lastly, they look at user behavior or patterns. Using Application Intelligence takes a more holistic view of the application. It uses various data points to make educated decisions about the state of the application itself, not just the infrastructure. Let’s take the example of a web server; it’s not simply enough to know if the server is running, but is the site accessible without any errors? Is the response slow at all? Are users refreshing multiple times and trying to access it? Is the database the website relies on also up and running and accessible? All the above are examples of the factors that application intelligence considers to be successful.

How LifeKeeper Uses Application Intelligence

So, how does LifeKeeper use application intelligence to enhance high availability for critical applications? Let’s break it down. LifeKeeper uses application-specific recovery kits (ARKs) that contain knowledge for each application (SAP, SQL, PostgreSQL, Oracle, etc.). This allows LifeKeeper to handle the startup/shutdown procedures of each application, monitor the health and status of both the application and any dependencies, as well as orchestrate intelligent failover/failback operations without corrupting any data. Users can group together related resources in a hierarchical relationship within LifeKeeper, which allows LifeKeeper to understand the dependencies between different application components (when a service relies on an IP or database, for example). This ensures LifeKeeper failovers happen in the correct order and recovery actions don’t break the application or leave it in an inconsistent or broken state.

Additionally, LifeKeeper does deep health checks, not just determining if the server is up, but also more detailed checks, such as whether a database is accepting connections or if a web service is returning expected responses. It can even monitor if certain expected background processes are running. LifeKeeper also uses application-specific configuration files to ensure data configuration consistency across nodes and that application settings are preserved or restored correctly. Lastly, LifeKeeper has the ability to use custom scripts to further fine-tune these deep checks to support less common or homegrown applications intelligently as well.

PostgreSQL ARK: A Real-World Example of Application Intelligence

To take a deeper dive, we can look at how PostgreSQL ARK uses Application Intelligence. The PostgreSQL ARK uses specific logic to monitor, start, stop, and failover PostgreSQL via knowledge of the specific PostgreSQL startup and shutdown commands, awareness of critical config files like postgresql.conf and pg_hba.conf and understanding the data directory layout and lock file behavior.

Intelligent Monitoring and Ordered Failover for PostgreSQL

Additionally, it doesn’t just check that PostgreSQL is running, it also checks if the database is responding to queries, the correct data directory is accessible, and if there is any corruption in the transaction logs? It uses dependency tracking to make sure that the resources PostgreSQL often depends on are available such as the Virtual IP for client connections and the mounted storage for its data directory. This ensures that LifeKeeper can bring up the resources in the correct order in case of a failover, such as mounting the disk first, bringing up the IP, and then starting PostgreSQL before verifying the service health.

Preventing Split-Brain and Ensuring Data Integrity

Lastly, LifeKeeper uses application intelligence to avoid split-brain (a phenomenon where more than one node thinks it’s the ‘primary’ node) scenarios by avoiding starting two active PostgreSQL servers with the same data directory and avoiding data corruption by not failing over when writes are still in progress. These are examples of all the different ways LifeKeeper and the various ARKs have implemented application intelligence to make the combined product as resilient as possible.

Strengthen Application Resilience with Intelligent High Availability

In summary, LifeKeeper’s built-in application intelligence enables precise, fast, and reliable failover and recovery by understanding how applications behave and what they need to run correctly.

Ensure application resilience and uninterrupted service—request a demo or start your free trial today to experience how SIOS LifeKeeper uses application intelligence to protect your critical workloads.

Author: Cassy Hendricks-Sinke, Principal Software Engineer, Team Lead

Reproduced with permission from SIOS

Transitioning from VMware to Nutanix

May 8, 2025 by Jason Aw Leave a Comment

Transitioning from VMware to Nutanix

10 Considerations for Choosing a High Availability Solution in a Nutanix Environment

If you’re planning a move from VMware to Nutanix, making sure your critical applications stay up and running should be at the top of your list. While Nutanix offers great benefits like simplified management and better performance, its built-in high availability only covers the virtual machine—not the applications themselves. This paper shares ten key insights to help you plan ahead and avoid downtime during and after your migration. You’ll get practical guidance on choosing the right clustering solutions for both Windows and Linux, how to handle shared storage in Nutanix, and what to consider if you’re running a mix of operating systems. Whether you’re moving to Nutanix AHV, or managing hybrid environments, learn how to simplify your HA strategy, reduce risk, and keep your most important systems protected.

Reproduced with permission from SIOS

Are my servers disposable? How High Availability software fits in cloud best practices

May 2, 2025 by Jason Aw Leave a Comment

Are my servers disposable? How High Availability software fits in cloud best practices

In this VMblog article, “Are my servers disposable? How High Availability software fits in cloud best practices,” Philip Merry, a software engineer at SIOS Technology, explores how the shift to cloud computing has changed the role and perception of servers in modern IT environments. With the rise of automation and infrastructure-as-code, servers have become increasingly disposable, easily created, destroyed, and replaced, aligning with cloud best practices like those outlined in the AWS Well-Architected Framework. However, Merry emphasizes that while infrastructure can be treated as temporary, the applications running on it remain critical and must be continuously available. To bridge this gap, high availability (HA) software plays a vital role, allowing IT teams to maintain uptime and reliability by decoupling application continuity from the underlying server hardware. This approach empowers organizations to embrace the flexibility of cloud environments without compromising on the stability and performance of their essential applications.

Author: Beth Winkowski, SIOS Technology Corp. Public Relations

Reproduced with permission from SIOS