SIOS SANless clusters

SIOS SANless clusters High-availability Machine Learning monitoring

  • Home
  • Products
    • SIOS DataKeeper for Windows
    • SIOS Protection Suite for Linux
  • News and Events
  • Clustering Simplified
  • Success Stories
  • Contact Us
  • English
  • 中文 (中国)
  • 中文 (台灣)
  • 한국어
  • Bahasa Indonesia
  • ไทย

Think Before You Script: Best Practices for Gen/App Recovery

September 20, 2025 by Jason Aw Leave a Comment

Think Before You Script Best Practices for GenApp Recovery

Think Before You Script: Best Practices for Gen/App Recovery

SIOS Recovery Kits provide a wealth of best practices for application-aware monitoring and recovery.  In general, each SIOS recovery kit provides a step-by-step programmatic approach to restoring the application, database, or service in accordance with High Availability (HA) best practices.  The SIOS Recovery Kits provide the intelligence needed to restore operation after a normal system shutdown, after an unexpected system failure or crash, and even in the case where the application, database, or service itself crashes or becomes unavailable.   In addition, each recovery includes experiential wisdom and improvements from over two decades in the field.

However, if a customer still needs to roll their own script for providing HA, SIOS LifeKeeper for Windows and SIOS LifeKeeper for Linux include an option for script integration via the Generic Application (Gen/App) Recovery Kit.

Best Practices for Writing Gen/App Recovery Scripts

1. Use Modern, Supported Scripting Languages for Gen/App Recovery

A common practice with existing solutions is to use the old existing scripts on new systems and architecture.  However, it is essential to make sure you are using a modern, supported scripting language.

2. Avoid Hardcoded Values in Gen/App Scripts

Using hardcoded values can cause portability issues, as well as challenges with long-term maintenance.  Avoid using hard-coded values that are subject to change in future deployments, for example, directory paths, user names, or similar.

3. Practice Code Reuse to Improve Gen/App Script Quality

Duplicate code is a common problem in customer-developed scripts.  Duplicate code creates quality, maintenance, and troubleshooting problems.  Practice code reuse, such as inheritance, functions, and subroutines.

4. Choose Meaningful Names for Functions and Variables

Descriptive variables are more helpful than single-character variables such as ‘n’ or ‘i.  When looking at code months or years later, will the variable ‘n’ mean as much as iReturnCode?

5. Remove Unused Functions and Variables to Prevent Code Bloat

While meaningful names for functions and variables are important, avoid cluttering the code with unused variables and functions.  Declaring variables and not using them can create confusion during future updates and troubleshooting.  While the days of 8 MB of memory are long gone, additional variables or functions that provided limited reuse or no additional value are still burdensome and create code bloat.

6. Verify All Input Parameters for Reliable Gen/App Execution

In the rush to get something working, don’t ignore input variable validation.  Verify all input to the script and to functions.  Don’t assume that if “we got here,” all of our inputs are valid.

7. Log Helpful and Actionable Messages

Consider what output needs to be logged for status/progress, error conditions, or troubleshooting.  Each message should be thoughtfully considered and appropriately worded to provide helpful feedback to operators and future developers.

8. Check Return Codes on All Method/Function/API Calls and Take Defensive Action

Commands that are executed within the body of the script or function will have return codes, explicitly pass, fail, or other.  Be sure to check, log, and properly handle both expected and unexpected return codes from methods, functions, and API calls.

9. Use Defensive Programming Techniques

Apply best practices for defensive programming, including least privilege access, input validation, error handling, etc.

10. Test Gen/App Recovery Scripts Beyond the Happy Path

Working code is not enough.  Develop a robust validation plan and test the code extensively, especially beyond the happy path when everything is expected to work.

11. Use Version Control for Script Management and Troubleshooting

Use version control and code management tools.  Version control is essential for troubleshooting, management, and tracking the inevitable fixes required for your scripts.

12. Catch Errors Early with Code Inspections and Peer Reviews

Use code inspections and peer reviews to increase the resilience and robustness of the code.  Code reviews help find problems early and reduce the cost, risk, and burden of late-stage failures and bugs.

13. Verify Permissions Required for Execution in Gen/App Recovery

Having well-organized, modern, reviewed, inspected, tested, and controlled code is an essential part of a well-crafted gen/app script.  However, the best-coded script will fail to execute if it does not have the right permissions.  Ensure that the script has the correct permissions to execute standalone as well as under the service/user accounts of the HA solution.

14. Comment Code Clearly to Explain Logic and Business Use Cases

Provide comments that help explain the business logic and use case, describe expected function inputs and returns, and contribute to overall understanding.  Well-written code still needs comments, especially if it is not obvious what business logic or requirement is being addressed.  An example comment block could look like:
Name:

Purpose:

Preconditions:

Postconditions:

Returns:

Ready to Simplify Gen/App Recovery with Confidence?

Don’t leave high availability to chance. With SIOS LifeKeeper and the Generic Application (Gen/App) Recovery Kit, you can safeguard critical applications, streamline recovery, and reduce downtime.

Request a demo today to see how SIOS can help you achieve reliable, cost-effective high availability and disaster recovery.

Author: Cassius Rhue, VP, Customer Experience at SIOS

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability, High Availability

How to Assess if My Network Card Needs Replacement

May 21, 2025 by Jason Aw Leave a Comment

How to Assess if My Network Card Needs Replacement

How to Assess if My Network Card Needs Replacement

A network interface card (NIC), often referred to as a network card, is a vital component of any server infrastructure. It enables systems in a cluster to communicate with each other and the outside world. If your NIC is experiencing issues, it can compromise the health of your cluster, lead to false node failures, or increase the risk of split-brain scenarios. Recognizing the signs of a failing NIC early can save time, reduce downtime, and maintain high availability.

In this blog, we’ll explore how to assess whether your network card needs replacement, the symptoms to look out for, and the tools that can aid you in diagnosing the issue.

Common Symptoms of a Failing NIC

1. Intermittent Connectivity

One of the first signs of NIC failure is unstable or sporadic connectivity. You may notice dropped packets, high latency, or difficulty reaching external hosts. These issues can cause nodes in a LifeKeeper cluster to temporarily lose connection and trigger unnecessary failovers.

2. Degraded Network Speed

If a system is underperforming on network-related tasks such as slow replication, sluggish application response, or delayed heartbeat communication, it may be due to a faulty NIC that is no longer operating at its rated speed (e.g., 1 Gbps vs. 10 Gbps). In clustered environments, slow replication is especially concerning because it delays data synchronization between nodes. This not only increases recovery time in the event of a failover but also raises the risk of data loss or inconsistent state across systems if a complete failure occurs before the replication finishes.

3. System Logs Showing Network Errors

Frequent kernel or system log messages related to the NIC driver or interface, such as “link down,” “NIC reset,” or “device not responding,” are red flags. These messages indicate the OS is having trouble communicating with the card at a hardware or driver level.

4. Unusual Heat or Physical Damage

While not common, physical inspection may reveal damage such as scorch marks or excessive heat emission. Hardware issues at this level can quickly deteriorate performance or cause complete failures, which is certainly not desirable in any environment.

5. Issues in Virtual or Cloud Environments

In virtualized and cloud environments, NIC behavior can be affected not just by the underlying hardware but also by the configuration of the hypervisor or virtual networking layer. For example, virtual NICs assigned through VMware or Hyper-V may show degraded performance if incompatible/outdated drivers are used, or even if the VM is assigned an adapter type that is not optimized for the desired workload.

Network Card Troubleshooting Tools for Windows and Linux

Diagnosing NIC issues early helps minimize downtime and prevent unnecessary failovers. The following are essential tools for identifying hardware or driver-related NIC issues, including options for both Linux and Windows environments:

  • ethtool (Linux):
    Use this to view NIC statistics, driver information, and up-to-date link status. A high number of transmit/receive errors, dropped packets, or failed auto negotiations could indicate a deteriorating NIC.
  • PowerShell cmdlets (Windows):
    Get-NetAdapter and Get-NetAdapterStatistics allow you to inspect link status, speed, and adapter health on Windows systems. Combined with Get-NetEventSession, you can also track event logs related to NIC behavior over time.
  • dmesg / journalctl (Linux) or Event Viewer (Windows):
    These tools help uncover system or kernel-level alerts. Look for messages such as “NIC reset,” “link down,” or “device not responding.” In Windows, these might appear under “System” or “Application”  logs and indicate driver crashes or hardware unresponsiveness.
  • ping / iperf (Cross platform):
    Useful for testing basic connectivity and throughput. If packet loss, jitter, or unexpected latency spikes occur during tests, it could point to faulty hardware or cabling.
  • Network Bonding Failover Behavior:
    When using bonded or teamed interfaces for redundancy, observe whether one interface is triggering failover events more frequently than the others. This could mean the failing NIC is silently degrading, even if no system errors are reported.

When to Replace Your NIC?

It may be time to replace your NIC if:

  • You observe consistent or worsening symptoms outlined above.
  • Logs and tools confirm hardware or driver issues that persist after driver updates or firmware reinstallation.
  • The issue follows the NIC when moved to another system (if removable).
  • The card is outdated and unsupported by the current OS or clustering tools.
  • You are in a highly available (HA) environment where the continuity of service is critical. In these cases, it is especially best practice to proactively move services or resources to nodes with verified healthy NICs while troubleshooting to avoid risking a failover delay or unexpected downtime.

Preventative Measures to Avoid Network Card Failures

To avoid NIC-related failures:

  • Use redundancy: Implement bonding or teaming across multiple NICs.
  • Keep firmware up to date: Periodically check for driver and firmware updates from your hardware vendor.
  • Monitor proactively: Use tools and third-party network monitoring to catch early signs of NIC degradation.
  • Regular testing: Validate link speed and latency as part of regular cluster health checks.

Final Thoughts on Maintaining Network Interface Card Health

The NIC may not be the most glamorous piece of hardware, but its health is critical to a stable, highly available environment. Knowing when and how to assess a network card’s performance helps prevent unexpected downtime, ensures seamless failover behavior, and keeps your cluster communication resilient.

SIOS Technology Corporation provides high availability cluster software that protects & optimizes IT infrastructures with cluster management for your most important applications. Request a demo today.

Author: Aidan Macklen, Customer Experience Engineer Intern at SIOS Technology Corp.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability

Application Intelligence in Relation to High Availability

May 12, 2025 by Jason Aw Leave a Comment

Application Intelligence in Relation to High Availability

Application Intelligence in Relation to High Availability

Application Intelligence in the context of High Availability (HA) refers to the system’s ability to understand and respond intelligently to the behavior and health of applications in real time to maintain continuous service availability.

What is Application Intelligence?

So, what is Application Intelligence? Application intelligence involves monitoring, analyzing, and reacting to several factors. These can include application state, like whether the application is up or down? Performance metrics include response time, error rates, throughput, and memory usage. Application dependencies, such as databases or external services. Lastly, they look at user behavior or patterns. Using Application Intelligence takes a more holistic view of the application. It uses various data points to make educated decisions about the state of the application itself, not just the infrastructure. Let’s take the example of a web server; it’s not simply enough to know if the server is running, but is the site accessible without any errors? Is the response slow at all? Are users refreshing multiple times and trying to access it? Is the database the website relies on also up and running and accessible?  All the above are examples of the factors that application intelligence considers to be successful.

How LifeKeeper Uses Application Intelligence

So, how does LifeKeeper use application intelligence to enhance high availability for critical applications? Let’s break it down.  LifeKeeper uses application-specific recovery kits (ARKs) that contain knowledge for each application (SAP, SQL, PostgreSQL, Oracle, etc.). This allows LifeKeeper to handle the startup/shutdown procedures of each application, monitor the health and status of both the application and any dependencies, as well as orchestrate intelligent failover/failback operations without corrupting any data. Users can group together related resources in a hierarchical relationship within LifeKeeper, which allows LifeKeeper to understand the dependencies between different application components (when a service relies on an IP or database, for example). This ensures LifeKeeper failovers happen in the correct order and recovery actions don’t break the application or leave it in an inconsistent or broken state.

Additionally, LifeKeeper does deep health checks, not just determining if the server is up, but also more detailed checks, such as whether a database is accepting connections or if a web service is returning expected responses. It can even monitor if certain expected background processes are running. LifeKeeper also uses application-specific configuration files to ensure data configuration consistency across nodes and that application settings are preserved or restored correctly.  Lastly, LifeKeeper has the ability to use custom scripts to further fine-tune these deep checks to support less common or homegrown applications intelligently as well.

PostgreSQL ARK: A Real-World Example of Application Intelligence

To take a deeper dive, we can look at how PostgreSQL ARK uses Application Intelligence.  The PostgreSQL ARK uses specific logic to monitor, start, stop, and failover PostgreSQL via knowledge of the specific PostgreSQL startup and shutdown commands, awareness of critical config files like postgresql.conf and pg_hba.conf and understanding the data directory layout and lock file behavior.

Intelligent Monitoring and Ordered Failover for PostgreSQL

Additionally, it doesn’t just check that PostgreSQL is running, it also checks if the database is responding to queries, the correct data directory is accessible, and if there is any corruption in the transaction logs?  It uses dependency tracking to make sure that the resources PostgreSQL often depends on are available such as the Virtual IP for client connections and the mounted storage for its data directory.  This ensures that LifeKeeper can bring up the resources in the correct order in case of a failover, such as mounting the disk first, bringing up the IP, and then starting PostgreSQL before verifying the service health.

Preventing Split-Brain and Ensuring Data Integrity

Lastly, LifeKeeper uses application intelligence to avoid split-brain (a phenomenon where more than one node thinks it’s the ‘primary’ node) scenarios by avoiding starting two active PostgreSQL servers with the same data directory and avoiding data corruption by not failing over when writes are still in progress. These are examples of all the different ways LifeKeeper and the various ARKs have implemented application intelligence to make the combined product as resilient as possible.

Strengthen Application Resilience with Intelligent High Availability

In summary, LifeKeeper’s built-in application intelligence enables precise, fast, and reliable failover and recovery by understanding how applications behave and what they need to run correctly.

Ensure application resilience and uninterrupted service—request a demo or start your free trial today to experience how SIOS LifeKeeper uses application intelligence to protect your critical workloads.

Author: Cassy Hendricks-Sinke, Principal Software Engineer, Team Lead

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability, High Availability

High Availability & the Cloud: The More You Know

October 25, 2021 by Jason Aw Leave a Comment

High Availability & the Cloud

High Availability & the Cloud: The More You Know

While researching reasons to migrate to the cloud, you’ve probably learned that the benefits of cloud computing include scalability, reliability, availability, and more. But what, exactly, do those terms mean? Let’s consider high availability (HA), as it is often the ultimate goal of moving to the cloud for many companies.

The idea is to make your products, services, and tools accessible to your customers and employees at any time from anywhere using any device with an internet connection. That means ensuring your critical applications are operational – even through hardware failures, software issues, human errors, and sitewide disasters – at least 99.99% of the time (that’s the definition of high availability).

While public cloud providers typically guarantee some level of availability in their service level agreements, those SLAs only apply to the cloud hardware. There are many reasons for application downtime that aren’t covered by SLAs. For this reason, you need to protect these applications with clustering software that will detect issues and reliably move operations to a standby server if necessary. As you plan what and how you will make solutions available in the cloud, remember that it is important that your products and services and cloud infrastructure are scalable, reliable, and available when and where they are needed.

Quick Stats on High Availability in the Cloud in 2021

Now that we’ve defined availability in the cloud context, let’s look at its impact on organizations and businesses. PSA, these statistics may shock you, but don’t fret. We’ve also got some solutions to these pressing and costly issues.

  1. As much as 80% of Enterprise IT will move to the cloud by 2025 (Oracle).
  2. The average cost of IT downtime is between $5,600 and $11,600 per minute (Gartner; Comparitech).
  3. Average IT staffing to employee ratio is 1:27 (Ecityworks).
  4. 22% of downtime is the result of human error (Cloudscene).
  5. In 2020, 54% of enterprises’ cloud-based applications moved from an on-premises environment to the cloud, while 46% were purpose-built for the cloud (Forbes).
  6. 1 in 5 companies don’t have a disaster recovery plan (HBJ).
  7. 70% of companies have suffered a public cloud data breach in the past year (HIPAA).
  8. 48% of businesses store classified information on the cloud (Panda Security).
  9. 96% of businesses experienced an outage in a 3-year period (Comparitech).
  10.  45% of companies reported downtime from hardware failure (PhoenixNAP).

What You Can Do – Stay Informed

If you are interested in learning the fundamentals of availability in the cloud or hearing about the latest developments in application and database protection, join us. The SIOS Cloud Availability Symposium is taking place Wednesday, September 22nd (EMEA) and Thursday, September 23rd (US) in a global virtual conference format for IT professionals focusing on the availability needs of the enterprise IT customer. This event will deliver the information you need on application high availability clustering, disaster recovery, and protecting your applications now and into the future.

Cloud Symposium Speakers & Sessions Posted

We have selected speakers presenting a wide range of sessions supporting availability for multiple areas of the data application stack. Check out the sessions posted and check back for additional presentations to be announced! Learn more

Register Now

Whether you are interested in learning the fundamentals of availability in the cloud or hearing about the latest developments in application and database protection, this event will deliver the information you need on application high availability clustering, disaster recovery, and protecting your applications now and into the future.

Register now for the SIOS Cloud Availability Symposium.

Reproduced from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability, Cloud, cloud migration, disaster recovery, High Availability

Beginning Well is Great, But Maintaining Uptime Takes Vigilance

September 28, 2021 by Jason Aw Leave a Comment

Beginning Well is Great, But Maintaining Uptime Takes Vigilance

 

 

Beginning Well is Great, But Maintaining Uptime Takes Vigilance

Author Isabella Poretsis states, “Starting something can be easy, it is finishing it that is the highest hurdle.” It is great to have a kickoff meeting.  It is invigorating, and exciting. Managers and leaders look out at the greenfield with excitement and optimism is high.  But, this moment of kickoff, and even the Champagne popping moment of a successful deployment are but just the beginning. Maintaining uptime requires ongoing vigilance.

High availability and the elusive four nines of uptime for your critical applications and databases aren’t momentary occurrences, but rather, a constant endeavor to end the little foxes that destroy the vineyard.  Staying abreast of threats, up-to-date on the updates, and properly trained and prepared is the work from which your team “is never entitled to take a vacation.”

For those who want to stay vigilant in maintaining uptime, here are five tips:

1. Monitor the Environment 

Very little in enterprise software still follows the “set it and forget it” mindset.  Everything, since the day you uncorked the grand opening champagne to now, has been moving toward a state of decline.  If you aren’t monitoring the servers, workloads, network traffic, and hardware (virtual or physical), you may lose uptime and stability.

2. Perform Maintenance

One thing that I have always noticed in over twenty plus years of software development and services is that all software comes with updates.  Apply them.  Remember to execute sound maintenance policies, including taking and verifying backups. One tech writer suggested the only update you regret is the one you failed to make.

3. Learn Continuously

My first introduction to high availability came when I unplugged one end of the Token Ring for a server in our lab as an intern, fresh from the CE-211 lab.  The administrator was in my face in minutes.  After an earful, he gave me an education.  Ideally, you and your team want to learn without taking down your network, but you do absolutely want to keep learning.  Look into paid courses on existing technology, new releases, emerging infrastructure.  Check your vendors for courses and items related to your process, environment, software deployments and company enterprise.  Free courses for many things also exist if money is an issue.

4. Multiply the learning

In addition to continuous learning, make a plan to multiply the learning.  As the VP of Customer Experience at SIOS we have seen the tremendous difference between teams who share their learning and those who don’t.  Teams that share their learning avoid gaps in knowledge that compromise downtime.  The best way to know that you learned something is to teach it to somebody else. As you learn, share the learning with team members to reduce the risk of downtime due to error, and for that matter vacation.

5. End well . . .before the next beginning

All projects, servers, and software have an ending.  End well.  Decommission correctly.  Begin the next phase, deployment, software relationship, etc well by closing up loose ends, documenting what went well, what did not, and what to do next.  Treat your existing vendors well.  You just may need them again later.  Understand the existing systems and high availability solutions before proceeding with a new deployment.  This proper ending helps you begin again from a better starting place headed towards a stronger outcome.

Keeping the system highly available is a continuous process.  Set it and forget it is a nice catch phrase, but the reality is that uptime takes vigilance, continual monitoring, proper maintenance, and constant.

-Cassius Rhue, VP, Customer Experience

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability, clusters, disaster recovery, High Availability

  • 1
  • 2
  • 3
  • …
  • 6
  • Next Page »

Recent Posts

  • Reframing Early Computer Science Education: The Soft Skills of Solution Design Part 1
  • How to Cut SQL Server HA/DR Costs and Gain Advanced Features
  • Commonalities between Disaster Recovery (DR) and your spare tire
  • Unlocking Near-Zero Downtime Patch Management with High Availability Clustering
  • How to Safely Combine DataKeeper for Linux with Backup and Replication Tools

Most Popular Posts

Maximise replication performance for Linux Clustering with Fusion-io
Failover Clustering with VMware High Availability
create A 2-Node MySQL Cluster Without Shared Storage
create A 2-Node MySQL Cluster Without Shared Storage
SAP for High Availability Solutions For Linux
Bandwidth To Support Real-Time Replication
The Availability Equation – High Availability Solutions.jpg
Choosing Platforms To Replicate Data - Host-Based Or Storage-Based?
Guide To Connect To An iSCSI Target Using Open-iSCSI Initiator Software
Best Practices to Eliminate SPoF In Cluster Architecture
Step-By-Step How To Configure A Linux Failover Cluster In Microsoft Azure IaaS Without Shared Storage azure sanless
Take Action Before SQL Server 20082008 R2 Support Expires
How To Cluster MaxDB On Windows In The Cloud

Join Our Mailing List

Copyright © 2025 · Enterprise Pro Theme on Genesis Framework · WordPress · Log in