SIOS SANless clusters

SIOS SANless clusters High-availability Machine Learning monitoring

  • Home
  • Products
    • SIOS DataKeeper for Windows
    • SIOS Protection Suite for Linux
  • News and Events
  • Clustering Simplified
  • Success Stories
  • Contact Us
  • English
  • 中文 (中国)
  • 中文 (台灣)
  • 한국어
  • Bahasa Indonesia
  • ไทย

3 Challenges of Maintaining High Availability with a Legacy Infrastructure

June 9, 2026 by Jason Aw Leave a Comment

3 Challenges of Maintaining High Availability with a Legacy Infrastructure

3 Challenges of Maintaining High Availability with a Legacy Infrastructure

High availability (HA) is critical for organizations that rely on continuous access to applications, services, and data. Whether supporting customer-facing platforms or internal business operations, downtime can quickly lead to financial loss, productivity issues, and reputational damage. While many companies continue to use legacy infrastructure due to cost, compatibility, or business requirements, maintaining high availability in older environments becomes increasingly difficult over time. Legacy IT systems often introduce technical limitations and operational risks that modern platforms are designed to avoid.

One of the most common issues with legacy infrastructure is the growing incompatibility between software packages, libraries, and system components. Older technologies are often built on tightly coupled dependencies that were designed years ago, before decoupling was really put into practice. Over time, these systems become difficult to update because the software has drastically changed or isn’t maintained anymore.

Here are some examples of what issues you can run into with older infrastructure:

  • Updating one package or library can unintentionally break another component that relies on an older version. I’ve run into this myself before, where one dependency leads to another and another, and hours pass as you’re recompiling a dozen packages!
  • Lack of documentation on how the services interact can make it difficult to upgrade.
  • Lastly, modern monitoring, security, or automation tools may not integrate cleanly with outdated systems. In HA environments, even small compatibility issues can trigger major disruptions.

Plan for infrastructure modernization as part of your high availability strategy

Maintaining high availability with older infrastructure presents both technical and operational challenges. Package incompatibilities, limited vendor support, and declining internal expertise can all threaten system stability and increase the risk of downtime.

While legacy systems may continue to serve important business functions, organizations should proactively plan for infrastructure modernization, improve documentation practices, and invest in knowledge transfer before critical expertise is lost. A strong HA strategy is about ensuring long-term reliability, security, and operational resilience for the future.

Author: Cassy Hendricks-Sinke, Senior System Engineer, IT Operations, SIOS

Reproduced with permission from SIOS

 

Filed Under: Clustering Simplified Tagged With: High Availability

LifeKeeper Generic Applications for High Availability and Disaster Recovery

June 4, 2026 by Jason Aw Leave a Comment

LifeKeeper Generic Applications for High Availability and Disaster Recovery

LifeKeeper Generic Applications for High Availability and Disaster Recovery

Keys to Success for Protecting Business-Critical Applications

High Availability and Disaster Recovery have to cover a broad range of use cases. There are as many use cases as there are organizations, far exceeding the capabilities of any single High Availability and Disaster Recovery solution to provide out-of-the-box support for every scenario. While many common applications have a wide array of High Availability and Disaster Recovery solutions available, more specific use cases limit the selection available to protect business-critical applications.

Of course, LifeKeeper cannot cover every use case out of the box. LifeKeeper, however, provides a versatile and flexible framework that can be adapted to a wide range of use cases to remedy this limitation. While powerful, this framework can appear complex to an outsider. This blog is here to help give a leg up when starting to conceptualize a Generic Application Recovery Kit for your specific use case.

Related Blogs and Background Reading Recommendations

Within this blog, there is also the assumed familiarity with the LifeKeeper Resource Hierarchy framework and LifeKeeper Clustering in general. For background on these topics, the blogs listed below provide fantastic context. Additionally, this blog builds upon a previous blog regarding one means to close the gap between possible use cases and supported protection mechanisms via the use of the “Quick Service Protection Application Recovery Kit” (QSP ARK) within LifeKeeper, linked below.

  • Linux Clustering / Windows Clustering (Writing credits to Ms. Hoagland, Vice President of SIOS Global Sales and Marketing, and the SIOS Marketing Team)
  • Application Intelligence in Relation to High Availability (Writing credits to Ms. Hendricks-Sinke, Senior Software Engineer at SIOS)
  • Resource actions and background on The Generic Application Recovery Kit (Writing credits to Mr. Birmingham, Senior Technical Evangelist)
  • Choosing Between GenApp and QSP: Tailoring High Availability for Your Critical Applications (Writing credits to Ms. Hendricks-Sinke, Senior Software Engineer at SIOS).

This blog, however, will explore the options available when the QSP ARK cannot meet the demands for High Availability and Disaster Recovery for a particular application or use case.

Conceptualizing Applications and Defining the Approach

Asking the Smallest Question About Application Health

System administration and software engineering are both fields with lots of nuance. There can be so many different elements behind a question that a simple, straightforward answer can be difficult to obtain. Conversational, this can be easily navigated. In code, complex answers are difficult to accommodate. Asking the “smallest” question is the practice of targeting an inquiry to the smallest element possible, while ensuring that the answer has clearly defined criteria.

“Is the application running?” This is a “big” question; it may require a verbose answer. Yes, the application is running, but it is not responding. Yes, the application is running, but it is running on that other system – not the one you’re talking about. The criteria of the answer are ambiguous, and the answer is nuanced – a level of detail that developers would rather not have to handle.

“Is the application’s process running, and is the application actively responding to queries?”

Though longer to say, it is a smaller question. It clearly defines the conditions under which the answer is yes or no. While this change is an improvement, it is not yet the “smallest” question. The previous falls victim to the same pitfall of asking “Are both X and Y true?” A yes or no answer cannot give the level of detail to determine the truth of X and Y independently. The smallest question requires specificity; it must provide full insight into the status of the smallest element of the greater whole. “Is the application’s process running on the desired system?” That’s a small question – in this case, this is the smallest question. Keep in mind, there might be multiple “smallest” questions – in this example, “is the application responding to queries” would also qualify.

While questions can be broken down almost indefinitely, there is a limit. Asking “the smallest question?, comes with the implication of “Asking the smallest question that still provides useful/actionable information”. Asking “Am I on the train to Philadelphia?” is sufficient; going further to ask “Am I on the train to Philadelphia, and which direction is Philadelphia?”  provides more information – but it is not actionable. I cannot change the direction of the train. I know from the answer to “Am I on the train to Philadelphia?” if I need to call into work to inform my boss that I will be late.

Though clear in this example, this is less obvious when developing for a generic application. Throughout the process of protecting the generic application, one must still keep perspective on the bigger picture. This, like anything else, is a skill – with practice and collaboration comes the ability to determine when a question is the smallest question, and when further nuance stops providing additional useful information.

Broad questions that have been broken into smaller, specific, and targeted inquiries about individual elements are the basis on which Generic Application Recovery Kits are built. Each “big question” can be answered through a composite of the answers provided for each of the elements implicated within.

Once the questions are broken into their smallest elements, the information that needs to be relayed becomes significantly clearer. Knowing the information needed, the remaining work in developing a Generic Application Recovery Kit is all a matter of how to get the information that is needed from the information that is provided. One must work with the information that is given.

Working with Application APIs and LifeKeeper APIs

Often, applications provide Graphical User Interfaces (GUIs) to display information or show changes that occurred to the application. While fantastic for human-driven use, this is less useful when the administration is being done by an application. GUIs are for people to use, and applications (foregoing a massive amount of programming effort and unnecessary complexity) are not equipped to interface with the GUI of another application as a human would. For the purposes of LifeKeeper and a Generic Application Resource, the exchange of information between the Generic Application Recovery Kit’s action scripts and the application being protected must be done by an Application Programming Interface, or an “API”.

LifeKeeper provides its own API for interacting with the LifeKeeper, the hierarchy, and the resources within the hierarchy. In the case of LifeKeeper’s API, the command-line utilities contained within the product are the easiest to use in a Generic Application. As a general recommendation, only the command line utilities outlined in LifeKeeper Product Documentation (Linux Commands Documentation / Windows Commands Documentation) should be used. Even with this recommendation, these commands should be used with care and attention to detail to ensure that unintended actions are not taken.

Of course, LifeKeeper is not the only factor in the Generic Application. The application being protected will also need an API presented so action scripts can leverage the application’s API to achieve the desired outcomes. Developing a Generic Application Recovery Kit does require knowledge of the protected application’s API and the use of that API within the action scripts that compose the Generic Application Recovery Kit.

Using Return Codes and Output Streams in Recovery Scripts

Whether it is the API for LifeKeeper or the protected application, information will be primarily output in two ways:

  • Return Codes
  • Output Streams (sometimes called “STDOUT/STDERR output” or just “terminal output”)

How Return Codes Help Determine Success or Failure

Return codes, in the broadest sense, provide a quick way to see if a utility succeeded or failed. Typically (In the context of a shell environment), a return code of 0 indicates success, while a nonzero return code indicates failure.

Depending on the application, the exact value of the return code may give more insight into the error encountered. Often, the outcome of actions performed via an application’s API can be surmised simply by checking the return code.

In more nuanced cases, it may be that the return code is simply used to tell the program which course of action to take following a call to the application’s API. Return codes are especially useful when dealing with utilities that concern the state of some underlying element.

How Output Streams Provide More Detailed Application Information

Output Streams, while more complex to make use of in a program, are sometimes necessary for information exchange or to verify outcomes. If running a utility to get the system’s hostname, the return code alone will not indicate what that hostname is, unless the utility was successful in retrieving the hostname. In some cases, an API utility may return a successful return code if the requested information was obtained, but that information has to be evaluated for validity based on the circumstances.

Whether using return codes or output streams, developing a Generic Application requires the use of the information at hand. When thinking of ways to achieve resource actions (outlined in the next section) or determine information about an application or LifeKeeper resource, try to think in terms of return codes and output streams, not GUI interfaces. It can be helpful to imagine trying to convey information over the phone. This is to say, information is best communicated, actions are best defined, and scenarios are best handled when the inputs and outputs of utilities are reported exactly as they are to be provided as input or are reported as output.

Building a Foundation for Generic Application Protection

This section kept the strategies very conceptual. These strategies lay the foundation for thinking about an application through the responses provided to questions and actions issued upon that application.  Going forward, the approach will become more specific to LifeKeeper and the process of creating a Generic Application Recovery Kit. In the meantime, these strategies develop like any other skill, through practice. In technical communication, writing procedures, or any capacity in which you find yourself, practicing these conceptualization strategies will benefit not only in the short term but over time as well.

Need help protecting a business-critical application that does not fit a standard high availability model? SIOS can help you evaluate your environment and determine the right LifeKeeper approach. Request a demo today.

Author: Philip Merry, L3 Support Engineer at SIOS Technology Corp.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: disaster recovery, High Availability

SIOS Enterprise Support Guide: What Your Plan Covers

May 30, 2026 by Jason Aw Leave a Comment

SIOS Enterprise Support Guide What Your Plan Covers

SIOS Enterprise Support Guide: What Your Plan Covers

What’s Included in Your SIOS Enterprise Support Plan?

Here are some quick tips for what is covered and not covered with Enterprise level support, and where to go for additional information based on three common scenarios.

24/7 Support for Critical System Downtime

Scenario 1: System Down After Hours
Joan’s team: It’s 7 pm EST on Sunday. The routine switchover between SIOS LifeKeeper cluster nodes should have been simple.  But something unexpected happened, resulting in the switchover failure. Despite all of the team’s efforts to resolve the issue, the cluster remains down.  Joan needs help, but she is not sure that her SIOS Technical support plan covers weekends or how long it will take to get a support person on the phone.

Customers who have purchased (or renewed) their Enterprise level support prior to an incident have access to receive support 24 hours a day, 7 days a week.  This support includes weekends and holidays to address Critical Issues.  Critical Issues mean down production systems or applications, where Customer data cannot be accessed using SIOS Programs.  For all Priority 1 (critical) issues, where normal operation results in the loss of access to your production data, SIOS provides a 2-hour response time.

If Joan has valid Enterprise support, she will be able to reach out to the SIOS support team, and her after-hours issue will be covered.

Installation and Configuration Support

Scenario 2: Installation Assistance Needed
Scott’s team: It’s 4 pm EST on Thursday. The approvals have been completed for the new infrastructure project, including the required high availability configuration for critical applications and data. At the kickoff, the stakeholders moved the date for go-live. As a result, the team needs to get the systems installed and configured quickly to avoid service interruption.  Scott’s team knows how to configure the application and server, but they want to be doubly sure they install the HA solution correctly. They need help, but Scott’s not sure that their support plan covers help with installation errors.

Since Scott’s team is in the deployment phase, the new infrastructure project involves systems that have not been validated or successfully put into production.  If Scott’s team has valid SIOS Enterprise level support, he will have access to SIOS product documentation and installation pointers.  However, assistance with installation and configuration is not covered under Scott’s Enterprise support, but he can contact his SIOS sales representative to arrange a paid Professional Services installation engagement. This engagement will ensure that Scott’s team gets the assistance they need to properly install, configure, and validate their cluster. SIOS provides a wide range of professional services designed to help customers quickly and cost-effectively implement, manage, and maintain their HA environments.

Root Cause Analysis (RCA) After Failover

Scenario 3: Post-Failover RCA Support
Amol’s team: It’s 2 am EST on Tuesday.  An alert has been sent out to the entire application team at AjaxBjax Corp. The cluster protecting the company’s most critical application system is conducting a failover.  Amol checks the application dashboard and discovers that the failover was successful and all applications are functioning.  However, Amol knows that management will want some explanations and assurances.  Amol wants to make sure that all application services are up and functioning, but he isn’t sure that their support plan covers whatever this is.

Amol’s team is looking for an RCA and the confidence that their system is going to continue to be operational. Amol’s data is accessible, and his application is fully functional. His system is not a critical down production server, nor a P1 issue.  However, if AjaxBjax Corp has valid Enterprise support for their cluster, they will be able to reach out to the SIOS support team for guidance around the clock (US East), Monday through Friday, for RCA issues.  Amol’s 2 am call will be routed to one of the knowledgeable SIOS support centers, where the team will begin working with Amol.

Additional Questions About Contacting SIOS Support

Amol and Joan were able to contact support via the Support Hotline (US: 877.457.5113; International: +1.803.808.4270) with coverage included by their Enterprise Support.  Scott was able to receive the help he needed, not from the Support team, but through the purchase of services to assist with configuration and installation.  But what about other scenarios, where can Scott, Amol, Joan, and others find more about their support levels and support details?  Or whether their product has reached the maintenance or extended support phases?

When you need to find additional information about your support agreement, you can consult the SIOS Technical Support Agreement (TSA), which is included with each order.  The TSA is also conveniently located on our download site, and can be requested via an email to the SIOS Support Team at support@us.sios.com.  Additionally, product schedules and support tier information can be found online at the Product Lifecycle page.

Customers who already know what’s covered under their plan, but need help with a problem, answers to a general question, root cause analysis, the latest software, or pointers to more information can open a new case via the Support Portal website or via email to the Support Inbox at support@us.sios.com.  Once your case is created, the team will work to provide timely responses and resolution.

Author: Cassius Rhue VP, Customer Experience

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Application availability, High Availability

Why a Sandbox Environment Is Essential for High Availability

May 25, 2026 by Jason Aw Leave a Comment

Why a Sandbox Environment Is Essential for High Availability

Why a Sandbox Environment Is Essential for High Availability

Convincing Management to Invest in Non-Production Infrastructure

Convincing management to invest in non-production infrastructure is not a job for the faint of heart.  Handled casually, discussions regarding an additional test cluster or sandbox environment quickly deteriorate into complaints about paying double for an environment (infrastructure, software, IT resources, applications, and licenses), and accusations that test clusters  “generate zero revenue”.  The cost discussion expands into a mixture of assertions that backup, DevOps, and software runbooks have rendered test environments obsolete.

However, the cost of not having an exact replica of your production environment for testing is often exponentially higher than the cost of an additional test cluster.  These extra costs often hide in the form of unplanned outages, corrupted data, emergency fixes, and stressed-out engineering teams.

10 Questions to Help Justify a Sandbox Environment

If you are struggling to get budget approval for a proper sandbox environment, pose these 10 questions to your leadership team. They shift the conversation from the cost of a duplicate cluster to the value of ensuring the business from loss.

1. How much does downtime actually cost our organization?

Start with the bottom line. If a deployment fails and the production HA cluster goes dark, what is the cost to the organization?  How much do we lose per hour?  What’s our company’s burn rate per business unit?

This question moves the conversation beyond vague statements to the cost per minute of lost revenue, idle employee salaries during an outage, and the harder-to-measure cost of reputational damage. If a production outage costs $300,000 per hour, preventing just one four-hour outage annually saves $1.2 million. Armed with tangible business numbers, the ROI of implementing a sandbox to derisk a costly outage becomes crystal clear.

2. How many maintenance activities do we perform each month?

It’s simple: frequency equals risk exposure.  Risk exposure equals additional costs.  If you are deploying updates, patches, or configuration changes weekly, you are rolling the dice 52 times a year.  Refer back to question 1: How much would an hour of downtime due to a bad patch update cost the organization?  Now multiply that by your maintenance frequency.

As Tristan Allen, Associate Software Engineer at SIOS, reminds customers, a sandbox that is a replica of production provides an invaluable environment “where new features, configuration changes, and patches can be thoroughly tested. Beyond functional testing, a QA environment allows for process validation, performance benchmarking, load testing, and security validation. These are critical activities for identifying bottlenecks, vulnerabilities, or integration issues before they have the chance to impact end users or compromise your environment.”

The velocity of releases and maintenance updates increases the necessity for a safety net.

3. How confident are we in deploying to production?

Does the team hold its breath every time they touch production with an update?  How many times have we heard the phrase, “It was only a one-line change”?  Off by one and null pointer errors are small changes that have historically led to major downtime.  How confident are you in your team’s ability to ensure newly deployed packages are free from coding errors, logic flaws, architectural issues, third-party incompatibilities, or sequencing mistakes?

How confident is your team in the health of your production environment? If your production environment is brittle, a sandbox cluster allows you to validate the deployment process itself, significantly reducing the cost and stress of emergency rollbacks, as well as validating fixes beforehand.

4. What is our risk tolerance for applying security patches directly in production?

Security patches are non-negotiable, but sometimes they conflict with existing libraries or configurations. Applying a kernel patch or a database update directly to production is a gamble.

As VP of Customer Experience, we worked directly with a customer to roll back a kernel update applied directly to production.  While the update fixed one problem, it had unexpected side effects that greatly impacted the storage layer, leading to deadlocks, application crashes, and other bottlenecks.

If you are having a hard time justifying a full QA cluster, ask your management team: Are we willing to risk a critical business application to apply a security patch? A sandbox allows you to apply these patches in an identical environment first, ensuring that “fixing” security doesn’t “break” the business.  Beyond patches, it allows you to deploy new applications and updates to explore any security vulnerabilities or risks that may arise.

5. What is the financial and operational impact of data corruption?

Downtime is temporary; data loss can be permanent. Incompatible changes to underlying storage, application logic errors, or problems in device drivers can silently corrupt data in a way that isn’t immediately obvious.  Do you want your production environment to be the place where you discover that the update to your backup tool means you can no longer back up or restore your critical application data?

By the time you realize the error in production, you might be weeks deep into corrupted records.  Or you may hit a crisis and realize that your backups cannot be restored on the newly updated software.  A sandbox allows you to run data integrity tests, data migrations, schema updates, driver changes, and even replication software scenarios against a copy of real data, ensuring that if data is lost or mangled, it happens in a safe environment, not the one billing your customers.

6. Can we afford for third-party integrations to fail silently?

Your application likely relies on APIs, third-party authentication, third-party applications, or some other form of dependency. These behave differently under load and especially in clustered environments.

Incompatible changes often arise not from your code, but from how your code interacts with the infrastructure. If a change works on a developer’s laptop but fails when distributed across three nodes, that is a disruption that stops business. A sandbox catches the “it works on my machine” bugs before they reach the customer.

7. How prepared are we for a true DR scenario?

Most organizations have a Disaster Recovery (DR) plan on paper, but a plan that hasn’t been tested is just a hypothesis. The only way to validate a DR strategy is to execute it, simulating a total site failure or data corruption event. Without a sandbox cluster, testing your DR plan requires you to target your production environment.  This introduces risk, expense, dangerous logistics, and downtime.

Without a sandbox cluster, you must intentionally take your revenue-generating systems offline to verify they can come back online. This requires massive coordination between network, storage, database, and application teams.  The cost for this exercise in production resembles a running water meter on a leaky system.

In addition to the downtime, the process of testing DR scenarios in production only introduces risk and complexity.  The risk involves working with live data and making sure there is strict adherence to all the data protection steps.  The complexity isn’t usually the failover—it’s the restoration. Once you have successfully failed over to a secondary site or backup node, getting the production cluster back to its original state (failback) is a complex, high-risk operation.

Remind management that the cost of a sandbox would allow your teams to simulate catastrophic failures and execute full recovery procedures during business hours without impacting users. Teams could work together to refine the “Run Book”, find and resolve process flaws safely, and practice thoroughly so that when a real disaster strikes, the team is executing a well-choreographed routine rather than a dangerous first-time experiment.

9. How do we onboard new vendors and train existing teams?

Exceptional organizations have an IT onboarding process for new team members, vendors, and service providers.  These organizations understand that a well-structured onboarding framework is essential for new team members.  They value and prioritize creating learning management systems and a culture ripe with comprehensive resources that help newcomers understand the critical HA environments they will be managing, maintaining, and updating.  They also understand the value of continuous learning and a proactive approach to keeping the team’s skills sharp.

Without a sandbox system that is a direct replica of production, your IT Onboarding must leverage your production clusters.  That means the new college grad is learning how to run patch management, security software, and application updates in an HA environment on the company’s breadwinner.  When they reach a spot that is unclear to them in the run book, or coincidentally missing, the cost to productivity and risk of reputational injury to them and the business can be devastating.

In advocating for a sandbox environment, emphasize the nature of ongoing onboarding of vendors, partners, and managed service providers, and the risks of not having a place for those individuals and teams to learn about the business or explore procedures.  If your organization does not have a sandbox system, consider asking your leadership a few questions:

  • Where will our new team members go to understand the environment they will be managing, maintaining,g and updating?
  • How will they keep their skills current?
  • What systems do we utilize to properly onboard the next team when necessary?

10. Is the cost of the HA tool insurance cheaper than the disaster?

Finally, address the elephant in the room: the cost of the tools and hardware.

High Availability clustering software and the associated compute costs are not free. However, compare the annual cost of the sandbox license and infrastructure against the cost of a single major downtime, rollback, or data loss event.  In almost every scenario, the cost of prevention will be a fraction of the cost of the cure.

A Sandbox Environment Is a Business Continuity Investment

As Tristan Allen, Associate Software Engineer at SIOS, concludes in his blog:

QA and production environments play a vital role in keeping systems running smoothly. By keeping environments separate, testing thoroughly, and managing deployments carefully, IT teams can reduce downtime, maintain high availability, and make transitions between updates seamless.

If your management team is having trouble understanding the benefits of a full sandbox, try asking them a few of these questions.  By asking these questions, you move the discussion away from an overly simplified cost conversation and toward a focused dialogue related to business continuity, making the approval of that budget line item much easier for management to sign.  A sandbox cluster is not a luxury item; it is a risk mitigation asset for the business.

Request a demo to see how SIOS helps you reduce downtime risk with resilient high availability and disaster recovery solutions.

Author: Cassius Rhue, VP of Customer Experience at SIOS

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: High Availability

Inheriting DataKeeper

May 19, 2026 by Jason Aw Leave a Comment

Inheriting DataKeeper

Inheriting DataKeeper

What It Means to Inherit a DataKeeper Environment

The concept of inheritance often brings to mind assets passed down from one individual to another. Webster and other dictionaries define inheritance as:

An inheritance consists of the assets, property, and sometimes debts left behind by a deceased person, distributed to beneficiaries via a will, trust, or state intestacy laws. It commonly includes cash, real estate, stocks, bonds, personal items (jewelry, cars), and business interests.”

In the world of IT, inheritance takes on a digital twist. When a Systems Administrator inherits a cluster that utilizes tools like DataKeeper, they’re not dealing with tangible assets like jewelry or real estate, but rather digital resources—think configurations, roles, and critical volume resources. And while this inheritance is hopefully the result of someone retiring from  the company or receiving a well-earned promotion, we’ll cross our fingers it isn’t due to someone transitioning to that “great BIG data center in the sky.” (Yes, humor is a coping mechanism for IT professionals!)

So, if you’re the lucky recipient of an existing 1×1 cluster with a SQL Server role and associated DataKeeper Volume resources, where do you begin? What steps should you take to ensure a smooth onboarding and knowledge transfer process?

To help guide that transition, here are some key questions you should ask yourself or your Management Team:

Account Administration Questions

Account Manager Details

  • Who is the current Account Manager responsible for this account?
    • What are their contact details (email, phone, etc.)?

Licensing Information

  • What is the status of your licensing agreement, contracts, and renewals?
  • Are there any upcoming licensing expirations or renewal deadlines one should be aware of?
  • Where can I access the licensing portal, and do I have the necessary credentials?

DataKeeper Administration Questions

Comprehending the Environment

  • Assess the current infrastructure, to include Windows Server Failover Clustering setup, servers, storage, etc.
  • What current workloads and applications are DataKeeper protecting?

Configuration and Management

  • Become familiar with DataKeeper configuration.
    • What Asynchronous and Synchronous mirror types are in use?
    • How are the cluster nodes set up?
    • What storage is involved?

Maintenance and Software Updates

  • How to stay informed about new releases, patches, and updates for DataKeeper?

Testing Failover and Recovery

  • Occasionally, test failovers to ensure HA and DR configs are working as expected.
  • Is the mirrored data consistent and recoverable in the event of a disaster?
Screenshot of Failover Cluster Manager showing a running SQL Server role and DataKeeper Volume S online.

Understand Resource Ownership and Dependencies

Once you’ve understood as much as possible about your inheritance, your next step is to begin taking care of what you’ve been given, as depicted in the illustration above.  When “inheriting” ownership of a SQL Server Cluster, it is crucial to identify and communicate with all cross-functional teams impacted by cluster administration.  A few key areas, as there are numerous, to focus on include:

SQL Server  or Application Team

  • Being proactively notified about any planned changes to SQL Server names or instances
  • Informed of large SQL inserts or operations that may impact cluster performance
  • Provide details on the locations of the database files, backups,s and snapshots.

Networking Team

  • Communicate plans to move a SQL role or related resource to a different network.
  • Share information on new IP addresses or other network-related changes that could affect cluster operations

Storage Team

  • Being cautious when making changes to the Source and Target volume (e.g, resizing, formatting, or adding partitions) for these can have an impact on DataKeeper replication.
  • Is there ample bandwidth for the existing mirror?
    • Are you able to collaborate with the Networking Team to ensure bandwidth is sufficient, isolated from other applications to avoid bottlenecks?

Why Runbooks Matter in a DataKeeper Environment:

Runbooks are an essential part of a smooth operation and provide great resolutions for environments utilizing DataKeeper,  for cluster administrators and related technologies.  Ideally, a well-crafted runbook should be a “living document” that evolves over time, reflecting changes in infrastructure, workflows, and best practices.  If previous administrators have done the due diligence, your runbook should have comprehensive coverage in the following areas:

  • Break/Fix: working through known issues, which could be anywhere in the “stack,” e.g, Physical Layer all the way up to the Application Layer
  • Workflows: deploying software and managing routine day-to-day cluster operations
  • Maintenance: how is patch management performed, database backups, etc
  • Vendor support: how to reach SIOS, Microsoft, AWS, and other providers
    • Most importantly, when to “reach out” to them?

Key Takeaways For Inheriting  DataKeeper

This blog highlights several important talking points for navigating such transitions, including account administration, resource ownership, cross-functional collaboration, and the value of runbooks. **However, it’s important to note that these are just a few considerations among many others that may fall outside the scope of this discussion. Every environment is unique, and successful cluster administration requires a thorough understanding of the specific infrastructure, dependencies, and workflows involved.

Enjoy your “inheritance” . . .

Don’t spend it all in one place . . .

Request a demo to see how SIOS DataKeeper can help simplify cluster administration and support high availability.

Author: Greg Tucker, Senior Product Support Engineer at SIOS

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: DataKeeper

  • 1
  • 2
  • 3
  • …
  • 116
  • Next Page »

Recent Posts

  • 3 Challenges of Maintaining High Availability with a Legacy Infrastructure
  • LifeKeeper Generic Applications for High Availability and Disaster Recovery
  • SIOS Enterprise Support Guide: What Your Plan Covers
  • Why a Sandbox Environment Is Essential for High Availability
  • Inheriting DataKeeper

Most Popular Posts

Maximise replication performance for Linux Clustering with Fusion-io
Failover Clustering with VMware High Availability
create A 2-Node MySQL Cluster Without Shared Storage
create A 2-Node MySQL Cluster Without Shared Storage
SAP for High Availability Solutions For Linux
Bandwidth To Support Real-Time Replication
The Availability Equation – High Availability Solutions.jpg
Choosing Platforms To Replicate Data - Host-Based Or Storage-Based?
Guide To Connect To An iSCSI Target Using Open-iSCSI Initiator Software
Best Practices to Eliminate SPoF In Cluster Architecture
Step-By-Step How To Configure A Linux Failover Cluster In Microsoft Azure IaaS Without Shared Storage azure sanless
Take Action Before SQL Server 20082008 R2 Support Expires
How To Cluster MaxDB On Windows In The Cloud

Join Our Mailing List

Copyright © 2026 · Enterprise Pro Theme on Genesis Framework · WordPress · Log in