clusters Archives - SIOS SANless clusters

Ten Questions to Consider for Better High Availability Cluster Maintenance

April 24, 2023 by Jason Aw Leave a Comment

Ten Questions to Consider for Better High Availability Cluster Maintenance

Maintenance is a part of every company’s lifecycle. Every infrastructure is constantly moving and changing, even those that are moving towards end of life. Your team has likely had a lot of success doing what you’ve done in the past, but as systems become more complicated and complex, what you have deemed success in the past may need a refresh. Here are ten questions to improve cluster maintenance, maximize high availability, and minimize downtime.

How to Ensure High Availability During System Maintenance

What are the best days for the business stakeholders?

Different from unplanned downtime, these are windows in which multiple teams, systems, and interconnected resources are simply not available for planned activities. For example, one company is required to do monthly system compliance checks and safety inspections. During this time, the business operations are shuttered by inspectors, auditors, and similar.

What are the best dates for the team to schedule maintenance?

As VP of Customer Experience we’ve worked closely with a number of teams who have blackout dates for certain events and activities. Your team is likely responsible for more than one set of systems and servers, and reports to multiple teams with critical applications and infrastructure. Understanding which days are best for the team helps you avoid distractions, conflicts, and lost time due to known resource constraints.

What dates and times coordinate best with partners, consultants, and non-company contractors?

Critical infrastructure typically includes many other providers and vendors who are not directly related to the company’s staffing. These resources include OS, security and HA vendors and consultants, as well as architects from the infrastructure providers and other partners. Understanding in advance what days are best or included in your support tiers is critical to proper scheduling and staffing.

With the rise in global teams finding the right time for all of these resources is another question that is important to answer. What is the best time for resources in EST, IST, EMEA, and other regions?

What is the intended scope of the maintenance? What is the desired outcome of the maintenance activities? Think holistically.

Think beyond simple maintenance of the application to include the entire environment where it is running. Recently, a customer who was planning to upgrade their application decided to upgrade their OS at the same time. Unfortunately, this slight change in scope came with larger than expected consequences. Their application did not support the newly upgraded OS and problems ensued. Be sure that the scope of the maintenance window is well-defined and that outcomes for that scope are detailed. It is not enough to say, the environment works. Detail expected versions, behavior, and measurable outcomes wherever possible. See more about IT Resilience.

What is the length of time for the maintenance window (anticipated, allowed)?

Ideally we’d all love to have all the time to perform maintenance, but having customers located around the world means there is little tolerance for planned downtime windows – even for critical tasks. As you plan for maintenance, what length of down time is anticipated? Can you realistically meet your maximum allowed windows? If not, then you will need to replan the maintenance events.

What’s the rollback plan?

While we hope nothing goes wrong, we should be aware that we are dealing with software, complex environments and configurations, and lots of moving pieces being handled by numerous teams. A rollback plan – that is, a means of returning the systems to the pre-maintenance versions and settings – is essential. Be sure that if something goes wrong you have a rollback plan, for example full backups or machine images. See more about disaster recovery.

Who are the individual team members involved, what are their roles and responsibilities? Are all the required roles and responsibilities clearly defined?

As VP of Customer Experience our team was involved in a maintenance activity that encountered an unforeseen delay due to key team members that were missing. As you lay out your plan and architecture be sure to identify the team members as well as the IT roles and responsibilities required. As Sr. Support Engineer Greg Tucker reminds customers, HA touches every layer of your environment including storage, network, compute, OS, security, policies, etc.

Where is the maintenance plan documented? When was the last time the plan was reviewed, updated, and tested?

Success is wonderful, but it can also make you complacent or comfortable. After years of success, your process may no longer be well documented or actively being followed. Answering these questions can make sure your team continues to have success.

What issues were resolved in test/QA prior to the production plans?

Kudos for continuing to test maintenance steps. Be sure that issues resolved in test environments are properly added to the production maintenance plans. The SIOS Customer Success team has seen customers perform QA tests, uncover false assumptions and make necessary corrections, but fail to place those corrections in their production checklist.

Who or what is missing from your plans?

Now that you’ve looked over the plans, timing, teams, roles, and architecture one last question remains: who or what is missing? As a last step, look over your plans and ask the question: “Who is missing from our plans?” Also, consider asking “What is missing from our plans?” As VP of Customer Experience I have worked with our team to review activity plans for countless customers. One of the most memorable maintenance plan reviews uncovered a series of steps within the rollback plan that included restoring servers from cloned images and data from backup. However, the image cloning and data backup steps were not included in the task list. They had been overlooked and assumed to have been done earlier in the process.

System Maintenance is a Critical Element to Maintaining High Availability

System maintenance is a critical and necessary part of maintaining computer systems. The maintenance could be to correct errors, introduce new software functionality, or adapt a system to a new use case. When the systems in question are business critical systems that are essential for the organization to maintain business continuity, having a thought out plan is essential. Consider these ten questions and others of your own to make sure that your maintenance satisfies the needs of the business without unnecessary risk or delay.

Contact SIOS today for High Availability and Disaster Recovery Solutions.

Reproduced with permission from SIOS

White Paper: High Availability Clusters in VMware vSphere without Sacrificing Features or Flexibility

October 22, 2022 by Jason Aw Leave a Comment

High Availability Clusters in VMware vSphere without Sacrificing Features or Flexibility

Six key facts you should know about high availability protection in VMware vSphere

Many large enterprises are moving important applications from traditional physical servers to virtualized environments, such as VMware vSphere in order to take advantage of key benefits such as configuration flexibility, data and application mobility, and efficient use of IT resources. Realizing these benefits with business-critical applications, such as SQL Server or SAP can pose several challenges.

This paper explains these challenges and highlights six key facts you should know about HA protection in VMware vSphere environments that can save you money.

Reproduced with permission from SIOS

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

September 24, 2022 by Jason Aw Leave a Comment

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Microsoft introduced Azure Shared Disk in Q1 of 2022. Shared Disk allows you to attach a managed disk to more than one host. Effectively this means that Azure now has the equivalent of SAN storage, enabling Highly Available clusters to use shared disk in the cloud!

A major advantage of using Azure Shared Disk with a SIOS Lifekeeper cluster hierarchy is that you will no longer be required to have either a storage quorum or witness node to avoid so called split-brain – which occurs when the communication between nodes is lost and several nodes are potentially changing data simultaneously. Fewer nodes means less cost and complexity.

SIOS has introduced an Application Recovery Kit (ARK) for our LifeKeeper for Linux product; called LifeKeeper SCSI-3 Persistent Reservations (SCSI3) Recovery Kit that allows Azure Shared Disks to be used in conjunction with SCSI-3 reservations. This ARK guarantees that a shared disk is only writable from the node that currently holds the SCSI-3 reservations on that disk.

When installing SIOS Lifekeeper, the installer will detect that it’s running in Microsoft Azure EC2 and automatically install the LifeKeeper SCSI-3 Persistent Reservations (SCSI3) Recovery Kit to enable support for Azure Shared Disk.

Resource creation within Lifekeeper is straightforward and simple (Figure 1). Once locally mounted, the Azure Shared Disk is simply added into Lifekeeper as a file-system type resource. Lifekeeper will assign it an ID (Figure 2) and manage the SCSI-3 locking automatically.

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Figure 1] Creation of /sapinst within Lifekeeper.

Figure 2] /sapinst created and extended to both cluster nodes.

SCSI-3 reservations guarantee that Azure Shared Disk is only writable on the node that holds the reservations (Figure 3). In a scenario where cluster nodes lose communication with each other the standby server will come online, causing a potential split-brain situation. However, because of the SCSI-3 reservations only one node can access the disk at a time, which prevents an actual split-brain scenario. Only one system will hold the reservation and it will either become the new active node (in this case the other will reboot) or remain the active node. Nodes that do not hold the Azure Shared Disk reservation will simply end up with the resource in an “Standby State” state because they cannot acquire the reservation.

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Figure 3] Output from Lifekeeper logs when trying to mount a disk that is already reserved.

Link to Microsoft’s definition of Azure Shared Disks https://docs.microsoft.com/en-us/azure/virtual-machines/disks-shared

At present SIOS supports Locally-redundant Storage (LRS) and we’re working with Microsoft to test and support Zone-Redundant Storage (ZRS). Ideally we’d like to know when there is a ZRS failure so that we can fail-over the resource hierarchy to the most local node to the active storage.

At present SIOS is expecting the Azure Shared Disk support to arrive in its next release of Lifekeeper 9.6.2 for Linux, Q3 2022.

Reproduced with permission from SIOS

Introduction To Clusters – Part 2

November 23, 2021 by Jason Aw Leave a Comment

Introduction To Clusters – Part 2

What Types of Clusters Are There and How Do They Work?

An Overview of HA Clusters, and Load Balancing Clusters

Clustering helps improve reliability and performance of software and hardware systems by creating redundancy to compensate for unforeseen system failure. If a system is interrupted due to hardware or software failure or natural disaster, this can have a major impact on business and revenue, wasting crucial time and expense to get things back up and running.

This is where clustering comes in. There are three main types of clustering solutions – HA clusters, load balancing clusters, and HPC clusters. Which type will best increase system availability and performance for your business? Let’s have a look at the three types of clustering solutions in more detail below.

What is HA Clustering?

High Availability clustering, also known as HA clustering, is effective for mission-critical business applications, ERP systems, and databases, such as SQL Server SAP, and Oracle that require near-continuous availability.

HA clustering can be divided into two types, “Active-Active” configuration and active-passive configuration.

Let’s take a look at the difference between these two HA clustering types.

HA Clustering Type 1: Active-Active Configuration

In the active-active configuration, processing is performed on all nodes in the cluster. For example, in the case of two-node clustering, both nodes are active. If one node stops, the processing will be taken over the other.

However, if each node is operating at close to 100% and one node stops, it will be difficult for another node to take on the additional processing load. Therefore, capacity planning with a margin is important for HA clustering.

HA Clustering Type 2: Active-Standby Configuration

Let’s use our two-node example again. In the active-standby configuration, one node is configured as the active node and the other node is configured as the standby node. The active node and the standby node exchange signals called “heartbeats” to constantly check whether they are operating normally.

If the standby node cannot receive the heartbeat of the active node, the standby node determines that the active node has stopped and will take over the processing of the active node. This mechanism is called “failover”. Conversely, the mechanism that recovers the stopped operating node and transfers the processing back to the recovered active node is called “failback.”

In an active/standby configuration, when a failure occurs, the simple switch from the active node to the standby node makes recovery relatively easy. However, it is necessary to consider that the resources of the standby node when the operating node is operating normally will be wasted.

Two Components of HA Clustering: Application and Storage

For an HA cluster to be effective, two areas need to be addressed: application orchestration and storage protection. Clustering software monitors the health of the application being protected and, if it detects an issue, moves operation of that application over to the standby node. The standby node needs access to the most up-to-date versions of data – preferably identical to the data that the primary node was accessing before the incident. This can be accomplished in two ways: shared storage, share-nothing storage. In the shared storage model, both cluster nodes access the same storage – typically a SAN. In shared-nothing (aka SANless) configurations, local storage on all nodes are mirrored using replication software.

Clustering software products vary widely in their ability to monitor and detect issues that may cause application failure and in their ability to orchestrate failovers reliably. Many clustering products only detect whether the application server is operational, but do not detect a wide range of software, services, network, and other issues that can cause application failure.

Application Awareness is Essential

Similarly, complex ERP and database applications have multiple component parts that have to be stored on the correct server or instance, started up in the right order, and brought on line in accordance with complex best practices. Choose a clustering software with specialized software called application recovery kits designed specifically to maintain best practices for the application/database-specific requirements.

There are multiple ways to configure an HA Cluster:

Traditional Two Node Clusters with Shared Storage

Two servers are clustered with shared storage.

Two Node SANless Cluster

Clusters can be configured using local LAN and high speed synchronous block-level replication.

Real-time replication can be used to synchronize storage on the primary server with storage on a standby server located in the same data center, in your disaster recovery site, or both. This allows you to build high availability and disaster recovery configurations flexibly; Two node or multi-nodeSIOS block level replication is highly optimized for performance. You can even use super fast, high-speed locally attached storage such as PCIe flash type storage devices on your physical servers to achieve very low cost, high performance, high availability configurations. Your data is protected on the flash device and your application too.

Third Node for Disaster Protection

This configuration uses a SAN-based cluster and adds a third, SANless node into a remote data center or the cloud and achieve full disaster recovery protection. In the event of a disaster, the standby remote physical server is brought into service automatically with no data loss, eliminating the hours needed for restoration from backup media.

What is a Load Balancing Cluster?

Load balancing clustering is a mechanism that can be used as a single system by distributing processing to multiple nodes using a load balancer to improve performance by distributing processing. While it can isolate a failed node to prevent node failure from affecting the entire system, the load balancer is a critical single point of failure risk and not a high availability option. It is only effective for applications such web server load balancing. If the load balancer itself fails, the entire system stops.

What is HPC Clustering?

You can also use clustering for performance instead of high availability. High-Performance Computing clusters, or HPC clusters combine the processing power of multiple (sometimes thousands of nodes) to get the CPU performance needed in CPU-intensive environments such as scientific and technological environments requiring large-scale simulations, CAE analysis, and parallel processing.

Are you ready to find the right HA clustering solution for your business?

Learn more about SIOS High Availability clustering here.

Reproduced with permission from SIOS

Disaster Recovery Made Simple

October 22, 2021 by Jason Aw Leave a Comment

Disaster Recovery Made Simple

Heard the term disaster recovery (DR) thrown around often? DR is a strategy and set of policies, procedures, and tools. It ensures critical IT systems, databases, and applications continue to operate and be available to users when a man-made or natural disaster happens. It typically involves moving application operation to a redundant DR environment that is geographically separated from the primary environment. While the IT team owns the disaster recovery strategy, DR is an important component of every organization’s Business Continuity Plan. The latter is a strategy and set of policies, procedures, and tools to ensure business operations continue through an interruption in service.

It may sound confusing at first. But we’ve collected some quick facts to make disaster recovery simple to understand:

Point 1. Implement an IT disaster recovery or a disaster recovery plan (DRP)

A DRP is a strategy and set of policies, procedures, and tools that ensure critical IT systems, databases, and applications continue to operate and be available to users when a disaster strikes the organization’s primary computing environment. While the IT team owns the disaster recovery strategy, DR is an important component of every organization’s Business Continuity Plan.

Point 2. Ensure Geographic Separation

An essential part of application disaster recovery is ensuring there is a redundant, geographically separated application environment available. You have either efficient, block level replication and or a clustering software that can failover operation to it in the event of a disaster. If your application is running in a cloud, your clustering environment should failover across cloud regions and availability zones for disaster recovery.

Point 3. Test, test, and test some more

In a recent Spiceworks survey, 59 percent of organizations indicated they had experienced one to three outages (that is, any interruption to normal levels of IT-related service) over the course of one year. 11 percent have experienced four to six. 7 percent have experienced seven or more. In short, a DR event is nearly inevitable. Be sure you conduct regular testing to ensure you know exactly what will happen when it does.

Point 4. Understand Your Risk

The disaster in DR does not need to be a full-fledged hurricane, tornado, flood, or earthquake that impacts your business. Disasters come in many forms, including a cyber-attack, fire, theft, or vandalism. In fact, simple human error still rates among the leading causes of IT data center downtime. In short, a disaster is any crisis that results in a down system for a long duration and/or major data loss on a large scale that impacts your IT infrastructure, data center, and your business.

Point 5. Ensure Your DRP has a Checklist

It should include critical IT systems and network prioritized by their expected time for recovery (RTO). Document the steps needed to restart, reconfigure and recover systems and networks. Employees should know where to locate the DRP and how to execute basic emergency steps in the event of an unforeseen incident.

Point 6. Substantiate DRPs through testing

DRPs should identify deficiencies and provides opportunities to fix problems before a disaster occurs. Testing can offer proof that the plan is effective and that it will enable you to meet recovery point and recovery time objectives (RPOs and RTOs). Since IT systems and technologies are constantly changing, DR testing also helps ensure a disaster recovery plan is up to date.

Choose a failover clustering technology that makes DR testing simple by facilitating fast, simple, reliable switchover of application operation to DR nodes and back.

When you look at those statistics, you know you are living on borrowed time if you don’t have a disaster recovery plan in place. The SIOS disaster recovery solution is a multi-site, geographically dispersed cluster that meets RPO and RTOs with ease. What makes SIOS different from many other DR providers is that it offers one solution that meets both high availability and disaster recovery needs. To learn more about our DR solutions, check out the insights page here.

Reproduced with permission from SIOS