SIOS SANless clusters

SIOS SANless clusters High-availability Machine Learning monitoring

  • Home
  • Products
    • SIOS DataKeeper for Windows
    • SIOS Protection Suite for Linux
  • News and Events
  • Clustering Simplified
  • Success Stories
  • Contact Us
  • English
  • 中文 (中国)
  • 中文 (台灣)
  • 한국어
  • Bahasa Indonesia
  • ไทย

How to recreate the file system and mirror resources to ensure the size information is correct

November 11, 2022 by Jason Aw Leave a Comment

How to recreate the file system and mirror resources to ensure the size information is correct

When working with high availability (HA) clustering, it’s essential to ensure that the configuration of all nodes in the cluster are parallel with one another. These ‘mirrored’ configurations help to minimize the failure points on the cluster, providing a higher standard of HA protection. For example, we have seen situations in which the mirror-size was updated on the source node but the same information was not updated on the target node. The mirror size mismatch prevented LifeKeeper from starting on the target node in a failover. Below are the recommended steps for recreating the mirror resource on the target node with the same size information as the source:

Steps:

  1. Verify – from the application’s perspective – that the data on the source node is valid and consistent
  2. Backup the file system on the source (which is the source of the mirror)
  3. Run /opt/LifeKeeper/bin/lkbackup -c to backup the LifeKeeper config on both nodes
  4. Take all resources out of service.  In our example the resources are in service on node sc05 and sc05 is the source of the mirror (and sc06 is the target system/target of the mirror).
    1. In the right pane of the LifeKeeper GUI, right-click on the DataKeeper resource that is in service.
    2. Click Out of Service from the resource popup menu.
    3. A dialog box will confirm that the selected resource is to be taken out of service. Any resource dependencies associated with the action are noted in the dialog. Click Next.
    4. An information box appears showing the results of the resource being taken out of service. Click Done.
  1. Verify that all resources are out of service and file systems are unmounted
    1. Use the command cat /proc/mdstat on the source to verify that no mirror is configured
  1. Use the mount command on the source to make sure the file system is no longer mounted
  2. Use /opt/LifeKeeper/bin/lcdstatus -q  on the source to make sure the resources are all OSU.
  1. In the LifeKeeper GUI break the dependency between the IP resource (VIP) and the file system resource (/mnt/sps).  Right click on the VIP resource and select Delete Dependency.

Then, select the File System resource (/mnt/sps) for the Child Resource Tag.

This will result in two hierarchies, one with the IP resource (VIP) and one with the file system resource (/mnt/fs) and the mirror resource (datarep-sps).

  1. Delete the hierarchy with the file system and mirror resources. Right click on /mnt/sps and select Delete Resource Hierarchy.
  1. On the source, perform ‘mount <device> <directory>’ on the file system.

Example: mount /dev/sdb1 /mnt/sps

  1. Via the GUI recreate the mirror and file systems via the following:
    1. Recovery Kit: Data Replication
  1. Switchback Type: Intelligent
  2. Server: The source node
  3. Hierarchy Type: Replicate Existing Filesystem
  4. Existing Mount Point: <select your mount point>. It is /mnt/sps for this example.
  5. Data Replication Resource Tag: <Take the default>
  6. File System Resource Tag: <Take the default>
  7. Bitmap File: <Take the default>
  8. Enable Asynchronous Replication: Yes
  1. Once created, you can Extend the mirror and file system hierarchy:
    1. Target server: Target node
    2. Switchback Type: Intelligent
    3. Template Priority: 1
    4. Target Priority: 10
  1. Once the pre-extend checks complete select next followed by these values:
    1. Target disk: <Select the target disk for the mirror>.  It is /dev/sdb1 in our example.
    2. Data Replication Resource Tag: <Take the default>
    3. Bitmap File: <Take the default>
    4. Replication Path: <Select the replication path in your environment>
    5. Mount Point: <Select the mount point in your environment>.  It is /mnt/sps in our example.
    6. Root Tag: <Take the default>

When the resource “extend” is done select “Finish” and then “Done”.

  1.  In the LifeKeeper GUI recreate the dependency between the IP resource (VIP) and the file system resource (/mnt/sps). Right click on the VIP resource and select Create Dependency.  Select /mnt/sps for the Child Resource Tag.
  1. At this point the mirror should be performing a full resync of the size of the file system. In the LifeKeeper GUI in the right pane of the LifeKeeper GUI, right-click on the VIP resource. Select “In Service” to restore the IP resource (VIP), select the source system where the mirror is in service (sc05 in our example)  and verify that the application restarts and the IP is accessible.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: High Availability, SIOS LifeKeeper

Explaining the Subtle but Critical Difference Between Switchover, Failover, and Recovery

November 9, 2022 by Jason Aw Leave a Comment

Explaining the Subtle but Critical Difference Between Switchover, Failover, and Recovery

High availability is a speciality and like most specialities, it has its own vocabulary and terminology. Our customers are typically very knowledgeable about IT but if they haven’t been working in an HA environment, some of our common HA terminology can cause a fair amount of confusion – for them and for us. They are simple-sounding but with very specific meaning in the context of HA.Three of these terms are discussed here – swithover, failover, and recovery.

What is a Switchover?

A switchover is a user-initiated action via the high availability (HA) clustering solution user interface or CLI. In a switchover, the user manually initiates the action to change the source or primary server for the protected application. In a typical switchover scenario, all running applications and dependencies are stopped in an orderly fashion, beginning with the parent application and concluding when all of the child/dependencies are stopped. Once the applications and their dependencies are stopped, they are then restarted in an orderly fashion on the newly designated primary or source server.

For example, if you have resources Alpha, Beta, and Gamma. Resource Alpha depends on resources Beta and Gamma. Resource Beta depends on resource Gamma.  In a switchover event, resource Alpha is stopped first, followed by Beta, and then finally Gamma.  Once all three are stopped, the switchover continues to bring the resources into an operational state on the intended server.  The process starts with resource Gamma, followed by Beta, and then finally the start up operations complete for resource Alpha. 

Traditionally, a switchover operation requires more time as resources must be stopped in a graceful and orderly manner. A switchover is often performed when there is a need to update software versions while maintaining uptime, performing maintenance work (via rolling upgrades) on the primary production node, or doing DR testing.

Key Takeaway: If there was no failure to cause the action, then it was a switchover

What is a Failover?

A failover operation is typically a non-user initiated action in response to a server crash or unexpected/unplanned reboot. Consider the scenario of an HA cluster with two nodes, Node A and Node B.  In this scenario, all critical applications Alpha, Beta, and Gamma are started and operational on Node A. In this scenario, a failover is what takes place when Node A experiences an unexpected/unplanned reboot, power-off, halt, or panic. Once the HA software detects that Node A is no longer functioning and operationally available within the cluster (as defined by the solution), it will trigger a failover operation to restore access of the critical applications, resources, services and dependencies on the available cluster node, Node B in this case.  In a failover scenario, because Node A has experienced a crash (or other simulated immediate failure) there are no processes to stop on Node A, and consequently once proper detection and fencing actions have been processed, Node B will immediately begin the process of restoring resources. As in the switchover case, the process starts with resource Gamma, followed by Beta, and then finally the start up operations complete for resource Alpha. Traditionally, a failover operation requires less time than a switchover. This is because the processing of a failover does not require any resources to be stopped (or quiesced) on the previous primary (in-service or active) node.

Key Takeaway: A failover occurs in response to a system failure.

What is Recovery?

A recovery event is easy to confuse with a failover. A recovery event occurs when a process, server, communication path, disk, or even cluster resource fails and the high availability software operates in response to the identified failure. Most HA software solutions are capable of multiple ways of handling a recovery event. The most prominent methods include:

  1. Graceful restart locally, then a graceful restart on the remote
    1. A restart is always attempted locally, if recovery is successful no further action occurs. If a local restart fails the next operation occurs
    2. If a local restart fails, resources are gracefully moved to the remote node
  2. Graceful restart locally, then a forced restart on the remote
    1. A restart is always attempted locally, if recovery is successful no further action occurs.  If a local restart fails the next operation occurs.
    2. Resources are moved to the remote node by fencing the primary node
  3. Forced restart on the remote
    1. A restart is never attempted locally
    2. Resources are always forced to the next available cluster node as described in method 2b.
  4. Forced server restart, no remote failover
    1. A restart is always attempted locally
    2. If a local restart fails, the primary node is restarted to attempt to recover services.
    3. Resources will not fail to a remote system
  5. Policy based local restart, then remote
    1. Policies may govern the number of retries before a remote attempt a recovery occurs

Due to the number of variations in recovery policy it is easy to see a recovery event that resembles the behavior of a switchover. This is often the case in methods 1 and 5. In these scenarios applications and services are gracefully stopped in an orderly fashion before being started on the remote node. Methods 2 and 3, customers will often see a behavior similar to a failover. In methods 2 and 3, the primary server is restarted or fenced by the HA software which creates an observable behavior similar to a failover.  Method 4 is typically an option that is rarely used, but is a hybrid of both a switchover and a failover.  Method 4 begins with a graceful stop of the applications and services, followed by a restart of the applications and services (much like a switchover). However, if the local restart of the applications and services fails, the system will be restarted (much like a failover), but without actually failing to the remote cluster node. While rare, Method 4 is often invoked in cases where an unbalanced cluster is present, or used with a policy based methodology.

Key Takeaway: A recovery event depends on the method chosen

HA terminology between vendors is an area where common terms can take on different meanings. As you deploy and maintain your cluster solution with enterprise applications, be sure that you understand the solution provider terms for failover, switchover and recovery.  And, while you are at it, make sure you know whether the restaurant will put the sauce on the side (in a saucer), or on the side (your mashed potatoes)

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: disaster recovery, failover clustering, High Availability

White Paper: High Availability Clusters in VMware vSphere without Sacrificing Features or Flexibility

October 22, 2022 by Jason Aw Leave a Comment

White Paper High Availability Clusters in VMware vSphere without Sacrificing Features or Flexibility

High Availability Clusters in VMware vSphere without Sacrificing Features or Flexibility

Six key facts you should know about high availability protection in VMware vSphere

Many large enterprises are moving important applications from traditional physical servers to virtualized environments, such as VMware vSphere in order to take advantage of key benefits such as configuration flexibility, data and application mobility, and efficient use of IT resources. Realizing these benefits with business-critical applications, such as SQL Server or SAP can pose several challenges.

This paper explains these challenges and highlights six key facts you should know about HA protection in VMware vSphere environments that can save you money.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: clusters, High Availability

High Availability Protection for Financial Services

October 18, 2022 by Jason Aw Leave a Comment

White Paper High Availability Protection for Financial Services

High Availability Protection for Financial Services

Historically, financial services organizations ran critical applications on mainframes. However, today’s financial institutions may have critical workloads in a variety of configurations, from all on-premises, hybrid, and all in-cloud infrastructures depending on the server types and bandwidth available to accommodate these configurations. Learn how to plan and ensure applications and other related resources in HA/DR environments meet the organization’s needs for scalability, reliability, and configuration flexibility.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: High Availability, White Paper

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

September 24, 2022 by Jason Aw Leave a Comment

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Microsoft introduced Azure Shared Disk in Q1 of 2022. Shared Disk allows you to attach a managed disk to more than one host. Effectively this means that Azure now has the equivalent of SAN storage, enabling Highly Available clusters to use shared disk in the cloud!

A major advantage of using Azure Shared Disk with a SIOS Lifekeeper cluster hierarchy is that you will no longer be required to have either a storage quorum or witness node to avoid so called split-brain – which occurs when the communication between nodes is lost and several nodes are potentially changing data simultaneously. Fewer nodes means less cost and complexity.

SIOS has introduced an Application Recovery Kit (ARK) for our LifeKeeper for Linux product; called LifeKeeper SCSI-3 Persistent Reservations (SCSI3) Recovery Kit that allows Azure Shared Disks to be used in conjunction with SCSI-3 reservations. This ARK guarantees that a shared disk is only writable from the node that currently holds the SCSI-3 reservations on that disk.

When installing SIOS Lifekeeper, the installer will detect that it’s running in Microsoft Azure EC2 and automatically install the LifeKeeper SCSI-3 Persistent Reservations (SCSI3) Recovery Kit to enable support for Azure Shared Disk.

Resource creation within Lifekeeper is straightforward and simple (Figure 1). Once locally mounted, the Azure Shared Disk is simply added into Lifekeeper as a file-system type resource. Lifekeeper will assign it an ID (Figure 2) and manage the SCSI-3 locking automatically.

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Figure 1] Creation of /sapinst within Lifekeeper.

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Figure 2] /sapinst created and extended to both cluster nodes.

SCSI-3 reservations guarantee that Azure Shared Disk is only writable on the node that holds the reservations (Figure 3). In a scenario where cluster nodes lose communication with each other the standby server will come online, causing a potential split-brain situation. However, because of the SCSI-3 reservations only one node can access the disk at a time, which prevents an actual split-brain scenario. Only one system will hold the reservation and it will either become the new active node (in this case the other will reboot) or remain the active node. Nodes that do not hold the Azure Shared Disk reservation will simply end up with the resource in an “Standby State” state because they cannot acquire the reservation.

New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk

Figure 3] Output from Lifekeeper logs when trying to mount a disk that is already reserved.

Link to Microsoft’s definition of Azure Shared Disks https://docs.microsoft.com/en-us/azure/virtual-machines/disks-shared

At present SIOS supports Locally-redundant Storage (LRS) and we’re working with Microsoft to test and support Zone-Redundant Storage (ZRS). Ideally we’d like to know when there is a ZRS failure so that we can fail-over the resource hierarchy to the most local node to the active storage.

At present SIOS is expecting the Azure Shared Disk support to arrive in its next release of Lifekeeper 9.6.2 for Linux, Q3 2022.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: clusters, High Availability

  • « Previous Page
  • 1
  • 2
  • 3
  • 4
  • …
  • 34
  • Next Page »

Recent Posts

  • Video: The SIOS Advantage
  • Demo Of SIOS DataKeeper For A Three-Node Cluster In AWS
  • 2023 Predictions: Data Democratization To Drive Demand For High Availability
  • Understanding the Complexity of High Availability for Business-Critical Applications
  • Epicure Protects Business Critical SQL Server with Amazon EC2 and SIOS SANLess Clustering Software

Most Popular Posts

Maximise replication performance for Linux Clustering with Fusion-io
Failover Clustering with VMware High Availability
create A 2-Node MySQL Cluster Without Shared Storage
create A 2-Node MySQL Cluster Without Shared Storage
SAP for High Availability Solutions For Linux
Bandwidth To Support Real-Time Replication
The Availability Equation – High Availability Solutions.jpg
Choosing Platforms To Replicate Data - Host-Based Or Storage-Based?
Guide To Connect To An iSCSI Target Using Open-iSCSI Initiator Software
Best Practices to Eliminate SPoF In Cluster Architecture
Step-By-Step How To Configure A Linux Failover Cluster In Microsoft Azure IaaS Without Shared Storage azure sanless
Take Action Before SQL Server 20082008 R2 Support Expires
How To Cluster MaxDB On Windows In The Cloud

Join Our Mailing List

Copyright © 2023 · Enterprise Pro Theme on Genesis Framework · WordPress · Log in