Clustering Simplified Archives - Page 36 of 104

White Paper: High Availability Clusters in VMware vSphere without Sacrificing Features or Flexibility

October 22, 2022 by Jason Aw Leave a Comment

High Availability Clusters in VMware vSphere without Sacrificing Features or Flexibility

Six key facts you should know about high availability protection in VMware vSphere

Many large enterprises are moving important applications from traditional physical servers to virtualized environments, such as VMware vSphere in order to take advantage of key benefits such as configuration flexibility, data and application mobility, and efficient use of IT resources. Realizing these benefits with business-critical applications, such as SQL Server or SAP can pose several challenges.

This paper explains these challenges and highlights six key facts you should know about HA protection in VMware vSphere environments that can save you money.

Reproduced with permission from SIOS

High Availability Protection for Financial Services

October 18, 2022 by Jason Aw Leave a Comment

High Availability Protection for Financial Services

Historically, financial services organizations ran critical applications on mainframes. However, today’s financial institutions may have critical workloads in a variety of configurations, from all on-premises, hybrid, and all in-cloud infrastructures depending on the server types and bandwidth available to accommodate these configurations. Learn how to plan and ensure applications and other related resources in HA/DR environments meet the organization’s needs for scalability, reliability, and configuration flexibility.

Reproduced with permission from SIOS

How to use Azure Site Recovery (ASR) to replicate a Windows Server Failover Cluster (WSFC) that uses SIOS DataKeeper for cluster storage

October 14, 2022 by Jason Aw Leave a Comment

How to use Azure Site Recovery (ASR) to replicate a Windows Server Failover Cluster (WSFC) that uses SIOS DataKeeper for cluster storage

Intro

So you have built a SQL Server Failover Cluster Instance (FCI), or maybe an SAP ASCS/ERS cluster in Azure. Each node of the cluster resides in a different Availability Zone (AZ), or maybe you have strict latency requirements and are using Placement Proximity Groups (PPG) and your nodes all reside in the same Availability Set. Regardless of the scenario, you now have a much higher level of availability for your business critical application than if you were just running a single instance.

Now that you have high availability (HA) covered, what are you going to do for disaster recovery? Regional disasters that take out multiple AZs are rare, but as recent history has shown us, Mother Nature can really pack a punch. You want to be prepared should an entire region go offline.

Azure Site Recovery (ASR) is Microsoft’s disaster recovery-as-a-service (DRaaS) offering that allows you to replicate entire VMs from one region to another. It can also replicate virtual machines and physical servers from on-prem into Azure, but for the purpose of this blog post we will focus on the Azure Region-to-Region DR capabilities.

Setting up Azure Site Recovery

We are going to assume you have already built your cluster using SIOS DataKeeper. If not, here are some pointers to help get you started.

Failover Cluster Instances with SQL Server on Azure VMs

SIOS DataKeeper Cluster Edition for the SAP ASCS/SCS cluster share disk

We are also going to assume you are familiar with Azure Site Recovery. Instead of yet another guide on setting up ASR, I suggest you read the latest documentation from Microsoft. This article will focus instead on some things you may not have considered and the specific steps required to fix your cluster after a failover to a different subnet.

Paired Regions

Before you start down the DR path, you should be aware of the concept of Azure Paired Regions. Every Region in Azure has a preferred DR Region. If you want to learn more about Paired Regions, the documentation provides a great background. There are some really good benefits of using your paired region, but it’s really up to you to decide on what region you want to use to host your DR site.

Cloud Witness Location

When you originally built your cluster you had to choose a witness type for your quorum. You may have selected a File Share Witness or a Cloud Witness. Typically either of those witness types should reside in an AZ that is separate from your cluster nodes.

However, when you consider that, in the event of a disaster, your entire cluster will be running in your DR region, there is a better option. You should use a cloud witness, and place it in your DR region. By placing your cloud witness in your DR region, you provide resiliency not only for local AZ failures, but it also protects you should the entire region fail and you have to use ASR to recover your cluster in the DR region. Through the magic of Dynamic Quorum and Dynamic Witness, you can be sure that even if your DR region goes offline temporarily, it will not impact your production cluster.

Multi-VM Consistency

When using ASR to replicate a cluster, it is important to enable Multi-VM Consistency to ensure that each cluster node’s recovery point is from the same point in time. That ensures that the DataKeeper block level replication occurring between the VMs will be able to continue after recovery without requiring a complete resync.

Crash Consistent Recovery Points

Application consistent recovery points are not supported in replicated clusters. When configuring the ASR replication options do not enable application consistent recovery points.

Keep IP Address After Failover?

When using ASR to replicate to your DR site there is a way to keep the IP address of the VMs the same. Microsoft described it in the article entitled Retain IP addresses during failover. If you can keep the IP address the same after failover it will simplify the recovery process since you won’t have to fix any cluster IP addresses or DataKeeper mirror endpoints, which are based on IP addresses.

However, in my experience, I have never seen anyone actually follow the guidance above, so recovering a cluster in a different subnet will require a few additional steps after recovery before you can bring the cluster online.

Your First Failover Attempt

Recovery Plan

Because you are using Multi-VM Consistency, you have to failover your VMs using a Recovery Plan. The documentation provides pretty straightforward guidance on how to do that. A Recovery Plan groups the VMs you want to recover together to ensure they all failover together. You can even add multiple groups of VMs to the same Recovery Plan to ensure that your entire infrastructure fails over in an orderly fashion.

A Recovery Plan can also launch post recovery scripts to help the failover complete the recovery successfully. The steps I describe below can all be scripted out as part of your Recovery Plan, thereby fully automating the complete recovery process. We will not be covering that process in this blog post, but Microsoft documents this process.

Static IP Addresses

As part of the recovery process you want to make sure the new VMs have static IP addresses. You will have to adjust the interface properties in the Azure Portal so that the VM always uses the same address. If you want to add a public IP address to the interface you should do so at this time as well.

Network Configuration

After the replicated VMs are successfully recovered in the DR site, the first thing you want to do is verify basic connectivity. Is the IP configuration correct? Are the instances using the right DNS server? Is name resolution functioning correctly? Can you ping the remote servers?

If there are any problems with network communications then the rest of the steps described below will be bound to fail. Don’t skip this step!

Load Balancer

As you probably know, clusters in Azure require you to configure a load balancer for client connectivity to work. The load balancer does not fail over as part of the Recovery Plan. You need to build a new load balancer based on the cluster that now resides in this new vNet. You can do this manually or script this as part of your Recovery Plan to happen automatically.

Network Security Groups

Running in this new subnet also means that you have to specify what Network Security Group you want to apply to these instances. You have to make sure the instances are able to communicate across the required ports. Again, you can do this manually, but it would be better to script this as part of your Recovery Plan.

Fix the IP Cluster Addresses

If you are unable to make the changes described earlier to recover your instances in the same subnet, you will have to complete the following steps to update your cluster IP addresses and the DataKeeper addresses for use in the new subnet.

Every cluster has a core cluster IP address. What you will see if you launch the WSFC UI after a failover is that the cluster won’t be able to connect. This is because the IP address used by the cluster is not valid in the new subnet.

If you open the properties of that IP Address resource you can change the IP address to something that works in the new subnet. Make sure to update the Network and Subnet Mask as well.

Once you fix that IP Address you will have to do the same thing for any other cluster address that you use in your cluster resources.

Fix the DataKeeper Mirror Addresses

SIOS DataKeeper mirrors use IP addresses as mirror endpoints. These are stored in the mirror and mirror job. If you recover a DataKeeper based cluster in a different subnet, you will see that the mirror comes up in a Resync Pending state. You will also notice that the Source IP and the Target IP reflect the original subnet, not the subnet of the DR site.

Fixing this issue involves running a command from SIOS called CHANGEMIRRORENDPOINTS. The usage for CHANGEMIRRORENDPOINTS is as follows.

emcmd <NEW source IP> CHANGEMIRRORENDPOINTS <volume letter> <ORIGINAL target IP> <NEW source IP> <NEW target IP>

In our example, the command and output looked like this.

After the command runs, the DataKeeper GUI will be updated to reflect the new IP addresses as shown below. The mirror will also go to a mirroring state.

Conclusions

You have now successfully configured and tested disaster recovery of your business critical applications using a combination of SIOS DataKeeper for high availability and Azure Site Recovery for disaster recovery. If you have questions, or would like to consult with SIOS to help you design and implement high availability and disaster recovery for your business critical applications like SQL Server, SAP ASCS and ERS, SAP HANA, Oracle or other business critical applications, please reach out to us.

Reproduced with permission from SIOS

The Generic Application Recovery Kit

October 6, 2022 by Jason Aw Leave a Comment

The Generic Application Recovery Kit

The SIOS Protection Suite for Linux comes with an array of handy Application Recovery Kits covering major databases such as SAP HANA and Oracle, IP, File Systems and NAS or NFS shares and exports. Every SIOS supplied ARK has restore(start), remove(stop), quickcheck and recover scripts – these are not readily configurable beyond any options asked for during configuration and addition into a protected hierarchy.

These ARKs are developed, maintained, quality checked and in some cases “support certified” by the application vendors themselves. What do you do if you have an application or service that’s not covered by an existing SIOS ARK?

Enter the generic ARK. The generic ARK can be added into a hierarchy and configured in a similar manner to other SIOS ARKs; the special thing about the generic ARK is that it requires you to provide a restore, remove and quickCheck script and optionally a recover script.

You can use any configured scripting language to create your scripts (BASH or Perl are common), lets investigate these scripts a little further:

restore: This is the script that is used to start your service or application

remove: This is the script used to stop your service or application

quickCheck: This script is used to determine whether your application or service is functioning as you would expect it to

recover: This script would be used to attempt a recovery following a failure, certain applications and services lend themselves to being restarted or certain commands being run to try to recover from a failure scenario

By default, the quickCheck script runs every 180 seconds. If the quickCheck script detects a failure of the application it calls the recover script. The recover script tries to restart the application on the current node. Should the recover script fail to restart the application, or a recover script is not provided, the remove script is then executed. This initiates a failover to the standby node.

Templates for the Generic Application Kit

SIOS provides example templates for the Generic Application Kit. These examples are installed with the lifekeeper software and can be found here:

quickCheck, remove and restore

/opt/Lifekeeper/lkadm/subsys/gen/app/templates/actions/

recovery

/opt/Lifekeeper/lkadm/subsys/gen/app/templates/recovery

There are examples for quickCheck, remove and restore in both BASH (.sh) and Perl (.pl) languages. The example scripts are self documented with comments throughout the script. Assuming that you’re familiar with either BASH or Perl then you will be able to understand what the scripts are doing.. A return code of 0 indicates a successful run, other values indicate a failure. The results of a script will trigger the next action that LifeKeeper takes.

Setup within Lifekeeper

After you have created your scripts you can create a Generic Application by clicking the green plus sign to create a new resource. Choose “Generic Application” to launch the configuration wizard.

Add a resource and select Generic Application

Select the Restore script

Select the Remove script

start from here

Select the QuickCheck script

Select the Recovery script (none in this example)

The Application Info is a way to pass in information to the GenAPP scripts. For example, in our GenAPP for the generic load balancer we use this field to pass the port that the load-balancer is listening on.

Select whether or not you want to bring the GenAPP online once it’s created, sometimes you want to leave the GenAPP offline so that you can create any dependencies that might be required.

Give the resource that will be created a name

Once you have entered all the information, the resource will be created

So you can see that creating a GenAPP to protect almost any application is straightforward and easy. A GenAPP allows you to protect ANY application, even custom applications that are built inhouse.

If you would like to learn more about how SIOS can help you keep your business critical applications available please contact us!

Reproduced with permission by SIOS

How to convert from SIOS NFS resources to EFS

October 2, 2022 by Jason Aw Leave a Comment

How to convert from SIOS NFS resources to EFS

As many customers look into migrating their SAP solutions into AWS, they may also want to convert existing network file shares (NFS) shares for /sapmnt or /usr/sap/<SID> file systems into elastic file system (EFS) shares. EFS shares are hosted as cloud file storage that can be managed the same as any local file system. In this case any data that is placed in an EFS share will have much higher protection due to the high availability and durability provided.

Steps for Converting an Existing SAP Hierarchy using NFS to EFS

Companies who are currently using SIOS LifeKeeper for Linux clusters to protect SAP on premises can easily convert their SAP hierarchy from NFS to EFS using the following straightforward steps. The process should only take about 20 minutes.

In this example, the SIOS LifeKeeper Linux solution is protecting an NFS export share /exports/sapmnt/EDM with local mount point /sapmnt/EDM (i.e. 12.1.4.10:/exports/sapmnt/EDM /sapmnt/EDM) (Figure 1).

Create the EFS share reference: https://docs.aws.amazon.com/efs/latest/ug/gs-step-two-create-efs-resources.html

Make sure to update the security group to the one your instances are using, this has to be done to be able to mount.
IP address of filesystem can be found under the networking tab.

Mount the EFS share on the primary (ISP) node to a temporary location (i.e. /sapmnttmp)
- mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport <ipaddress-filesystem>:/ /sapmnttmp

Add SAP_NFS_CHECK_IGNORE=1 to the /etc/default/LifeKeeper on both nodes

At this point there are mounted entries for the EFS filesystem. Before this check is set LifeKeeper is checking the NFS mounts that have already been there, Since, we know that we have mounted an efs filesystem it is safe to enable this check to ignore the nfs warning due to this new fiesystem being currently unrecognizable.

Use LifeKeeper lkbackup tool to create a backup copy of the LifeKeeper configuration
a. /opt/LifeKeeper/bin/lkbackup -c -n
Use LifeKeeper GUI or CLI to stop the SAP resources (perform an Out-of-Service) (Figure 2)
a. /opt/LifeKeeper/bin/perform_action -t SAP-EDM_ASCS00 -a remove

**6. Stop any remaining sapstartsrv, SAP** ERS or saposcol and saphostexec processes on the standby and primary nodesa. lsof -c sap
i. **kill <pid>**

7. Copy NFS data from the NFS export to the new EFS location

a. cp -pra /exports/sapmnt /sapmnttmp

b. cp -pra /exports/usr/sap/EDM/ASCS00 /sapmnttmp

8. Take hanfs resources osu

a. /opt/LifeKeeper/bin/perform_action -t hanfs-/exports/sapmnt/EDM -a remove (Figure 3)

b. **/opt/LifeKeeper/bin/perform_action -t hanfs-/exports/usr/sap/EDM/ASCS00 -a remove** (Figure 4)

9. Use umount to unmount the existing /sapmnt and other NFS local mounts.

a. umount/exports/sapmnt/EDM
b. umount /exports/usr/sap/EDM/ASCS00

10. Use LifeKeeper GUI or CLI to take associated datarep-sapmnt resources OSU

a. /opt/LifeKeeper/bin/perform_action -t datarep-EDM -a remove (Figure 5)

/opt/LifeKeeper/bin/perform_action -t datarep-ASCS00 -a remove (Figure 6)

Add mount entries for EFS into /etc/fstab on node 1
a. Replace sap exports mounts with efs
i. :/sapmnt/EDM /sapmnt/EDM nfs nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 0 0
ii. :/ASCS00 /usr/sap/EDM/ASCS00 nfs nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 0 0
Use the umount command to unmount the temporary /sapmnttmp mount point on node 1
a. umount /sapmnttmp
Unmount the sap mounts
a. umount -l /sapmnt/EDM
b. umount -l /usr/sap/EDM/ASCS00
Use mount command to remount EFS file systems on node 1
a. mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport :/sapmnt/EDM /sapmnt/EDM
b. mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport :/ASCS00 /usr/sap/EDM/ASCS00
Now you are ready to delete the dependency between the SAP resources and the old SIOS HANFS and NFS resources.
Use GUI to delete the dependencies between SAP resources and HANFS and NFS resources
a. Delete the nfs-/exports/dependencies from SAP-EDM_ASCS00
b. Delete the hanfs-/ child dependencies from ip-12.1.4.10
c. Delete the child dependency, ip-12.1.4.10, from nfs-/export/ filesystems
d. Use LifeKeeper GUI or CLI to start the SAP resources (bring the SAP resource in-service)
e. The new hierarchies will look similar to the following (Figure 7):

**f. Remove hanfs-/exports and nfs-/exports hierachies (Figure 8)**

Add mount entries for EFS into /etc/fstab on node 2.
a. :/sapmnt/EDM /sapmnt/EDM nfs nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 0 0

b. :/ASCS00 /usr/sap/EDM/ASCS00 nfs nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 0 0
Use mount command to remount EFS file systems on node 2, refer to step 13 before adding entries.
a. mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 10.0.147.83:/sapmnt /sapmnt

b. mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 10.0.147.83:/ASCS00 /usr/sap/EDM/ASCS00
Verify SAP resources are started properly including IP and EC2 resources
Use the LK GUI to perform a switchover to the target (or standby node)
Done

Conclusion

Converting an NFS filesystem to EFS is a sure-fire way to provide much more protection to your data and take advantage of AWS cloud resources. It also simplifies the resource hierarchies making your filesystems easier to read and manage. The steps provided above will enable a much faster and smoother transition of your data stored in the cloud.

Reproduced with permission from SIOS