|April 29, 2022||
How to Achieve High Availability in the Cloud Using WSFC
Microsoft Windows Server includes Windows Server Failover Clustering (WSFC) software to ensure the availability of critical applications. In an on-premises environment, primary and standby nodes in the cluster are connected to the same shared storage. However, this infrastructure cannot be taken directly to the cloud. Shared storage that spans both primary and standby systems is essential in WSFC, but shared storage cannot be used with public cloud services such as IaaS (Infrastructure as a Service) in AWS, Azure, or Google Cloud.
Geographically Separated Shared Storage for WSFC is Not Available in the Cloud
When migrating on-premises applications to the cloud, companies prefer to move their entire infrastructure to the cloud, including WSFC, without changing the on-premises operation process. This allows them to minimize disruption by applying the same WSFC skills and know-how in the cloud.
The servers that make up the cluster are divided into the primary node – where the application runs – and standby node(s). WSFC software monitors the application and server node to ensure they are operational. If WSFC detects something wrong with the primary node, it switches operation of the application to the standby node in a process called “failover”.
In a WSFC environment, the primary server and the standby server are connected to shared storage – typically storage called SAN (Storage Area Network) or iSCSI-SAN storage.
To failover operations from the primary server to the standby server, the network link must be switched so the standby server can read from and write to the SAN that normally reads from and writes to the primary server. In this way, it is possible to restart the service in a short time, allowing the standby server to access the same data as the primary node and meet low Recovery Point Objectives (RPOs).
See related content: Disaster Recovery Fundamentals.
However, when migrating WSFC to the cloud, there is no SAN available. For example, you cannot link Amazon Web Services (AWS) and Microsoft Azure to multiple nodes (servers) to use as shared storage. The same applies to IaaS for other cloud services.
It is possible to build an HA cluster configuration based on WSFC without shared storage, but it requires extremely advanced skills, such as creating your own program to recover data on the standby node. The operation is complicated and it is not easy to verify when an incident occurs.
Data Replication Software Solves the Problem
To rectify this problem, you can install data replication software that is specialized for HA clusters – such as SIOS DataKeeper Cluster Edition – and synchronize storage among local servers. Data on the local disks of the primary and standby nodes are synchronized in real-time using host-based, block-level replication. With this method, you do not need shared storage. Instead, you can build an HA cluster configuration using familiar WSFC without disrupting established processes.
With DataKeeper, synchronized nodes appear as a SAN in the WSFC management screen (Failover Cluster Management). If your operations managers have used WSFC, they will require little to no training with this approach.
High Availability in the Cloud Surpasses On-Premises HA with SIOS DataKeeper and WSFC
DataKeeper Cluster Edition is a software add-on that seamlessly integrates with Windows Server Failover Clustering (WSFC) to add performance-optimized host-based synchronous or asynchronous replication. In the unlikely event that the HA cluster malfunctions, WSFC will orchestrate the failover of operations to the standby node(s) and access shared storage as if it is shared storage. This simple mechanism makes it possible to move to AWS without changing the operations of the existing system.
Without compromising familiar WSFC operations, it is possible to guarantee high availability in the cloud using DataKeeper that is equivalent to or better than on-premises high availability. The advantage of this cluster configuration is that it is very simple and can be easily applied to any cloud environment.
Seamless Integration with WSFC
SIOS DataKeeper Cluster Edition seamlessly integrates with and extends Windows Server Failover Clustering (WSFC) by providing a performance-optimized, host-based data replication mechanism. While WSFC manages the software cluster, SIOS performs the data replication to enable disaster protection and ensure zero data loss in cases where shared storage clusters are impossible or impractical, such as in cloud, virtual, and high-performance storage environments.
Reproduced with permission from SIOS
|April 26, 2022||
The single best way to deploy quorum/witness
During a recent meeting, a customer asked a question about High Availability (HA) and the need for quorum/witness feasibility. Their question was, “What is the best way to deploy quorum/witness?” The answer to their question is simple, there is no single best way to deploy quorum. To understand why, let’s start by defining three key things: what is a witness resource, a quorum resource and a split-brain scenario.
What is split brain?
In a normal cluster environment, the protected application is running on the primary node in the cluster. In the event of an application failure of that primary node, the clustering software moves the application operation to a secondary or remote node, which assumes the role of primary. At any given time, there is only one primary node.
Split brain is a condition that occurs when members of a cluster are unable to communicate with each other, but are in a running and operable state, and subsequently take ownership of common resources simultaneously. In effect, you have two bus drivers fighting for the steering wheel. Split-brain, due to its destructive nature, can cause data loss or data corruption and is best avoided through use of fencing, quorum, witness, or a quorum/witness functionality for cluster arbitration.
In most cluster managers, quorum is maintained when:
In most cluster managers, quorum is lost when:
What is a witness resource (or server)?
A witness resource is a server, network endpoint, or a device that is used to achieve and maintain quorum when a cluster has an even number of members. A cluster with an odd number of members, using cluster majority, does not need to use a witness resource as all members of the cluster server to arbitrate majority membership.
What is quorum and a quorum resource?
A quorum resource is a resource (device, system, block storage, file storage, file share, etc) that serves as a means for arbitration of the cluster state and membership. In some cluster managers, quorum is a resource within the cluster that aids or is required for any cluster state and cluster membership decisions. In other cluster managers, quorum functions as a tie-breaker to avoid split-brain.
More than One Way to Deploy a Quorum
Given the critical nature of quorum it is essential that HA architectures deploy quorum/witness resources properly, and fortunately (or unfortunately) there is no single, best way to deploy quorum. There are several factors that may shape the way in which your witness and quorum resources behave. These factors include:
1. Whether or not your deployment will be on-premises, cloud, or hybrid
Deploying in an on-premises datacenter where additional storage devices, such as fiber channel storage, power control devices or connections, or traditional stonith devices are present will provide customers with additional options for quorum and witness functionality that may not reside in the cloud. Likewise, cloud and hybrid environments present differences in what can be deployed and what use cases quorum is being deployed to prevent. Additionally, latency requirements and differences may limit what types of devices and resources are available for a quorum/witness configuration.
2. Your recovery objectives
Recovery objectives are also important to consider when designing and architecting your quorum and witness resources. In an example two node cluster (node A and node B), when node A experiences a loss of connectivity to node B, what is the highest priority for recovery. If the witness/quorum resources are in the same network with node A, this could result in node A remaining online, but severed from clients, while node B is unable to assess quorum and takeover. Likewise, if the quorum device lived only in the region, data-center or network with node B, a loss could result in a failover of resources to a defunct network or center or away from a functional and operation primary node.
3. Redundancy of Available Data Centers (or Regions) Within Your Infrastructure
The redundancy of the data center or region is also an important factor in your HA topology with quorum/witness. If your data center has only two levels of redundancy, you must understand the tradeoff between placement of the quorum/witness in the same data center as the primary or standby cluster node. If the data center has more than two redundant tiers, such as a third availability zone or access to a second region, this option would provide a higher level of redundancy for the cluster.
4. Disaster Recovery Requirements
Understanding your true disaster recovery requirements is also a major factor in your design. If your cluster manager software requires access to the quorum/witness in order to recover from a total data center outage (or region failure) then you’ll need to understand this impact on your design. Many high availability software packages have tools or methods for this scenario, but if your software does not, your design and placement of quorum/witness may need to accommodate this reality.
5. Number Of Members Within the Cluster, and Their Location
An additional quorum/witness server is typically not required when the cluster contains an odd number of nodes. However, if using only two nodes in a cluster or deploying a DR node that is not always available may change your architecture. As VP of Customer Experience I have worked with customers who have deployed three node architectures, but for cost savings they automate periodic shutdown of the third server.
6. Operation System and Cluster Manager
The final factor to mention on quorum/witness is the cluster manager and operating system. Not all HA software and cluster managers are equal when it comes to deployment of quorum/witness or arbitration of quorum status. Some clustering software requires shared disks for arbitration, others are more flexible allowing shares (NFS, SMB, EFS, Azure Files, and S3). Being aware of what your cluster manager requires, and the modes that it supports with regards to quorum (simple majority, witness, file share, etc) will impact not only what you deploy, but how you deploy.
The single best way to deploy a quorum/witness server is to understand your vendor’s definition of quorum/witness and their available options, know your requirements, factor in the limitations or opportunities presented by your data center (or cloud environment) and architect the solution that provides your critical systems the highest level of protection against split-brains, false failovers, and downtime.
-Cassius Rhue, VP, Customer Experience
Reproduced from SIOS
|April 21, 2022||
Measuring and Improving Write Throughput Performance on GCP Using SIOS DataKeeper for Windows
This post serves to document my findings in GCP in regards to write performance to a disk being replicated to GCP. But first, some background information. A customer expressed concern that DataKeeper was adding a tremendous amount of overhead to their write performance when testing with a synchronous mirror between Google Zones in the same region. The original test they performed was with the bitmap file on the C drive, which was a persistent SSD. In this configuration they were only pushing about 70 MBps. They tried relocating the bitmap to an extreme GCP disk, but the performance did not improve.
Moving the Bitmap to a Local SSD
I suggested that they move the bitmap to a local SSD, but they were hesitant because they believed the extreme disk they were using for the bitmap had latency and throughput that was as good or better than the local SSD, so they doubted it would make a difference. In addition, adding a local SSD is not a trivial task since it can only be added when the VM is originally provisioned.
Selecting the Instance Type
As I set out to complete my task, the first thing I discovered was that not every instance type supported a local SSD. For instance, the E2-Standard-8 does not support local SSD. For my first test I settled on a C2-Standard-8 instance type, which is considered “compute optimized”. I attached a 500 GB persistent SSD and started running some write performance tests and quickly discovered that I could only get the disk to write at about 140MBps rather than the max speed of 240MBps. The customer confirmed that they saw the same thing. It was perplexing, but we decided to move on and try a different instance type.
The second instance type we selected was an N2-Standard-8. With this instance type we were able to push the disk to its maximum throughput speed of 240 MBps when not replicating the disk. I moved the bitmap to the local SSD I had provisioned and repeated the same tests on a synchronous mirror (DataKeeper v8.8.2) and got the results shown below.
Diskspd test parameters
diskspd.exe -c96G -d10 -r -w100 -t8 -o3 -b64K -Sh -L D:\data.dat
The 64k and 4k write sizes all incur overhead which could be considered as “acceptable” for synchronous replication. The 8k write size seems to incur a more significant amount of overhead, although the average latency of 3.183ms is still pretty low.
-Dave Bermingham, Director, Customer Success
Reproduced with permission from SIOS
|April 17, 2022||
How COVID-19 Impacts High Availability
Compared to friends, family, and those who have required treatment, hospitalization, or intensive care, my COVID symptoms have been mild. This is likely the result of reasonably good health, both doses of the vaccine, a booster shot, and early detection and treatment. And, my heart goes out to every family who has lost a loved one to any aspect of this pandemic, and to all those who have lost opportunities and special moments. As I and several members of our SIOS team recover from COVId-19, we wanted to share five things that your IT Team may be dealing with as they fight COVID and enterprise downtime, and five things you can do to help them.
Five COVID Concerns Facing IT Teams
So, what should IT team leads, stakeholders and managers do when their teams experience an issue with COVID-19.
Five Ways to Help IT Teams Battling COVID
As the pandemic continues, we all hope for a future that greatly resembles normalcy, including less illness, fear and worry. In the meantime, being more aware of the concerns your team members are facing during COVID illness and recovery will greatly help you proactively prepare and weather the current storm. In addition, key lessons learned from this pandemic can be applied across a number of other organizational, employee life, and global concerns.
Reproduced with permission from SIOS
|April 13, 2022||
How to Get the Most from Your Tech Support Call
Technical support experts share their tips on how to fast-track issue resolution
SIOS provides high availability protection for our customers’ most critical applications, databases, and ERPs. When our customers call tech support, there is no time to waste. We’ve earned a reputation (and several awards) for our HA/DR expertise and support excellence.
We’ve asked our tech support team to share the following five questions that can fast-track your issue resolution.
Fast, Accurate Diagnosis
Thorough and accurate tech support is similar to diagnosing an illness. Imagine asking your doctor to treat a headache. The human body is a complex interaction of multiple systems. The source of your problem may not be obvious or even in your head. To diagnose the issue and recommend a treatment, your doctor typically begins with questions aimed at identifying the circumstances that caused your symptoms.
Failover clustering also involves multiple systems at every layer of the IT infrastructure – network, storage, OS, application, database, and server. And like your real headache, your HA issue is often caused by something unrelated to your HA clustering software. Like your doctor, a good support professional will ask a variety of questions to characterize your issue. The more information you can provide about your support issue, the faster and more effectively it can be diagnosed and resolved.
Fast-Tracking Issue Resolution
As an IT best practice, consider logging key information and system changes as an ongoing business exercise. By putting answers to the following key questions at your fingertips, this process will speed the diagnosis and fast-track issue resolution. (It may also help you prevent issues from occurring in the first place).
For more than 20 years, SIOS Customer Experience team has been helping enterprise customers implement HA/DR solution for a wide range of use cases. We value our customers and encourage them to contact us whenever they have questions about their HA/DR.
Reproduced with permission from SIOS