Archives for April 2022

How to Achieve High Availability in the Cloud Using WSFC

April 29, 2022 by Jason Aw Leave a Comment

How to Achieve High Availability in the Cloud Using WSFC

Figure 1: Failover Cluster for High Availability in Azure

Microsoft Windows Server includes Windows Server Failover Clustering (WSFC) software to ensure the availability of critical applications. In an on-premises environment, primary and standby nodes in the cluster are connected to the same shared storage. However, this infrastructure cannot be taken directly to the cloud. Shared storage that spans both primary and standby systems is essential in WSFC, but shared storage cannot be used with public cloud services such as IaaS (Infrastructure as a Service) in AWS, Azure, or Google Cloud.

Geographically Separated Shared Storage for WSFC is Not Available in the Cloud

When migrating on-premises applications to the cloud, companies prefer to move their entire infrastructure to the cloud, including WSFC, without changing the on-premises operation process. This allows them to minimize disruption by applying the same WSFC skills and know-how in the cloud.

Figure 2: Failover Clustering in AWS EC2. Cluster nodes are in geographically separated Availability Zones for DR protection.

The servers that make up the cluster are divided into the primary node – where the application runs – and standby node(s). WSFC software monitors the application and server node to ensure they are operational. If WSFC detects something wrong with the primary node, it switches operation of the application to the standby node in a process called “failover”.

In a WSFC environment, the primary server and the standby server are connected to shared storage – typically storage called SAN (Storage Area Network) or iSCSI-SAN storage.

To failover operations from the primary server to the standby server, the network link must be switched so the standby server can read from and write to the SAN that normally reads from and writes to the primary server. In this way, it is possible to restart the service in a short time, allowing the standby server to access the same data as the primary node and meet low Recovery Point Objectives (RPOs).

See related content: Disaster Recovery Fundamentals.

However, when migrating WSFC to the cloud, there is no SAN available. For example, you cannot link Amazon Web Services (AWS) and Microsoft Azure to multiple nodes (servers) to use as shared storage. The same applies to IaaS for other cloud services.

It is possible to build an HA cluster configuration based on WSFC without shared storage, but it requires extremely advanced skills, such as creating your own program to recover data on the standby node. The operation is complicated and it is not easy to verify when an incident occurs.

Data Replication Software Solves the Problem

To rectify this problem, you can install data replication software that is specialized for HA clusters – such as SIOS DataKeeper Cluster Edition – and synchronize storage among local servers. Data on the local disks of the primary and standby nodes are synchronized in real-time using host-based, block-level replication. With this method, you do not need shared storage. Instead, you can build an HA cluster configuration using familiar WSFC without disrupting established processes.

With DataKeeper, synchronized nodes appear as a SAN in the WSFC management screen (Failover Cluster Management). If your operations managers have used WSFC, they will require little to no training with this approach.

High Availability in the Cloud Surpasses On-Premises HA with SIOS DataKeeper and WSFC

DataKeeper Cluster Edition is a software add-on that seamlessly integrates with Windows Server Failover Clustering (WSFC) to add performance-optimized host-based synchronous or asynchronous replication. In the unlikely event that the HA cluster malfunctions, WSFC will orchestrate the failover of operations to the standby node(s) and access shared storage as if it is shared storage. This simple mechanism makes it possible to move to AWS without changing the operations of the existing system.

Without compromising familiar WSFC operations, it is possible to guarantee high availability in the cloud using DataKeeper that is equivalent to or better than on-premises high availability. The advantage of this cluster configuration is that it is very simple and can be easily applied to any cloud environment.

Seamless Integration with WSFC

SIOS DataKeeper Cluster Edition seamlessly integrates with and extends Windows Server Failover Clustering (WSFC) by providing a performance-optimized, host-based data replication mechanism. While WSFC manages the software cluster, SIOS performs the data replication to enable disaster protection and ensure zero data loss in cases where shared storage clusters are impossible or impractical, such as in cloud, virtual, and high-performance storage environments.

Reproduced with permission from SIOS

The single best way to deploy quorum/witness

April 26, 2022 by Jason Aw Leave a Comment

The single best way to deploy quorum/witness

During a recent meeting, a customer asked a question about High Availability (HA) and the need for quorum/witness feasibility. Their question was, “What is the best way to deploy quorum/witness?” The answer to their question is simple, there is no single best way to deploy quorum. To understand why, let’s start by defining three key things: what is a witness resource, a quorum resource and a split-brain scenario.

What is split brain?

In a normal cluster environment, the protected application is running on the primary node in the cluster. In the event of an application failure of that primary node, the clustering software moves the application operation to a secondary or remote node, which assumes the role of primary. At any given time, there is only one primary node.

Split brain is a condition that occurs when members of a cluster are unable to communicate with each other, but are in a running and operable state, and subsequently take ownership of common resources simultaneously. In effect, you have two bus drivers fighting for the steering wheel. Split-brain, due to its destructive nature, can cause data loss or data corruption and is best avoided through use of fencing, quorum, witness, or a quorum/witness functionality for cluster arbitration.

In most cluster managers, quorum is maintained when:

All servers are able to see the same state for all cluster peers and the witness
All servers are able to see the same state for the all of cluster peers, though not the witness
All servers are able to see the see the witness resource, though not each other, and avoid split-brain scenarios

In most cluster managers, quorum is lost when:

Servers are unable to see all cluster peers and the witness server
Servers are unable to see a majority of cluster peers, even though they can see the witness server
Servers are unable to access or maintain access to the quorum resource to successfully arbitrate quorum membership and resource access

What is a witness resource (or server)?

A witness resource is a server, network endpoint, or a device that is used to achieve and maintain quorum when a cluster has an even number of members. A cluster with an odd number of members, using cluster majority, does not need to use a witness resource as all members of the cluster server to arbitrate majority membership.

What is quorum and a quorum resource?

A quorum resource is a resource (device, system, block storage, file storage, file share, etc) that serves as a means for arbitration of the cluster state and membership. In some cluster managers, quorum is a resource within the cluster that aids or is required for any cluster state and cluster membership decisions. In other cluster managers, quorum functions as a tie-breaker to avoid split-brain.

More than One Way to Deploy a Quorum

Given the critical nature of quorum it is essential that HA architectures deploy quorum/witness resources properly, and fortunately (or unfortunately) there is no single, best way to deploy quorum. There are several factors that may shape the way in which your witness and quorum resources behave. These factors include:

1. Whether or not your deployment will be on-premises, cloud, or hybrid

Deploying in an on-premises datacenter where additional storage devices, such as fiber channel storage, power control devices or connections, or traditional stonith devices are present will provide customers with additional options for quorum and witness functionality that may not reside in the cloud. Likewise, cloud and hybrid environments present differences in what can be deployed and what use cases quorum is being deployed to prevent. Additionally, latency requirements and differences may limit what types of devices and resources are available for a quorum/witness configuration.

2. Your recovery objectives

Recovery objectives are also important to consider when designing and architecting your quorum and witness resources. In an example two node cluster (node A and node B), when node A experiences a loss of connectivity to node B, what is the highest priority for recovery. If the witness/quorum resources are in the same network with node A, this could result in node A remaining online, but severed from clients, while node B is unable to assess quorum and takeover. Likewise, if the quorum device lived only in the region, data-center or network with node B, a loss could result in a failover of resources to a defunct network or center or away from a functional and operation primary node.

3. Redundancy of Available Data Centers (or Regions) Within Your Infrastructure

The redundancy of the data center or region is also an important factor in your HA topology with quorum/witness. If your data center has only two levels of redundancy, you must understand the tradeoff between placement of the quorum/witness in the same data center as the primary or standby cluster node. If the data center has more than two redundant tiers, such as a third availability zone or access to a second region, this option would provide a higher level of redundancy for the cluster.

4. Disaster Recovery Requirements

Understanding your true disaster recovery requirements is also a major factor in your design. If your cluster manager software requires access to the quorum/witness in order to recover from a total data center outage (or region failure) then you’ll need to understand this impact on your design. Many high availability software packages have tools or methods for this scenario, but if your software does not, your design and placement of quorum/witness may need to accommodate this reality.

5. Number Of Members Within the Cluster, and Their Location

An additional quorum/witness server is typically not required when the cluster contains an odd number of nodes. However, if using only two nodes in a cluster or deploying a DR node that is not always available may change your architecture. As VP of Customer Experience I have worked with customers who have deployed three node architectures, but for cost savings they automate periodic shutdown of the third server.

6. Operation System and Cluster Manager

The final factor to mention on quorum/witness is the cluster manager and operating system. Not all HA software and cluster managers are equal when it comes to deployment of quorum/witness or arbitration of quorum status. Some clustering software requires shared disks for arbitration, others are more flexible allowing shares (NFS, SMB, EFS, Azure Files, and S3). Being aware of what your cluster manager requires, and the modes that it supports with regards to quorum (simple majority, witness, file share, etc) will impact not only what you deploy, but how you deploy.

The single best way to deploy a quorum/witness server is to understand your vendor’s definition of quorum/witness and their available options, know your requirements, factor in the limitations or opportunities presented by your data center (or cloud environment) and architect the solution that provides your critical systems the highest level of protection against split-brains, false failovers, and downtime.

-Cassius Rhue, VP, Customer Experience

Reproduced from SIOS

Measuring and Improving Write Throughput Performance on GCP Using SIOS DataKeeper for Windows

April 21, 2022 by Jason Aw Leave a Comment

Measuring and Improving Write Throughput Performance on GCP Using SIOS DataKeeper for Windows

Background

This post serves to document my findings in GCP in regards to write performance to a disk being replicated to GCP. But first, some background information. A customer expressed concern that DataKeeper was adding a tremendous amount of overhead to their write performance when testing with a synchronous mirror between Google Zones in the same region. The original test they performed was with the bitmap file on the C drive, which was a persistent SSD. In this configuration they were only pushing about 70 MBps. They tried relocating the bitmap to an extreme GCP disk, but the performance did not improve.

Moving the Bitmap to a Local SSD

I suggested that they move the bitmap to a local SSD, but they were hesitant because they believed the extreme disk they were using for the bitmap had latency and throughput that was as good or better than the local SSD, so they doubted it would make a difference. In addition, adding a local SSD is not a trivial task since it can only be added when the VM is originally provisioned.

Selecting the Instance Type

As I set out to complete my task, the first thing I discovered was that not every instance type supported a local SSD. For instance, the E2-Standard-8 does not support local SSD. For my first test I settled on a C2-Standard-8 instance type, which is considered “compute optimized”. I attached a 500 GB persistent SSD and started running some write performance tests and quickly discovered that I could only get the disk to write at about 140MBps rather than the max speed of 240MBps. The customer confirmed that they saw the same thing. It was perplexing, but we decided to move on and try a different instance type.

The second instance type we selected was an N2-Standard-8. With this instance type we were able to push the disk to its maximum throughput speed of 240 MBps when not replicating the disk. I moved the bitmap to the local SSD I had provisioned and repeated the same tests on a synchronous mirror (DataKeeper v8.8.2) and got the results shown below.

The Results

Diskspd test parameters

diskspd.exe -c96G -d10 -r -w100 -t8 -o3 -b64K -Sh -L D:\data.dat
diskspd.exe -c96G -d10 -r -w100 -t8 -o3 -b8K -Sh -L D:\data.dat
diskspd.exe -c96G -d10 -r -w100 -t8 -o3 -b4K -Sh -L D:\data.dat

MBps

The Data

Write Size	MB/s	MBps Percent Overhead
64k-Mirror	240.01	0.00%
64k-NoMirror	240.02
8k-Mirror	58.87	39.18%
8k-NoMirror	96.8
4k-Mirror	29.34	21.84%
4k-NoMirror	37.54

Write Size	AvgLat	AvgLat Overhead
64k-Mirror	6.247	-0.02%
64k-NoMirror	6.248
8k-Mirror	3.183	39.21%
8k-NoMirror	1.935
4k-Mirror	3.194	21.88%
4k-NoMirror	2.495

Conclusions

The 64k and 4k write sizes all incur overhead which could be considered as “acceptable” for synchronous replication. The 8k write size seems to incur a more significant amount of overhead, although the average latency of 3.183ms is still pretty low.

-Dave Bermingham, Director, Customer Success

Reproduced with permission from SIOS

How COVID-19 Impacts High Availability

April 17, 2022 by Jason Aw Leave a Comment

How COVID-19 Impacts High Availability

Compared to friends, family, and those who have required treatment, hospitalization, or intensive care, my COVID symptoms have been mild. This is likely the result of reasonably good health, both doses of the vaccine, a booster shot, and early detection and treatment. And, my heart goes out to every family who has lost a loved one to any aspect of this pandemic, and to all those who have lost opportunities and special moments. As I and several members of our SIOS team recover from COVId-19, we wanted to share five things that your IT Team may be dealing with as they fight COVID and enterprise downtime, and five things you can do to help them.

Five COVID Concerns Facing IT Teams

Personal and Family Concerns and Fears

Initially, my symptoms were barely noticeable, a slight irritation in my throat, and a little sinus drainage, which I self-diagnosed as seasonal allergies. But when the issues worsened, accompanied by a bad cough I became worried. Of course, we’d all like to think that our work performance and responsibilities remain unchanged, but the reality may be a little harder to assess. Despite initial negative tests, I continued to develop symptoms that eventually impacted my ability to work, increased my personal health concerns, and raised a number of fears. If your team has been directly affected by COVID-19, understand that they are likely dealing with personal concerns, fears, and worries in addition to the real health challenges that may impact their schedules, tasks, and activities.In the midst of their personal concerns each team member is likely also dealing with larger concerns, namely concerns about family. During my illness, thankfully, my children all remained well. However, my wife was not so lucky. She became ill three days after my symptoms and remained ill longer and with more severe symptoms and setbacks. While we have the benefits of a large family unit, a licensed teenage driver, and an extra car not driven by COVID-positive parents, your team may not have these luxuries. And even if they do, it does not give them freedom from concern or reduce the amount of time and mental energy they need to apply to sanitize the home, keep their children in school and healthy, and deal with regulations, mandates, and close contact issues. Not to mention concerns over income and expenses. Team members facing personal and family concerns may experience difficulty concentrating, short-tempers, and difficulty meeting deadlines and schedules.
FODO – Fear of Disappointing Others

Even without COVID-19 illness, businesses worldwide are feeling the impact of a smaller workforce. The events aptly described as the “Great Shift”, “Great Resignation”, or “Great Shuffle” have already dramatically reshaped workforces, including those dealing with HA, leaving teams with fewer people to carry on critical tasks. This deficit in team members can lead those with COVID to battle a Fear of Disappointing Others (FODO). Sick team members may continue to try to work out of loyalty to the team or a fear of disappointing bosses, peers, or stakeholders. This FODO often leads to workers who are already functioning in a stressed environment (see #1 and 2 above) to attempt to maintain pre-COVID levels of activity. While heroic, it is also counterproductive to personal and professional recovery.
Fatigue

As I continue to deal with COVID-19 symptoms, one of the biggest issues I continue to face is fatigue. Initially, that fatigue, which was driven by FODO, prevented me from getting adequate rest and recovery. Because I had seen how shorthanded our team was and witnessed others try to brave their illness to keep up with demand, I tried to do the same. But, without warning I found myself drained, not at the end of the day, but for periods of time throughout the day. For me, starting the day before 5am and continuing to focus on work, tasks, strategy, and personnel matters for 8 to 12 hours was normal. (We can debate later if that was ever healthy). Now some felt like climbing Everest before 8 AM. The best advice I received was from a friend and co-worker who said, “Don’t fight it. When your body says rest, rest!”
Brain Fog

Around the same time that I started feeling sick, a colleague shared that they felt like they were in a fog following their bout with COVID symptoms. Like me, they were fully vaccinated and their symptoms and duration were mild. In fact, they actually never tested positive. Nevertheless, they spent days with what we both termed “brain fog.” An experience that we describe as slowness to recall details, a sense of knowing the answer, but lacking some mental sharpness that is somehow different from the physical fatigue and mental fatigue. In some instances, it appears as a slower response to a question, a pause in the keystrokes, or a delay before the light comes on in the room.
Failed Recovery

Five days into COVID, I woke up from an early night’s rest feeling better than ever. I jumped into my regular routine and by noon discovered that I had not fully recovered. Instead I was exhausting a small store of energy gained by sleeping well the night before. Trying to fight through this exhaustion created a new setback in my recovery. The following day I felt worse than before. The agony of a failed recovery and a concern about how to avoid more setbacks was added to my fatigue and fog.

So, what should IT team leads, stakeholders and managers do when their teams experience an issue with COVID-19.

Five Ways to Help IT Teams Battling COVID

Practice Empathy

Be mindful that COVID affects each person and family differently. Some of your coworkers and administrators will have minor issues, no symptoms, and no complications. While others, single parents, multi-generational families, or families with children or vulnerable persons will have many more issues and concerns. Know that the virus also impacts each person uniquely. Even within my own family my symptoms and those of my wife were different. While I experienced greater fatigue, she experienced more headaches. Have patience for coworkers who may be dealing with brain fog, juggling work schedules, caring for sick loved ones, or dealing with myriad issues related to COVID.
Assess needs

Unlike the flu or common cold, COVID recovery is irregular. A team member may show up at work one day feeling much improved and stay home sick the next. Your business still has technical needs and requirements for high availability and disaster recovery. However, with persons in and out of availability due to illness, be sure to understand the current roles and responsibilities required within the team. When an individual is out sick, be sure to assess their role, their impact to the team, their level of responsibility to the infrastructure,etc. You may also need to assess who within the team or organization can provide coverage in the event of a critical downtime event.
Prioritize issues

Help your team by prioritizing key issues. Under normal circumstances, your IT team is balancing dozens of requests ranging from the trivial (USB keyboard) to the critical (issues related to downtime, security threats, or storage issues). While it may be obvious to you and the team, other stakeholders may need to understand the status of the IT team and how operations will be handled until a return to more “normal” staffing occurs.
Be sure your Processes are up-to-date

As team members swap in and out, it is critical that IT maintenance and management processes are kept up to date. These processes will help each member of the team service your enterprise effectively and efficiently when performing a task that is not their normal responsibility. It will also reduce the amount of time each team member needs to spend researching the status of the systems they are covering while a coworker recuperates.
Give People Time

I’ve rushed back into the routine more than I should have, only to suffer the consequences of setbacks and greater fatigue on the following day. As a leader or individual contributor on a team, be sure to give yourself and your team time to “get back to normal.”

As the pandemic continues, we all hope for a future that greatly resembles normalcy, including less illness, fear and worry. In the meantime, being more aware of the concerns your team members are facing during COVID illness and recovery will greatly help you proactively prepare and weather the current storm. In addition, key lessons learned from this pandemic can be applied across a number of other organizational, employee life, and global concerns.

Reproduced with permission from SIOS

How to Get the Most from Your Tech Support Call

April 13, 2022 by Jason Aw Leave a Comment

How to Get the Most from Your Tech Support Call

Technical support experts share their tips on how to fast-track issue resolution

SIOS provides high availability protection for our customers’ most critical applications, databases, and ERPs. When our customers call tech support, there is no time to waste. We’ve earned a reputation (and several awards) for our HA/DR expertise and support excellence.

We’ve asked our tech support team to share the following five questions that can fast-track your issue resolution.

Fast, Accurate Diagnosis

Thorough and accurate tech support is similar to diagnosing an illness. Imagine asking your doctor to treat a headache. The human body is a complex interaction of multiple systems. The source of your problem may not be obvious or even in your head. To diagnose the issue and recommend a treatment, your doctor typically begins with questions aimed at identifying the circumstances that caused your symptoms.

Failover clustering also involves multiple systems at every layer of the IT infrastructure – network, storage, OS, application, database, and server. And like your real headache, your HA issue is often caused by something unrelated to your HA clustering software. Like your doctor, a good support professional will ask a variety of questions to characterize your issue. The more information you can provide about your support issue, the faster and more effectively it can be diagnosed and resolved.

Fast-Tracking Issue Resolution

As an IT best practice, consider logging key information and system changes as an ongoing business exercise. By putting answers to the following key questions at your fingertips, this process will speed the diagnosis and fast-track issue resolution. (It may also help you prevent issues from occurring in the first place).

Can you describe the error you are receiving? What is the exact symptom you are witnessing that is causing concern?
When did it happen (time, time zone you are in?)
A typical diagnostic method is to examine log files from the machine with issues. Log files can be hundreds of lines of message strings or command output. By tracking the precise time you noticed the problematic symptoms, we can significantly narrow the log file examination.
Have you or are you able to upload the logs?
Providing an explanation and description of the error along with the timeframe for which it happened goes a long way in diagnosis provided the logs can be uploaded to the support ticket. In some IT environments uploading the logs requires using corporate-approved file sharing, while dark sites require no electronic distribution of system logs. If logs cannot be provided externally, be sure that the full logs are captured and archived for reference and review with the support agent as the case progresses. Applications and systems, especially those under duress can produce exhaustive and extensive logs that can overwrite critical information.
Which system was the primary cluster node at the time?
Given the interconnected nature of clustering, it is important to inform your tech support representative of whether the cluster node you are calling about was functioning as the primary or secondary node at the time of the issue.
What have you tried to do to remedy the issue?
Great physicians know that their patients have likely tried a home remedy or over-the-counter medication prior to the visit. Knowing this information is helpful in diagnosis and treatment. The same applies with great support technicians. Sharing not only what you were trying to do at the time of the issue, but how you tried to resolve your errors can help them craft a better treatment and recovery plan, and make sure that their recommendations for recovery protect your critical data and applications.

For more than 20 years, SIOS Customer Experience team has been helping enterprise customers implement HA/DR solution for a wide range of use cases. We value our customers and encourage them to contact us whenever they have questions about their HA/DR.

Reproduced with permission from SIOS