Clustering Simplified Archives - Page 41 of 104

What is “Split Brain” and How to Avoid It

June 23, 2022 by Jason Aw Leave a Comment

What is “Split Brain” and How to Avoid It

As we have discussed, in a High Availability cluster environment there is one active node and one or more standby node(s) that will take over service when the active node either fails or stops responding.

This sounds like a reasonable assumption until the network layer between the nodes is considered. What if the network path between the nodes goes down?

Neither node can now communicate with the other and in this situation the standby server may promote itself to become the active server on the basis that it believes the active node has failed. This results in both nodes becoming ‘active’ as each would see the other as being dead. As a result, data integrity and consistency is compromised as data on both nodes would be changing. This is referred to as “Split Brain”.

To avoid a split brain scenario, a Quorum node (also referred to as a ‘Witness’) should be installed within the cluster. Adding the quorum node (to a cluster consisting of an even number of nodes) creates an odd number of nodes (3, 5, 7, etc.), with nodes voting to decide which should act as the active node within the cluster.

In the example below, the server rack containing Node B has lost LAN connectivity. In this scenario, through the addition of a 3rd node to the cluster environment, the system can still determine which node should be the active node.

Quorum/Witness functionality is included in the SIOS Protection Suite. At installation, Quorum / Witness is selected on all nodes (not only the quorum node) and a communication path is defined between all nodes (including the quorum node).

The quorum node doesn’t host any active services. Its only role is to participate in node communication in order to determine which are active and to provide a ‘tie-break vote’ in case of a communication outage.

SIOS also supports IO Fencing and Storage as quorum devices, and in these configurations an additional quorum node is not required.

Reproduced with permission from SIOS

How does Data Replication between Nodes Work?

June 19, 2022 by Jason Aw Leave a Comment

In the traditional datacenter scenario, data is commonly stored on a storage area network (SAN). The cloud environment doesn’t typically support shared storage.

SIOS DataKeeper presents ‘shared’ storage using replication technology to create a copy of the currently active data. It creates a NetRAID device that works as a RAID1 device (data mirrored across devices).

Data changes are replicated from the Mirror Source (disk device on the active node – Node A in the diagram below) to the Mirror Target (disk device on the standby node – Node B in the diagram below).

In order to guarantee consistency of data across both devices, only the active node has write access to the replicated device (/datakeeper mount point in the example below). Access to the replicated device (the /datakeeper mount point) is not allowed while it is a Mirror Target (i.e., on the standby node).

Reproduced with permission from SIOS

How a Client Connects to the Active Node

June 15, 2022 by Jason Aw Leave a Comment

How a Client Connects to the Active Node

As discussed earlier, once a High Availability cluster has been configured, two or more nodes run simultaneously and users connect to the “active” node. When an issue occurs on the active node, a “failover” condition occurs and the “standby” node becomes the new “active” node. When a failover occurs there must be a mechanism that either allows a client to detect the failover condition and to reconnect, or a seamless transfer of the user’s active client session to the active node.

A Virtual IP Address

Usually a “virtual” IP address is created when a cluster is configured and the client communicates with the active node using a virtual IP address. When a failover occurs, the virtual IP address is reassigned to the new active node and the client reconnects to the same virtual IP address.

As an example, let us assume that there are two nodes, A and B, with IP addresses of 10.20.1.10 and 10.20.2.10. In this example, we will define a virtual IP address of 10.20.0.10 which should be considered to be assigned to the current active node.

This is similar to assigning a second IP address to one network interface card on one node. If the command ip a is entered on the active node, both IP addresses will appear (as on lines 10 and 12 in this Linux example):

The ARP Protocol

When a client attempts to find a server using an IP address, the client typically uses ARP (Address Resolution Protocol) to find the MAC (Media Access Control) address of the target machine.

Once a client broadcasts a message to find the target IP address, the active node answers with its MAC address and the client resolves the request and connects to it.

ARP Alternatives for the Cloud Environment

In the cloud environment, however, it is not possible to identify the active node using ARP as many layers are abstracted in the virtual environment. An alternative method based on the network infrastructure in use in the specific cloud environment may be required. There are normally several options, and a selection should be made from the following list.

AWS Route Table Scenario

AWS Elastic IP Scenario

AWS Route53 Scenario

Azure Internal Load Balancer Scenario

Google Cloud Internal Load Balancer Scenario

Reproduced with permission from SIOS

Public Cloud Platforms and their Network Structure Differences

June 11, 2022 by Jason Aw Leave a Comment

There are several public cloud platforms including Amazon Web Services (AWS), Microsoft Azure and Google Cloud. While there are many similarities in their infrastructures, there are some differences. In many cases a VPC (Virtual Private Cloud) or a VNET (Virtual Network) that is tied to a region is created. One or more VPCs may be defined for a logical group of applications. By so doing, different systems are divided into separate unconnected networks unless different VPCs are specifically connected.

Under a VPC many different subnets can be defined. Based on the purpose, some subnets are configured as “public” subnets which are accessible to the internet and some are configured as “private” subnets which are not accessible to the internet.

Some cloud providers (such as Azure and Google Cloud) allow subnets to be defined across Availability Zones (different datacenters), while some (such as AWS) do not allow subnets to be defined across Availability Zones. In the latter case, a subnet will need to be defined for each Availability Zone.

In this guide, we’ll use different Availability Zones for each node. Once the basic functionality of the SIOS product is understood, it might be appropriate to explore different scenarios (similar to those in use in your own network infrastructure) that involve distributing workloads across different subnets, modifying the IP ranges for these subnets, changing the manner in which the network is connected to the Internet, etc.

Reproduced with permission from SIOS

How Workloads Should be Distributed when Migrating to a Cloud Environment

June 7, 2022 by Jason Aw Leave a Comment

How Workloads Should be Distributed when Migrating to a Cloud Environment

Determining how Workloads (nodes) should be distributed is a common topic of discussion when migrating to the public cloud with High Availability in mind. If workloads are located within an on-premise environment, more often than not the locations of these workloads are defined by the location(s) of established datacenters. In many cases choosing another location in which to host a workload is not an available option. With a public cloud offering there are a wide range of geographical regions as well as availability zones to choose from.

An Availability Zone is generally analogous to one or more datacenters (physical locations) being located in the same physical region (e.g., in California). These datacenters may be located in different areas but are connected using high-speed networks to minimize connection latency between them. (Note that hosting services across several datacenters within an availability region should be transparent to the user).

As a general rule, the greater the physical distance between workloads, the more resilient the environment becomes. It’s a reasonable assumption that natural disasters such as earthquakes won’t affect different regions at the same time (e.g., both U.S. west coast and east coast at the same time). However, there is still a chance of experiencing service outages across different regions simultaneously due to system-wide failures (some cloud providers have previously reported simultaneous cross-region outages such as in the US & Australia). It may be appropriate to consider creating a DR (disaster recovery) plan defined across different cloud providers.

How Workloads Should be Distributed when Migrating to a Cloud Environment

Another perspective worthy of consideration is the cost to protect the resources. Generally the greater the distance between workloads, the more costs are incurred for data transfer. In many cases, data transfer between nodes within the same datacenter (Availability Zone) is free while it might costs $0.01/GB or more to transfer data across Availability Zones. This additional cost might be doubled (or more) when data is transferred across regions (i.e. $0.02 / GB). In addition, due to the increased physical distance between workloads, greater data latency between nodes should be anticipated between locations. Through consideration of these factors, generally speaking, it is recommended to distribute workloads across Availability Zones within the same Region.

Reproduced with permission from SIOS