SIOS SANless clusters - Page 42 of 192 - SIOS SANless clusters High-availability Machine Learning monitoring

June 27, 2022	New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk New Options for High Availability Clusters, SIOS Cements its Support for Microsoft Azure Shared Disk Microsoft introduced Azure Shared Disk in Q1 of 2022. Shared Disk allows you to attach a managed disk to more than one host. Effectively this means that Azure now has the equivalent of SAN storage, enabling Highly Available clusters to use shared disk in the cloud! A major advantage of using Azure Shared Disk with a SIOS Lifekeeper cluster hierarchy is that you will no longer be required to have either a storage quorum or witness node. This way you can avoid so called split-brain – which occurs when the communication between nodes is lost and several nodes are potentially changing data simultaneously. Fewer nodes means less cost and complexity. LifeKeeper SCSI-3 Persistent Reservations (SCSI3) Recovery Kit SIOS has introduced an Application Recovery Kit (ARK) for our LifeKeeper for Linux product. This is called LifeKeeper SCSI-3 Persistent Reservations (SCSI3) Recovery Kit. This allows for Azure Shared Disks to be used in conjunction with SCSI-3 reservations. The ARK guarantees that a shared disk is only writable from the node that currently holds the SCSI-3 reservations on that disk. When installing SIOS Lifekeeper, the installer will detect that it’s running in Microsoft Azure EC2. It will automatically install the LifeKeeper SCSI-3 Persistent Reservations (SCSI3) Recovery Kit to enable support for Azure Shared Disk. Resource creation within Lifekeeper is straightforward and simple (Figure 1). The Azure Shared Disk is simply added into Lifekeeper as a file-system type resource once locally mounted. Lifekeeper will assign it an ID (Figure 2) and manage the SCSI-3 locking automatically. Figure 1. Creating an SAP Instance (sapinst) in LifeKeeper Figure 2: Created Extended to both nodes. SCSI-3 reservations guarantee that Azure Shared Disk is only writable on the node that holds the reservations (Figure 3). In a scenario where cluster nodes lose communication with each other, the standby server will come online, causing a potential split-brain situation. However, because of the SCSI-3 reservations only one node can access the disk at a time. This actually prevents an actual split-brain scenario. Only one system will hold the reservation. It will either become the new active node (in this case the other will reboot) or remain the active node. Nodes that do not hold the Azure Shared Disk reservation will simply end up with the resource in an “Standby State” state. Simply because they cannot acquire the reservation. Figure 3 – Output from Lifekeeper logs when trying to mount a disk that is already reserved. Link to Microsoft’s definition of Azure Shared Disks https://docs.microsoft.com/en-us/azure/virtual-machines/disks-shared What You Can Expect At the moment, SIOS supports Locally-redundant Storage (LRS). We’re working with Microsoft to test and support Zone-Redundant Storage (ZRS). Ideally we’d like to know when there is a ZRS failure so that we can fail-over the resource hierarchy to the most local node to the active storage. SIOS is expecting the Azure Shared Disk support to arrive in its next release of Lifekeeper 9.6.2 for Linux. Reproduced with permission from SIOS
June 23, 2022	What is “Split Brain” and How to Avoid It What is “Split Brain” and How to Avoid It As we have discussed, in a High Availability cluster environment there is one active node and one or more standby node(s) that will take over service when the active node either fails or stops responding. This sounds like a reasonable assumption until the network layer between the nodes is considered. What if the network path between the nodes goes down? Neither node can now communicate with the other and in this situation the standby server may promote itself to become the active server on the basis that it believes the active node has failed. This results in both nodes becoming ‘active’ as each would see the other as being dead. As a result, data integrity and consistency is compromised as data on both nodes would be changing. This is referred to as “Split Brain”. To avoid a split brain scenario, a Quorum node (also referred to as a ‘Witness’) should be installed within the cluster. Adding the quorum node (to a cluster consisting of an even number of nodes) creates an odd number of nodes (3, 5, 7, etc.), with nodes voting to decide which should act as the active node within the cluster. In the example below, the server rack containing Node B has lost LAN connectivity. In this scenario, through the addition of a 3rd node to the cluster environment, the system can still determine which node should be the active node. Quorum/Witness functionality is included in the SIOS Protection Suite. At installation, Quorum / Witness is selected on all nodes (not only the quorum node) and a communication path is defined between all nodes (including the quorum node). The quorum node doesn’t host any active services. Its only role is to participate in node communication in order to determine which are active and to provide a ‘tie-break vote’ in case of a communication outage. SIOS also supports IO Fencing and Storage as quorum devices, and in these configurations an additional quorum node is not required. Reproduced with permission from SIOS
June 19, 2022	How does Data Replication between Nodes Work? How does Data Replication between Nodes Work? In the traditional datacenter scenario, data is commonly stored on a storage area network (SAN). The cloud environment doesn’t typically support shared storage. SIOS DataKeeper presents ‘shared’ storage using replication technology to create a copy of the currently active data. It creates a NetRAID device that works as a RAID1 device (data mirrored across devices). Data changes are replicated from the Mirror Source (disk device on the active node – Node A in the diagram below) to the Mirror Target (disk device on the standby node – Node B in the diagram below). In order to guarantee consistency of data across both devices, only the active node has write access to the replicated device (/datakeeper mount point in the example below). Access to the replicated device (the /datakeeper mount point) is not allowed while it is a Mirror Target (i.e., on the standby node). Reproduced with permission from SIOS
June 15, 2022	How a Client Connects to the Active Node How a Client Connects to the Active Node As discussed earlier, once a High Availability cluster has been configured, two or more nodes run simultaneously and users connect to the “active” node. When an issue occurs on the active node, a “failover” condition occurs and the “standby” node becomes the new “active” node. When a failover occurs there must be a mechanism that either allows a client to detect the failover condition and to reconnect, or a seamless transfer of the user’s active client session to the active node. A Virtual IP Address Usually a “virtual” IP address is created when a cluster is configured and the client communicates with the active node using a virtual IP address. When a failover occurs, the virtual IP address is reassigned to the new active node and the client reconnects to the same virtual IP address. As an example, let us assume that there are two nodes, A and B, with IP addresses of 10.20.1.10 and 10.20.2.10. In this example, we will define a virtual IP address of 10.20.0.10 which should be considered to be assigned to the current active node. This is similar to assigning a second IP address to one network interface card on one node. If the command ip a is entered on the active node, both IP addresses will appear (as on lines 10 and 12 in this Linux example): The ARP Protocol When a client attempts to find a server using an IP address, the client typically uses ARP (Address Resolution Protocol) to find the MAC (Media Access Control) address of the target machine. Once a client broadcasts a message to find the target IP address, the active node answers with its MAC address and the client resolves the request and connects to it. ARP Alternatives for the Cloud Environment In the cloud environment, however, it is not possible to identify the active node using ARP as many layers are abstracted in the virtual environment. An alternative method based on the network infrastructure in use in the specific cloud environment may be required. There are normally several options, and a selection should be made from the following list. AWS Route Table Scenario AWS Elastic IP Scenario AWS Route53 Scenario Azure Internal Load Balancer Scenario Google Cloud Internal Load Balancer Scenario Reproduced with permission from SIOS
June 11, 2022	Public Cloud Platforms and their Network Structure Differences Public Cloud Platforms and their Network Structure Differences There are several public cloud platforms including Amazon Web Services (AWS), Microsoft Azure and Google Cloud. While there are many similarities in their infrastructures, there are some differences. In many cases a VPC (Virtual Private Cloud) or a VNET (Virtual Network) that is tied to a region is created. One or more VPCs may be defined for a logical group of applications. By so doing, different systems are divided into separate unconnected networks unless different VPCs are specifically connected. Under a VPC many different subnets can be defined. Based on the purpose, some subnets are configured as “public” subnets which are accessible to the internet and some are configured as “private” subnets which are not accessible to the internet. Some cloud providers (such as Azure and Google Cloud) allow subnets to be defined across Availability Zones (different datacenters), while some (such as AWS) do not allow subnets to be defined across Availability Zones. In the latter case, a subnet will need to be defined for each Availability Zone. In this guide, we’ll use different Availability Zones for each node. Once the basic functionality of the SIOS product is understood, it might be appropriate to explore different scenarios (similar to those in use in your own network infrastructure) that involve distributing workloads across different subnets, modifying the IP ranges for these subnets, changing the manner in which the network is connected to the Internet, etc. Reproduced with permission from SIOS