Clustering Simplified Archives - Page 42 of 104

Benefits of SIOS Protection Suite/LifeKeeper for Linux

June 3, 2022 by Jason Aw Leave a Comment

Benefits of SIOS Protection Suite/LifeKeeper for Linux

SIOS software supports multiple operating system versions and flavors (both Linux and Windows).
- Consistent user experience to protect mission-critical resources regardless of the operating system.
SIOS delivers high availability solutions for multiple environments including on-premises, virtualized (VMware) and cloud environments including public platforms from AWS, Azure, Google Cloud as well as privately hosted cloud environments. The same tools can be used across different environments.
- Hybrid environments are also supported. It is possible to configure an on-premise node as the primary location with the secondary node being hosted in a private cloud and a third node located on a public cloud platform providing an additional DR (disaster recovery) option.

SIOS provides application-aware protection mechanisms including Application Recovery Kits (ARKs), available for the world’s leading providers of enterprise applications and databases.
- Our comprehensive library of ARKs protect the broadest range of applications ‘off-the-shelf’.
- Other ‘non-standard’ or legacy applications can be protected with the built-in ‘GenApp’ ARK or by developing a custom ARK either in-house or in collaboration with SIOS engineers.

Unlike open-source tools (which force the user to manually establish environmental parameters in advance and then type a complex set of command-line parameters), a series of wizard based installation and configuration screens enables intuitive selection of resources requiring protection and helps in selecting the type and extent of protection to be provisioned.
- Wizards scan the system and environment to identify the resources to protect. In most cases the operator only needs to confirm default selections and enter unique environment-specific host IDs etc. to complete the installation.
- This reduces the likelihood of misconfiguration of the HA solution and the inevitable unexpected downtime as the result of a system or application failure.
SIOS provides a comprehensive range of technical support options including 24×7 critical support, offering support options tailored to the available budget, systems complexity or criticality of the application requiring high availability.

Reproduced with permission from SIOS

SIOS Protection Suite/LifeKeeper for Linux – Integrated Components

May 29, 2022 by Jason Aw Leave a Comment

SIOS Protection Suite/LifeKeeper for Linux – Integrated Components

SIOS Protection Suite includes the following software components to protect an organization’s mission-critical systems.

SIOS LifeKeeper

SIOS LifeKeeper provides a complete fault-resilient software solution that enables high availability for servers, file systems, applications, and processes. LifeKeeper does not require any customized, fault-tolerant hardware. LifeKeeper simply requires two or more systems to be grouped in a network and site-specific configuration data is then created to provide automatic fault detection and recovery.

In the case of a failure, LifeKeeper migrates protected resources from the failed server to a designated standby server. Users experience a brief interruption during the actual switchover. However, LifeKeeper restores operations on the standby server without operator intervention.

SIOS DataKeeper

SIOS DataKeeper provides an integrated data replication capability for LifeKeeper environments. This feature enables LifeKeeper resources to operate in shared and non-shared storage environments.

Application Recovery Kits (ARKs)

Application Recovery Kits (ARKs) include tools and utilities that allow LifeKeeper to manage and control a specific application or service. When an ARK is installed for a specific application, LifeKeeper is able to monitor the health of the application and automatically recover the application if it fails. These Recovery Kits are non-intrusive and require no changes within the application in order for it to be protected by LifeKeeper.

There is a comprehensive library of ‘off-the-shelf’ Application Recovery Kits available as part of the SIOS Protection Suite portfolio. The types and quantity of ARKs supplied vary based on the edition of SIOS Protection Suite purchased.

Reproduced with permission from SIOS

High Availability, RTO, and RPO

May 25, 2022 by Jason Aw Leave a Comment

High Availability, RTO, and RPO

High availability (HA) is an information technology term that refers to a computer software or component that is operational and available for more than 99.99% of the time. End users of an application, or system, experience less than 52.5 minutes per year of service interruption. This level of availability is typically achieved through the use of high availability clustering, a configuration that reduces application downtime by eliminating single points-of-failure through the use of redundant servers, networks, storage, and software.

What are recovery time objectives (RTO) and recovery point objectives (RPO)?

In addition to 99.99% availability time, high availability environments also meet stringent recovery time and recovery point objectives. Recovery time objective (RTO) is a measure of the time elapsed from application failure to restoration of application operation and availability. It is a measure of how long a company can afford to have that application down. Recovery point objectives (RPO) are a measure of how up-to-date the data is when application availability has been restored after a downtime issue. It is often described as the maximum amount of data loss that can be tolerated when a failure happens. SIOS high availability clusters deliver an RPO of zero and an RTO of minutes.

What is a high availability cluster?

In a high availability cluster, important applications are run on a primary server node, which is connected to one or more secondary nodes for redundancy. Clustering software, such as SIOS LifeKeeper, monitors clustered applications and dependent resources to ensure they are operational on the active node. System level monitoring is accomplished via intervallic heartbeats between cluster nodes. If the primary server fails, the secondary server initiates recovery after the heartbeat timeout interval is exceeded. For application level failures, the clustering software detects that an application is not available on the active node. It then moves the application and dependent resources to the secondary node(s) in a process called a failover, where operation continues and meets stringent RTOs.

In a traditional failover cluster, all nodes in the cluster are connected to the same shared storage, typically a storage area network (SAN). After a failover, the secondary node is granted access to the shared storage, enabling it to meet stringent RPOs.

Reproduced with permission from SIOS

SIOS Protection Suite for Linux Evaluation Guide for AWS Cloud Environments

May 21, 2022 by Jason Aw Leave a Comment

SIOS Protection Suite for Linux Evaluation Guide for AWS Cloud Environments

Get Started Evaluating SIOS Protection Suite for Linux in AWS

Use this step-by-step guide to configure and test a two-node cluster in AWS to protect resources such as Oracle, SQL Server, PostgreSQL, NFS, SAP, and SAP HANA.

Before You Begin Your Evaluation

Review these links to understand key concepts you’ll need before you begin your failover clustering project in AWS.

High Availability, RTO, and RPO – Gain a clear understanding of the fundamental concepts of HA.

SIOS Protection Suite for Linux – Integrated Components – Understand the components included in SIOS Protection Suite: SIOS LifeKeeper, DataKeeper and application recovery kits/

Benefits of SIOS Protection Suite for Linux – Learn how SIOS Protection Suite for Linux provides superior data protection

How Workloads Should be Distributed when Migrating to a Cloud Environment – Unlike on=premises environments, public cloud offerings give you a wide range of geographical regions availability zones to choose from when deploying your workloads. Learn best practices for choosing a workload distribution strategy for the cloud.

Public Cloud Platforms and their Network Structure Differences – Public cloud providers have their own networking structures. Learn the basics of how these structures affect your cluster configuration.

How a Client Connects to the Active Node – Learn the key clustering mechanisms that detect and manage the response to application failures.

How does Data Replication between Nodes Work? – Ensuring redundancy of storage is essential. Learn how SIOS DataKeeper efficiently replicates data between nodes.

What is “Split Brain” and How to Avoid It Learn how to ensure you maintain one active node and one or more standby node(s) even when the network connection fails.

Configuring Network Components

This section outlines the computing resource required for each node, the network structure and the process required to configure these components.

Creating an Instance on AWS EC2 from Scratch

Configure Linux Nodes to Run SIOS Protection Suite for Linux

Configure Linux Nodes to Run SIOS Protection Suite for Linux

Install SIOS Protection Suite for Linux

Install SIOS Protection Suite for Linux

Login and Basic Configuration

Protecting Critical Resources

Once the IP resource is protected, initiate a switchover (where the “standby” node becomes the “active” node) to test the functionality.

Performance Of Azure Shared Disk With Zone Redundant Storage (ZRS)

May 16, 2022 by Jason Aw Leave a Comment

Performance Of Azure Shared Disk With Zone Redundant Storage (ZRS)

On September 9th, 2021, Microsoft announced the general availability of Zone-Redundant Storage (ZRS) for Azure Disk Storage, including Azure Shared Disk.

What makes this interesting is that you can now build shared storage based failover cluster instances that span Availability Zones (AZ). With cluster nodes residing in different AZs, users can now qualify for the 99.99% availability SLA. Prior to support for Zone Redundant Storage, Azure Shared Disks only supported Locally Redundant Storage (LRS), limiting cluster deployments to a single AZ, leaving users susceptible to outages should an AZ go offline.

There are however a few limitations to be aware of when deploying an Azure Shared Disk with ZRS.

Only supported with premium solid-state drives (SSD) and standard SSDs. Azure Ultra Disks are not supported.
Azure Shared Disks with ZRS are currently only available in West US 2, West Europe, North Europe, and France Central regions
Disk Caching, both read and write, are not supported with Premium SSD Azure Shared Disks
Disk bursting is not available for premium SSD
Azure Site Recovery support is not yet available.
Azure Backup is available through Azure Disk Backup only.
Only server-side encryption is supported, Azure Disk Encryption is not currently supported.

I also found an interesting note in the documentation.

“Except for more write latency, disks using ZRS are identical to disks using LRS, they have the same scale targets. Benchmark your disks to simulate the workload of your application and compare the latency between LRS and ZRS disks.”

While the documentation indicates that ZRS will incur some additional write latency, it is up to the user to determine just how much additional latency they can expect. A link to a disk benchmark document is provided to help guide you in your performance testing.

Following the guidance in the document, I used DiskSpd to measure the additional write latency you might experience. Of course results will vary with workload, disk type, instance size, etc.,but here are my results.

	Locally Redundant Storage (LRS)	Zone Redundant Storage (ZRS)
Write IOPS	5099.82	4994.63
Average Latency	7.830	7.998

The DiskSpd test that I ran used the following parameters.

diskspd -c200G -w100 -b8K -F8 -r -o5 -W30 -d10 -Sh -L testfile.dat

I wrote to a P30 disk with ZRS and a P30 with LRS attached to a Standard DS3 v2 (4 vcpus, 14 GiB memory) instance type. The shared ZRS P30 was also attached to an identical instance in a different AZ and added as shared storage to an empty cluster application.

A 2% overhead seems like a reasonable price to pay to have your data distributed synchronously across two AZs. However, I did wonder what would happen if you moved the clustered application to the remote node, effectively putting your disk in one AZ and your instance in a different AZ.

Here are the results.

	Locally Redundant Storage (LRS)	Zone Redundant Storage (ZRS)	ZRS when writing from the remote AZ
Write IOPS	5099.82	4994.63	4079.72
Average Latency	7.830	7.998	9.800

In that scenario I measured a 25% write latency increase. If you experience a complete failure of an AZ, both the storage and the instance will failover to the secondary AZ and you shouldn’t experience this increase in latency at all. However, other failure scenarios that aren’t AZ wide could very well have your clustered application running in one AZ with your Azure Shared Disk running in a different AZ. In those scenarios you will want to move your clustered workload back to a node that resides in the same AZ as your storage as soon as possible to avoid the additional overhead.

Microsoft documents how to initiate a storage account failover to a different region when using GRS, but there is no way to manually initiate the failover of a storage account to a different AZ when using Zone Redundant Storage. You should monitor your failover cluster instance to ensure you are alerted any time a cluster workload moves to a different server and plan to move it back just as soon as it is safe to do so.

You can find yourself in this situation unexpectedly, but it will also certainly happen during planned maintenance of the clustered application servers when you do a rolling update. Awareness is the key to help you minimize the amount of time your storage is performing in a degraded state.

I hope in the future Microsoft allows users to initiate a manual failover of a ZRS disk the same as they do with GRS. The reason they added the feature to GRS was to put the power in the hands of the users in case automatic failover did not happen as expected. In the case of Zone Redundant Storage, I could see people wanting to try to tie together storage and application, ensuring they are always running in the same AZ, similar to how host based replication solutions like SIOS DataKeeper do it.