Clustering Simplified Archives - Page 43 of 104

Availability SLAs: FT, High Availability and Disaster Recovery – Where to start

May 14, 2022 by Jason Aw Leave a Comment

Availability SLAs: FT, High Availability and Disaster Recovery – Where to start

It’s fair to say that in this modern era where many aspects of our lives are technology-driven, we live in a very instantaneous world. For example, at a click of a button, our weekly grocery order arrives on our doorstep. We can instantly purchase tickets for events or travel. Or even these days, order a brand-new car without having to go anywhere near a showroom and deal with a pushy salesperson. We are spoilt in this world of convenience.

But let’s spare a thought for all the vendors and the service providers who must underpin this level of service. They have to maintain a high level of investment to ensure that their underlying infrastructures (and specifically their IT infrastructures) are built and operated in a way where they can support this “always-on” expectation. Applications and databases have to be always running, to meet both customer demand and maximise company productivity and revenue. The importance of IT business continuity is as critical as it’s ever been.

Many IT availability concepts are floated about such as fault tolerance (FT), high availability (HA) and disaster recovery (DR). But this can raise further questions. What’s the difference between these availability concepts? Which of them will be right for my infrastructure? Can they be combined or interchanged? The first and foremost step for any availability initiative is to establish a clear application/database availability service level agreement (SLA). This then defines the most suitable availability approach.

What is an SLA?

To some extent, we all know what an SLA is, but for this discussion, let’s make sure we’re all on the same wavelength. The availability SLA is a contract between a service provider and their end-user that defines the expected level of application/database uptime and accessibility a vendor is to ensure and outlines the penalties involved (usually financial) if the agreed-upon service levels are not met. In the IT world, the SLA is forged from two measures of criticality to the business – Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). Very simply, the RTO defines how quickly we need the application operation to be restored in the event of a failure. The RPO defines how current our data needs to be in the event of a recovery scenario. Once you can identify these metrics for your applications and databases, this will then define your SLA. The SLA is measured as a percentage, so for example, you may come across terms such as 99.9% or 99.99% available. These are measures of how many minutes of uptime and availability IT will guarantee for the application in a given year. In general, more protection means more cost. It’s therefore critical to estimate the cost of an hour of downtime for the application or database and use this SLA as a tool for selecting a solution that makes good business sense.

Once we have our SLA, we can make a business decision about which type of solution – FT, HA, DR, or a combination thereof — is the most suitable approach for our availability needs.

What is Fault Tolerance (FT)?

FT delivers a very impressive availability SLA at 99.999%. In real-world terms, an FT solution will guarantee no more than 5.25 minutes of downtime in one year. Essentially, two identical servers are run in parallel to each other, processing transactions on both servers at the same time in an active-active configuration in what’s referred to as a “lockstep” process. If the primary server fails, the secondary server continues processing, without any interruption to the application or any data loss. The end-user will be blissfully unaware that a server failure has occurred.

This sounds fantastic! This sounds superb! Why would we need anything else? But hold on…as awesome as FT sounds on paper, there are some caveats to consider.

The “lockstep” process is a strange beast. It’s very fussy about the type of server hardware it can run on, particularly in terms of processors. This limited hardware compatibility list forces FT solutions to sit in the higher end of the cost bracket, which could very much be in the hundreds of thousands of dollars by the time you factor in two or more FT clusters with associated support and services.

Software Error Vulnerability

FT solutions are also designed with hardware fault tolerance in mind and don’t pay much attention to any potential application errors. Remember, FT solutions are running the same transactions and processes at the same time, so if there’s an application error on the primary server, this will get replicated on the secondary server too.

What is High Availability (HA)?

For most SLAs, FT is simply too expensive to purchase and manage for average use cases. In most cases, HA solutions are a better option. They provide nearly the same level of protection at a fraction of the cost. HA solutions provide a 99.99% SLA which equates to about 52 minutes of downtime in one year, by deploying in an Active-Standby manner. The reduced SLA is introduced as there’s a small period of downtime where the Active server has to switch over to the Standby server before operations are resumed. OK, this is not as impressive as an FT solution, but for most IT requirements, HA meets SLAs, even for supercritical applications such as CRM and ERP systems.

Equally important, High Availability solutions are more application agnostic, and can also manage the failover of servers in the event of an application failure as well as hardware or OS failures. They also allow a lot more configuration flexibility. There is no FT-like hardware compatibility list to deal with, as on most occasions they will run on any platform where the underlying OS is supported.

How does Disaster Recovery (DR) fit into the picture?

Like FT and HA, DR can also be used to support critical business functions. However, DR can be used in conjunction with FT and HA. Fault Tolerance and High Availability are focussed on maintaining uptime on a local level, such as within a datacentre (or cloud availability zone). DR delivers a redundant site or datacentre to failover to in the event a disaster hits the primary datacentre.

What does it all mean?

At the end of the day, there’s no wrong or right availability approach to take. It boils down to the criticality of the business processes you’re trying to protect and the basic economics of the solution. In some scenarios, it’s a no-brainer. For example, if you’re running a nuclear power plant, I’d feel more comfortable that the critical operations are being protected by an FT system. Let’s face it, you probably don’t want any interruptions in service there. But for most IT environments, critical uptime can also be delivered with HA at a much more digestible price point.

How to choose: FT, HA and DR?

First and foremost, understand your business operations in detail and identify the cost of downtime.
Once your SLAs are established, weigh up the costs of the availability solution of choice against the cost of any potential downtime.
When choosing your availability solution, look at ease of deployment and ease of use, as these will also impact the overall TCO of the availability solution.

IT systems are robust, but they can go wrong at the most inconvenient times. FT, HA and DR are your insurance policies to protect you when delivering SLAs to customers in this instant and convenience-led world.

Reproduced with permission from SIOS

How to Avoid IO Bottlenecks: DataKeeper Intent Log Placement Guidance for Windows Cloud Deployments

May 9, 2022 by Jason Aw Leave a Comment

How to Avoid IO Bottlenecks: DataKeeper Intent Log Placement Guidance for Windows Cloud Deployments

To ensure optimal application performance, when deploying SIOS DataKeeper it is important to place the intent log (bitmap file) on the lowest latency disk available, avoiding an IO bottleneck. In AWS, GCP and Azure, the lowest latency disk available is an ephemeral drive. However, in Azure, the difference between using an ephemeral drive vs Premium SSD is minimal, so it is not necessary to use the ephemeral drive when running DataKeeper in Azure. In AWS and GCP however, it is imperative to relocate the intent log to the ephemeral drive, otherwise write throughput will be significantly impacted.

When leveraging an ephemeral disk for the bitmap file there is a tradeoff. The nature of an ephemeral drive is that the data stored on it is not guaranteed to be persistent. In fact, if the cloud instance is stopped from the console, the ephemeral drive attached to the instance is discarded and a new drive is attached to the instance. In this process the bitmap file is discarded and a new, empty bitmap file is put in its place.

There are certain scenarios where if the bitmap file is lost, a complete resync will occur. For instance, if the primary server of a SANless cluster is shutdown from the console a failover will occur, but when the server comes back online a complete resync will occur from the new source of the mirror to the old source. This happens automatically, so the user does not have to take any action and the active node stays online during this resync period.

There are other scenarios where bitmap file placement can also impact performance. For instance, if you are replicating NVMe drives you will want to carve out a small partition on the NVMe drive to hold the bitmap file. A general rule of thumb is that the bitmap file should be on the fastest, lowest latency disk available on the instance. It should also be located on a disk that is not overly taxed with other IO operations.

Information on how to relocate the intent log can be found in the DataKeeper documentation. Additional information regarding how the intent log is used can be found in the DataKeeper documentation.

Reproduced with permission from SIOS

Leading Media Platform Protects Critical SAP S/4 HANA in AWS EC2 Cloud

May 5, 2022 by Jason Aw Leave a Comment

Leading Media Platform Protects Critical SAP S/4 HANA in AWS EC2 Cloud

SIOS Chosen Based on Certifications and Validations for SAP, Amazon Web Services and Red Hat Linux

This leading media organization reaches virtually all adults in Singapore weekly. via the widest range of media platforms in the country, spanning digital, television, radio, print and out-of-home media and more than 50 products and brands in four languages (English, Mandarin, Malay and Tamil).

The Environment

The company uses an SAP S/4 HANA application and HANA database to power operations throughout the organization, unifying operations across multiple departments. They were running these critical applications and database in a Red Hat Linux environment in their on-premises data center. Protecting these essential systems from downtime and disasters is a top priority for this organization’s IT team.

The Challenge

The media organization’s IT team recognized that they could realize significant cost savings by moving its SAP applications and HANA database into the AWS EC2 cloud. However, for the migration to be successful, they needed a high availability (HA) and disaster recovery (DR) solution that would “lift-and-shift” with their existing SAP landscape to AWS without disruption.

The Evaluation

The company’s IT team wanted an HA/ DR solution that they could rely on to meet their 99.99% availability SLA in the cloud. The solution needed to be certified by both SAP and AWS and support a Red Hat Linux operating system. Finally, to ensure they could deliver full DR protection for these critical workloads, they needed a clustering solution that would failover across multiple AWS availability zones (AZs). An AWS Solution architect recommended that the organization use SIOS LifeKeeper for Linux clustering software. The company’s IT team had a short timeline to complete the project and needed to choose an HA vendor who could meet their requirements without impeding their progress.

The Solution

They chose SIOS LifeKeeper because it not only met all of our criteria, but also because the SIOS team got organized very quickly, enabling them to keep their cloud migration project on schedule. SIOS Lifekeeper is certified by both AWS and SAP for highavailability on Red Hat and AWS EC2 cloud environment. The SIOS solution also met another key criteria as it is able to replicate and provide redundancy across AWS availability zones The organization’s IT team was impressed with SIOS’ dedicated local support team who were available to answer questions and provide support 24 hours a day, 7 days a week. The organization currently has five pairs of failover clusters using SIOS LifeKeeper for Linux to protect S/4 HANA and SAP HANA applications running across multiple availability zones in AWS EC2.

The Results

The availability has been reliable we had not had any extended downtime since using the software By enabling this customer to migrate these important workloads to the cloud without sacrificing HA/DR or application performance, SIOS allowed us to achieve significant cost savings,” he said. SIOS LifeKeeper’s ease of Mediacorp saved even more by minimizing IT admin time.

Reproduced with permission from SIOS

Protect Systems from Downtime

May 1, 2022 by Jason Aw Leave a Comment

Protect Systems from Downtime

In today’s business environment, organizations rely on applications, databases, and ERP Systems such as SAP, SQL Server, Oracle and more. These applications unify and streamline your most critical business operations. When they fail, they cost you more than just money. It is critical to protect these complicated systems from downtime.

Proven High Availability & Disaster Recovery

SIOS has 20+ years of experience in high availability and disaster recovery. SIOS knows there isn’t a one size fits all solution. Today’s data systems are a combination of on-premise, public cloud, hybrid cloud and multi cloud environments. The applications themselves can create even more complexity. But configuring open-source cluster software can be painstaking, time consuming, and prone to human error.

SIOS has solutions that provide high availability and disaster recovery for critical applications. These solutions have been developed based on our real-world experience, across different industries and use-cases. Our products include SIOS DataKeeper Cluster Edition for Windows and SIOS LifeKeeper for Linux or Windows. These powerful applications provide failover protection. The Application Recovery Kits included with LifeKeeper speeds up application configuration time by automating configuration and validates inputs.

System Protection On-Premises, in the Cloud or in Hybrid Environments

SIOS provides the protection you need for business-critical applications and reduces complexity of managing them, whether on-premises, in the cloud, or in hybrid cloud environments. Learn more about us in the video below or contact us to learn more about high availability and disaster recovery for your business critical applications.

Watch this video on YouTube

Reproduced with permission from SIOS

How to Achieve High Availability in the Cloud Using WSFC

April 29, 2022 by Jason Aw Leave a Comment

How to Achieve High Availability in the Cloud Using WSFC

Figure 1: Failover Cluster for High Availability in Azure

Microsoft Windows Server includes Windows Server Failover Clustering (WSFC) software to ensure the availability of critical applications. In an on-premises environment, primary and standby nodes in the cluster are connected to the same shared storage. However, this infrastructure cannot be taken directly to the cloud. Shared storage that spans both primary and standby systems is essential in WSFC, but shared storage cannot be used with public cloud services such as IaaS (Infrastructure as a Service) in AWS, Azure, or Google Cloud.

Geographically Separated Shared Storage for WSFC is Not Available in the Cloud

When migrating on-premises applications to the cloud, companies prefer to move their entire infrastructure to the cloud, including WSFC, without changing the on-premises operation process. This allows them to minimize disruption by applying the same WSFC skills and know-how in the cloud.

Figure 2: Failover Clustering in AWS EC2. Cluster nodes are in geographically separated Availability Zones for DR protection.

The servers that make up the cluster are divided into the primary node – where the application runs – and standby node(s). WSFC software monitors the application and server node to ensure they are operational. If WSFC detects something wrong with the primary node, it switches operation of the application to the standby node in a process called “failover”.

In a WSFC environment, the primary server and the standby server are connected to shared storage – typically storage called SAN (Storage Area Network) or iSCSI-SAN storage.

To failover operations from the primary server to the standby server, the network link must be switched so the standby server can read from and write to the SAN that normally reads from and writes to the primary server. In this way, it is possible to restart the service in a short time, allowing the standby server to access the same data as the primary node and meet low Recovery Point Objectives (RPOs).

See related content: Disaster Recovery Fundamentals.

However, when migrating WSFC to the cloud, there is no SAN available. For example, you cannot link Amazon Web Services (AWS) and Microsoft Azure to multiple nodes (servers) to use as shared storage. The same applies to IaaS for other cloud services.

It is possible to build an HA cluster configuration based on WSFC without shared storage, but it requires extremely advanced skills, such as creating your own program to recover data on the standby node. The operation is complicated and it is not easy to verify when an incident occurs.

Data Replication Software Solves the Problem

To rectify this problem, you can install data replication software that is specialized for HA clusters – such as SIOS DataKeeper Cluster Edition – and synchronize storage among local servers. Data on the local disks of the primary and standby nodes are synchronized in real-time using host-based, block-level replication. With this method, you do not need shared storage. Instead, you can build an HA cluster configuration using familiar WSFC without disrupting established processes.

With DataKeeper, synchronized nodes appear as a SAN in the WSFC management screen (Failover Cluster Management). If your operations managers have used WSFC, they will require little to no training with this approach.

High Availability in the Cloud Surpasses On-Premises HA with SIOS DataKeeper and WSFC

DataKeeper Cluster Edition is a software add-on that seamlessly integrates with Windows Server Failover Clustering (WSFC) to add performance-optimized host-based synchronous or asynchronous replication. In the unlikely event that the HA cluster malfunctions, WSFC will orchestrate the failover of operations to the standby node(s) and access shared storage as if it is shared storage. This simple mechanism makes it possible to move to AWS without changing the operations of the existing system.

Without compromising familiar WSFC operations, it is possible to guarantee high availability in the cloud using DataKeeper that is equivalent to or better than on-premises high availability. The advantage of this cluster configuration is that it is very simple and can be easily applied to any cloud environment.

Seamless Integration with WSFC

SIOS DataKeeper Cluster Edition seamlessly integrates with and extends Windows Server Failover Clustering (WSFC) by providing a performance-optimized, host-based data replication mechanism. While WSFC manages the software cluster, SIOS performs the data replication to enable disaster protection and ensure zero data loss in cases where shared storage clusters are impossible or impractical, such as in cloud, virtual, and high-performance storage environments.

Reproduced with permission from SIOS