Failover Clustering With VMware High Availability A Perfect Match?

November 28, 2018 by Jason Aw Leave a Comment

Failover Clustering With VMware High Availability: Overkill Or A Perfect Match?

Implementing high availability (HA) at the VMware layer is a useful solution. It helps to protect against some types of failures. However, VMware HA alone simply doesn’t cover all the bases. Let’s explore the possibility of Failover Clustering with VMware High Availability.

According to Gartner Research, most unplanned outages are caused by application failure (40 percent of outages) or admin error (40 percent). Hardware, network, power, or environmental problems cause the rest (20 percent total). VMware HA focuses on protection against hardware failures, but a good application-clustering solution picks up the slack in other areas.

Having A Good Strategy Is Essential For Failover Clustering with VMware High Availability

Here are a few things to consider when architecting the proper HA strategy for your VMware environment.

Failover Clustering With VMware High Availability: Overkill Or A Perfect Match

Shorten Outages With Application-level Monitoring And Clustering

What about recovery speed? In a perfect world, there would be no failures, outages or downtime. But if an unplanned outage does occur, the next best thing is to get up and running again fast. This equation represents the total availability of your environment:

As you can see, detection time is a crucial piece of the equation. Here’s another place where VMware HA alone doesn’t quite cut it. VMware HA treats each virtual machine (VM) as a “black box” and has no real visibility into the health or status of the applications that are running inside. The VM and OS running inside might be just fine, but the application could be stopped, hung, or misconfigured, resulting in an outage for users.

Even when a host server failure is the issue, you must wait for VMware HA to restart the affected VMs on another host in the VMware cluster. That means that applications running on those VMs are down until 1) the outage is detected, 2) the OS boots fully on the new host system, 3) the applications restart, and 4) users reconnect to the apps.

By clustering at the application layer between multiple VMs, you are not only protected against application-level outages, you also shorten your outage-recovery time. The application can simply be restarted on a standby VM, which is already booted up and waiting to take over. To maximize availability, the VMs involved should live on different physical servers. Or even better, separate VMware HA clusters or even separate datacenters!

Eliminate Storage As A Potential Single Point Of Failure (SPOF)

Traditional clustering solutions, including VMware HA, require shared storage and typically protect applications or services only within a single data center. Technically, the shared-storage device represents an SPOF in your architecture. If you lose access to the back-end storage, your cluster and applications are down for the count. The goal of any HA solution is to increase overall availability by eliminating as many potential SPOFs as possible.

So how can you augment a native VMware HA cluster to provide greater levels of availability? To protect your entire stack, from hardware to applications, start with VMware HA. Next, you need a way to monitor and protect the applications. Clustering at the application level (i.e., within the VM) is the natural choice. Be sure to choose a clustering solution that supports host-based data replication (i.e., a shared-nothing configuration) so that you don’t need to go through the expense and complexity of setting up SAN-based replication. SAN replication solutions also typically lock you into a single storage vendor. On top of that, to cluster VMs by using shared storage, you generally need to enable Raw Device Mapping (RDM). This means that you lose access to many powerful VMware functions, such as vMotion.

Going with a shared-nothing cluster configuration eliminates the storage tier as an SPOF. At the same time allows you to use vMotion to migrate your VMs between physical hosts. It’s a win-win! A shared-nothing cluster is also an excellent solution for disaster recovery because the standby VM can reside at a different data center.

Cover All The Bases

Application-failover clustering, layered over VMware HA, offers the best of both worlds. You can enjoy built-in hardware protection and application awareness, greater flexibility and scalability, and faster recovery times. Even better, the solution doesn’t need to break the bank.

Want to understand how Failover Clustering With VMware High Availability could work for your projects? Check in with SIOS.
Reproduced with permission from LinuxClustering

Microsoft Wants Your Input On The Next Version Of Windows Server

March 13, 2018 by Jason Aw Leave a Comment

Microsoft Wants Your Input On The Next Version Of Windows Server

Windows Server has a new UserVoice page: http://windowsserver.uservoice.com/forums/295047-general-feedback with subsections:

Clustering: http://windowsserver.uservoice.com/forums/295074-clustering
Storage: http://windowsserver.uservoice.com/forums/295056-storage
Virtualization: http://windowsserver.uservoice.com/forums/295050-virtualization
Networking: http://windowsserver.uservoice.com/forums/295059-networking
Nano Server: http://windowsserver.uservoice.com/forums/295068-nano-server
Linux Support: http://windowsserver.uservoice.com/forums/295062-linux-support

This is where YOU get to provide Microsoft with your feedback directly.

Reproduced with permission from https://clusteringformeremortals.com/2015/05/12/microsoft-wants-your-input-on-the-next-version-of-windows-server/

Making Sense Of Virtualization Availability Options

January 21, 2018 by Jason Aw Leave a Comment

What Are The Virtualization Availability Options?

Microsoft Windows Server 2008 R2 and vSphere 4.0 are newly released. Let’s take a look at some Virtualization Availability Options when considering the availability of your virtual servers and the applications running on them.

I also will take this opportunity to describe some of the features that enable virtual machine availability. Additionally, I have grouped these features into their function roles to help highlight their purpose.

Planned Downtime

Live Migration and VMware’s VMotion are both solutions that allow an administrator to move a virtual machine from one physical server to another with no perceivable downtime. There is one key thing to remember. The move must be a planned event in order to move a virtual machine from one server to another without any downtime. A planned event is that the virtual machine’s memory is synchronized between the servers, before the actual switchover occur. This is true of both Microsoft’s and VMware’s solutions. Also keep in mind that both of these technologies require the use of shared storage to hold the virtual hard disks (VMDK and VHD files), which limits Live Migration and VMotion to local area networks. This also means that any downtime planned for the storage array must be handled in a different way. Important to note if you want to limit the impact to your virtual machines.

Unplanned Downtime

Microsoft’s Windows Server Failover Clustering and VMware’s High Availability (HA) are solutions that are available to protect virtual machines in the event of unplanned downtime. Both solutions are similar. They monitor virtual machines for availability. The VMs are moved to standby node if there is a failure. Then, the machines are rebooted for this recovery process. There was no time to sync the memory before failover.

Disaster Recovery

How do I recover my virtual machines in the event of a complete site loss? The good news is that virtualization makes this process a whole lot easier. A virtual machine is simply a file that can picked up and moved to another server. Up to this point, VMware and Microsoft are pretty similar in their availability features and functionality. However, here is where Microsoft really shines. VMware offers Site Recovery Manager which is a fine product. But is limited in support to only SRM-certified array-based replication solutions. Also, the failover and failback process is not trivial and can take the better part of a day to do a complete round trip from the DR site back to the primary data center. It does have some nice features like DR testing. In my experience with Microsoft’s solution for disaster recovery they have a much better solution when it comes to disaster recovery.

Microsoft’s Hyper-V DR solution

Microsoft’s Hyper-V DR solution is Windows Server Failover Clustering in a multi-site cluster configuration (see video demonstration). In this configuration, the performance and behavior is the same as a local area cluster, yet it can span data centers. Essentially, you can actually move your virtual machines across data centers with little to no perceivable downtime. Failback is the same process, just point and click to move the virtual machine resource back to the primary data center. There is no built-in “DR Testing”. Although I think it is preferable to do an actual DR test in just the matter of a minute or two with no perceivable downtime.

Host-Based Replication Vendors

One other thing I like about WSFC multi-site clusters is that the replication options include not only array-based replication vendors, but also host-based replication vendors. This really gives you a wide range of replication solutions in all price ranges and does not require that you upgrade your existing storage infrastructure.

Fault Tolerance

Fault tolerance basically eliminates the need to reboot a virtual machine in the event of an unexpected failure. VMware has the edge here in that it offers VMware FT. There are a few other 3^rd party hardware and software vendors that play in this space as well. There are plenty of limitations and requirements when it comes to implementing FT systems. This is an option if you need to ensure a hardware component failure results in zero downtime vs. the minute or two it takes to boot up a VM in a standard HA configuration. You probably want to make sure that your existing servers are already chock full of hot standby CPUs, RAM, power supplies, etc. And you have redundant paths to the network and storage. Otherwise you may be throwing good money after bad. Fault tolerance is great for protection from hardware failures. What happens if your application or the virtual machine’s operating system is behaving badly? That is when you need application level clustering as described below.

Application Availability

Everything I have discussed up to this point really only takes into consideration the health of your physical servers and your virtual machines as a whole. This is all well and good, however, what happens if your virtual machine blue screens? Or what if that latest SQL service pack broke your application? In those cases, none of these solutions are going to do you one bit of good. For those most critical applications, you really must cluster at the application layer. Look into clustering solutions that run within the OS on the virtual machine vs. within the hypervisor. In the Microsoft world, this means MSCS/WSFC or 3^rd party clustering solutions. Your storage options, when clustering within the virtual machine, are limited in scope to either iSCSI targets or host-based replication solutions. Currently, VMware really does not have a solution to this problem. It would defer to solutions that run within the virtual machine for application layer monitoring.

Summary

With the advent of virtualization, it is really not a question of if you need availability. And more of a question of what Virtualization Availability Options will help meet your SLA and/or DR requirements. I hope that this information helps you make sense of the availability options available to you.

Reproduced with permission from https://clusteringformeremortals.com/2009/08/14/making-sense-of-virtualization-availability-options-2/

Read our success stories to understand how SIOS can help you