The simple days of HA & DR are gone

Date: November 20, 2022

Tags: disaster recovery, High Availability

Flipping through the TV channels I stumbled on the scene in the movie “He’s Just Not That Into You” with Drew Barrymore, saying what most of us in 2022 are feeling about Technology and especially high availability and disaster recovery:

“I miss the days when you had one phone number and one answering machine and that one answering machine had one cassette tape and that one cassette tape either had a message from a guy or it didn’t. And now you just have to go around checking all these different portals just to get rejected by seven different technologies. It’s exhausting.”

Sometimes, don’t you wish there was only one cloud or maybe even no cloud platform; one DB running on one OS; and only a front end application to worry about. But, the world has changed and is moving faster, and becoming more complicated. Advances in technology, the fallout of mergers and acquisitions, and the increasing appetites and pace of our 24/7 society, with billions of consumers looking for the latest deal and the best experience, means that the simple days are gone.

4 hard truths about your availability

Your solution isn’t as simple as you think

Of course your enterprise environment isn’t simple. You have legacy systems and applications, the kind that have been around almost since punch cards. You have new systems, made for the new generation of applications and databases. In addition you have solutions that were created a decade ago to bridge the gap or span the time between migrating from one platform to another, but despite your best efforts, these systems linger. Added to these challenges is a growing set of systems and IT resources from the merger and acquisition of Company U. Delivering HA is not as simple as you think in the new era.

Bad architecture is a bigger problem than you realize

As VP of Customer Experience, we’ve seen the damage caused by bad architecture. While deploying HA software can definitely help improve an application and database’s availability, HA software will never fully overcome incomplete requirements, poor networking, lack of redundant hardware, or other missing architectural components. Our team once worked with a customer to correct an undersized environment that left their system unstable during peak operating times. Because of their bad architecture, which included networking and hardware instability, their teams frequently found themselves scrambling to recover from avoidable downtime issues. In order to have a complete and sound, highly available and resilient solution you will need to deploy great software as a part of a sound architecture.

Your admins need more help than they’ll admit

Developing an enterprise grade, highly available resilient HA solution, built on a solid architecture with the ability to grow is not a simple process. Designing and architecting for resilience, application and data availability is not as easy as grabbing a box of cake mix off the shelf. Throw in an array of tools, processes from different teams, a mixture of SLA’s, and the varieties of OS, applications, databases, and platforms and you have a recipe for needing help. Recently, I interviewed a 20 year veteran working in an enterprise support environment. He described how many of his peers, and even at times himself, have not been able to handle the weight of maintaining critical enterprise availability. Your admins, not only need help when they have been up since 2am dealing with a catastrophic, multi-system, multi-application, nearly complete data center collapse, but also in the day to day hard work of enterprise availability in one of the most technologically complex eras ever.

Your solution may not be as highly available as you think

“While public cloud providers typically guarantee some level of availability in their service level agreements, those SLAs only apply to the cloud hardware.” There are many other reasons for application downtime that aren’t covered by cloud provider SLAs including:

Software issues and bugs
Human errors
Software failure
System or application hangs

As VP of Customer Experience we’ve seen a thing or two, including a denial of service attack caused by a failed exit in a recursion routine, system exhaustion, security software quarantine of healthy, critical applications, kernel panics, and virtual machines that randomly reboot. If your HA strategy is relying solely on the SLAs of your hypervisor, your solution may not be as highly available as you think. You need to protect critical applications with clustering software that can monitor and detect issues, respond to problems reliably, and if necessary move operations to a standby server to ensure that your products and services remain reliable and available when and where they are needed.

Our single data center has become a series of cloud platforms, spanning dozens of data centers. Our skunk work application has become a part of the bevy of critical front end, middleware and backend solutions that we must manage across Windows, Linux, and a few different *Nix varieties. The march of technology means that our high availability has become more complex and requires better architecture. It also means that our teams need more help to manage it all, and if we aren’t careful it could mean that we remain vulnerable and exposed. Which of the four truths is your team facing most?

Cassius Rhue, VP Customer Experience

Reproduced with permission from SIOS