disaster recovery Archives - Page 14 of 23

Glossary: Disaster Recovery

May 23, 2021 by Jason Aw Leave a Comment

Glossary of Terms: Disaster Recovery

Definition: Provides a structured approach for responding to unplanned incidents that threaten a company’s IT infrastructure. Disaster recovery protection typically involves ensuring redundancy and geographic separation of business-critical operations, data, and systems.

Reproduced from SIOS

Cloud Availability: The Biggest Trap of 2021

April 12, 2021 by Jason Aw Leave a Comment

Cloud Availability: The Biggest Trap of 2021

Author Carey Nieuwhof hooked me with a blog topic of the biggest trap for 2021. While not directly speaking to HA, the topic alone made me reflect on some of the trends of 2020. Cloud innovations are numerous and begin at the most fundamental levels of the infrastructure. Not to mention advances in AI, machine learning, compute capacity and algorithms, memory management and sharing, and a battery of others. All of these advances add up to making the current generation cloud the most robust, reliable and available data center. These centers, optimized with redundant power, cooling, a legion of IoT devices for monitoring and alerting, redundant networking, high speed interconnects, massive servers, storage, and disks are impressive– and quite possibly the biggest trap that may be looming in 2021.

The biggest trap of 2021 will be believing that cloud availability alone is the same as or enough for higher availability. This is a complex trap to dissect. The list of named advances that make up the backbone of many data centers is indeed vast and impressive, and it is only a fraction of the technological innovations that exist driving the cloud. So, what makes this massively redundant, high capacity, and AI driven infrastructure a trap? Namely, that hardware and infrastructure availability still leave your enterprise at risk.

The Top Risks of Cloud High Availability

Disks

Disks have gotten faster and more intelligent. New eye-popping advances in chip sets, access technology, manufacturing, storage capacity and raid technology means that cloud vendors are able to put up gaudy numbers for speed, access, and redundancy. This reduces the risk for single points of failure (SPOF’s) for the disk infrastructure and provides confidence that a single disk, or even a momentary loss of power to the disks will not cause a lack of availability.

Storage Arrays

The storage arrays and enclosures housed within the data center providing access to the disks have also greatly improved. No longer the big eye soar of blinking lights and Airboat sized fans, these units are small in size but loaded with capacity and performance enhancements. You’ll be hard pressed to find a modern chassis that isn’t built with redundant power, redundant disk capabilities, and able to provide near zero replication between connected storage units, even between units that are dispersed at greater distances. In addition these units have added the benefits of AI to predict failures, proactively resolve problems, and optimize workloads to reduce performance bottlenecks.

Servers

Remember when it seemed so long ago that big name manufacturers and tech prognosticators were predicting game changing technology that would reshape the landscape of the future. It seems like decades ago when people were predicting server technology advances such as: reduced footprint, faster more complex chipsets, NVMe, battery efficiencies, cooling advances, storage advancements, in-memory and persistent memory advances, GPUs and bare metal provisioning. That future has arrived and been surpassed. Servers are now accelerating the pace of cloud computing capabilities and increasing the ability of the cloud to promote redundancy, reliability and robustness.

Networking

Advances in the networking solutions, tools, software and equipment also make the list of things that make cloud availability stronger in 2020. Over the last few years, vendors have released solutions that have expanded the speed, possible topologies, capacities and distance capabilities of inter- and intra- cloud networks. Like so many other technologies, vendors are automating traffic flow and patterns using AI and Machine Learning, taking advantage of advances in manufacturing to build in device redundancy that can be leveraged for availability and reliability.

Applications

Applications are still a vulnerable part of the cloud architecture when left unprotected. Applications that are not protected by an application aware higher availability module or framework, or SIOS Application Recovery Kit (ARK) run the risk of being down at the most critical time or moment in your business lifecycle. A SIOS ARK provides the application in the cloud with critical application aware monitoring and recovery, as well as failover and disaster recovery orchestration in the event of a failure.

Databases

While numbers of databases have increased their robustness, and some have even jumped in to offer replication enhancements, these databases are still a risk on their own. Databases with replication technology still need orchestration, automation, and the intelligence to make sure that they are highly available to the application components that need them. What good is it if your database continues to hum along in your primary Region and Availability Zone, if your application has actually failed to a different Region or DR site. Supplement databases with replication, such as the SAP HANA database, with the automation and best practices of the SIOS Technology Corp HANA ARK and the SAP certified SAP S/4 HANA ARK. Protect databases that do not have replication technology, or whose technology is limited with the combination of the SIOS Protection Suite, SIOS DataKeeper for Linux and the associated ARK.

Storage

In the realm of disks and storage it can be intriguing to believe that the capacity, redundancy of software and hardware raid mean that you are highly available. However, storage is only available if it is accessible to the applications and Virtual Machines that need them. What technology do you have deployed to monitor and recover mounted cloud shares and volumes such as EFS and ANF. An unplanned downtime and its associated chaos can be as near as an unintended unmount, or offline operation by a well-intentioned user.

Virtual Machines

Hypervisor technology has made your virtual machine push button easy. Integrated cloud solutions promise to monitor if the VM is available and provide options such as restart or migrate. These solutions are not enough to cover issues with your Virtual Machine that may stall, delay, or degrade your availability. In addition to what your cloud vendor provides, you need a monitoring and availability solution that understands how to monitor the VM health such as:

Disk capacity.
CPU deadlocks
Memory contention and errors
Resource exhaustion

A VM that runs without the ability to process applications requests may escape the eye of your cloud only monitoring, but shouldn’t escape the watchful monitoring of your higher availability solution.

Data Center

Let’s get real for a moment. All the advances in data center availability, redundancy and reliability does not negate the need for eliminating your data center as a single point of failure (SPOF). As VP of Customer Experience, we have worked with a customer who deployed best in class redundancy within the private cloud data center, much like the major public cloud vendors. And if not for the high availability and data replication solution provided by SIOS Technology Corp, this customer would have experienced major downtime when a tropical storm ripped through their area taking out power, backup generators, cooling, and networking.

However, with SIOS Technology, the customer was able to preemptively failover ahead of the storm to a data center more inland. Cooling failures, construction mishaps, as well as human and natural disasters are continual reminders that a single data center isn’t the same as higher availability.

Don’t fall into the biggest trap of 2021. Make sure you have true high availability by avoiding thinking the cloud has you covered.

– Cassius Rhue, VP, Customer Experience

Reproduced from SIOS

Stages of IT Disaster Recovery Grief

March 8, 2021 by Jason Aw Leave a Comment

Stages of IT Disaster Recovery Grief

Disaster recovery grief can hit you out of nowhere if you haven’t implemented the right enterprise availability architecture. Meet our friend Dave in IT to walk us through the 5 stages of disaster grief.

Stage 1: Denial

Dave in IT: “Uh oh. What’s that alert? It’s just a little application crash, right? No big deal. I’ll have things up and running in no time.”

In the land of enterprise availability, there is no such thing as a little application crash or no big deal. Companies have SLA with real money on the line. Your selective reality is probably not the same perspective of your customers and stakeholders.

Stage 2: Anger

Dave in IT: “Are you kidding me. Of all the… [censored]...times, today the application won’t start. Ughh. I hate this[censored]...[censored]... application. Wait, what’s this new alert. Seriously, now, the datacenter is down!”

It gets messy really, really fast in the fast pace, and high stakes environments. When unchecked alerts and failures happen, problems can mount quickly along with pressure, frustration and anger.

State 3: Bargaining

Dave in IT: “Hey Ard in Applications, this is Dave in IT. Do you guys have any backups for the App1 environment? . . .Ard are you sure? Could you just check again? I know you’ve checked twice, but can you check one more time. I’ll buy drinks on Taco Tuesday!”

Dave in IT: “Hey Donna DBA, this is Dave in IT. Art in Applications said you might help me out. Did you by chance setup any database replication for that finance database or the inventory management system? . . . Are you sure? Umh, do you remember if we have any way to recover from a umh . . . datacenter crash?”

When my daughter gets in trouble, bargaining is her first go to. Okay, second. The first is to disappear, but you’re too smart to just walk away from the flames. But, Dave in IT isn’t the only one to realize that bargaining and begging is a poor substitute for a well defined strategy for high availability and disaster recovery. Skip the bargaining and begging about your disaster because “80% of the people don’t care, and 20% are glad it’s you (paraphrased from Les Brown).”

Stage 4: Sadness

Dave in IT: “This is just great. The application server crashed, the datacenter is down, and backups, if I can find them and if I can load them, will take hours to get restored. There is no way I’m getting out of this… where did I put that updated resume.”

Of course you have backups, and you’ve validated them. But there is an RTO and RPO impact of going back to those backups. Are you able to absorb this time? That is of course, after your data center recovers.

Step 5: Acceptance

Dave in IT: “It’s been two hours. I never knew we had this many Executive stakeholders before. No way I’m making it to my 2nd year anniversary after this. Well, I guess I’ll clean out my office tomorrow. No way I’m making it through this!”

Failures happen. Datacenters go down. Applications fail. There is no denying the possibility of losing a data center, having a server fail, or an application crash. This type of acceptance is normal, a part of improving your availability. Accepting that you may lose your job or worse because you failed to implement an availability strategy is something the experts at SIOS Technology Corp. want to make sure you avoid.

Don’t be like Dave in IT. Avoid the stages of disaster grief, and the hours of disaster recovery and downtime by architecting and implementing an enterprise availability architecture that includes the best of hybrid, on-premise, or cloud coupled with the best solution for monitoring, recovery, and system failover automation.

– Cassius Rhue, VP Customer Experience

Reproduced from SIOS

Why Does High Availability Have To Be So Complicated?

March 1, 2021 by Jason Aw Leave a Comment

Why Does High Availability Have To Be So Complicated?

Why Does love High Availability Have To Be So Complicated?

It’s the Hallmark movie season, I mean Christmas season, I mean Hallmark Christmas movie season… (don’t judge too harshly, I’m a father of six young ladies, a hopeless romantic, and married to an amazing spouse who enjoys a good holiday laugh and happy ending). If you are in the Hallmark movie season, you know that it is highly likely that you’ll hear the phrase, “Why is love so complicated?” It will be spoken just before the heartbroken young person has developed feelings for a new love interest, and is ready to dance the night away in their arms, just as the old flame walks into the party. If you aren’t into the Hallmark holiday romances, maybe it isn’t love that you are wondering about. Perhaps you want to know: “Why does high availability have to be so complex.

Ten Reasons That High Availability Is So ‘Gosh Darn’ Complicated:

The speed of innovation

Cloud computing, edge computing, hyper converged, multi-cloud, containers, and machine learning are changing the landscape of enterprise availability at a blistering pace. By conservative estimates, AWS currently has over 175 services, and “provides a highly reliable, scalable, low-cost infrastructure platform in the cloud that powers hundreds of thousands of businesses in 190 countries around the world.” Choosing an HA solution that allows consistent management across all of these environments, with infrastructure and application awareness is an important way to reduce complexity.
Randomness of disasters

Someone once said, “make your solution disaster proof, and the universe will build a better disaster.” Not only are we seeing innovations in the realm of technology, but also in the world of disasters. Resource starvation, cooling system disasters, natural disasters, power grid failures, and a host of new and random disasters often make it harder to insulate the entirety of your enterprise. Last year’s solutions will likely need updates to handle this year’s unprecedented outages. It’s important to work with a vendor that has focused on high availability for many years – who has firsthand experience with finding solutions to the randomness of disasters.
Application complexity

As technology moves head in the realm of virtualization and cloud computing, applications are following suite. As these application vendors add new options to take advantage of the cloud, they are also adding additional complexity. Your applications should be protected by solutions designed for higher availability and clustering in AWS, Azure, GCP or other environments. Look for vendors who provide greater application awareness, understanding of best practices, and who deliver availability solutions architected to taking account of how the application may have been architected and are able to optimize the application’s orchestration in the cloud.
Advances in threats

The threats to your enterprise also impact your availability. Systems have always had to handle the attacks from intruders, hackers, and even the self-inflicted. These attacks have become more sophisticated, and the solutions and methods to avoid being victimized often impact the layout, architecture, and software that is deployed within your organization. This software has to “play nice” with your availability solution and your applications. As VP of Customer Experience for SIOS Technology, I have seen how an overly aggressive virus scanner can impact your application and your availability solution. Ensure you understand the impact of your security systems on your HA/DR environment and choose a HA solution that works with, not against your security goals.
Regulatory requirements

Data breaches impact the architecture for your application, hypervisor and environment, but so too does the regulatory requirements. Businesses that have become global now have to make sure they are compliant with data handling regulations in multiple countries. This can impact what region your solutions can be deployed in, and how many zones you can use for redundancy. Additional, regulatory requirements can also impact the teams that can support your organization which may impact the choices for your availability software and support.
Shrinking windows

In the world of 24/7 searches, shopping, gaming, banking, and research the windows are shrinking. Queries must run faster and take less time. Responses have to be quicker and have better data. This means that the allowable downtime for your environment is shrinking faster than you previously imagined. It also means that maintenance windows are tighter, packed, and have to be optimized and highly coordinated. Work with an HA vendor that can provide guidance on optimizing your cluster configuration for both application performance and fast recovery time.
Increasing competitive pressure

I grew up in a small town. The hardware store had one competitor. The grocery store had one competitor. The bookstore, antique shop, car dealership, rental office, and bank all had one competitor. Today, you have thousands upon thousands of competitors who want nothing more than to see your customers in their checkout carts. This competition impacts the complexity of your entire business. It weighs heavily on what can and cannot be done in maintenance windows, with upgrades, and at what speed you innovate. Environments that may have been refreshed once every five years have moved to the cloud where optimizations and advancements in processor speed and memory can be had in seconds or minutes. Systems that once had a single run book covering a simple list of applications now look closer to “War and Peace” and cover the growing number of processes, products, services and intelligence being added to increase profits while simultaneously working to reduce risks and downtime.
High availability solution costs

We all wish we had an unlimited budget, but the reality between what you have available is sometimes somewhere between a little and not enough. Teams are often forced to balance consumption versus fixed cost, license costs for applications on the standby clusters, and associated costs for availability software. Enterprise licenses often add a ‘tough to swallow’ price tag for a standby server in an availability environment. Architecting an availability solution is never free, even if you are a hard core ‘DIY’ team. DIY comes with additional costs in maintenance, management, source control, testing, deployment, version management and version control, patches, and patch management. While your team of experts may be clearly up for the challenge, your business likely would prefer their highly valued talents be applied to creating more revenue opportunities.
Business growth

Growth of your business due to innovation means that your teams are now responsible for more critical applications, more sites, more offices, and more data that needs to be accessible and highly available. As your business grows and thrives the challenges that come with scaling up and scaling out add to the complexities mentioned previously, but also just expand what you have to prepare and plan for.
Team turnover

The complexity of the environments, speed of innovation, growth of your business, advances in the application tier, and growth in the competitive landscape brings with it the challenge of retaining top talent to keep your infrastructure running smoothly. Most companies understand that availability is a merger of people, process, product, and architecture among other things. So finding ways to reduce the complexity of clustering environments with automated configuration, documented run books, leveraging products with consistent HA strategies across the infrastructure is a key to both retaining the talent that installs and manages your infrastructure, and mitigating the risks and heavy lifting of those responsible for the key components of availability.

Let’s face it, love takes hard work, good communication, time, investment, skill and determination. There are no shortcuts to a successful relationship. The same can be said about achieving the best outcomes in an ever emerging, increasingly complex, and fluid technology space within your enterprise. Availability, clustering, disaster recovery and up time is so ‘gosh darn’ hard because it requires a serious, dedicated, non-stop top to bottom cultural shift accounting for the speed of innovation, the complexity of applications and orchestration, competition and growth, and the other components of keeping applications, databases, and critical infrastructure available to those who need them, when they need them.

-Cassius Rhue, Vice President, Customer Experience

How to Understand & Respond to Availability Alerts

January 29, 2021 by Jason Aw Leave a Comment

Houston We Have a Problem (or How to Understand & Respond to Availability Alerts)

A Successful Failure

Houston we have a problem! It is an iconic line that reminds countless space buffs and movie fans about the great difficulty, potential disaster, and the perilous state of the Apollo 13 space mission – a mission NASA now calls “A Successful Failure.” Ignoring your own application availability alerts may not go down in history as a defining moment, but can also wreak similar havoc

Now back to 1970:

“A routine stir of an oxygen tank ignited damaged wire insulation inside it, causing an explosion that vented the contents of both of the Service Module’s (SM) oxygen tanks to space. Without oxygen, needed for breathing and for generating electric power, the SM’s propulsion and life support systems could not operate. The Command Module’s (CM) systems had to be shut down to conserve its remaining resources for reentry, forcing the crew to transfer to the Lunar Module (LM) as a lifeboat. With the lunar landing canceled, mission controllers worked to bring the crew home alive.”

An explosion of oxygen tanks triggered alarms, warnings, pressure and voltage drops, interrupted communications, and then the now famous radio communication between the astronauts and Mission Control. But what if, after the explosion, the crew did nothing? What if they never checked on the explosion, never responded to the warnings and gauges, and never informed Mission Control of there being an issue? What if Mission Control, after being notified or alerted back at their dashboard in the control center, never attempted to provide any assistance? What if the team buried their heads in the sand, or resigned themselves to fate and chance, never tried to learn, improvise, or improve from the failure they encountered? The result would have been tragic! It may have made it to a documentary, but hardly a blockbuster movie featuring an iconic line.

What Do You Do When an Alert is Triggered in Your Environment?

Space walks are a far cry from our own day to day activities, unless of course you work for NASA, but recent blogs on Apollo 13 do spark a question applicable to availability. What do you do when there is an alert triggered in your environment? Do you just ignore it? Do you downplay it, waiting to see if the alerts, log messages, or other indicators will just go away? Do you contact your vendor support to understand how you can disable these alerts, warnings, and messages? Or do you say, “We have a problem here and we need to work it out”?

As a VP of Customer Experience at SIOS Technology Corp. we have experienced both sides of alerts and indicators. We have painstakingly walked with customers who chose to ignore warnings, turning off critical alerts that indicated issues, ranging from application thresholds to network instability to potential data inconsistency. And we have also seen customers who have tuned into their alerts, investigated why their alarms were going off, uncovered the root cause and enjoyed the fruit of their labor. This fruit is most often the sweet reward of improved stability, innovation and learning, or an averted disaster.

4 things you can do when you your availability product triggers an alert

1. Determine if the type and criticality of the availability alert.

Is the alert or error indicative of a warning, an error, or a critical issue? A good place to assist you and your team with understanding criticality is to consult with available documentation. Check the product documentation, online forums, knowledge base articles (KBA), and internal team data and process manuals.

2. Assess the immediacy of the alert.

For warnings and errors, how likely are they to progress into a critical issue or event. For critical issues and alerts, this may be obvious but an assessment, even of critical events will provide some guidance on your next steps; self-correction, issue isolation, or immediate escalation.

3. Consult additional sources.

What other sources can you access to make a determination about the alert condition? For example, if the alert is storage related, are there other tools that can expose the health of your storage? If the issue is a network alert, are there hypervisor tools, traffic tools, NIC statistics, or other specialized monitoring tools deployed to help with analysis.

4. Contact support.

In other words, if you are unsure, alert Mission Control. After determining the type, assessing the immediacy, and consulting additional sources, it is a good idea to contact your vendor for support. A warning about a threshold for API calls may seem innocent. But if the API calls will fail once such a limit is reached, this could be cause for immediate action. Getting the authority of the specialist can be helpful in keeping peace of mind and avoiding disaster.

An experienced vendor like SIOS can help you quickly identify the causes of problems and recommend the best solution.

Repeatedly ignoring problems in your availability environment can lead to unexpected, but no less devastating results. Addressing the problems indicated by alerts, log messages, warning indicators, or other installed and configured indicators gives your customers, your business, your teams, and yourself the “opportunity to solve the problems,” before it becomes a disaster. And at the same time, strengthens your availability strategy and infrastructure. Which will you choose?

– Cassius Rhue, VP, Customer Experience

Reproduced from SIOS