downtime Archives - SIOS SANless clusters

How to Reduce Downtime in SAP

August 12, 2022 by Jason Aw Leave a Comment

How to Reduce Downtime in SAP

Thinking about how to reduce downtime in SAP is an important topic that should be visited during initial solution design. Changes to existing SAP landscapes can be made, these can be more tricky in existing production environments where downtime will be an issue.

There are several typical components in an SAP landscape that can be considered single points of failure; ASCS (Central Services), HANA DB, NFS nodes and SAP Application servers. Ideally these should be protected by using redundant servers in a High Availability configuration.

HA/DR Goals for SAP

Core goals when designing components of High Availability/Disaster recovery for SAP should be:

● Minimize Downtime
● Eliminate Data loss
● Maintain data integrity
● Enable flexible configuration

In today’s modern cloud environments the infrastructure of the underlying hardware is typically well protected from failures by using multiple redundant NICs, redundant storage and hardware availability zones – however, this still doesn’t guarantee that your SAP application will be running and responding to requests.

Using a high availability solution such as the SIOS Protection Suite introduces intelligent High Availability coupled with local disk replication to ensure that your SAP applications and services are continually monitored, protected and have the ability to automatically switch to redundant hardware when a failure is detected.

Now lets consider a simple example of an SAP configuration that’s not HA protected, it might look something like this (figure 1):

If this environment is used to process transactions from a web server that is used to sell clothing to customers, SAP is being used to process sales, track orders, track inventory and provide multiple automated ordering etc based on these transactions.

Now let’s imagine that this sales processing environment (pictured above) was configured in the cloud without HA because the architect thought that highly redundant hardware in the cloud environment was good enough to protect from failures. If that HANA DB experiences an issue and shuts down let’s look at the steps that are typical required to get the database back up and running:

● Even if HANA is configured with HANA System Replication, failover to the secondary HANA DB system isn’t automated. This will require someone who knows HANA to correct, after the failure is detected and they are notified of the outage.
● Real Time transactions from web server will be suspended until the issue is resolved

If this small clothing retailer transacts about $10 million annually from web based sales, that equates to roughly $1150 per hour in sales equalized over the year. Peak times would cost much more per hour.

This report from IBM suggests that the average downtime cost per hour is $10k

Figure 2: SAP Landscape with HA/DR

If HA software had been in use (figure 2), HANA DB failover would have been automatic and interruption to the web server would have been within the configured timeouts and absolutely no sales would have been lost. An alert would be generated and the cause could be looked at and diagnosed in a more leisurely manner than a system down situation.

Scale up the customer size and it’s very likely that any system down situation would start to cost hundreds of thousands of dollars and consume significant people resources to resolve.

Another IBM report suggests a staggering 44% of respondents had bi-monthly unplanned outages and another 35% had monthly unplanned outages.

Planned outages themselves are another potential problem with 46% of respondents reporting monthly planned outages and a further 29% reporting yearly planned outages. Having applications and services protected by HA software can also mitigate these planned outages by allowing services to be moved to running systems during maintenance activities.

Learn more about high availablity for SAP and S/4HANA.

Disney’s Encanto – Lessons on High Availability, IT Teams & downtime

March 3, 2022 by Jason Aw Leave a Comment

Lessons on High Availability, IT Teams, and defeating downtime from Disney’s Encanto

Over the weekend I’ve joined the masses of people who have tuned in to Disney’s Encanto and become a fan of the story, a student of the lessons and opportunities, and an absolute fan of Lin-Manuel Miranda. What does Disney’s Encanto provide in relation to High Availability, Clustering, and Resiliency?

Lessons on High Availability, IT Teams, and defeating downtime from Disney’s Encanto
[Warning: movie spoilers ahead]

In Encanto you quickly learn that the Family Madrigal is a special family. In one of the opening songs, “The Family Madrigal” we understand that all of the members of the family have unique and special gifts; superhuman strength, the ability to hear for miles, prophecy and prediction, the power to conjure beautiful flowers and plants, the ability to shape-shift, the ability to heal, and the ability to control the weather. Well, everyone it seems has a ‘gift’ except Mirabel.

Lesson 1: You don’t need superhuman gifts to make a difference.

Mirabel, while not gifted like the other siblings and members of the family, is the central figure in understanding the health, and disease of the family. Moreover, she is able to help the family put things back together when it all falls apart, without the other gifts. You need High Availability, but you don’t have to break the budget, develop supernatural abilities, or depend on a miracle to achieve it.

As the movie continues, Pepa’s youngest son Antonio is readied for his gift ceremony. However, during the party and celebration Abuela notices cracks in the foundation of Casita. But her warnings go unheeded.

Lesson 2: Don’t ignore the cracks.

When Mirabel sees the cracks it leads her on a quest to find out what is endangering Casita and how she can help. Initially, she is ignored by the others and even rebuked. How will you respond if you see cracks or shortcomings in your IT infrastructure, or cracks in your architecture and design? Will you ignore the cracks, pretend they aren’t seen or even rebuke the team for finding them? Don’t ignore the cracks. Responding to the first sign of an issue is most often the perfect way to prevent a greater issue.

On her quest to find answers and save the miracle’s magic, Dolores tells Mirabel to talk to her super-strong older sister, Luisa who initially suggests that everything is okay and that there is absolutely nothing wrong. But Luisa eventually begins to reveal that the weight of knowing there is an is becoming too much for her to carry alone.

Lesson 3: The weight of HA is too big for a single person or team.

As Luisa put it, “It is pressure that breaks the camel’s back, pressure that’ll never stop.”. Developing an High Availability solution, designing and architecting for resilience and data availability is not a simple process, and it is definitely not a task for a single person or single team. Your DBA, IT Admin, and ERP Administrators cannot handle the weight of maintaining critical enterprise availability alone. Likewise, a one-dimensional approach cannot carry the weight of four (4) nines of availability. Instead, it takes a fully aligned team working in concert with a complete HA solution to understand, design, develop, and deploy the tools and techniques. How well are the roles and responsibilities on your IT teams distributed and defined? Ensure no one is bearing the responsibility for HA alone.

When Mirabel seeks Bruno for the answers she is looking for, everyone says, “We Don’t Talk About Bruno.” Bruno’s gift is precognition, but because of his warnings and seemingly negative visions, he disappeared.

Lesson 4: Don’t be afraid of the person who sees trouble ahead.

As VP of Customer Experience, I’ve helped customers perform health assessments for their infrastructure and clustering solutions. When the health check completes, not all customers are happy to hear that they have issues to resolve. We all do all we can to avoid the bad news. But, ignoring upgrades, forgetting to do maintenance, and downplaying risks identified by the Bruno of your team will not make the trouble disappear. In fact, it may make your worst fears a reality.

Mirabel eventually finds a secret passage leading to Bruno and discovers that Bruno never left, but felt that he had to destroy her vision to protect her and himself.

Lesson 5: Corporate culture can crush or create higher availability

Your culture can either crush or create a space for higher availability and resiliency.

Mirabel asks Bruno if he has been patching the cracks in Casita, but Bruno replies that he is afraid of the cracks.

Lesson 6: Don’t be afraid of the cracks

HA requires continuous, coordinated ongoing effort. An essential part of the effort is finding solutions and fixes for those IT cracks that could jeopardize your application or the gaps between architecture and execution.

Even as Bruno (or Hernando) tries to patch the cracks, it is apparent that the foundational issues are too much for spackle and superficial solutions.

Lesson 7: Spackle won’t fix a foundational problem

Take a look at your infrastructure and look at the ways in which problems are being addressed. Are you deploying workarounds, band-aids, and temporary “hacks”, or are you looking at architectural and foundational solutions that address the root cause of the problem with your clusters, enterprise availability, and execution during disasters?

Lesson 8: Find your Jorge

If you’ve been deploying more hacks and workarounds than root cause solutions, find your Jorge. Find a skilled team member, partner, or solution provider and give them permission to grapple with implementing the foundational solution that will fix the problem or strengthen the infrastructure.

Bruno sees another vision that Casita could be saved if Mirable hugged Isabela. Mirabel offers Isabela an opportunity to blossom but Abuela doesn’t see it that way. An argument between Mirabel and Abuela ensues,and Abuela blames Mirabel for the cracks in ‘Casita’. Mirabel blames Abuela for her impossible demands, unrealistic expectations, and misplaced hopes.

Lesson 9: Blame creates more problems

Pass the Blame is a great party game, but it is not great for HA, cluster resilience, or data protection. I once helped a customer whose organization illustrated the unproductiveness of blame. After a proof of concept cluster hit an issue causing a delay, the Project Manager blamed the application team for the delay. The applications team blamed the backup administrator, who in turn blamed the infrastructure admin. Throughout the blaming session, their cluster remained unavailable, the proof-of-concept remained stalled, and the only progress being made was in the cracks of anger growing between teams. It was only when they put these differences aside that they could make the adjustments they needed to resolve their issue and continue with a successful POC.

‘Casita’ collapses and Mirabel runs away. Later, Alma finds Mirabel and after reconciling they join the family and village in building back Casita Better than ever.

Lesson 10: Build it back stronger

Of course, the final scenes of Encanto are filled with lessons in the confession of Alma (Abuela) such as:

Don’t hide what you see or ignore the cracks
Tell the truth to the people who matter(or in your case the team and business)
Build a culture that allows others to be more than their role or gift
Seek (and accept) help earlier
Pain doesn’t need to be a prison
Leaders have power to bring people together or push them away

But the most important of the final lessons is to build back better, stronger, and together. After every unplanned or planned outage, there will be lessons learned from root cause analysis, experiences and fresh understanding. As a result of this, there will also be an opportunity to build back a stronger solution and architecture for your high availability and disaster recovery.

Consider the case of a customer who was able to create a standard deployment pipeline and QA system after discovering an outage was caused by code deployed directly to production. Or another customer who uncovered that disk and database warnings were being suppressed for weeks before the outage. Don’t waste the time and opportunity that comes when you have downtime. Be sure to work together to avoid the silos, dependencies on single strengths, or placing the hope of your infrastructure on the wrong thing.

Of course, you should watch the whole movie for yourself, but there are even more lessons for HA as you walk through the magic and music of the movie and pick up on the lives and lessons from a few of the other characters

Camilo: Be adaptable and flexible
Luisa: Strength is good, but don’t deny the places where you feel weak and vulnerable
Pepa: You’ve got control of your attitudes and actions
Dolores: Listen to everything, but also take action
Bruno: If you love what you do, you’ll keep looking for solutions
Isabela: It doesn’t have to be perfect to work. HA is an ongoing battle: design, develop, test, deploy, repeat
Kid with the coffee: Too much coffee isn’t good for you
Mirabel: You need someone with hope and vision to succeed
Alma: Don’t blow your second chances and don’t lose sight of what really matters

The movie closes with a great reunion and Mirabel and the Madrigals stand in front of the finished house. When Mirabel touches the doorknob to the door the ‘Casita’ springs back to life and the home along with the magical gifts of the family all return. Try these ten lessons for High Availability from Encanto, enjoy the movie, and remember “There is nothing you can’t do… together” with your team of customers, partners, solution providers, and administrators.

Install Service Packs Into Cluster While Also Minimizing Planned Downtime

January 24, 2018 by Jason Aw Leave a Comment

How To Install Service Packs Into A Cluster While Also Minimizing Planned Downtime

I answer this question about installing service packs into a cluster rather enough. To the extent I thought I should probably but a link to it in my blog.

http://support.microsoft.com/default.aspx/kb/174799?p=1

This article tells you everything you need to know. By following the instructions in the article, you are minimizing the amount of planned downtime. At the same time, also giving yourself the opportunity to “test” the update on one node before your upgrade both nodes. If the upgrade does not go well on the first node, at least the application is still running on the second node until you can figure out what went wrong.

Cluster at application layer vs hypervisor layer?

This is just one of the side benefits that you get when you cluster at the application layer vs. clustering at the hypervisor layer. If this were simply a VM in an availability group, you would have to schedule downtime to complete the application upgrade. Also, you have to hope that it all went well as the only failback is to restore the VM from backup. As I discussed in earlier articles, there is a benefit to clustering at the hypervisor level. But you have to understand what you are giving up as well.

Reproduced with permission from Clusteringformeremortals.com