high availability - SAP Archives - SIOS SANless clusters

Achieving High Availability for SAP HANA

February 14, 2025 by Jason Aw Leave a Comment

Achieving High Availability for SAP HANA

Countless businesses rely on SAP ERP systems for their mission-critical, high availability applications. However, with the 2027 deadline for transforming these systems to the new HANA environment looming, it’s vital that these enterprises consider how they will achieve high availability under the new regime – ideally, before they’re faced with unplanned downtime.

It’s vital that businesses start thinking about this change early, as achieving the top-tier “five nines” standard of high availability – 99.999% uptime – under the HAHA environment comes with many challenges. Fortunately, these can be overcome with well-designed architecture and the right technical expertise.

This Database Trends & Applications article written by Ian Allton, Solutions Architect at SIOS Technology Corp published in their Big Data Quarterly looks to help enterprises make the right start in their transition to HANA by running through the three steps to achieving best practices for high availability with an SAP HANA database.

Reproduced with permission from SIOS

Why SIOS HANA Multitarget Automation is a Bigger Deal Than you Think

July 3, 2023 by Jason Aw Leave a Comment

Why SIOS HANA Multitarget Automation is a Bigger Deal Than you Think

Larry (not his real name) was a SIOS customer who had deployed a replication solution for high availability and disaster recovery (HA/DR) in the past. When he launched the PoC to test a two-node replication solution for Linux, using SIOS LifeKeeper and the DataKeeper replication, his top priority was protection of data integrity. Larry’s PoC test list included the standard items including: database start / stop, migrating the database to the backup node, maintenance activities, and server failover just to name a few. Larry was adamant that the solution be capable of both fast server switchover,(i.e., the graceful migration), and fast failover (i.e., the sudden and forced migration), of applications, databases, storage and services from one server to another. But, he was even more forceful and passionate that such activities should not cause data loss.

Protect Data Integrity By Avoiding Split Brain

In addition to these standard tests, Larry added specific tests to try to force a “split brain” scenario. Split brain is a condition that occurs when members of a cluster are unable to communicate with each other, but are in a running and operable state, and subsequently take ownership of common resources simultaneously. In effect, you have two bus drivers fighting for the steering wheel. Due to its destructive nature, split brain, can cause data loss or data corruption and is best avoided through use of a mechanism to determine which node should remain active (driving the bus) and which node(s) should stop writing to disk.

While split brain scenarios are relatively uncommon in clusters that deploy the use of quorum and quorum plus witness capabilities, the difficulty of split brain resolution increases exponentially with every node added to the cluster configuration. In a multitarget configuration with three or more nodes, clustering software not only has to orchestrate a failover to the correct node, it has to automatically switch replication from the new primary node to the tertiary node to maintain DR protection while making sure to arbitrate properly between nodes. In other clustering solutions those complex actions have to be manually scripted and manually updated in the event of a failover and again to restore normal operation, and it only gets harder when a split brain occurs.

Due to the features and improvements in the SIOS LifeKeeper and SAP HANA Application Recovery Kit (ARK), Larry had difficulty introducing a split brain scenario. However, when he was able to finally contrive one, he benefited greatly from understanding the logic that the SIOS products used to protect his data. Larry realized the high level of sophistication designed into the data protection provided by SIOS clustering software. He selected SIOS LifeKeeper.

The SIOS HANA Multitarget Automation Difference

Scenarios like Larry’s are just one of nine reasons SIOS’ HANA multitarget automation is a bigger deal than you think. Here are all nine:

Enhanced Protection
SIOS’ solution simplifies the protection of a HANA database resource in a multitarget scenario. Wizard-based options quickly detect the current configuration and precisely add the information to the LifeKeeper configuration. Error detection is both concise and informative to help users resolve any issues and subsequently save time.
Streamlined Administration
Natalie(not her real name) was responsible for an HANA multinode configuration. When servers failed or required maintenance, Natalie leveraged different scripts and tools to perform the required actions. This, however, was not scalable. After moving to SIOS LifeKeeper, Natalie and team had a simple UI to perform all core tasks such as stopping and restarting HANA and HANA system replication. Additionally, if a disaster strikes, the team can use the single, simplified SIOS UI instead of searching for the latest runbook, finding a copy of the right scripts, or calling Natalie at 2AM. .
Simplified Monitoring
SIOS’ intuitive status reports in the UI provided the team with a quick way to determine the replication status. Using a single tool, versus a collection of monitoring boards and homemade scripts, simplifies administration and saves time.
Automated Recovery
Some HANA HSR solutions are capable of performing a failover of the HANA replication between those two nodes. However, an administrator often has to re-register the replication after a system failover. In the case of three or more nodes, will the administrator understand how to update the registration on the third or forth nodes? Will they remember to use sync and async appropriately? The SIOS solution, capable of handling three or even four nodes for multitarget replication, will seamlessly automate the registration of target nodes after a failure.
Flexibility and Scalability
The ability to protect a HANA cluster in two, three, or four node combinations means that customers have the flexibility to dial up their level of both availability and disaster recovery. Two node customers, with quorum, are able to provide availability protection against a disaster and handle maintenance activities with near zero downtime leveraging HANA takeover with handshake feature. Customers deploying three nodes can dial up additional disaster recovery functionality by deploying the third node with async replication in a different data center or region. For added benefit, three node customers can deploy a fourth node, with storage quorum, to enable high availability and disaster recovery in the event of an entire data center loss.
Data Protection
Let’s go back to Larry’s issue. He was running HANA on primary node A with multitarget replication to Nodes B and C. What happens when your manual efforts end in disaster? Which node was the primary? Were things in sync when node A crashed? How do I avoid bringing up the wrong node? In addition to adding support for three or more nodes in a multi-target HSR configuration, the new HANA ARK includes additional admin tools to help in the event of a disaster or unfortunate split brain event.

The HANA_DATA_OUT_OF_SYNC_<tag> flag prevents users from accidentally restoring the database on the wrong system. The HANA_LAST_OWNER_<tag> flag helps administrators know when an action was taken on the primary system while standby nodes were not in sync. This flag tells the administrator that this node was the last owner and should be where replication is resumed. HANA_DATA_CONSISTENCY_UNKNOWN_<tag> helps SIOS to automatically resolve and restore replication when all communications between standbys were temporarily lost and then restored. When used with best practices, quorum deployment, and proper tuning, these tools allow administrators like Larry to avoid split brains and recover safely if and when they occur.
Reporting, Performance and Disaster Recovery
Of course the true benefit for multi-target is in the extra nodes and the functionality that these nodes unlock. Using three nodes in the same data center can unlock the potential for more reporting via the logreplay_readaccess parameter, while still maintaining a node at a DR site. In addition, SIOS’ support for different replication modes gives users the option to have sync nodes and async nodes for better performance across data centers (or regions).
Continuous Testing
How often does your team test homemade scripts? How often is your runbook reviewed with respect to configuration, administration, and 2 AM scenarios. The HANA multi-target solution was not only continuously tested by SIOS engineers, QA, and Customer Experience experts, but the solution also continues to be tested and validated for HANA failover and recovery processes with each release and update.
Extensive Documentation
Some time ago our team worked with a customer for cluster administration. While his predecessor was very knowledgeable about their environment, staff promotions and reorganization had left many IT folks responsible for systems they knew little about. When asked about runbooks and documentation of their configuration, the customer was unable to find details from the previous team or previous administrators. In addition to rock solid automation, administration, monitoring, recovery, and data protection, the SIOS multi-target solution includes detailed, easy-to-use documentation about the implementation, operation, and management of a HANA multitarget system controlled by LifeKeeper.

Leveraging SIOS’ total solution means that customers can benefit from consistent, timely monitoring and detection, fast, reliable and efficient recovery, and a fully automated solution that guarantees high availability and disaster recovery protection. Contact us for more information on SAP HANA multitarget automation.

-By Cassius Rhue, VP Customer Experience

Reproduced with permission from SIOS

SAP on Azure High Availability Best Practices

November 23, 2022 by Jason Aw Leave a Comment

SAP on Azure High Availability Best Practices

In the following video, Bala Anbalagan, senior SAP architect for Microsoft with 20 years of experience in SAP, explains the best practices for configuring high availability to protect SAP solutions in Azure. He also reviews the mistakes often made when implementing HA solutions in the cloud and key factors that users should know about when configuring SIOS LifeKeeper.

Configuring SAP High Availability Solutions in the Cloud

Bala explains that every SAP user should remember that a high availability solution is indispensable, especially in the cloud. Any cloud provider will need to make changes in their environments. Even though they have high service levels for their hardware infrastructure, there will be brief periods downtime that can bring your SAP systems down completely.

It is also critical that users configure SAP HA properly. The main purpose of installing HA solutions is to protect against downtime, but if you don’t do it properly, you are just wasting time and money, regardless of the cloud you’re running in. It is essential to follow the configuration rules of your cloud provider. If you misconfiguration your HA or fail to test failover and failback, it can result in a business disruption when you are least expecting it – particularly during a period of high-utilization.
SIOS LifeKeeper can detect errors during the configuration process. For example, it sends warnings if you only configure a single communication channel, as you always want a redundant communication channel, or a secondary network connection, between the nodes in the HA cluster. If you use SIOS DataKeeper, it will also show warnings if something is wrong with the configuration during the replication process.

What makes configuring SIOS straightforward?

SIOS has a pretty straightforward configuration process. Basically, you just need LifeKeeper installed in each of your cluster nodes and you use different types of SIOS application-specific recovery kits (ARK) modules (that come with LifeKeeper) depending on the application you want to recover. Also, the process is very easy to follow with a straightforward GUI – intelligence is built in, and you don’t need to change the details of the GUI. It automatically detects most of the information, further simplifying the set up process.

Knowing which ARK to use and how to use it is important in the configuring process. The ARK is a software module that provides application-specific intelligence to the LifeKeeper software. SIOS provides separate ARKs for different applications. For example, for SAP HANA, you install the SIOS SAP HANA ARK to enable LIfeKeeper to automate configuration steps, detect failures and manage a reliable failover for SAP HANA while maintaining SAP’s best practices.

Biggest Mistakes in Implementing HA for SAP in Azure

Users commonly implement HA for SAP solutions in Azure with the same process as they do in an on-premises environment. They need to change their mindset. Always make sure to follow the recommendations provided by the cloud provider, that is, read documents and keep the parameters as recommended by the cloud providers.

Another common mistake is adding too much complexity. Some customers put everything into a single cluster, but clusters should be separated for different servers. Making a cluster too large adds unnecessary complexity and potential risk.

Thorough testing in every aspect is critical when it comes to HA clustering. Testing HA configurations before going live as well as periodically (and frequently) are the best things you can do to prevent unexpected downtime.

Learn more about SAP high availability best practices in the video below or contact us to for more information about implementing high availability and disaster recovery for your essential applications in the cloud.

Reproduced with permission from SIOS

Understanding and Avoiding Split Brain Scenarios

September 23, 2021 by Jason Aw Leave a Comment

Understanding and Avoiding Split Brain Scenarios

Split brain. Most readers of our blogs will have heard the term, in the computing context that is, yet we cannot help but to sympathize with those whose first mental image is of the chaos that would result if someone had two brains, both equally in control at the same time.

What is a Failover Cluster Split Brain Scenario?

In a failover cluster split brain scenario, neither node can communicate with the other, and the standby server may promote itself to become an active server because it believes the active node has failed. This results in both nodes becoming ‘active’ as each would see the other as being failed. As a result, data integrity and consistency is compromised as data on both nodes would be changing. This is referred to as split brain.

There are two types of split-brain scenarios which may occur for an SAP HANA resource hierarchy if appropriate steps are not taken to avoid them.

HANA Resource Split Brain: The HANA resource is Active (ISP) on multiple cluster nodes. This situation is typically caused by a temporary network outage affecting the communication paths between cluster nodes.
SAP HANA System Replication Split Brain: The HANA resource is Active (ISP) on the primary node and Standby (OSU) on the backup node, but the database is running and registered as the primary replication site on both nodes. This situation is typically caused by either a failure to stop the database on the previous primary node during failover, having Autostart enabled for the database, or a database administrator manually running “hdbnsutil -sr_takeover” on the secondary replication site outside of the clustering software environment.

Avoiding Split Brain Issues

Recommendations for avoiding or resolving each type of split-brain scenario in the SIOS Protection Suite clustering environment are given below.

While in a split-brain scenario, a message similar to the following is logged and broadcast to all open consoles every quickCheck interval (default 2 minutes) until the issue is resolved.

EMERG:hana:quickCheck:HANA-SPS_HDB00:136363:WARNING: 
A temporary communication failure has occurred between servers 
hana2-1 and hana2-2. 
Manual intervention is required in order to minimize the risk of 
data loss. 
To resolve this situation, please take one of the following resource 
hierarchies out of service: HANA-SPS_HDB00 on hana2-1 
or HANA-SPS_HDB00 on hana2-2. 
The server that the resource hierarchy is taken out of service on 
will become the secondary SAP HANA System Replication site.

Recommendations for resolution:

Investigate the database on each cluster node to determine which instance contains the most up-to-date or relevant data. This determination must be made by a qualified database administrator who is familiar with the data.
The HANA resource on the node containing the data that needs to be retained will remain Active (ISP) in LifeKeeper, and the HANA resource hierarchy on the node that will be re-registered as the secondary replication site will be taken entirely out of service in LifeKeeper. Right-click on each leaf resource in the HANA resource hierarchy on the node where the hierarchy should be taken out of service and click Out of Service …
Once the SAP HANA resource hierarchy has been successfully taken out of service, LifeKeeper will re-register the Standby node as the secondary replication site during the next quickCheck interval (default 2 minutes). Once replication resumes, any data on the Standby node which is not present on the Active node will be lost. Once the Standby node has been re-registered as the secondary replication site, the SAP HANA hierarchy has returned to a highly available state.

SAP HANA System Replication Split Brain Resolution

While in this split-brain scenario, a message similar to the following is logged and broadcast to all open consoles every quick. Check interval (default 2 minutes) until the issue is resolved.

EMERG:hana:quickCheck:HANA-SPS_HDB00:136364:WARNING: 
SAP HANA database HDB00 is running and registered as 
primary master on both hana2-1 and hana2-2. 
Manual intervention is required in order to 
minimize the risk of data loss. To resolve this situation, 
please stop database instance 
HDB00 on hana2-2 by running the command ‘su – spsadm -c 
“sapcontrol -nr 00 -function Stop”’ 
on that server. Once stopped, 
it will become the secondary SAP HANA System Replication site.

Recommendations for resolution:

Investigate the database on each cluster node to determine whether important data exists on the Standby node which does not exist on the Active node. If important data has been committed to the database on the Standby node while in the split-brain state, the data will need to be manually copied to the Active node. This determination must be made by a qualified database administrator who is familiar with the data.
Once any missing data has been copied from the database on the Standby node to the Active node, stop the database on the Standby node by running the command given in the LifeKeeper warning message:

su – adm -c “sapcontrol -nr <Inst#> -function Stop”

where is the lower-case SAP System ID for the HANA installation and <Inst#> is the instance number for the HDB instance (e.g., the instance number, for instance, HDB00 is 00)
Once the database has been successfully stopped, LifeKeeper will re-register the Standby node as the secondary replication site during the next quickCheck interval (default 2 minutes). Once replication resumes, any data on the Standby node which is not present on the Active node will be lost. Once the Standby node has been re-registered as the secondary replication site, the SAP HANA hierarchy has returned to a highly available state.

Being aware of common split-brain scenarios and taking these steps to mitigate them can save you time and protect data integrity.

Reproduced with permission from SIOS

Fifty Ways to Improve Your High Availability

April 5, 2021 by Jason Aw Leave a Comment

Fifty Ways to Improve Your High Availability

I love the start of another year. Well, most of it. I love the optimism, the mystery, the potential, and the hope that seems to usher its way into life as the calendar flips to another year. But, there are some downsides with the turn of the calendar. Every year the start of the New Year brings ‘____ ways to do_____. My inbox is always filled with, “Twenty ways to lose weight.” “Ten ways to build your portfolio.” “Three tips for managing stress.” “Nineteen ways to use your new iPhone.” The onslaught of lists for self improvement, culture change, stress management, and weight loss abound, for nearly every area of life and work, including “Thirteen ways to improve your home office.” But, what about high availability? You only have so much time every week. So how do you make your HA solution more efficient and robust than ever. Where is your list? Here it is, fifty ways to make your high availability architecture and solution better:

Get more information from the cluster faster
Set up alerts for key monitoring metrics
Add analytics. Multiply your knowledge
Establish a succinct architecture from an authoritative perspective
Connect more resources. Link up with similar partners and other HA professionals
Hire a consultant who specializes in high availability
100x existing coverage. Expand what you protect
Centralize your log and management platforms
Remove busywork
Remove hacks and workarounds
Create solid repeatable solution architectures
Utilize your platforms: Public, private, hybrid or multi-cloud
Discover your gaps
Search for Single Points of Failure (SPOFs)
Refuse to implement incomplete solutions
Crowdsource ideas and enhancements
Go commercial and purpose built
Establish a clear strategy for each life cycle phase
Clarify decision making process
Document your processes
Document your operational playbook
Document your architecture
Plan staffing rotation
Plan maintenance
Perform regular maintenance (patches, updates, security fixes)
Define and refine on-boarding strategies
Clarify responsibility
Improve your lines of communication
Over communicate with stakeholders
Implement crisis resolution before a crisis
Upgrade your infrastructure
Upsize your VM; CPU, memory, and IOPs
Add redundancy at the zone or region level
Add data replication and disaster recovery
Go OS and Cloud agnostic
Get training for the team (cloud, OS, HA solution, etc)
Keep training the team
Explore chaos testing
Imitate the best in class architectures
Be creative. Innovation expands what you can protect and automate.
Increase your automation
Tune your systems
Listen more
Implement strict change management
Deploy QA clusters. Test everything before updating/upgrading production
Conduct root cause analysis exercises on any failures
Address RCA and Closed Loop Corrective Action reports
Learn your lesson the first time. Reuse key learnings.
Declutter. Don’t run unnecessary services or applications on production clusters
Be persistent. Keep working at it.

So, what are the ideas and ways that you have learned to increase and improve your enterprise availability? Let us know!

-Cassius Rhue, VP, Customer Experience

Reproduced from SIOS