|November 23, 2022||
SAP on Azure High Availability Best Practices
In the following video, Bala Anbalagan, senior SAP architect for Microsoft with 20 years of experience in SAP, explains the best practices for configuring high availability to protect SAP solutions in Azure. He also reviews the mistakes often made when implementing HA solutions in the cloud and key factors that users should know about when configuring SIOS LifeKeeper.
Configuring SAP High Availability Solutions in the Cloud
Bala explains that every SAP user should remember that a high availability solution is indispensable, especially in the cloud. Any cloud provider will need to make changes in their environments. Even though they have high service levels for their hardware infrastructure, there will be brief periods downtime that can bring your SAP systems down completely.
It is also critical that users configure SAP HA properly. The main purpose of installing HA solutions is to protect against downtime, but if you don’t do it properly, you are just wasting time and money, regardless of the cloud you’re running in. It is essential to follow the configuration rules of your cloud provider. If you misconfiguration your HA or fail to test failover and failback, it can result in a business disruption when you are least expecting it – particularly during a period of high-utilization.
What makes configuring SIOS straightforward?
SIOS has a pretty straightforward configuration process. Basically, you just need LifeKeeper installed in each of your cluster nodes and you use different types of SIOS application-specific recovery kits (ARK) modules (that come with LifeKeeper) depending on the application you want to recover. Also, the process is very easy to follow with a straightforward GUI – intelligence is built in, and you don’t need to change the details of the GUI. It automatically detects most of the information, further simplifying the set up process.
Knowing which ARK to use and how to use it is important in the configuring process. The ARK is a software module that provides application-specific intelligence to the LifeKeeper software. SIOS provides separate ARKs for different applications. For example, for SAP HANA, you install the SIOS SAP HANA ARK to enable LIfeKeeper to automate configuration steps, detect failures and manage a reliable failover for SAP HANA while maintaining SAP’s best practices.
Biggest Mistakes in Implementing HA for SAP in Azure
Users commonly implement HA for SAP solutions in Azure with the same process as they do in an on-premises environment. They need to change their mindset. Always make sure to follow the recommendations provided by the cloud provider, that is, read documents and keep the parameters as recommended by the cloud providers.
Another common mistake is adding too much complexity. Some customers put everything into a single cluster, but clusters should be separated for different servers. Making a cluster too large adds unnecessary complexity and potential risk.
Thorough testing in every aspect is critical when it comes to HA clustering. Testing HA configurations before going live as well as periodically (and frequently) are the best things you can do to prevent unexpected downtime.
Learn more about SAP high availability best practices in the video below or contact us to for more information about implementing high availability and disaster recovery for your essential applications in the cloud.
Reproduced with permission from SIOS
|November 20, 2022||
The simple days of HA & DR are gone
Flipping through the TV channels I stumbled on the scene in the movie “He’s Just Not That Into You” with Drew Barrymore, saying what most of us in 2022 are feeling about Technology and especially high availability and disaster recovery:
“I miss the days when you had one phone number and one answering machine and that one answering machine had one cassette tape and that one cassette tape either had a message from a guy or it didn’t. And now you just have to go around checking all these different portals just to get rejected by seven different technologies. It’s exhausting.”
Sometimes, don’t you wish there was only one cloud or maybe even no cloud platform; one DB running on one OS; and only a front end application to worry about. But, the world has changed and is moving faster, and becoming more complicated. Advances in technology, the fallout of mergers and acquisitions, and the increasing appetites and pace of our 24/7 society, with billions of consumers looking for the latest deal and the best experience, means that the simple days are gone.
4 hard truths about your availability
Of course your enterprise environment isn’t simple. You have legacy systems and applications, the kind that have been around almost since punch cards. You have new systems, made for the new generation of applications and databases. In addition you have solutions that were created a decade ago to bridge the gap or span the time between migrating from one platform to another, but despite your best efforts, these systems linger. Added to these challenges is a growing set of systems and IT resources from the merger and acquisition of Company U. Delivering HA is not as simple as you think in the new era.
As VP of Customer Experience, we’ve seen the damage caused by bad architecture. While deploying HA software can definitely help improve an application and database’s availability, HA software will never fully overcome incomplete requirements, poor networking, lack of redundant hardware, or other missing architectural components. Our team once worked with a customer to correct an undersized environment that left their system unstable during peak operating times. Because of their bad architecture, which included networking and hardware instability, their teams frequently found themselves scrambling to recover from avoidable downtime issues. In order to have a complete and sound, highly available and resilient solution you will need to deploy great software as a part of a sound architecture.
Developing an enterprise grade, highly available resilient HA solution, built on a solid architecture with the ability to grow is not a simple process. Designing and architecting for resilience, application and data availability is not as easy as grabbing a box of cake mix off the shelf. Throw in an array of tools, processes from different teams, a mixture of SLA’s, and the varieties of OS, applications, databases, and platforms and you have a recipe for needing help. Recently, I interviewed a 20 year veteran working in an enterprise support environment. He described how many of his peers, and even at times himself, have not been able to handle the weight of maintaining critical enterprise availability. Your admins, not only need help when they have been up since 2am dealing with a catastrophic, multi-system, multi-application, nearly complete data center collapse, but also in the day to day hard work of enterprise availability in one of the most technologically complex eras ever.
“While public cloud providers typically guarantee some level of availability in their service level agreements, those SLAs only apply to the cloud hardware.” There are many other reasons for application downtime that aren’t covered by cloud provider SLAs including:
As VP of Customer Experience we’ve seen a thing or two, including a denial of service attack caused by a failed exit in a recursion routine, system exhaustion, security software quarantine of healthy, critical applications, kernel panics, and virtual machines that randomly reboot. If your HA strategy is relying solely on the SLAs of your hypervisor, your solution may not be as highly available as you think. You need to protect critical applications with clustering software that can monitor and detect issues, respond to problems reliably, and if necessary move operations to a standby server to ensure that your products and services remain reliable and available when and where they are needed.
Our single data center has become a series of cloud platforms, spanning dozens of data centers. Our skunk work application has become a part of the bevy of critical front end, middleware and backend solutions that we must manage across Windows, Linux, and a few different *Nix varieties. The march of technology means that our high availability has become more complex and requires better architecture. It also means that our teams need more help to manage it all, and if we aren’t careful it could mean that we remain vulnerable and exposed. Which of the four truths is your team facing most?
Cassius Rhue, VP Customer Experience
Reproduced with permission from SIOS
|November 15, 2022||
What Does the New Driver in SIOS LifeKeeper for Windows Do For You?
Making data protection in shared and SAN-less environments stronger for years to come.
What does Coca-Cola, KitKat, SalesForce, and SIOS LifeKeeper for Windows have in common? Here are a few hints:
These companies made significant improvements to their iconic products, services and solutions to better serve their customers, adapt and prepare for the future, and capitalize on their strengths. In a similar fashion, SIOS has made dramatic improvements to our SIOS LifeKeeper for Windows product.
Prior to LifeKeeper for Windows version 8.9.0, shared storage functionality, including I/O fencing and drive identification and management was handled by the NCR_LKF driver. Starting with the SIOS LifeKeeper for Windows release version 8.9.0, SIOS Technology Corp. redesigned the shared storage driver architecture. Beginning with the current release, the NCR_LKF driver has been removed and replaced by the SIOS ExtMirr driver, the engine behind the SANless storage replication of SIOS DataKeeper / SIOS DataKeeper Cluster Edition.
Five significant benefits of the NCR_LKF architectural change in SIOS LifeKeeper for Windows:
The ExtMirr driver provides a more modern filter driver to manage the shared storage functionality. While the NCR_LKF driver focused on “keeping the lights on” and the “data safe”, the architecture of the driver lagged behind more modern drivers. The ExtMirr driver maintains that data protection, while being more compatible, more modern, and more easily supported in newer versions of the Windows OS.
The driver used in both SIOS DataKeeper and SIOS DataKeeper Cluster Edition includes a robust fencing architecture. While the NCR_LKF driver was capable of I/O fencing, the new driver is more robust and has been tested in SAN and SANless environments. The enhanced I/O fencing leverages volume lock and node ownership information within the protected volume.
Leveraging the I/O fencing for the ExtMirr driver used in the DataKeeper products means that the LifeKeeper for Windows solution increases in integration with the DataKeeper product line. The ExtMirr driver also includes the latest Microsoft driver signing and works seamlessly with Operating Systems that enforce driver signing and Secure Boot.
The ExtMirr driver gives customers and administrators a large set of command-line utilities for obtaining and administering the status of the volume. The emcmd commands are native to both of the SIOS DataKeeper products. They can now be used for easier administration with the SIOS LifeKeeper shared volume configurations. Customers and partners who leverage both shared storage and replicated configurations with the LifeKeeper for Windows products now have a single command line set of tools to know and use. The emcmd tools replace the previous volume.exe, volsvc, and similar NCR_LKF filter driver tools for administration (lock, unlock, etc).
With the addition of the ExtMirr driver into SIOS LifeKeeper for Windows, the shared storage configurations, as well as replication configurations, will now see a boost in updates, new features, and fixes. While the NCR_LKF driver provided a solid foundation and stable base for I/O fencing, switching to the ExtMirr driver means that customers will see the same strength and stability, with faster updates for new product support
Aligning the two products to a single driver may not be as flashy as the SalesForce Classic to Lightning update, but it adds significant functionality, increases the strength and longevity of both the SIOS DataKeeper and SIOS LifeKeeper solutions, and will make data protection in shared and SAN-less environments stronger for years to come.
Cassius Rhue, VP Customer Experience
Reproduced with permission from SIOS
|November 11, 2022||
How to recreate the file system and mirror resources to ensure the size information is correct
When working with high availability (HA) clustering, it’s essential to ensure that the configuration of all nodes in the cluster are parallel with one another. These ‘mirrored’ configurations help to minimize the failure points on the cluster, providing a higher standard of HA protection. For example, we have seen situations in which the mirror-size was updated on the source node but the same information was not updated on the target node. The mirror size mismatch prevented LifeKeeper from starting on the target node in a failover. Below are the recommended steps for recreating the mirror resource on the target node with the same size information as the source:
Then, select the File System resource (/mnt/sps) for the Child Resource Tag.
This will result in two hierarchies, one with the IP resource (VIP) and one with the file system resource (/mnt/fs) and the mirror resource (datarep-sps).
Example: mount /dev/sdb1 /mnt/sps
When the resource “extend” is done select “Finish” and then “Done”.
Reproduced with permission from SIOS
|November 9, 2022||
Explaining the Subtle but Critical Difference Between Switchover, Failover, and Recovery
High availability is a speciality and like most specialities, it has its own vocabulary and terminology. Our customers are typically very knowledgeable about IT but if they haven’t been working in an HA environment, some of our common HA terminology can cause a fair amount of confusion – for them and for us. They are simple-sounding but with very specific meaning in the context of HA.Three of these terms are discussed here – swithover, failover, and recovery.
What is a Switchover?
A switchover is a user-initiated action via the high availability (HA) clustering solution user interface or CLI. In a switchover, the user manually initiates the action to change the source or primary server for the protected application. In a typical switchover scenario, all running applications and dependencies are stopped in an orderly fashion, beginning with the parent application and concluding when all of the child/dependencies are stopped. Once the applications and their dependencies are stopped, they are then restarted in an orderly fashion on the newly designated primary or source server.
For example, if you have resources Alpha, Beta, and Gamma. Resource Alpha depends on resources Beta and Gamma. Resource Beta depends on resource Gamma. In a switchover event, resource Alpha is stopped first, followed by Beta, and then finally Gamma. Once all three are stopped, the switchover continues to bring the resources into an operational state on the intended server. The process starts with resource Gamma, followed by Beta, and then finally the start up operations complete for resource Alpha.
Traditionally, a switchover operation requires more time as resources must be stopped in a graceful and orderly manner. A switchover is often performed when there is a need to update software versions while maintaining uptime, performing maintenance work (via rolling upgrades) on the primary production node, or doing DR testing.
Key Takeaway: If there was no failure to cause the action, then it was a switchover
What is a Failover?
A failover operation is typically a non-user initiated action in response to a server crash or unexpected/unplanned reboot. Consider the scenario of an HA cluster with two nodes, Node A and Node B. In this scenario, all critical applications Alpha, Beta, and Gamma are started and operational on Node A. In this scenario, a failover is what takes place when Node A experiences an unexpected/unplanned reboot, power-off, halt, or panic. Once the HA software detects that Node A is no longer functioning and operationally available within the cluster (as defined by the solution), it will trigger a failover operation to restore access of the critical applications, resources, services and dependencies on the available cluster node, Node B in this case. In a failover scenario, because Node A has experienced a crash (or other simulated immediate failure) there are no processes to stop on Node A, and consequently once proper detection and fencing actions have been processed, Node B will immediately begin the process of restoring resources. As in the switchover case, the process starts with resource Gamma, followed by Beta, and then finally the start up operations complete for resource Alpha. Traditionally, a failover operation requires less time than a switchover. This is because the processing of a failover does not require any resources to be stopped (or quiesced) on the previous primary (in-service or active) node.
Key Takeaway: A failover occurs in response to a system failure.
What is Recovery?
A recovery event is easy to confuse with a failover. A recovery event occurs when a process, server, communication path, disk, or even cluster resource fails and the high availability software operates in response to the identified failure. Most HA software solutions are capable of multiple ways of handling a recovery event. The most prominent methods include:
Due to the number of variations in recovery policy it is easy to see a recovery event that resembles the behavior of a switchover. This is often the case in methods 1 and 5. In these scenarios applications and services are gracefully stopped in an orderly fashion before being started on the remote node. Methods 2 and 3, customers will often see a behavior similar to a failover. In methods 2 and 3, the primary server is restarted or fenced by the HA software which creates an observable behavior similar to a failover. Method 4 is typically an option that is rarely used, but is a hybrid of both a switchover and a failover. Method 4 begins with a graceful stop of the applications and services, followed by a restart of the applications and services (much like a switchover). However, if the local restart of the applications and services fails, the system will be restarted (much like a failover), but without actually failing to the remote cluster node. While rare, Method 4 is often invoked in cases where an unbalanced cluster is present, or used with a policy based methodology.
Key Takeaway: A recovery event depends on the method chosen
HA terminology between vendors is an area where common terms can take on different meanings. As you deploy and maintain your cluster solution with enterprise applications, be sure that you understand the solution provider terms for failover, switchover and recovery. And, while you are at it, make sure you know whether the restaurant will put the sauce on the side (in a saucer), or on the side (your mashed potatoes)
Reproduced with permission from SIOS