SIOS SANless clusters

SIOS SANless clusters High-availability Machine Learning monitoring

  • Home
  • Products
    • SIOS DataKeeper for Windows
    • SIOS Protection Suite for Linux
  • News and Events
  • Clustering Simplified
  • Success Stories
  • Contact Us
  • English
  • 中文 (中国)
  • 中文 (台灣)
  • 한국어
  • Bahasa Indonesia
  • ไทย

SAP on Azure High Availability Best Practices

November 23, 2022 by Jason Aw Leave a Comment

SAP on Azure High Availability Best Practices

In the following video, Bala Anbalagan, senior SAP architect for Microsoft with 20 years of experience in SAP, explains the best practices for configuring high availability to protect SAP solutions in Azure. He also reviews the mistakes often made when implementing HA solutions in the cloud and key factors that users should know about when configuring SIOS LifeKeeper.

Configuring SAP High Availability Solutions in the Cloud

Bala explains that every SAP user should remember that a high availability solution is indispensable, especially in the cloud. Any cloud provider will need to make changes in their environments. Even though they have high service levels for their hardware infrastructure, there will be brief periods downtime that can bring your SAP systems down completely.

It is also critical that users configure SAP HA properly. The main purpose of installing HA solutions is to protect against downtime, but if you don’t do it properly, you are just wasting time and money, regardless of the cloud you’re running in. It is essential to follow the configuration rules of your cloud provider. If you misconfiguration your HA or fail to test failover and failback, it can result in a business disruption when you are least expecting it – particularly during a period of high-utilization.
SIOS LifeKeeper can detect errors during the configuration process. For example, it sends warnings if you only configure a single communication channel, as you always want a redundant communication channel, or a secondary network connection, between the nodes in the HA cluster. If you use SIOS DataKeeper, it will also show warnings if something is wrong with the configuration during the replication process.

What makes configuring SIOS straightforward?

SIOS has a pretty straightforward configuration process. Basically, you just need LifeKeeper installed in each of your cluster nodes and you use different types of SIOS application-specific recovery kits (ARK) modules (that come with LifeKeeper) depending on the application you want to recover. Also, the process is very easy to follow with a straightforward GUI – intelligence is built in, and you don’t need to change the details of the GUI. It automatically detects most of the information, further simplifying the set up process.

Knowing which ARK to use and how to use it is important in the configuring process. The ARK is a software module that provides application-specific intelligence to the LifeKeeper software. SIOS provides separate ARKs for different applications. For example, for SAP HANA, you install the SIOS SAP HANA ARK to enable LIfeKeeper to automate configuration steps, detect failures and manage a reliable failover for SAP HANA while maintaining SAP’s best practices.

Biggest Mistakes in Implementing HA for SAP in Azure

Users commonly implement HA for SAP solutions in Azure with the same process as they do in an on-premises environment. They need to change their mindset. Always make sure to follow the recommendations provided by the cloud provider, that is, read documents and keep the parameters as recommended by the cloud providers.

Another common mistake is adding too much complexity. Some customers put everything into a single cluster, but clusters should be separated for different servers. Making a cluster too large adds unnecessary complexity and potential risk.

Thorough testing in every aspect is critical when it comes to HA clustering. Testing HA configurations before going live as well as periodically (and frequently) are the best things you can do to prevent unexpected downtime.

Learn more about SAP high availability best practices in the video below or contact us to for more information about implementing high availability and disaster recovery for your essential applications in the cloud.


Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: Azure, Cloud, high availability - SAP

The simple days of HA & DR are gone

November 20, 2022 by Jason Aw Leave a Comment

The simple days of HA & DR are gone

Flipping through the TV channels I stumbled on the scene in the movie “He’s Just Not That Into You” with Drew Barrymore, saying what most of us in 2022 are feeling about Technology and especially high availability and disaster recovery:

“I miss the days when you had one phone number and one answering machine and that one answering machine had one cassette tape and that one cassette tape either had a message from a guy or it didn’t. And now you just have to go around checking all these different portals just to get rejected by seven different technologies. It’s exhausting.”

Sometimes, don’t you wish there was only one cloud or maybe even no cloud platform; one DB running on one OS; and only a front end application to worry about. But, the world has changed and is moving faster, and becoming more complicated.  Advances in technology, the fallout of mergers and acquisitions, and the increasing appetites and pace of our 24/7 society, with billions of consumers looking for the latest deal and the best experience, means that the simple days are gone.

4 hard truths about your availability

  1. Your solution isn’t as simple as you think

Of course your enterprise environment isn’t simple.  You have legacy systems and applications, the kind that have been around almost since punch cards.  You have new systems, made for the new generation of applications and databases.  In addition you have solutions that were created a decade ago to bridge the gap or span the time between migrating from one platform to another, but despite your best efforts, these systems linger. Added to these challenges is a growing set of systems and IT resources from the merger and acquisition of Company U.  Delivering HA is not as simple as you think in the new era.

  1. Bad architecture is a bigger problem than you realize

As VP of Customer Experience, we’ve seen the damage caused by bad architecture.  While deploying HA software can definitely help improve an application and database’s availability, HA software will never fully overcome incomplete requirements, poor networking, lack of redundant hardware, or other missing architectural components.  Our team once worked with a customer to correct an undersized environment that left their system unstable during peak operating times.  Because of their bad architecture, which included networking and hardware instability, their teams frequently found themselves scrambling to recover from avoidable downtime issues.  In order to have a complete and sound, highly available and resilient solution you will need to deploy great software as a part of a sound architecture.

  1. Your admins need more help than they’ll admit

Developing an enterprise grade, highly available resilient HA solution, built on a solid architecture with the ability to grow is not a simple process.  Designing and architecting for resilience, application and data availability is not as easy as grabbing a box of cake mix off the shelf.  Throw in an array of tools, processes from different teams, a mixture of SLA’s, and the varieties of OS, applications, databases, and platforms and you have a recipe for needing help.   Recently, I interviewed a 20 year veteran working in an enterprise support environment.  He described how many of his peers, and even at times himself, have not been able to handle the weight of maintaining critical enterprise availability.  Your admins, not only need help when they have been up since 2am dealing with a catastrophic, multi-system, multi-application, nearly complete data center collapse, but also in the day to day hard work of enterprise availability in one of the most technologically complex eras ever.

  1. Your solution may not be as highly available as you think 

“While public cloud providers typically guarantee some level of availability in their service level agreements, those SLAs only apply to the cloud hardware.” There are many other reasons for application downtime that aren’t covered by cloud provider SLAs including:

  • Software issues and bugs
  • Human errors
  • Software failure
  • System or application hangs

As VP of Customer Experience we’ve seen a thing or two, including a denial of service attack caused by a failed exit in a recursion routine, system exhaustion, security software quarantine of healthy, critical applications, kernel panics, and virtual machines that randomly reboot.  If your HA strategy is relying solely on the SLAs of your hypervisor, your solution may not be as highly available as you think. You need to protect critical applications with clustering software that can monitor and detect issues, respond to problems reliably, and if necessary move operations to a standby server to ensure that your products and services remain reliable and available when and where they are needed.

Our single data center has become a series of cloud platforms, spanning dozens of data centers.  Our skunk work application has become a part of the bevy of critical front end, middleware and backend solutions that we must manage across Windows, Linux, and a few different *Nix varieties.  The march of technology means that our high availability has become more complex and requires better architecture.  It also means that our teams need more help to manage it all, and if we aren’t careful it could mean that we remain vulnerable and exposed.  Which of the four truths is your team facing most?

Cassius Rhue, VP Customer Experience

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: disaster recovery, High Availability

What Does the New Driver in SIOS LifeKeeper for Windows Do For You?

November 15, 2022 by Jason Aw Leave a Comment

What Does the New Driver in SIOS LifeKeeper for Windows Do For You?

Making data protection in shared and SAN-less environments stronger for years to come.

What does Coca-Cola, KitKat, SalesForce, and SIOS LifeKeeper for Windows have in common?  Here are a few hints:

  • Coca-Cola relaunched a campaign using product redesigns of its iconic brands to adapt to the future, specifically to focus on social themes.
  • Kit-Kat rebranded its candy bar in the UK to commemorate and celebrate the booming social media, YouTube, and general technology wave, and capitalize on the brand strength of the Android (KitKat) OS.
  • Salesforce revamped its base product to create a sleeker, more modern, and faster interface to serve its customer’s needs.

These companies made significant improvements to their iconic products, services and solutions to better serve their customers, adapt and prepare for the future, and capitalize on their strengths.  In a similar fashion, SIOS has made dramatic improvements to our SIOS LifeKeeper for Windows product.

Prior to LifeKeeper for Windows version 8.9.0, shared storage functionality, including I/O fencing and drive identification and management was handled by the NCR_LKF driver.  Starting with the SIOS LifeKeeper for Windows release version 8.9.0, SIOS Technology Corp. redesigned the shared storage driver architecture. Beginning with the current release, the NCR_LKF driver has been removed and replaced by the SIOS ExtMirr driver, the engine behind the SANless storage replication of SIOS DataKeeper / SIOS DataKeeper Cluster Edition.

Five significant benefits of the NCR_LKF architectural change in SIOS LifeKeeper for Windows:

  1. A more modern driver

The ExtMirr driver provides a more modern filter driver to manage the shared storage functionality.  While the NCR_LKF driver focused on “keeping the lights on” and the “data safe”, the architecture of the driver lagged behind more modern drivers.  The ExtMirr driver maintains that data protection, while being more compatible, more modern, and more easily supported in newer versions of the Windows OS.

  1. More robust I/O fencing

The driver used in both SIOS DataKeeper and SIOS DataKeeper Cluster Edition includes a robust fencing architecture. While the NCR_LKF driver was capable of I/O fencing, the new driver is more robust and has been tested in SAN and SANless environments. The enhanced I/O fencing leverages volume lock and node ownership information within the protected volume.

  1. Tighter integration and compatibility.

Leveraging the I/O fencing for the ExtMirr driver used in the DataKeeper products means that the LifeKeeper for Windows solution increases in integration with the DataKeeper product line.  The ExtMirr driver also includes the latest Microsoft driver signing and works seamlessly with Operating Systems that enforce driver signing and Secure Boot.

  1. Easier administration

The ExtMirr driver gives customers and administrators a large set of command-line utilities for obtaining and administering the status of the volume.  The emcmd commands are native to both of  the SIOS DataKeeper products. They can now be used for easier administration with the SIOS LifeKeeper shared volume configurations. Customers and partners who leverage both shared storage and replicated configurations with the LifeKeeper for Windows products now have a single command line set of tools to know and use. The emcmd tools replace the previous volume.exe, volsvc, and similar NCR_LKF filter driver tools for administration (lock, unlock, etc).

  1. More frequent updates and fixes

With the addition of the ExtMirr driver into SIOS LifeKeeper for Windows, the shared storage configurations, as well as replication configurations, will now see a boost in updates, new features, and fixes. While the NCR_LKF driver provided a solid foundation and stable base for I/O fencing, switching to the ExtMirr driver means that customers will see the same strength and stability, with faster updates for new product support

Aligning the two products to a single driver may not be as flashy as the SalesForce Classic to Lightning update, but it adds significant functionality, increases the strength and longevity of both the SIOS DataKeeper and SIOS LifeKeeper solutions, and will make data protection in shared and SAN-less environments stronger for years to come.

Cassius Rhue, VP Customer Experience

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: SIOS LifeKeeper, Windows HA

How to recreate the file system and mirror resources to ensure the size information is correct

November 11, 2022 by Jason Aw Leave a Comment

How to recreate the file system and mirror resources to ensure the size information is correct

When working with high availability (HA) clustering, it’s essential to ensure that the configuration of all nodes in the cluster are parallel with one another. These ‘mirrored’ configurations help to minimize the failure points on the cluster, providing a higher standard of HA protection. For example, we have seen situations in which the mirror-size was updated on the source node but the same information was not updated on the target node. The mirror size mismatch prevented LifeKeeper from starting on the target node in a failover. Below are the recommended steps for recreating the mirror resource on the target node with the same size information as the source:

Steps:

  1. Verify – from the application’s perspective – that the data on the source node is valid and consistent
  2. Backup the file system on the source (which is the source of the mirror)
  3. Run /opt/LifeKeeper/bin/lkbackup -c to backup the LifeKeeper config on both nodes
  4. Take all resources out of service.  In our example the resources are in service on node sc05 and sc05 is the source of the mirror (and sc06 is the target system/target of the mirror).
    1. In the right pane of the LifeKeeper GUI, right-click on the DataKeeper resource that is in service.
    2. Click Out of Service from the resource popup menu.
    3. A dialog box will confirm that the selected resource is to be taken out of service. Any resource dependencies associated with the action are noted in the dialog. Click Next.
    4. An information box appears showing the results of the resource being taken out of service. Click Done.
  1. Verify that all resources are out of service and file systems are unmounted
    1. Use the command cat /proc/mdstat on the source to verify that no mirror is configured
  1. Use the mount command on the source to make sure the file system is no longer mounted
  2. Use /opt/LifeKeeper/bin/lcdstatus -q  on the source to make sure the resources are all OSU.
  1. In the LifeKeeper GUI break the dependency between the IP resource (VIP) and the file system resource (/mnt/sps).  Right click on the VIP resource and select Delete Dependency.

Then, select the File System resource (/mnt/sps) for the Child Resource Tag.

This will result in two hierarchies, one with the IP resource (VIP) and one with the file system resource (/mnt/fs) and the mirror resource (datarep-sps).

  1. Delete the hierarchy with the file system and mirror resources. Right click on /mnt/sps and select Delete Resource Hierarchy.
  1. On the source, perform ‘mount <device> <directory>’ on the file system.

Example: mount /dev/sdb1 /mnt/sps

  1. Via the GUI recreate the mirror and file systems via the following:
    1. Recovery Kit: Data Replication
  1. Switchback Type: Intelligent
  2. Server: The source node
  3. Hierarchy Type: Replicate Existing Filesystem
  4. Existing Mount Point: <select your mount point>. It is /mnt/sps for this example.
  5. Data Replication Resource Tag: <Take the default>
  6. File System Resource Tag: <Take the default>
  7. Bitmap File: <Take the default>
  8. Enable Asynchronous Replication: Yes
  1. Once created, you can Extend the mirror and file system hierarchy:
    1. Target server: Target node
    2. Switchback Type: Intelligent
    3. Template Priority: 1
    4. Target Priority: 10
  1. Once the pre-extend checks complete select next followed by these values:
    1. Target disk: <Select the target disk for the mirror>.  It is /dev/sdb1 in our example.
    2. Data Replication Resource Tag: <Take the default>
    3. Bitmap File: <Take the default>
    4. Replication Path: <Select the replication path in your environment>
    5. Mount Point: <Select the mount point in your environment>.  It is /mnt/sps in our example.
    6. Root Tag: <Take the default>

When the resource “extend” is done select “Finish” and then “Done”.

  1.  In the LifeKeeper GUI recreate the dependency between the IP resource (VIP) and the file system resource (/mnt/sps). Right click on the VIP resource and select Create Dependency.  Select /mnt/sps for the Child Resource Tag.
  1. At this point the mirror should be performing a full resync of the size of the file system. In the LifeKeeper GUI in the right pane of the LifeKeeper GUI, right-click on the VIP resource. Select “In Service” to restore the IP resource (VIP), select the source system where the mirror is in service (sc05 in our example)  and verify that the application restarts and the IP is accessible.

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: High Availability, SIOS LifeKeeper

Explaining the Subtle but Critical Difference Between Switchover, Failover, and Recovery

November 9, 2022 by Jason Aw Leave a Comment

Explaining the Subtle but Critical Difference Between Switchover, Failover, and Recovery

High availability is a speciality and like most specialities, it has its own vocabulary and terminology. Our customers are typically very knowledgeable about IT but if they haven’t been working in an HA environment, some of our common HA terminology can cause a fair amount of confusion – for them and for us. They are simple-sounding but with very specific meaning in the context of HA.Three of these terms are discussed here – swithover, failover, and recovery.

What is a Switchover?

A switchover is a user-initiated action via the high availability (HA) clustering solution user interface or CLI. In a switchover, the user manually initiates the action to change the source or primary server for the protected application. In a typical switchover scenario, all running applications and dependencies are stopped in an orderly fashion, beginning with the parent application and concluding when all of the child/dependencies are stopped. Once the applications and their dependencies are stopped, they are then restarted in an orderly fashion on the newly designated primary or source server.

For example, if you have resources Alpha, Beta, and Gamma. Resource Alpha depends on resources Beta and Gamma. Resource Beta depends on resource Gamma.  In a switchover event, resource Alpha is stopped first, followed by Beta, and then finally Gamma.  Once all three are stopped, the switchover continues to bring the resources into an operational state on the intended server.  The process starts with resource Gamma, followed by Beta, and then finally the start up operations complete for resource Alpha. 

Traditionally, a switchover operation requires more time as resources must be stopped in a graceful and orderly manner. A switchover is often performed when there is a need to update software versions while maintaining uptime, performing maintenance work (via rolling upgrades) on the primary production node, or doing DR testing.

Key Takeaway: If there was no failure to cause the action, then it was a switchover

What is a Failover?

A failover operation is typically a non-user initiated action in response to a server crash or unexpected/unplanned reboot. Consider the scenario of an HA cluster with two nodes, Node A and Node B.  In this scenario, all critical applications Alpha, Beta, and Gamma are started and operational on Node A. In this scenario, a failover is what takes place when Node A experiences an unexpected/unplanned reboot, power-off, halt, or panic. Once the HA software detects that Node A is no longer functioning and operationally available within the cluster (as defined by the solution), it will trigger a failover operation to restore access of the critical applications, resources, services and dependencies on the available cluster node, Node B in this case.  In a failover scenario, because Node A has experienced a crash (or other simulated immediate failure) there are no processes to stop on Node A, and consequently once proper detection and fencing actions have been processed, Node B will immediately begin the process of restoring resources. As in the switchover case, the process starts with resource Gamma, followed by Beta, and then finally the start up operations complete for resource Alpha. Traditionally, a failover operation requires less time than a switchover. This is because the processing of a failover does not require any resources to be stopped (or quiesced) on the previous primary (in-service or active) node.

Key Takeaway: A failover occurs in response to a system failure.

What is Recovery?

A recovery event is easy to confuse with a failover. A recovery event occurs when a process, server, communication path, disk, or even cluster resource fails and the high availability software operates in response to the identified failure. Most HA software solutions are capable of multiple ways of handling a recovery event. The most prominent methods include:

  1. Graceful restart locally, then a graceful restart on the remote
    1. A restart is always attempted locally, if recovery is successful no further action occurs. If a local restart fails the next operation occurs
    2. If a local restart fails, resources are gracefully moved to the remote node
  2. Graceful restart locally, then a forced restart on the remote
    1. A restart is always attempted locally, if recovery is successful no further action occurs.  If a local restart fails the next operation occurs.
    2. Resources are moved to the remote node by fencing the primary node
  3. Forced restart on the remote
    1. A restart is never attempted locally
    2. Resources are always forced to the next available cluster node as described in method 2b.
  4. Forced server restart, no remote failover
    1. A restart is always attempted locally
    2. If a local restart fails, the primary node is restarted to attempt to recover services.
    3. Resources will not fail to a remote system
  5. Policy based local restart, then remote
    1. Policies may govern the number of retries before a remote attempt a recovery occurs

Due to the number of variations in recovery policy it is easy to see a recovery event that resembles the behavior of a switchover. This is often the case in methods 1 and 5. In these scenarios applications and services are gracefully stopped in an orderly fashion before being started on the remote node. Methods 2 and 3, customers will often see a behavior similar to a failover. In methods 2 and 3, the primary server is restarted or fenced by the HA software which creates an observable behavior similar to a failover.  Method 4 is typically an option that is rarely used, but is a hybrid of both a switchover and a failover.  Method 4 begins with a graceful stop of the applications and services, followed by a restart of the applications and services (much like a switchover). However, if the local restart of the applications and services fails, the system will be restarted (much like a failover), but without actually failing to the remote cluster node. While rare, Method 4 is often invoked in cases where an unbalanced cluster is present, or used with a policy based methodology.

Key Takeaway: A recovery event depends on the method chosen

HA terminology between vendors is an area where common terms can take on different meanings. As you deploy and maintain your cluster solution with enterprise applications, be sure that you understand the solution provider terms for failover, switchover and recovery.  And, while you are at it, make sure you know whether the restaurant will put the sauce on the side (in a saucer), or on the side (your mashed potatoes)

Reproduced with permission from SIOS

Filed Under: Clustering Simplified Tagged With: disaster recovery, failover clustering, High Availability

  • « Previous Page
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • …
  • 73
  • Next Page »

Recent Posts

  • Video: The SIOS Advantage
  • Demo Of SIOS DataKeeper For A Three-Node Cluster In AWS
  • 2023 Predictions: Data Democratization To Drive Demand For High Availability
  • Understanding the Complexity of High Availability for Business-Critical Applications
  • Epicure Protects Business Critical SQL Server with Amazon EC2 and SIOS SANLess Clustering Software

Most Popular Posts

Maximise replication performance for Linux Clustering with Fusion-io
Failover Clustering with VMware High Availability
create A 2-Node MySQL Cluster Without Shared Storage
create A 2-Node MySQL Cluster Without Shared Storage
SAP for High Availability Solutions For Linux
Bandwidth To Support Real-Time Replication
The Availability Equation – High Availability Solutions.jpg
Choosing Platforms To Replicate Data - Host-Based Or Storage-Based?
Guide To Connect To An iSCSI Target Using Open-iSCSI Initiator Software
Best Practices to Eliminate SPoF In Cluster Architecture
Step-By-Step How To Configure A Linux Failover Cluster In Microsoft Azure IaaS Without Shared Storage azure sanless
Take Action Before SQL Server 20082008 R2 Support Expires
How To Cluster MaxDB On Windows In The Cloud

Join Our Mailing List

Copyright © 2023 · Enterprise Pro Theme on Genesis Framework · WordPress · Log in