Clustering Simplified Archives - Page 42 of 92

Beginning Well is Great, But Maintaining Uptime Takes Vigilance

September 28, 2021 by Jason Aw Leave a Comment

Beginning Well is Great, But Maintaining Uptime Takes Vigilance

Author Isabella Poretsis states, “Starting something can be easy, it is finishing it that is the highest hurdle.” It is great to have a kickoff meeting. It is invigorating, and exciting. Managers and leaders look out at the greenfield with excitement and optimism is high. But, this moment of kickoff, and even the Champagne popping moment of a successful deployment are but just the beginning. Maintaining uptime requires ongoing vigilance.

High availability and the elusive four nines of uptime for your critical applications and databases aren’t momentary occurrences, but rather, a constant endeavor to end the little foxes that destroy the vineyard. Staying abreast of threats, up-to-date on the updates, and properly trained and prepared is the work from which your team “is never entitled to take a vacation.”

For those who want to stay vigilant in maintaining uptime, here are five tips:

1. Monitor the Environment

Very little in enterprise software still follows the “set it and forget it” mindset. Everything, since the day you uncorked the grand opening champagne to now, has been moving toward a state of decline. If you aren’t monitoring the servers, workloads, network traffic, and hardware (virtual or physical), you may lose uptime and stability.

2. Perform Maintenance

One thing that I have always noticed in over twenty plus years of software development and services is that all software comes with updates. Apply them. Remember to execute sound maintenance policies, including taking and verifying backups. One tech writer suggested the only update you regret is the one you failed to make.

3. Learn Continuously

My first introduction to high availability came when I unplugged one end of the Token Ring for a server in our lab as an intern, fresh from the CE-211 lab. The administrator was in my face in minutes. After an earful, he gave me an education. Ideally, you and your team want to learn without taking down your network, but you do absolutely want to keep learning. Look into paid courses on existing technology, new releases, emerging infrastructure. Check your vendors for courses and items related to your process, environment, software deployments and company enterprise. Free courses for many things also exist if money is an issue.

4. Multiply the learning

In addition to continuous learning, make a plan to multiply the learning. As the VP of Customer Experience at SIOS we have seen the tremendous difference between teams who share their learning and those who don’t. Teams that share their learning avoid gaps in knowledge that compromise downtime. The best way to know that you learned something is to teach it to somebody else. As you learn, share the learning with team members to reduce the risk of downtime due to error, and for that matter vacation.

5. End well . . .before the next beginning

All projects, servers, and software have an ending. End well. Decommission correctly. Begin the next phase, deployment, software relationship, etc well by closing up loose ends, documenting what went well, what did not, and what to do next. Treat your existing vendors well. You just may need them again later. Understand the existing systems and high availability solutions before proceeding with a new deployment. This proper ending helps you begin again from a better starting place headed towards a stronger outcome.

Keeping the system highly available is a continuous process. Set it and forget it is a nice catch phrase, but the reality is that uptime takes vigilance, continual monitoring, proper maintenance, and constant.

-Cassius Rhue, VP, Customer Experience

Reproduced with permission from SIOS

Understanding and Avoiding Split Brain Scenarios

September 23, 2021 by Jason Aw Leave a Comment

Understanding and Avoiding Split Brain Scenarios

Split brain. Most readers of our blogs will have heard the term, in the computing context that is, yet we cannot help but to sympathize with those whose first mental image is of the chaos that would result if someone had two brains, both equally in control at the same time.

What is a Failover Cluster Split Brain Scenario?

In a failover cluster split brain scenario, neither node can communicate with the other, and the standby server may promote itself to become an active server because it believes the active node has failed. This results in both nodes becoming ‘active’ as each would see the other as being failed. As a result, data integrity and consistency is compromised as data on both nodes would be changing. This is referred to as split brain.

There are two types of split-brain scenarios which may occur for an SAP HANA resource hierarchy if appropriate steps are not taken to avoid them.

HANA Resource Split Brain: The HANA resource is Active (ISP) on multiple cluster nodes. This situation is typically caused by a temporary network outage affecting the communication paths between cluster nodes.
SAP HANA System Replication Split Brain: The HANA resource is Active (ISP) on the primary node and Standby (OSU) on the backup node, but the database is running and registered as the primary replication site on both nodes. This situation is typically caused by either a failure to stop the database on the previous primary node during failover, having Autostart enabled for the database, or a database administrator manually running “hdbnsutil -sr_takeover” on the secondary replication site outside of the clustering software environment.

Avoiding Split Brain Issues

Recommendations for avoiding or resolving each type of split-brain scenario in the SIOS Protection Suite clustering environment are given below.

While in a split-brain scenario, a message similar to the following is logged and broadcast to all open consoles every quickCheck interval (default 2 minutes) until the issue is resolved.

EMERG:hana:quickCheck:HANA-SPS_HDB00:136363:WARNING: 
A temporary communication failure has occurred between servers 
hana2-1 and hana2-2. 
Manual intervention is required in order to minimize the risk of 
data loss. 
To resolve this situation, please take one of the following resource 
hierarchies out of service: HANA-SPS_HDB00 on hana2-1 
or HANA-SPS_HDB00 on hana2-2. 
The server that the resource hierarchy is taken out of service on 
will become the secondary SAP HANA System Replication site.

Recommendations for resolution:

Investigate the database on each cluster node to determine which instance contains the most up-to-date or relevant data. This determination must be made by a qualified database administrator who is familiar with the data.
The HANA resource on the node containing the data that needs to be retained will remain Active (ISP) in LifeKeeper, and the HANA resource hierarchy on the node that will be re-registered as the secondary replication site will be taken entirely out of service in LifeKeeper. Right-click on each leaf resource in the HANA resource hierarchy on the node where the hierarchy should be taken out of service and click Out of Service …
Once the SAP HANA resource hierarchy has been successfully taken out of service, LifeKeeper will re-register the Standby node as the secondary replication site during the next quickCheck interval (default 2 minutes). Once replication resumes, any data on the Standby node which is not present on the Active node will be lost. Once the Standby node has been re-registered as the secondary replication site, the SAP HANA hierarchy has returned to a highly available state.

SAP HANA System Replication Split Brain Resolution

While in this split-brain scenario, a message similar to the following is logged and broadcast to all open consoles every quick. Check interval (default 2 minutes) until the issue is resolved.

EMERG:hana:quickCheck:HANA-SPS_HDB00:136364:WARNING: 
SAP HANA database HDB00 is running and registered as 
primary master on both hana2-1 and hana2-2. 
Manual intervention is required in order to 
minimize the risk of data loss. To resolve this situation, 
please stop database instance 
HDB00 on hana2-2 by running the command ‘su – spsadm -c 
“sapcontrol -nr 00 -function Stop”’ 
on that server. Once stopped, 
it will become the secondary SAP HANA System Replication site.

Recommendations for resolution:

Investigate the database on each cluster node to determine whether important data exists on the Standby node which does not exist on the Active node. If important data has been committed to the database on the Standby node while in the split-brain state, the data will need to be manually copied to the Active node. This determination must be made by a qualified database administrator who is familiar with the data.
Once any missing data has been copied from the database on the Standby node to the Active node, stop the database on the Standby node by running the command given in the LifeKeeper warning message:

su – adm -c “sapcontrol -nr <Inst#> -function Stop”

where is the lower-case SAP System ID for the HANA installation and <Inst#> is the instance number for the HDB instance (e.g., the instance number, for instance, HDB00 is 00)
Once the database has been successfully stopped, LifeKeeper will re-register the Standby node as the secondary replication site during the next quickCheck interval (default 2 minutes). Once replication resumes, any data on the Standby node which is not present on the Active node will be lost. Once the Standby node has been re-registered as the secondary replication site, the SAP HANA hierarchy has returned to a highly available state.

Being aware of common split-brain scenarios and taking these steps to mitigate them can save you time and protect data integrity.

Reproduced with permission from SIOS

High Availability Architecture and Best Practices

September 16, 2021 by Jason Aw Leave a Comment

High Availability Architecture and Best Practices

13 Little Known Facts about High Availability

1. Hypervisor HA is not the Same as Application HA

A key misconception is that I have high availability because I have redundancy in my hardware or hypervisor. However, hardware and hypervisor redundancy does not guarantee high availability for applications. It is also not a guarantee that orchestration of applications will be properly executed in a failure.

2. In High Availability, Bigger Does Not Equal Better

If you are a powerlifter, bigger weights are better and smaller reps are better. Or, if we are talking about hugs. (You remember hugs are the things that we used to do when we saw a friend from a different town, that we hadn’t seen in a while.) But, bigger doesn’t always mean better. A bigger kidney stone, for example, is definitely not better. In higher availability, creating a bigger, more complex solution doesn’t always mean that you’ll have increased your high availability. It might mean you have the same availability or less. It may also mean that you have a bigger, more complex system with a lot of moving pieces to sort through in an outage.

3. Everything fails… sometimes

Application programming languages date back to the 1950’s. And while the languages, processors, IDEs, and quality of the code has improved, the reality is “all applications fail at some point.” Failures due to exceptions, bugs, unhandled terminations, accidental terminations, resource exhaustion, and more happen. Having an active/active, or active/passive application availability strategy is still necessary.

4. Focus on ‘why’ as much as ‘how’

Our natural tendency to jump into task completion mode is a necessary asset, but it needs to be tempered and guided by the answer to our questions of why. Adding a solution to an environment without understanding the business, application, database, and stakeholder requirements will lead to either a:

Failure
Over expenditures
Underperformance
Confusion and over architecture
All of the above

Instead of focusing solely on getting availability implemented, spend the necessary resources and effort to understand the business needs and answers to “why”

5. Unpatched issues are a common source of regret

If you do or you don’t you will have consequences. The consequence of all unpatched issues is regret. As VP of Customer Experience I have seen firsthand the downtime caused by customers failing to address known issues in a timely manner.

6. Undocumented issues cause downtime too

Picture the scene. A new admin is looking into servers on the network. The usage reports indicate the server is not active and no clients are connected. Not recognizing the server and finding no “tags”, documentation or other identifiers, the new admin believes that it should be shut down. Unfortunately the undocumented and uncommunicated instance is actually a standby server whose removal will cause downtime when the primary crashes unexpectedly. This isn’t a fictional story, this is the true story of a new admin who incorrectly identified a server as an idle QA system and shut it down prior to a patching exercise.

7. Complacency is also an enemy

We’d all love it if availability on premises or in the cloud, or anywhere in between was something that we can “set and forget.” But, few, if anything in life is really as simple as “set it and forget it.” One of the biggest enemies of your availability in the future, is your success with high availability now. When disasters are few and far between, and teams feel confident that they have realized sustained stability, complacency can step in. Success tempts us to think nothing is going to change, and complacency in respect to high availability therefore is an enemy to high availability. Things around your enterprise and within your enterprise are changing. The cloud is changing, your business needs are changing, and the applications and Operating Systems are also changing.

8. Change is hard

Change is hard. Just ask anyone with a sweet tooth who’s been trying to give up that second slice of cake before bedtime. Similar resistance occurs even in high availability. Teams, even those who experience disasters, are often reluctant to change even if the change is good. They need a vision, an understanding of why, and support. Other teams, those with solutions in place, are reluctant to improve high availability with fear of introducing instability or exposing themselves to new risk.

9. All change is not good change

Change is good, when change is good. When considering a change to the higher availability solution and architecture it is critical that changes are analyzed against the goals, the requirements, and within the scope of increasing availability. Changes that increase stability, add protection for critical components, eliminate workarounds, optimize the availability of services and are thoroughly tested are good changes.

10. Cheaper is not always better

Cheaper is not always better. While cheaper solutions typically have a lower price tag, they may also come with a number of limitations that make them less than ideal. When there is a lower price tag, beware of missing features such as a lack of application awareness, limited orchestration, hidden complexity, manual recovery and failover, and limited to no user validation. Cheaper solutions may also fail to include customer support. Be sure to understand whether your cheaper solution includes support, or if the support is an additional, and substantial addon cost.

The same applies to cheaper deployments with reduced compute, disk or storage. While the price tag and monthly cost might be lower, your solution may also be functioning at a less than ideal capacity.

11. Loud does not equal effective

Ever heard the story of the boy that cried wolf. An application monitoring solution that produces an alert storm is sooner than later a solution that gets ignored. Having a solution that provides alerts is great, but if that solution triggers critical alerts in error or in excess, it is ineffective.

12. High Availability is a culture and a mindset, not just a product or hardware solution

Software, hardware, processes, solutions and services are all a part of high availability. However, without a buy-in across IT functions and business units, it will be fraught with frustration and constantly the source of budget discussions instead of discussions on value, business stability, increased customer satisfaction, and diminished risk.

13. Now is not too late

Hope is not a strategy for high availability, nor does hoping that you will not have a critical disaster or application failure need to be a strategy. Designing and architecting a highly available enterprise architecture can be made possible now, even if it has been weeks or months since the last disaster.

Contact SIOS to learn more about high availability solutions for your application.

– Cassius Rhue, VP, Customer Experience

Reproduced from SIOS

12 Questions to Uncomplicate Your Cloud Migration

September 10, 2021 by Jason Aw Leave a Comment

12 Questions to Uncomplicate Your Cloud Migration

Cloud migration best practices

The “cloud is becoming more complicated,” it was the first statement in an hour-long webinar detailing the changes and opportunities with the boom in cloud computing and cloud migration. The presenter continued with an outline of cloud related things that traditional IT is now facing in their journey to AWS, Azure, GCP or other providers.

There were nine areas that surfaced as complications in the traditional transition to cloud:

Definitions
Pricing
Networking
Security
Users, Roles, and Profiles
Applications and Licensing
Services and Support
Availability
Backups

As VP of Customer Experience for SIOS Technology Corp I’ve seen how the following areas can impact a transition to cloud. To mitigate these complications, consumers are turning to managed service providers, cloud solution architects, contractors and consultants, and a bevy of related services, guides, blog posts and related articles. Often in the process of turning to outside or outsourced resources the complications to cloud are not entirely removed. Instead, companies and the teams they have employed to assist or to transition them to cloud still encounter roadblocks, speed bumps, hiccups and setbacks.

Most often these complications and slowdowns in migrating to the cloud come from twelve unanswered questions:

What are our goals for moving to the cloud?
What is your current on-premise architecture? Do you have a document, list, flow chart, or cookbook?
Are all of your application, database, availability and related vendors supported on your target cloud provider platform?
What are your current on-premises risks and limitations? What applications are unprotected, what are the most common issues faced on-premises?
Who is responsible for the cloud architecture and design? How will this architecture and design account for your current definitions and the definitions of the cloud provider?
Who are the key stakeholders, and what are their milestones, business drivers, and deadlines for the business project?
Have you shared your project plan and milestones with your vendors?
What are the current processes, governance, and business requirements?
What is the migration budget and does it include staff augmentation, training, and services? What are your estimates for ongoing maintenance, licensing, and operating expenses?
What are your team’s existing skills and responsibilities?
Who will be responsible for updating governance, processes, new cloud models, and the various traditional roles and responsibilities?
What are the applications, services, or functions that will move from IaaS to SaaS models?

Know Your Goals for the Cloud

So, how will answering these twelve questions will improve your cloud migration. As you can see from the questions, understanding your goals for the cloud is the first, and most important step. It is nearly universally accepted that “a cloud service provider such as AWS, Azure, or Google can provide the servers, storage, and communications resources that a particular application will require,” but for many customers, this only eliminates “he need for computer hardware and personnel to manage that hardware.” Because of this fact, often customers are focused on equipment or data center consolidation or reduction, without considering that there are additional cloud opportunities and gaps that they still need to consider. For example, cloud does eliminate management of hardware, but it “does not eliminate all the needs that an application and its dependencies will have for monitoring and recovery,” so if your goal was to get all your availability from the cloud, you may not reach that goal, or it may require more than just moving on premises to an IaaS model. Knowing your goals will go a long way in helping you map out your cloud journey.

Know Your Current On-Premises Architecture

A second critical category of questions needed for a proper migration to the cloud, (or any new platform) is understanding the current on-premises architecture. This step not only helps with the identification of your critical applications that need availability, but also their underlying dependencies, and any changes required for those applications, databases, and backup solutions based on the storage, networking, and compute changes of the cloud. Answering this question is also a key step in assessing the readiness of your applications and solutions for the cloud and quantifying your current risks.

A third area that will greatly benefit from working through these questions occurs when you discuss and quantify current limitations. Frequently, we see this phase of discovery opening the door to limitations of current solutions that do not exist in the cloud. For example, recently our services team worked with a customer impacted by performance issues in their SQL database cluster. A SIOS expert assisting with their migration inquired about the solution and architecture, and VM sizing decisions. After a few moments, a larger more application sized instance was deployed correcting limitations that the customer had accepted due to their on-premise restrictions on compute, memory, and storage. Similarly we have worked with customers who were storage sensitive. They would run applications with smaller disks and a frequent resizing policy, due to disk capacity constraints. While storage costs should be considered, running with minimal margins can become a limitation of the past.

Understand Business and Governance Changes

The final group of questions help your team understand schedules, business impacts, deadlines, and governance changes that need to be updated or replaced because they may no longer apply in the cloud. Migrating to the cloud can be a smooth transition and journey. However, failing to assess where you are on the journey and when you need to complete the journey can make it into a nightmare. Understanding timing is important and can be keenly aided by considering stakeholders, application vendors, business milestones, and business seasons. Selfishly, SIOS Technology Corp. wants customers to understand their milestones because as a Service provider it minimizes the surprises. But, we also encourage customers to answer these questions as they often uncover misalignment between departments and stakeholders. The DBAs believes that the cutover will happen on the last weekend of the month, but Finance is intent on closing the books over the final weekend of the same month; or the IT team believes that cutover can happen on Monday, but the applications team is unavailable until Wednesday, and perhaps most importantly the legal team hasn’t combed through the list of new NDAs, agreements, licensing, and governance changes necessary to pull it all together.

As customers work through the questions, with safety and empathy, what often emerges is a puzzle of pieces, ownership, processes, and decision makers that needs to be put back together using the cloud provider box top and honest conversations on budget, staffing, training, and services. The end result may not be a flawless migration, but it will definitely be a successful migration.

For help with your cloud migration strategy and high availability implementation, contact SIOS Technology Corp.

– Cassius Rhue, VP, Customer Experience

Learn more about common cloud migration challenges.

Read about some misconceptions about availability in the cloud.

Reproduced from SIOS

3 Steps to Effective IT System Redundancy

September 5, 2021 by Jason Aw Leave a Comment

3 Steps to Effective IT System Redundancy

In some industries duplicate tasks can be a waste of company resources and could introduce unintended human error and loss of time. But in the IT world of managing systems and data, the duplication process referred to as “redundancy” is critical to the continued success of your organization.

1. Protect Both Your Devices and Software with Redundancy Tools

IT tools that provide redundancy ensure your system and software assets are protected from loss or corruption. They should also provide for a timely recovery to restore interruptions of your business.

Redundancy in IT systems means having the ability to duplicate your system components, whether on hardware, VMs, or the cloud. At the user level, a simple example is making a copy of the user’s PC system and storing it on another PC as a spare in case the user’s PC fails.

This same concept can be applied to any other computer component, including servers, storage devices, and networking equipment. For example, “mirroring” is the mechanism for writing the same data to multiple disks, making those disks redundant.

Redundancy enables you to recover from a device failure by switching to a spare device as soon as possible. Businesses rely heavily on their IT systems, and a service outage caused by a system failure can cause considerable downtime of operations. As a result, redundancy is indispensable for the IT system to remain resilient to failure and reduce the risk of business interruption. Depending on your organization’s size and geographical locations, this could be difficult, time-consuming, and costly.

2. Keep All Data Current and Synchronized with Clustering

Having redundant devices with the same specifications and environment (operating system and software) does not automatically safeguard the loss of user files and emails, and mission-critical application data when there is a failure. This is true for not only an individual user’s PC but also the larger enterprise scale, across multiple servers and storage devices. Failure of a data storage device could render your business operations to significant delays without access to the latest data. For large applications like SQL Server, Oracle, or SAP, recovery time could be significant.

Unfortunately, many companies believe their risk is reduced by simply backing up their data. However, until a production device suddenly fails, most people do not realize how difficult it is to actually restore the data to a standby device from the backup copy.

In stark contrast, with a standby device that already has the ability to use the same data that was on the failed production device, all you have to do is start up the standby machine and switch to it. The recovery work will be much easier. This is possible with a High Availability (HA) cluster system.

Clustering helps improve reliability and performance of software and hardware systems by creating redundancy to compensate for unforeseen system failure. The HA cluster system consists of redundant servers in the active and standby systems and external storage (e.g., shared disk) that both servers can access. In the unlikely event that the operating server fails, by switching to the standby server, the service can be continued with the combination of the standby server and the external storage containing the latest data.

By the way, the same function can be achieved with “replication“, which synchronizes data between disks inside the server in real time. Replication is also an excellent measure against Disaster Recovery because it does not require the installation of expensive external storage and keeps the latest data on both instances. Depending on the location of the secondary instance, data is either synchronously or asynchronously replicated. Be aware that how the data is replicated impacts Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPO).

3. Automate Failover

Whether you utilize an HA cluster system or replication, the best practice is to avoid manual switch-over of your server when a failure occurs. Instead, automate the process so that it is performed without delay in a process called failover. Configuring an automated failover of the HA cluster system / replication minimizes downtime as much as possible and reduces human error.

SIOS SAN-based and SANLess clustering solutions provide high availability and disaster recovery for mission-critical applications in physical, virtual, cloud or hybrid cloud environments. For further information, refer to our Windows and Linux high availability products.

Reproduced from SIOS