clusters Archives - Page 2 of 4 - SIOS SANless clusters

Beginning Well is Great, But Maintaining Uptime Takes Vigilance

September 28, 2021 by Jason Aw Leave a Comment

Beginning Well is Great, But Maintaining Uptime Takes Vigilance

Author Isabella Poretsis states, “Starting something can be easy, it is finishing it that is the highest hurdle.” It is great to have a kickoff meeting. It is invigorating, and exciting. Managers and leaders look out at the greenfield with excitement and optimism is high. But, this moment of kickoff, and even the Champagne popping moment of a successful deployment are but just the beginning. Maintaining uptime requires ongoing vigilance.

High availability and the elusive four nines of uptime for your critical applications and databases aren’t momentary occurrences, but rather, a constant endeavor to end the little foxes that destroy the vineyard. Staying abreast of threats, up-to-date on the updates, and properly trained and prepared is the work from which your team “is never entitled to take a vacation.”

For those who want to stay vigilant in maintaining uptime, here are five tips:

1. Monitor the Environment

Very little in enterprise software still follows the “set it and forget it” mindset. Everything, since the day you uncorked the grand opening champagne to now, has been moving toward a state of decline. If you aren’t monitoring the servers, workloads, network traffic, and hardware (virtual or physical), you may lose uptime and stability.

2. Perform Maintenance

One thing that I have always noticed in over twenty plus years of software development and services is that all software comes with updates. Apply them. Remember to execute sound maintenance policies, including taking and verifying backups. One tech writer suggested the only update you regret is the one you failed to make.

3. Learn Continuously

My first introduction to high availability came when I unplugged one end of the Token Ring for a server in our lab as an intern, fresh from the CE-211 lab. The administrator was in my face in minutes. After an earful, he gave me an education. Ideally, you and your team want to learn without taking down your network, but you do absolutely want to keep learning. Look into paid courses on existing technology, new releases, emerging infrastructure. Check your vendors for courses and items related to your process, environment, software deployments and company enterprise. Free courses for many things also exist if money is an issue.

4. Multiply the learning

In addition to continuous learning, make a plan to multiply the learning. As the VP of Customer Experience at SIOS we have seen the tremendous difference between teams who share their learning and those who don’t. Teams that share their learning avoid gaps in knowledge that compromise downtime. The best way to know that you learned something is to teach it to somebody else. As you learn, share the learning with team members to reduce the risk of downtime due to error, and for that matter vacation.

5. End well . . .before the next beginning

All projects, servers, and software have an ending. End well. Decommission correctly. Begin the next phase, deployment, software relationship, etc well by closing up loose ends, documenting what went well, what did not, and what to do next. Treat your existing vendors well. You just may need them again later. Understand the existing systems and high availability solutions before proceeding with a new deployment. This proper ending helps you begin again from a better starting place headed towards a stronger outcome.

Keeping the system highly available is a continuous process. Set it and forget it is a nice catch phrase, but the reality is that uptime takes vigilance, continual monitoring, proper maintenance, and constant.

-Cassius Rhue, VP, Customer Experience

Reproduced with permission from SIOS

SIOS Protection Suite for Linux Quick Service Protection

February 6, 2021 by Jason Aw Leave a Comment

How to add custom application support to SIOS Protection Suite - SIOS Protection Suite for Linux Quick Service Protection

Using SIOS Protection Suite for Linux Quick Service Protection Resource

On a recent engagement with the SIOS Professional Services team, a customer inquired about how to protect a custom application with the SIOS Protection Suite for Linux solution. One of the highly experienced high availability experts at SIOS Technology Corp., helped understand the customer’s application and laid out the methods SIOS provides for custom application support.

SIOS Protection Suite for Linux provides multiple methods for adding high availability and application monitoring to custom applications. These options include the following:

Creating a custom application recovery kit (ARK)¹
Creating a generic application resource hierarchy
Creating a quick service protection resource

Type	Coding Complexity	Monitoring	Recovery
Custom Application Recovery Kit Resource¹	Highest	Highest	Highest
Generic Application Resource	Medium	High	High
Quick Service Protection Resource	Low	Medium	Medium

Definitions Used in Chart

Monitoring – defined as the ability to make a determination of the availability, accessibility and functioning of the protected application, database or service. A low level of application, database, or service monitoring provides basic coverage, such as a check for a running process, existence of a pid_file, or that the status command returns a ‘true’ result when executed. Note: A ‘true’ or ‘0 (zero)’ return code does not mean that the application, database, or service is running. But only that the command executed was able to successfully complete with a positive (‘true’ or ‘0 (zero)’) status result. The highest level of monitoring indicates that application specific knowledge is applied to determine the health and functioning of the application beyond lower level methods such as process status, ps output, or systemd status returns. The highest level of monitoring typically applies knowledge of recommended order of healthcheck operations, knowledge of dependencies, and analysis of the results obtained from status and monitoring commands.

Recovery – defined as the ability to restart a failed application, database or service. A low level of recovery capability implies that commands for a restart are issued and expected output are obtained from the issuance of the command. The highest level of monitoring indicates that application-specific knowledge is applied to determine how to initiate an orderly restart of the application, database, or service, which may require knowledge of recommended order of operations, dependencies, rollbacks or other related remediation of a failed service.

Solution: Quick Service Protection Resource

In this engagement, the customer’s application had systemd compatibility. Based on their overall requirements for avoiding coding, minimal monitoring needs, and simple recovery procedures, we recommended the Quick Service Protection (QSP) Resource.

The QSP resource works to quickly add support of a systemd service to the SIOS Protection Suite for Linux resource protection. In the case of Customer Example.com, they have a systemd compatible service, with the minimal required definition needed to start and stop their application.

[Unit]

Description=SIOS ‘as-is’ Example Service 2020

After=network.target

[Service]

Type=simple

Restart=always

RestartSec=3

User=root

ExecStart=/example_app/bin/exampleapp start

ExecStop=/example_app/bin/exampleapp stop

[Install]

WantedBy=multi-user.target

Example.com systemd file

SIOS recommends that prior to attempting the protection of the resource with the SIOS Protection Suite for Linux product, verify via systemctl that the example application stops and starts accordingly:

# systemctl status example

* example.service – SIOS ‘as-is’ Example Service 2020

Loaded: loaded (/usr/lib/systemd/system/example.service; disabled; vendor preset: disabled)

Active: inactive (dead)

# systemctl start example

# systemctl status example

* example.service – SIOS ‘as-is’ Example Service 2020

Loaded: loaded (/usr/lib/systemd/system/example.service; disabled; vendor preset: disabled)

Active: active (running) since Fri 2020-08-21 14:53:27 EDT; 5s ago

Main PID: 19937 (exampleapp)

CGroup: /system.slice/example.service

`-19937 /usr/bin/perl /example_app/bin/exampleapp start

# systemctl stop example

# systemctl status example

* example.service – SIOS ‘as-is’ Example Service 2020

Loaded: loaded (/usr/lib/systemd/system/example.service; disabled; vendor preset: disabled)

Active: inactive (dead)

After verifying that the application functions correctly via systemd, restart the service and ensure that the service is running.

# systemctl start example

# systemctl status example

* example.service – SIOS ‘as-is’ Example Service 2020

Loaded: loaded (/usr/lib/systemd/system/example.service; disabled; vendor preset: disabled)

Active: active (running) since Fri 2020-08-21 15:59:44 EDT; 3min 2s ago

Main PID: 30740 (exampleapp)

Refer to the SIOS Protection Suite for Linux Quick Service Protection Suite documentation for additional details on the resource create process.

Using the SPS-L UI select the Create option, indicated in the Global UI Resource Toolbar by the following icon:

Once the create wizard is launched, select the Quick Service Protection option in the Create Resource Wizard Window

In the next prompt for ‘Switchback Type’, choose whether you will use intelligent switchback or automatic switchback.

After selecting the ‘Switchback Type’, the Server dialogue appears allowing you to choose the primary server for the custom application.

(Note: If the service requires storage, be sure to choose the same primary server previously selected for the storage resources.)

In the Service Name dialog box, find the service for your custom application.

Once you’ve selected the correct service, example, determine whether you will enable monitoring or disable the monitoring service. Refer to the documentation to gain an understanding of the monitoring provided by the QSP resource.²

Next, choose a resource tag. A resource tag should be a meaningful name that will help your IT team quickly identify which SPS-L resource protects your application or service.

Lastly, follow the final dialogue to complete the resource creation process. Once the resource is created, use the UI to extend the resource to additional servers. If necessary, create dependencies between the newly protected custom service/application and any other required resources such as storage or IP resources.

NOTES:

¹ Creating a customer application recovery kit can be accomplished via an engagement with the SIOS Technology Corp. Professional Services Team. For more information contact professional-services@us.sios.com

² The QSP Recovery Kit quickCheck can only perform simple health (using the “status” action of the service command). QSP doesn’t guarantee that the service is provided or the process is functioning. If complicated starting and/or stopping is necessary, or more robust health checking operations are necessary, using a Generic Application or Custom Application ARK is recommended

Reproduced from SIOS

How To Choose A Cloud When You Need High Availability

January 8, 2021 by Jason Aw Leave a Comment

How To Choose A Cloud When You Need High Availability

Understand the cloud market

A number of analyst firms are predicting an ever-increasing number of deployments of applications, databases, and solutions in the cloud. According to Gartner, firms are “moving to the cloud at an increasing rate.”^[1] In fact, Gartner and other analysts expect the pace of cloud migration and deployment will continue to accelerate, driven in large part by the pace of innovation in the cloud. In a TechTarget article by Kurt Marko, of MarkoInsights, Marko notes that the pace of innovation that is “being undertaken in the cloud likely can’t be replicated on premises due to the elastic, scalable, and on-demand nature of managed public cloud services.”

We see more and more companies that had been using the cloud only for DevOps applications and databases that were not essential to their business, are now moving mission-critical applications, ERPs and databases that require high availability protection to the cloud.

If you are considering a move to the cloud – and it seems likely that you are – there are several keys to understand when you need high availability.

Familiarize yourself with the cloud high availability options

To plan for the proper availability solution for a cloud or hybrid cloud deployment, consider what the pain points are with regards to both availability (99.9% uptime) and high availability (99.99% uptime). You also need to understand the options that are available for high availability with an eye towards your plans to migrate to the cloud. Notable analysts and experts suggest looking for solutions that will not only mitigate and reduce the pain of migrating your workloads, but will also provide a balanced and comprehensive approach to availability throughout the lifespan of your cloud architecture. Note, it is also wise to consider solutions that can provide protection and high availability for portions of your workload that may one day repatriate from the cloud back to your on-premises environment.

Here are ten things to consider when comparing your availability options in the cloud:

1. The deployment method. Is it possible to deploy the availability solution you are considering using an image, CLI, UI, or other repeatable solution such as cloud formation template or packaged scripts.

2. The system requirements. Most notably, consider the operating system (OS), disk, CPU, and memory requirements.

3. The deployment environments. Do your availability options support on-premises only, one or more public clouds, or can they support a mixture, and/or hybrid cloud deployment. Is there a SaaS offering available as well?

4. The breadth and depth of application protection. “Breadth” meaning what types of applications, databases, front-ends, networking, and infrastructure components can be protected? Is there a flexible framework for adding new applications and variants? “Depth” meaning – is the solution application-aware – and able to maintain application-specific best practices throughout the application failover/failback processes?

5. Performance requirements. We often think of RTO and RPO, but what about other performance needs of your solution. Will your availability solution cause performance issues on failover?

6. Resilience requirements. How large a cluster can the availability solution support?, How many faults and failures can it detect and recover from. How will replication be handled while keeping metadata in sync?

7. Supportability and maintenance. Does the availability vendor have experience with a wide range of availability needs and configurations? Do they have longevity, and a support system designed to address issues that may go beyond their solution? Can they help you minimize disruption and planned downtime during your system management and maintenance (patches, upgrades, and general maintenance).

8. Total cost of ownership. There are entire industries and services dedicated to helping you calculate the total cost of ownership, so we won’t cover that here. Suffice it to say, your calculations will be unique to your organization, cloud provider, applications, and IT team. You should consider whether your availability solution vendor can help you identify strategies for saving utilization, licensing, and other costs? Does the solution automate manual tasks, reduce IT labor time?

9. Licensing and pricing model. How do you consume the cost of the software? Is there a subscription fee, subscription model, pay-as-you-go offering, bring your own license (BYOL), or combination of flexible options. How will you enable the product licensing? Is there a license server, licensing service, or encrypted key based on virtual machine deployment details, such as address, hostname, MAC address.

10. The impact on IT staff. How much training with the solution require? How much manual intervention will be needed in the event of an application failure event or disaster? Will it require specialized scripting that needs to be maintained? Who will be responsible for ongoing maintenance?

Weigh the benefits and trade-offs

Like every important decision, you need to understand your tradeoffs and choose the best balance to meet your needs. For example, I recently asked a friend to recommend a good walking shoe. I bought a pair he raved about – noting how lightweight they were, how strong and durable the fabric, and how stylish they were. I went for my first long walk-run in them, and I donated my first pair of “one run” shoes immediately thereafter. When I went to ‘Fleet Feet’ to get an expert’s opinion I ended up with a heavier shoe, with more breathable fabric (also less durable), and an unrivaled level of hideousness. I made a tradeoff between appearance and function that worked for my needs and budget.

Like running shoes, there is no silver bullet solution that will be the right fit for every company, every application, every database, and every possible server and architecture. You are officially free to stop looking for it. Instead, settle into the activity of weighing the trade-offs to determine what is the right fit for your company’s needs. Think about your tradeoffs. For example, if you’re sure you will be a full Microsoft shop, the importance of GCP and AWS support should be a little lower in your evaluation process.

Take your IT infrastructure dynamics into account

Think holistically about availability in your entire IT infrastructure – both on premises and in the cloud. The reasons to do so are best explained with another analogy. In 2018, I was the coordinator for an outreach program feeding the homeless and hungry in Columbia, South Carolina. Our group met once a week to serve a meal and a message of hope to over 100 men, women and children. When we considered expanding – adding more days of the week, more hours, or additional services, we had to think well beyond simple scheduling requirements. Knowing that we were providing a critical service to clients who depend on us, we had to consider all the factors that affected our ability to deliver those services consistently for the long-term, such as: cost, ages of our team members, outside obligations, alternative methods to achieve our goals, risk factors, and other dynamics within our parent organization.

When you are choosing your solution, after you’ve understood the market, familiarized yourself with options, and weighed the trade-offs, the last step is to take into account the various other dynamics in your overall environment. Will the solution meet the needs of your business as a whole? Will your critical data be protected from loss? Will your end-user productivity be protected from downtime? What training will be required to move to the cloud and how will that impact your ability to manage or maintain the solution that you choose? What IT roles will be added, removed, or changed in your cloud journey? Will any responsibilities for application availability move to any line-of-business owners? And how will the shifts in responsibilities, or team make up improve or decrease your overall potential for success. Consider whether your team needs to take a step-by-step approach, migrating smaller workloads first.

As VP of Customer Experience, I have seen a wide range of cloud migrating planning – some straightforward others extremely disruptive. In one instance a customers’ move to the cloud was highly contentious because management saw it as an opportunity to eliminate an entire IT department. I’m not suggesting that you play politics, but you should be aware of all of the factors at play in these complex projects.

Migrating to the cloud is supposed to save money, time and resources while affording improvements in availability and resilience. Regardless of which cloud you choose, make sure that you consider these tips and select the corresponding availability solution that gives you the flexibility to deliver the protection you need in the configuration you want.

Learn more about cloud high availability options with SIOS.

– Cassius Rhue, VP of Customer Experience, SIOS

Reproduced with permission from SIOS

How To Clone Availability In The Cloud With Better Outcomes

December 30, 2020 by Jason Aw Leave a Comment

How To Clone Availability In The Cloud With Better Outcomes

Tips from the movies – Multiplicity

Multiplicity is a 1996 American science fiction comedy film starring Michael Keaton as Doug Kinney, a busy construction worker struggling to make time for his family and his demanding job. When a scientist offers to clone him, Doug agrees to just make meeting his schedule and commitments easier. But then the copies of him begin making copies of themselves. By the time the last copy is made, the point is clear. Cloning may not be all it’s cracked up to be, or at the very least comes with some strong warnings, challenges and side effects. The famous original Star Trek episode “Trouble with Tribbles” illustrates a similar point.

Like cloning on the big screen (or small), cloning in the cloud is a great tool, but not without its challenges.

Tips for how to get better outcomes when you clone availability in the cloud

1. Clone operational systems

This sounds obvious, but I have seen it happen more than once in real enterprise environments. If you clone your non-functional system, the clone will be equally non-functional and problematic when you restore it. Be sure that the clone you make was from an operational and functional system.

2. Sync data to disk and resync on restore

File system integrity is critical. If you don’t ensure your application and/or VM are in a consistent state, most vendors will not guarantee the resulting created image. Since snapshots only capture data that has been written to your volume at the time the snapshot command is issued, this might exclude any data that has been cached by any applications or the operating system. Making sure data has been properly synced to the file system is an important step, and absolutely critical in a cluster environment.

File system integrity is also critical to keep in mind when you restore from an image. If you are using data replication and you restore an image as source or target in the cluster, making sure the two nodes are in sync is paramount. Failing to do so may lead to file system errors on failover or switchover, or even potential data loss. Clone availability in the cloud to get the result you want.

3. Stop your instance

Many environments do not require you to stop an instance to create an image, and some, such as AWS will do the step of powering down the node before making the copy. However, many tools and sites recommend making sure applications are stopped and file system access is properly synced to avoid damage, loss of integrity, or creating images that have trouble starting, stopping, or running installed applications.

4. Label everything in the cloud (nodes, disks, NICs, everything)

While creating a clone is a free operation, the resulting disks and components typically are not. AWS states, for example, that you are “charged for the snapshots until you deregister the image and delete the snapshots.” When things aren’t labeled, knowing what is in use or not in use and why it was created can become problematic. It also becomes subjected to the fleeting memories or poor concentration of existing team members. Label everything.

5. Prune clones and snapshots often (cost savings and headache savings)

Pruning old snapshots and clones is not only good for the cost savings, but it is also good for reducing headaches. Older snapshots run the risk of reintroducing vulnerabilities that have been addressed or resolved in newer copies. As VP of Customer Experience at SIOS Technology Corp., I saw the consequences firsthand when we worked with a customer who restored from a snapshot. They ran into several problems as they restarted the application. After troubleshooting, we determined that the clone was running an older version of security software. The cached credentials and metadata stored in the user profile were no longer in sync with the actual application data stored on the externally mounted data drives.

6. Limit or restrict cloning of clones in the cloud

Lastly, not everything you do in the cloud needs to be cloned. Consider limiting the types of workloads that you will clone and restrict the number or roles who can create clones in your environment.

In the movie, when Doug’s clones sparked their own series of duplications, an already overwhelmed Doug (Michael Keaton) is forced to exert extra energy to manage his many clones while trying to hide the mess he created from his wife. Achieving clone availability in the cloud with better outcomes is not difficult. Clone carefully to avoid making more work and adding risk from a tool that was supposed to make your work easier and your environment safer.

– Cassius Rhue, Vice President, Customer Experience

Reproduced from SIOS

Convert Azure Clusters To Managed Disks

September 11, 2018 by Jason Aw Leave a Comment

Why You Should Convert Azure Clusters To Managed Disks

You may have heard about the recent storage outage that impacted some instances in the US East region back on March 16th. A root cause analysis of the outage is posted here. March 16th US East Storage Outage

Customer Impact

A subset of customers using Storage in the East US region may have experienced errors and timeouts while accessing their storage account in a single Storage scale unit.

You might be asking, “What is a single Storage scale unit”. Well, you can think of it as a single storage cluster, or single SAN, or however you want to think about it. I don’t think Azure publishes their exact infrastructure. Although you can probably assume that behind the scenes they are using Scale Out File Servers for backend storage.

Survive The Outage With Minimal Downtime

So the question is, how could I have survived this outage with minimal downtime? If you read further down that root cause analysis you come across this little nugget.

Virtual Machines using Managed Disks in an Availability Set would have maintained availability during this incident.

Hence, it is time to Convert Azure Clusters To Managed Disks

What’s Managed Disks?

On February 8th Corey Sanders announced the GA of Managed Disks.

Managed Disks would have helped in this outage. Because by leveraging an Availability Set combined with Managed Disks, each of the instances in your Availability Set are connected to a different “Storage scale unit”. So in this particular case, only one of the cluster nodes would have failed, leaving the remaining nodes to take over the workload.

Prior to Managed Disks being available (anything deployed before 2/8/2016), there was no way to ensure that the storage attached to your servers resided on different Storage scale units. Sure, you could use different storage accounts for each instances. But in reality that did not guarantee that those Storage Accounts provisioned storage on different Storage scale units. More reasons to Convert Azure Clusters To Managed Disks.

So while an Availability Set ensured that your instances reside in different Fault Domains and Update Domains to ensure the availability of the instance itself, the additional storage attached to each instance really represented a single point of failure. Although the storage itself is highly resilient, with three copies of your data and geo-redundant options available, in this case with a power failure the entire Storage scale unit went down along with all the servers attached to it.

So long story short… Convert Azure Clusters To Managed Disks as soon as possible in order to help minimize downtime

https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-migrate-to-managed-disks

And if you really want to minimize downtime you should consider Hybrid Cloud Deployments that span cloud providers or on-prem to cloud!

Reproduced with permission from Clusteringformeremortals.com