High Availability Archives - Page 6 of 47

Strategies for Optimizing IT Systems for High Availability

June 5, 2024 by Jason Aw Leave a Comment

Strategies for Optimizing IT Systems for High Availability

Maintaining high availability (HA) of IT systems is essential for organizational success. From critical database management to ensuring seamless customer experiences, achieving uninterrupted operations presents unique challenges that require strategic planning. Here are some key strategies that organizations can leverage to optimize their IT systems for high availability.

Common Challenges in Optimizing IT Systems for High Availability

There are a couple different areas that start to pose challenges for the IT system. One that comes up very often is compatibility with Antivirus (AV) solutions. Oftentimes the issue stems from the Antivirus being overprotective of the systems and quarantining files that are critical to application or HA solutions functioning. Of course, it’s always important to verify compatibility between solutions, to go a step further though – it is always good for everyone who administers the system to be familiar with how the AV solution works and understand the procedure to configure/request changes to the AV solution so critical applications aren’t interrupted.

In addition to the AV solution, firewall configuration also comes up – oftentimes with HA solutions additional communication is transmitted over the network to orchestrate cluster behavior. As a result, there are usually specific rules that need to be added to accommodate the HA solution to prevent erroneous cluster-recovery actions by the HA solution.

Finally, the principles of access control become slightly more complex when configuring highly available systems. While the individual teams (IE, DB team, SAP team, cloud team – however things are distributed) each need permissions over their respective domain, any administrators who manage the HA solution may see that they have additional privileges accessible through the HA solution (IE, initiating failover of an application, creating communication between nodes, locking/unlocking storage, etc.). As a result, it is important to consider the actions available through the HA solution when delegating access permissions. It may be pertinent to have HA controls allowed only for root-level users, or you may define a procedure for taking actions via the HA solution so teams are notified and actions may be tracked. Regardless, in the view of the principle of least privilege, HA solutions present complexity that should be considered to ensure that applications and systems are only accessible and mutable by the delegated parties.

The Role of Failover and Disaster Recovery Strategies in Ensuring System Uptime

Failover capabilities and disaster recovery (DR) strategies both have significant impacts on the uptime of critical systems. Obviously, HA can provide failover capabilities to ensure single-server issues will not cause an outage for an application suite, and when configured properly – the failover can be nearly seamless. This allows recovery to proceed on the faulting system while standby systems come into a primary role to pick up the load. Of course, disaster recovery can be tightly interwoven with the HA strategy. If redundancy is already being configured – why not ensure that this redundancy exists across fault domains. If observed properly, applications can be highly available and fault-tolerant. When analyzing these outcomes from an IT perspective, properly configured HA and DR strategies can ensure that systems are utilized to their fullest potential, with minimal downtime. Natural disaster or technological failure in a region where applications are hosted is far less likely to propagate to other regions. Leveraging the planned redundancy in tandem with your disaster recovery plan can result in covering more functionality requirements with fewer resources – as careful planning can ensure redundancy and fault tolerance are both handled by the deployment of a standby site.

Balancing Cost-Effectiveness and High Availability: Strategies for Organizations

Configuring a clustered environment or a highly available system can get costly. Usually, at least one standby system is running alongside the primary system and accruing costs despite not handling a workload – but the costs can be mitigated. Here are a few ways I would suggest going about this:
Consider using a managed shared storage solution. If you don’t need redundant copies of data, you can save on storage by using shared storage. Something like Amazon EFS could mean you only need to pay for half of the storage versus a replicated disk configuration.

Consider the use case for a DR system. Oftentimes, these systems are simply stop-gap solutions while a primary site is recovered. Resources don’t run on the DR site for long periods of time, and so – depending on the workload – you may be able to provision a smaller system on your DR site to save on compute costs. Of course, you would need to communicate the design decision here with stakeholders so everyone is aware the DR site is not a long-term hosting solution – but provided your workload and workforce can handle the added restriction, saving on instance sizes can be accomplished. In the same vein, orchestrator and/or quorum systems that would not host workloads but only coordinate within a cluster may be able to be significantly smaller than the systems workloads are delegated to.

Consider using a solution of Scaling up or scaling out. Scaling up means increasing the compute capacity of a singular machine – in cloud environments this relates to a smaller instance increasing its resource pool to that of a larger instance when the workload overwhelms the smaller instance. Scaling out means increasing the number of workers that will be sharing the load of your application when the compute power is necessary. Obviously, the use case dictates when and where scaling up or scaling out is a better solution – but by being familiar with the software and environment at hand you will be able to make decisions and configure the systems to act appropriately when the time comes. Another thing to consider with a scaling solution is to consider the aggressiveness of your descaling rules. To save costs, ensure instances will scale back down to an appropriate resource pool – and evaluate the rules that dictate scale down behavior to ensure that you are not leaving excessive resources provisioned longer than needed.
Establish strong communication between IT teams, stakeholders, Cybersecurity teams, and HA Vendors. Ensuring that there is a basis of communication can facilitate a cooperative rollout of any technologies or upgrades to the environment. Additionally, by keeping communication active all teams will be more apprised to the activities occurring on systems. Keeping all teams up to date is crucial and can make it much easier to diagnose issues or begin a rollback procedure if necessary. Finally, maintaining strong communication also ensures that best practices can be efficiently shared between teams such that teams can work cooperatively rather than operating on different principles.

Implementing High Availability: Best Practices

The first and largest practice I would recommend for anyone deploying systems is to maintain a test environment. Keep the test environment as close to identical to the production environment as possible and perform dry-runs of any procedures that will occur on the production environment so teams are well versed in procedures and runbooks when a production rollout occurs. This practice also feeds into the other best practices I would provide for systems. By maintaining your test environment you are also maintaining a system that can be used to pre-test any changes. The test environment is the perfect place to verify product compatibility and ensure that any considerations for mutual operation between technologies are well established. A fantastic example I see time and time again is configuring exclusions for Antivirus software – there are cases where these exclusions do not get configured and the production environment suffers outages because antivirus might quarantine a file that gets access very frequently. Finally, make sure you are auditing your configuration regularly. Review various aspects such as security groups, access controls, firewall rules and Software compatibility (especially between HA, protected applications, and Antivirus). Maintain a strong log of the findings and any changes made as a result of these audits – keeping track of these details gives a solid record that can be reviewed if there seems to be a configuration change causing an issue. Additionally, when requesting support from vendors these audits can be a fantastic tool to share to reach a full root cause analysis sooner. Most of all, these audits will serve to provide a record of how things should be configured – if there were ever any changes from the ordained configuration, one can refer back to the results of past audits to re-align systems with the organization’s standard for system configuration.

SIOS understands optimizing IT systems for high availability is crucial for organizational success. By addressing compatibility challenges with antivirus solutions and fine-tuning firewall configurations, organizations can enhance system resilience and uptime. Contact us today for more information.

Reproduced with permission from SIOS

Choosing Between GenApp and QSP: Tailoring High Availability for Your Critical Applications

May 17, 2024 by Jason Aw Leave a Comment

Choosing Between GenApp and QSP: Tailoring High Availability for Your Critical Applications

GenApp or QSP? Both solutions are supported by LifeKeeper and help protect against downtime for critical applications, but understanding the nuances between these solutions is important to choosing the correct one for your specific needs. Here are some features, benefits and potential use cases for you to decide which may work best in your environment..

GenApp, short for Generic Application, is a resource type that allows you to manage custom applications within LifeKeeper. With the flexible framework you can use your own scripts to do a variety of tasks that your application might require to automate the failover and recovery process. This flexibility allows granular control of how LifeKeeper handles startup, shutdown, monitoring, logging actions and more to ensure your applications’ high availability.

QSP or Quick Service Protection is designed to be a quick and easy way to protect an OS service. QSP automates the monitoring, failover and recovery of these applications with built-in adjustable timeouts for these actions. Additionally, you can create a dependency relationship so that services can be started and stopped in conjunction with other applications that require the service.

How do I choose the right solution?

The first thing you need to determine is if your application can be recovered by stopping and restarting the service or daemon. If so, then QSP is probably the best and quickest solution for keeping your application up and running. This is because it requires no coding and within minutes you can add the application as a QSP resource within the LifeKeeper GUI. Additionally, it is part of the core product and any coding updates are included in new product releases. However, if your application requires anything other than simple health check and restart capabilities at the OS service level to recover properly, then you will want to explore GenApps. Creating the custom scripts for the GenApp resource type will require more in depth technical skill and long term maintenance, however the flexibility to do whatever tasks needed to keep your application running smoothly is critical, especially for niche applications. These tasks could be anything from monitoring, logging, cleanup tasks or configuration changes.

Want more technical details?

GenApps and QSP are supported on both LifeKeeper for Linux and Windows, more technical details can be found at the links below.

Reproduced with permission from SIOS

Four Tips For Choosing The Right High Availability Solution

April 15, 2024 by Jason Aw Leave a Comment

Four Tips For Choosing The Right High Availability Solution

High Availability and Lebron is the Greatest Of All Time (G.O.A.T) Debate

I was losing at Spades. I was losing at Kahoot. I was losing at a game of basketball, and all to the same friendly competitor, Brandon. So, to distract him I went back to my go to debate- “Lebron is the greatest of all time!” The next tension filled minutes were filled with back and forth rants tinged with the names of some of basketball’s greats: Michael Jordan, Julius Erving, Wilt Chamberlain, Bob Cousy, Shaq, Bill Russell, Jerry West, Steph Curry, Kevin Durant, Kobe Bryant, Magic and Worthy, and Lebron. He jousted with, “How can you even say Lebron is the greatest, Kobe had a killer instinct!” Our verbal sparring would expand to what are the requirements, what makes someone a part of the conversation of greatness, or even a candidate for part of the discussion. Do they need longevity, scoring records, defensive prowess, other accolades and honors? How many Most Valuable Player awards should they have as a minimum? What about the transcendence of their era? What about this or that, and of course my friend Brandon is always quick to throw in titles!

How to Choose the Best High Availability Solution

But, what does this have to do with High Availability? Glad you asked. How often have you been asked to provide or choose the best availability or higher availability solution from a sea of contenders? You’ve decided that the last weekend ruined by an unplanned application crash or down production server was the last weekend that will be ruined by a lack of automated monitoring and recovery. But, which solution is best among the great names like: Microsoft Failover Clustering, SuSE High Availability Extensions, PaceMaker, NEC ClusterPro, vWare HA, SIOS Protection Suite, and SIOS AppKeeper? Four things I learned in sparring over the Greatest Of All Time that will help you with your high and higher availability quandary.

The Requirements for HA

First, what are the requirements? If I wanted the best pure shooter of all time, I’d easily and readily include Steph Curry. If I wanted the most intimidating physical presence, I’m going with someone like Shaq. If I need the best teammate, assist leader, or all around great then I think Lebron James, Magic Johnson, Jerry West, Larry Bird are in the conversation. Likewise, before you start spinning up an HA solution, understand what you need. Is data replication essential or optional? Do you need SQL or are you equally inclined to use other databases? What other applications and packages are necessary? Do you need a solution that can usher you into the cloud, but first it has to tame legacy, vmWare, and physical systems? Will you be an all Windows application shop, or a mixture? Try to think of your team as well. Do you have high turnover that makes management of multiple solutions difficult, training courses essential, and real live people in support critical? Do you need ease of use or just heavy on robustness? Where does longevity and stability of the offering, product, and company fit?

Second, how are you prioritizing your requirements? How will you prioritize the greats against the established requirements? My friend Brandon is always quick to throw in titles. He always counters, how many titles does Lebron have? Titles are king in his debate. I typically, and sarcastically counter with stating that even the 12th man on the bench gets a ring. I highlight the fact that Robert Horry, an outstanding power forward, has more titles than Lebron and MJ. Have frank and honest conversations about the priority of the requirements. As you pick an HA solution, how important is ease of use, OS support, and application breadth of support as compared to RTO/RPO? What features and requirements are considered a must-have, should have, and are nice to have. As the VP of Customer Experience, we once encountered a customer who insisted that the cluster software supports 32 nodes, despite the fact that they had no plans to build clusters larger than two or three. Prioritize the list.

Measuring RPO and RTO for Disaster Recovery

Third, how are you measuring those requirements? How will you measure the greats against the established requirements? Stats in basketball are fun, informative, and often misleading. Brandon often reminds me to check how scoring titles were won as often as I taught how many were won. We often drop barbs about who is better to start or close the game and how to really measure drive, intensity, and a will to win. Likewise, when you comb through the literature, pour over the proof of concept details, determine and define how you will measure things like RPO and RTO. Is RTO based on the client reconnect time or the time the application is restarted? Are you measuring RTO for a failover (server crash) recovery (application crash), manual switchover (administrative action), or all of the above? If application performance is important to you, what does that measurement look like? Is it read performance, write performance, or based on the client’s actual or characterized workload? Think about where benchmarks fit in, or do they? Also, be honest about what you are comparing the numbers to. Measuring for faster database query times during normal operation and on recovery is important, but what if the rest of the solution creates lags that are experienced higher in the user experience?

Evaluating High Availability and Disaster Recovery

Lastly, keep evaluating. From the time that Julius rocked the baby to sleep on the baseline, to the days when Jordan took off from the freethrow line, to the time when Steph Curry shot a step inside the halfcourt line, the game of basketball has been evolving. The “Jordan Rules” and “Bad Boy Era” swagger has been replaced with a ruleset that favors and highlights the combination of skill, power and finesse. Likewise, the landscape of technology is constantly changing. The solution that made the top ten when Solaris and MP-RAS servers ruled the day, may not have adapted to the nimbleness of Linux, Windows or other variants. The SAN based solution that harnessed the power of Fiber Channel may be obsolete in the cloud and SANless world. So, keep evaluating greatness. Keep monitoring how the solutions in the top ten are moving with the trends, or better yet, still making them.

While my debate with Brandon rages on, and likely generations from now, even our children will not have settled on a winner, you can select the right HA solution to meet your enterprise availability needs. Contact a SIOS representative to help you understand, prioritize, and measure the SIOS Protection Suites ability to exceed your requirements.

Reproduced with permission from SIOS

Video: Application High Availability Will Become Universal | Predictions From SIOS Technology

February 5, 2024 by Jason Aw Leave a Comment

Video: Application High Availability Will Become Universal | Predictions From SIOS Technology

SIOS Technology is a high availability (HA) and disaster recovery (DR) solutions company providing application availability for critical mission-critical databases, applications, and services for their customers across Windows and Linux systems, and a variety of cloud platforms. Cassius Rhue, VP of Customer Experience at SIOS Technology, shares his 2024 predictions.

As reliance on applications continues to rise, there will be increasing pressure on IT teams to deliver efficient high availability and disaster recovery for applications that were traditionally considered non-essential in addition to mission-critical ones. Due to this shift, we will likely see an expansion of high-availability software solutions and services to meet this expectation.

With more companies expanding into the cloud and across different operating systems, more teams are also expected to cover a diverse set of operating systems, applications, and cloud platforms. Teams will be looking for applications and solutions that are consistent across these different operating systems and cloud environments to reduce complexity and improve cost efficiency.

HA solutions will also need to be consistent across the operating systems and cloud environments and we will see a drive toward cloud-agnostic HA. Companies need HA and DR solutions to be simple, automated, quick, and intelligent. As more organizations are migrating to the cloud, they will need to ensure they do not lose data in the process. HA solutions will need to bridge the gap between the old systems and the more modern ones.

2024 will see an increased focus on data retention, security access controls, and permissions prompting organizations to integrate more enhanced security measures into their high availability and disaster recovery solutions, services, and strategies. As the volume of data that is being collected continues to increase, organizations will also need more information about why failures have occurred. Automation and orchestration tools will likely play a central role in streamlining root cause analysis and providing intelligent responses.

SIOS Technology will continue to focus on its customers in the coming year, helping them avoid and reduce downtime, and ensuring their data and applications are available when the business needs them most. The company will continue to optimize its solution, providing additional adjacent services to benefit their customers, as well as, helping application providers and cloud providers form an effective HA strategy.

Reproduced with permission from SIOS

Ensuring Access To Critical Educational Applications

January 24, 2024 by Jason Aw Leave a Comment

Watch this video on YouTube

Ensuring Access To Critical Educational Applications

Education and information technology (IT) are increasingly inextricable. Whether the IT in question is an application supporting a classroom whiteboard, the database supporting a university registration system, the learning management systems (LMS), or the building maintenance system controlling student access to the labs, dorms, and dining halls — if key components of your IT infrastructure suddenly go dark, neither teachers, administrators, nor students can accomplish what they are there to accomplish. The mission of the institution is interrupted. If the interruptions are too frequent, if the experiences of students, teachers, and administrators suffer, the reputation of the institution itself can suffer as well.

An IT infrastructure designed to ensure the high availability (HA) of applications crucial to the educational experience can minimize the risk of disruption and reputational loss that could occur if for any reason these systems become unresponsive. In this instance, an HA infrastructure is defined as one capable of ensuring the availability of key applications no less than 99.99% of the time. Put another way, that means that your critical applications won’t be unexpectedly offline for more than four minutes per month.

How do you achieve HA? That question is readily answered, but it is not the only question you need to ask. Just as important is this: Which applications are so critical that they warrant an HA configuration? At its heart, an IT infrastructure configured for HA has one or more sets of secondary servers and storage subsystems that are housed in a geographically distinct location (which could be a remote data center if your primary server resides on-premises or in a separate availability zone [AZ] if your servers reside in the cloud). If something causes the applications running on the primary server to stop responding, the HA software managing your application will immediately fail over the application to the secondary server, where your critical applications will start up again from the point at which the primary server stopped responding. Depending on the size and performance characteristics of the primary server you plan to replicate, that secondary server may be costly, so it’s unlikely you’re going to configure all your academic applications for HA. Once you determine which applications warrant the investment in HA, you’ll know where you need to build out an HA environment.

Choices for Achieving High Availability

Once you’ve chosen the applications you intend to protect, your options for achieving HA become clearer. Are they running on Windows or Linux? Does your database management system (DBMS) have built-in support for an HA configuration? If so, what are its limitations? If your critical applications are running on Windows and SQL Server, for example, you could enable HA using the Availability Group (AG) feature of SQL Server itself. Alternatively, you could configure HA using a third-party SANless clustering tool, which offers options that the AG services in SQL Server do not. If you’re trying to protect database servers from multiple vendors, or if some of your critical applications run on Windows while others run on Linux, your ability to manage HA will be facilitated by the use of an HA solution that supports multiple DBMS and OS platforms. Opting for a cluster solution that accommodates diverse DBMS and OS platforms simplifies management, in contrast to the potential complexity and cumbersomeness of handling multiple database-native HA services concurrently..

Ensuring High Availability via database-native HA solutions

If you’re using a database-native HA solution, such as the AG feature of SQL Server, the software will synchronously replicate all the data in your primary SQL Server database to an identical instance of that database on the secondary system server. If something causes the primary server to stop responding, the monitoring features in the AG component will automatically cause the secondary server to take over. Because the AG feature has replicated all the data in real time, the secondary server can take over immediately and there is virtually no interruption of service or loss of data.

Many database-native HA tools operate in a similar manner. There are a few caveats, though, when considering a database-native approach: If the HA services are bundled into the DBMS itself, they may replicate only the data associated with that DBMS. If other critical data resides on your primary server, that will not be replicated to the secondary server in a database-native HA scenario. There may be other limitations on what the database-native services will replicate as well. If you use the Basic AG functionality that is bundled into SQL Server Standard Edition, for example, each AG can replicate only a single SQL database to a single secondary location. You could create multiple Basic AGs if your applications involve multiple SQL databases, but you cannot control whether each AG fails over at the same time in a failover situation — and problems may arise if they do not. One way around this limitation would be to use the Always On AG functionality bundled into SQL Server Enterprise Edition, which enables the replication of multiple SQL databases to multiple secondary servers, but that can get very expensive from a licensing perspective if your applications don’t otherwise use any of the features of SQL Server Enterprise Edition.

Other database-native HA solutions may have similar constraints, so be sure to understand them before investing in such an approach.

Ensuring High Availability via SANless Clustering

As an alternative to the database-native approach to HA, you could use a third-party tool to create a SANless cluster. Just as in the AG configuration described above, the SANless clustering software automates the synchronous replication of data from the primary to the secondary server; it also orchestrates the immediate failover to the secondary server if the primary server becomes unresponsive. Because failover takes only seconds, administrator, faculty, and student access to your critical applications will remain virtually uninterrupted.

The critical differences between the SANless clustering and a database-native approach lie in the practical details. The SANless clustering approach is database agnostic. It replicates any data on a designated storage volume. That could include multiple databases from multiple vendors, text files, video files, or any other educational asset whose availability is important. This can save an institution a considerable amount of money if a database-native approach to HA would otherwise require an upgrade to a more expensive edition of the database. Finally, as noted earlier, if you are trying to protect applications and data running in multiple operating environments, a SANless clustering approach may be more manageable than individual database–native approaches. You can use SANless clustering to ensure HA in either Windows or Linux environments, which can eliminate the complexities that could accompany the deployment of database-native approaches that differ among operating environments.

Reproduced with permission from SIOS