Archives for January 2019

Ensure High Availability for SQL Server on Amazon Web Services

January 30, 2019 by Jason Aw Leave a Comment

Ensure High Availability for SQL Server on Amazon Web Services

Database and system administrators have long had a wide range of options for ensuring that mission-critical database applications remain highly availability. Public cloud infrastructures, like those provided by Amazon Web Services, offer their own, additional high availability options backed by service level agreements. But configurations that work well in a private cloud might not be possible in the public cloud. Poor choices in the AWS services used and/or how these are configured can cause failover provisions to fail when actually needed. This article outlines the various options available for ensuring High Availability for SQL Server in the AWS cloud.

Options

For database applications, AWS gives administrators two basic choices. Each of which has different high availability (HA) and disaster recovery (DR) provisions: Amazon Relational Database Service (RDS) and Amazon Elastic Compute Cloud (EC2).

RDS

RDS is a fully managed service suitable for mission-critical applications. It offers a choice of six different database engines, but its support for SQL Server is not as robust as it is for other choices like Amazon Aurora, My SQL and MariaDB. Here are some of the common concerns administrators have about using RDS for mission-critical SQL Server applications:

Only a single mirrored standby instance is supported,
Agent Jobs are not mirrored and, therefore, must be created separately in the standby instance,
Failures caused by the database application software are not detected,
Performance-optimized in-memory database instances are not supported,
Depending on the Availability Zone assignments (over which the customer has no control) performance can be adversely affected,
SQL Server’s more expensive Enterprise Edition is needed for the data mirroring feature available only with Always On Availability Groups.

Elastic Compute Cloud

The other basic choice is the Elastic Compute Cloud with its substantially greater capabilities. This makes it the preferred choice when HA and DR are of paramount importance. A major advantage of EC2 is the complete control it gives admins over the configuration, and that presents admins with some additional choices.

Picking The Operating System

Perhaps the most consequential choice is which operating system to use: Windows or Linux. Windows Server Failover Clustering is a powerful, proven and popular capability that comes standard with Windows. But WSFC requires shared storage, and that is not available in EC2. Because Multi-AZ, and even Multi-Region, configurations are required for robust HA/DR protection, separate commercial or custom software is needed to replicate data across the cluster of server instances. Microsoft’s Storage Spaces Direct (S2D) is not an option here, as it does not support configurations that span Availability Zones.

The need for additional HA/DR provisions is even greater for Linux, which lacks a fundamental clustering capability like WSFC. Linux gives admins two equally bad choices for high availability: Either pay more for the more expensive Enterprise Edition of SQL Server to implement Always On Availability Groups; or struggle to make complex do-it-yourself HA Linux configurations using open source software work well.

Comparisons

Both of these choices undermine the cost-saving rationale for using open source software on commodity hardware in public cloud services. SQL Server for Linux is only available for the more recent (and more expensive) versions, beginning in 2017. And the DIY HA alternative can be prohibitively expensive for most organizations. Indeed, making Distributed Replicated Block Device, Corosync, Pacemaker and, optionally, other open source software work as desired at the application-level under all possible failure scenarios can be extraordinarily difficult. Which is why only very large organizations have the wherewithal (skill set and staffing) needed to even consider taking on the task.

Owing to the difficulty involved implementing mission-critical HA/DR provisions for Linux, AWS recommends using a combination of Elastic Load Balancing and Auto Scaling to improve availability. But these services have their own limitations that are similar to those in the managed Relational Database Service.

All of this explains why admins are increasingly choosing to use failover clustering solutions designed specifically for ensuring HA and DR protections in a cloud environment.

Failover Clustering Purpose-Built for the Cloud

The growing popularity of private, public and hybrid clouds has led to the advent of failover clustering solutions purpose-built for a cloud environment. These HA/DR solutions are implemented entirely in software that creates, as implied by the name, a cluster of servers and storage with automatic failover to assure high availability at the application level.

Most of these solutions provide a complete HA/DR solution that includes a combination of real-time block-level data replication, continuous application monitoring and configurable failover/failback recovery policies. Some of the more sophisticated solutions also offer advanced capabilities like support for Always on Failover Cluster Instances in the less expensive Standard Edition of SQL Server for both Windows and Linux. They also offer WAN optimization to maximize multi-region performance. There’s also manual switchover of primary and secondary server assignments to facilitate planned maintenance. Including the ability to perform regular backups without disruption to the application.

Most failover clustering software is application-agnostic, enabling organizations to have a single, universal HA/DR solution. This same capability also affords protection for the entire SQL Server application. And that includes the database, logons, agent jobs, etc., all in an integrated fashion. Although these solutions are generally also storage-agnostic, enabling them to work with shared storage area networks, shared-nothing SANless failover clustering is usually preferred for its ability to eliminate potential single points of failure.

Support for Always On Failover Cluster Instances (FCIs) in the less expensive Standard Edition of SQL Server, with no compromises to availability or performance, is a major advantage. In a Windows environment, most failover clustering software supports FCIs by leveraging the built-in WSFC feature. It makes the implementation quite straightforward for both database and system administrators. Linux is becoming increasingly popular for SQL Server and many other enterprise applications. Some failover clustering solutions now make implementing HA/DR provisions just as easy as it is for Windows by offering application-specific integration.

Typical Three-Node SANless Failover Cluster

The example EC2 configuration in the diagram shows a typical three-node SANless failover cluster configured as Virtual Private Cloud (VPC) with all three SQL Server instances in different Availability Zones. To eliminate the potential for an outage in a local disaster affecting an entire region, one of the AZs is located in a different AWS region.

High Availability for SQL Server

This three-node SANless failover cluster, with one active and two standby server instances, can handle two concurrent failures with minimal downtime and no data loss.

A three-node SANless failover cluster affords carrier-class HA and DR protections. The basic operation is the same in the LAN and/or WAN for Windows or Linux. Server #1 is initially the primary or active instance that replicates data continuously to both servers #2 and #3. It experiences a problem. Then it triggers an automatic failover to server #2, which now becomes the primary replicating data to server #3.

Failure Detected

If the failure was caused by an infrastructure outage, the AWS staff would begin immediately diagnosing and repairing whatever caused the problem. Once fixed, it could be restored as the primary, or server #2 could continue in that capacity replicating data to servers #1 and #3. Should server #2 fail before server #1 is returned to operation, as shown, server #3 would become the primary after a manual failover. Of course, if the failure was caused by the application software or certain other aspects of the configuration, it would be up to the customer to find and fix the problem.

SANless failover clusters can be configured with only a single standby instance, of course. But such a minimal configuration does require a third node to serve as a witness. The witness is needed to achieve a quorum for determining the assignment of the primary. This important task is normally performed by a domain controller in a separate AZ. Keeping all three nodes (primary, secondary and witness) in different AZs eliminates the possibility of losing more than one vote if any zone goes offline.

It is also possible to have two- and three-node SANless failover clusters in hybrid cloud configurations for HA and/or DR purposes. One such three-node configuration is a two-node HA cluster located in an enterprise data center with asynchronous data replication to AWS or another cloud service for DR protection—or vice versa.

In clusters within a single region, where data replication is synchronous, failovers are normally configured to occur automatically. For clusters with nodes that span multiple regions, where data replication is asynchronous, failovers are normally controlled manually to avoid the potential for data loss. Three-node clusters, regardless of the regions used, can also facilitate planned hardware and software maintenance for all three servers while providing continuous DR protection for the application and its data.

Maximise High Availability for SQL Server

By offering 55 availability Zones spread across 18 geographical Regions, the AWS Global Infrastructure affords enormous opportunity to maximize High Availability for SQL Server by configuring SANless failover clusters with multiple, geographically-dispersed redundancies. This global footprint also enables all SQL Server applications and data to be located near end-users to deliver satisfactory performance.

With a purpose-built solution, carrier-class high availability need not mean paying a carrier-like high cost. Because purpose-built failover clustering software makes effective and efficient use of EC2’s compute, storage and network resources, while being easy to implement and operate, these solutions minimize any capital and all operational expenditures, resulting in high availability being more robust and more affordable than ever before.

Reproduced from TheNewStack

Options for When Public Cloud Service Levels Fall Short

January 27, 2019 by Jason Aw Leave a Comment

Options for When Public Cloud Service Levels Fall Short

All public cloud service providers offer some form of guarantee regarding availability. These may or may not be sufficient, depending on each application’s requirement for uptime. These guarantees typically range from 95.00% to 99.99% of uptime during the month. Most impose some type of “penalty” on the service provider for falling short of those thresholds.

Usually, most cloud service providers offer a 99.00% uptime threshold. This equates to about seven hours of downtime per month. And for many applications, those two-9’s might be enough. But for mission-critical applications, more 9’s are needed. Especially given the fact that many common causes of downtime are excluded from the guarantee.

There are, of course, cost-effective ways to achieve five-9’s high availability and robust disaster recovery protection in configurations using public cloud services, either exclusively or as part of a hybrid arrangement. This article highlights limitations involving HA and DR provisions in the public cloud. It explores three options for overcoming these limitations, and describes two common configurations for failover clusters.

Caveat Emptor in the Cloud

While all cloud service providers (CSPs) define “downtime” or “unavailable” somewhat differently, these definitions include only a limited set of all possible causes of failures at the application level. Generally included are failures affecting a zone or region, or external connectivity. All CSPs also offer credits ranging from 10% for failing to meet four-9’s of uptime to around 25% for failing to meet two-9’s of uptime.

Redundant resources can be configured to span the zones and/or regions within the CSP’s infrastructure. It will help to improve application-level availability. But even with such redundancy, there remain some limitations that are often unacceptable for mission-critical applications. Especially those requiring high transactional throughput performance. These limitations include each master being able to create only a single failover replica. And it requires the use of the master dataset for backups, and using event logs to replicate data. These and other limitations can increase recovery time during a failure and make it necessary to schedule at least some planned downtime.

Significant Limitations

The more significant limitations involve the many exclusions to what constitutes downtime. Here are just a few examples from actual Public Cloud Service Levels agreements of what is excluded from “downtime” or “unavailability” that cause application-level failures resulting from:

factors beyond the CSP’s reasonable control (in other words, some of the stuff that happens regularly, such as carrier network outages and natural disasters)
the customer’s software, or third-party software or technology, including application software
faulty input or instructions, or any lack of action when required (in other words, the inevitable mistakes caused by human fallibility)
problems with individual instances or volumes not attributable to specific circumstances of “unavailability”
any hardware or software maintenance as provided for pursuant to the agreement

To be sure, it is reasonable for CSPs to exclude certain causes of failure. But it would be irresponsible for system administrators to use these as excuses. It is necessary to ensure application-level availability by some other means.

What Public Cloud Service Levels Are Available?

Provisioning resources for high availability in a way that does not sacrifice security or performance has never been a trivial endeavor. The challenge is especially difficult in a hybrid cloud environment where the private and public cloud infrastructures can differ significantly. It makes configurations difficult to test and maintain. Furthermore, it can result in failover provisions failing when actually needed.

For applications where the service levels offered by the CSP fall short, there are three additional options available based on the application itself, features in the operating system, or through the use of purpose-built failover clustering software.

Three Options for Improving Application-level Availability

HA/DR

The HA/DR options that might appear to be the easiest to implement are those specifically designed for each application. A good example is Microsoft’s SQL Server database with its carrier-class Always On Availability Groups feature. There are two disadvantages to this approach, however. The higher licensing fees, in this case for the Enterprise Edition, can make it prohibitively expensive for many needs. And the more troubling disadvantage is the need for different HA/DR provisions for different applications, which makes ongoing management a constant (and costly) struggle.

Uptime-Related Features Integrated Into The Operating System

Second option to improve Public Cloud Service Levels involves using uptime-related features integrated into the operating system. Windows Server Failover Clustering, for example, is a powerful and proven feature that is built into the OS. But on its own, WSFC might not provide a complete HA/DR solution because it lacks a data replication feature. In a private cloud, data replication can be provided using some form of shared storage, such as a storage area network. But because shared storage is not available in public clouds, implementing robust data replication requires using separate commercial or custom-developed software.

For Linux, which lacks a feature like WSFC, the need for additional HA/DR provisions and/or custom development is considerably greater. Using open source software like Pacemaker and Corosync requires creating (and testing) custom scripts for each application. These scripts often need to be updated and retested after even minor changes are made to any of the software or hardware being used. But because getting the full HA stack to work well for every application can be extraordinarily difficult, only very large organizations have the wherewithal needed to even consider taking on the effort.

Purpose-Built Failover Cluster

Ideally there would be a “universal” approach to HA/DR capable of working cost-effectively for all applications running on either Windows or Linux across public, private and hybrid clouds. Among the most versatile and affordable of such solutions is the third option: the purpose-built failover cluster. These HA/DR solutions are implemented entirely in software that is designed specifically to create, as their designation implies, a cluster of virtual or physical servers and data storage with failover from the active or primary instance to a standby to assure high availability at the application level.

Benefits Of These Solutions

These solutions provide, at a minimum, a combination of real-time data replication, continuous application monitoring and configurable failover/failback recovery policies. Some of the more robust ones offer additional advanced capabilities, such as a choice of block-level synchronous or asynchronous replication, support for Failover Cluster Instances (FCIs) in the less expensive Standard Edition of SQL Server, WAN optimization for enhanced performance and minimal bandwidth utilization, and manual switchover of primary and secondary server assignments to facilitate planned maintenance.

Although these general-purpose solutions are generally storage-agnostic, enabling them to work with storage area networks, shared-nothing SANless failover clusters are normally preferred based on their ability to eliminate potential single points of failure.

Two Common Failover Clustering Configurations

Every failover cluster consists of two or more nodes. Locating at least one of the nodes in a different datacenter is necessary to protect against local disasters. Presented here are two popular configurations: one for disaster recovery purposes; the other for providing both mission-critical high availability and disaster recovery. High transactional performance is often a requirement for highly available configurations. The example application is a database.

The basic SANless failover cluster for disaster recovery has two nodes with one primary and one secondary or standby server or server instance. This minimal configuration also requires a third node or instance to function as a witness, which is needed to achieve a quorum for determining assignment of the primary. For database applications, replication to the standby instance across the WAN is asynchronous to maintain high performance in the primary instance.

The SANless failover cluster affords a rapid recovery in the event of a failure in the primary. Resulting in a basic DR configuration suitable for many applications. It is capable of detecting virtually all possible failures, including those not counted as downtime in public cloud services. As such it will work in a private, public or hybrid cloud environment.

For example, the primary could be in the enterprise datacenter with the secondary deployed in the public cloud. Because the public cloud instance would be needed only during planned maintenance of the primary or in the event of its failure—conditions that can be fairly quickly remedied—the service limitations and exclusions cited above may well be acceptable for all but the most mission-critical of applications.

Three-Node SANless Failover Clusters

Public Cloud Service Levels — This three-node SANless failover cluster has one active and two standby server instances. It is capable of handling two concurrent failures with minimal downtime and no data loss.

The figure shows an enhanced three-node SANless failover cluster that affords both five-9’s high availability and robust disaster recovery protection. As with the two-node cluster, this configuration will also work in a private, public or hybrid cloud environment. In this example, servers #1 and #2 are located in an enterprise datacenter with server #3 in the public cloud. Within the datacenter, replication across the LAN can be fully synchronous to minimize the time it takes to complete a failover. Therefore, maximize availability.

When properly configured, three-node SANless failover clusters afford truly carrier-class HA and DR. The basic operation is application-agnostic and works the same for Windows or Linux. Server #1 is initially the primary or active instance that replicates data continuously to both servers #2 and #3. If it experiences a failure, the application would automatically failover to server #2, which would then become the primary replicating data to server #3.

Recovery

Immediately after a failure in server #1, the IT staff would begin diagnosing and repairing whatever caused the problem. Once fixed, server #1 could be restored as the primary with a manual failback, or server #2 could continue functioning as the primary replicating data to servers #1 and #3. Should server #2 fail before server #1 is returned to operation, as shown, server #3 would become the primary. Because server #3 is across the WAN in the public cloud, data replication is asynchronous and the failover is manual to prevent “replication lag” from causing the loss of any data.

SANless failover clustering software is able to detect all possible failures at the application level. It readily overcomes the CSP limitations and exclusions mentioned above. So, it makes it possible for this three-node configuration to be deployed entirely within the public cloud. To afford the same five-9’s high availability based on immediate and automatic failovers, servers #1 and #2 would need to be located within a single zone or region where the LAN facilitates synchronous replication.

For appropriate DR protection, server #3 should be located in a different datacenter or region, where the use of asynchronous replication and manual failover/failback would be needed for applications requiring high transactional throughput. Three-node clusters can also facilitate planned hardware and software maintenance for all three servers. At the same time, continue to provide continuous DR protection for the application and its data.

By offering multiple, geographically-dispersed datacenters, public clouds afford numerous opportunities to improve availability and enhance DR provisions. SANless failover clustering software makes effective and efficient use of all compute, storage and network resources. It also being easy to implement and operate. These purpose-built solutions minimize all capital and operational expenditures. Finally, resulting in high availability being more robust and more affordable than ever before.

# # #

About the Author

Cassius Rhue is Director of Engineering at SIOS Technology. He leads the software product development and engineering team in Lexington, SC. Cassius has over 17 years of software engineering, development and testing experience. He also holds a BS in Computer Engineering from the University of South Carolina.

Article from DRJ.com

Managing Cost of Cloud for High-Availability Applications

January 23, 2019 by Jason Aw Leave a Comment

Cost of Cloud for High-Availability Applications

Shortly after contracting with a cloud service provider, a bill arrives that causes sticker shock. There are unexpected and seemingly excessive charges. Those responsible seem unable to explain how this could have happened. The situation is urgent because the amount threatens to bust the IT budget unless cost-saving changes are made immediately. So how do we manage the Cost of Cloud for High-Availability Applications?

This cloud services sticker shock is often caused by mission-critical database applications. Especially these tend to be the most costly for a variety of reasons. These applications need to run 24/7. They require redundancy, which involves replicating the data and provisioning standby server instances. Data replication requires data movement, including across the wide area network (WAN). And providing high availability can result in higher costs to license Windows to get Windows Server Failover Clustering (versus using free open source Linux), or to license the Enterprise Edition of SQL Server to get Always On Availability Groups.

Before offering suggestions for managing Cost of Cloud for High-Availability Applications, it is important to note that the goal here is not to minimize those costs. But instead to optimize the price/performance for each application. In other words, it is appropriate to pay more when provisioning resources for those applications that require higher uptime and throughput performance. It is also important to note that a hybrid cloud infrastructure—with applications running in whole or in part in both the private and public cloud—will likely be the best way to achieve optimal price/performance.

Understanding Cloud Service Provider Business And Pricing Models

The sticker shock experience demonstrates the need to thoroughly understand how cloud services are priced and managing Cost of Cloud for High-Availability Applications. Only then can the available services be utilized in the most cost-effective manner.

All cloud service providers (CSPs) publish their pricing. Unless specified in the service agreement, that pricing is constantly changing. All hardware-based resources, including physical and virtual compute, storage, and networking services, inevitably have some direct or indirect cost. These are all based to some extent on the space, power, and cooling these systems consume. For software, open source is generally free. But all commercial operating systems and/or application software will incur a licensing fee. And be forewarned that some software licensing and pricing models can be quite complicated. So be sure to study them carefully.

In addition to these basic charges for hardware and software, there are potential á la carte costs for various value-added services. This includes security, load-balancing, and data protection provisions. There may also be “hidden” costs for I/O to storage or among distributed microservices, or for peak utilization that occurs only rarely during “bursts.”

Because every CSP has its own unique business and pricing model, the discussion here must be generalized. And, in general, the most expensive resources involve compute, software licensing, and data movement. Together they can account for 80% or more of the total costs. Data movement might also incur separate WAN charges that are not included in the bill from the CSP.

Storage and networking within the CSP’s infrastructure are usually the least costly resources. Solid state drives (SSDs) normally cost more than spinning media on a per-terabyte basis. But SSDs also deliver superior performance, so their price/performance may be comparable or even better. And while moving data back to the enterprise can be expensive, moving data from the enterprise to the public cloud can usually be done cost-free (notwithstanding the separate WAN charges).

Formulating Strategies For Optimizing Price/Performance

Covering the Cost of Cloud for High-Availability Applications needs meticulous checks. Here are some suggestions for managing resource utilization in the public cloud in ways that can lower costs while maintaining appropriate service levels for all applications. This include those that require mission-critical, high uptime and throughput.

In general, right-sizing is the foundational principle for managing resource utilization for optimal price/performance. When Willie Sutton was purportedly asked why he robbed banks, he replied, “Because that’s where the money is”. In the cloud, the money is in compute resources, so that should be the highest priority for right-sizing.

For new applications, start with minimal virtual machine configurations for compute resources. Add CPU cores, memory and/or I/O only as required to achieve satisfactory performance. All virtual machines for existing applications should eventually be right-sized. Begin with those that cost the most. Reduce allocations gradually while monitoring performance constantly until achieving diminishing returns.

It is worth noting that a major risk associated with right-sizing is the potential for under-sizing. However it can result in unacceptably poor performance. Unfortunately, the best way to assess an application’s actual performance is with a production workload, making the real world the right place to right-size. Fortunately, the cloud mitigates this risk by making it easy to quickly resize configurations on demand. So right-size aggressively where needed. However be prepared to react quickly in response to each change.

Storage, in direct contrast to compute, is generally relatively inexpensive in the cloud. But be careful using cheap storage, because I/O might incur a separate—and costly—charge with some services. If so, make use of potentially more cost-effective performance-enhancing technologies such as tiered storage, caching, and/or in-memory databases, where available, to optimize the utilization of all resources.

Software licenses can be a significant expense in both private and public clouds. For this reason, many organizations are migrating from Windows to Linux, and from SQL Server to less-expensive commercial and/or open source databases. But for those applications for which “premium” operating system and/or application software is warranted, check different CSPs to see if any pricing models might afford some savings for the configurations required.

Finally, all CSPs offer discounts, and combinations of these can sometimes achieve a savings of up to 50%. Examples include pre-paying for services, making service commitments, and/or relocating applications to another region.

Creating And Enforcing Cost Containment Controls

Self-provisioning for cloud services might be popular with users. But without appropriate controls, this convenience makes it too easy to over-utilize resources, including those that cost the most.

Begin the effort to gain better control by taking full advantage of the monitoring and management tools all CSPs offer. This is likely to encounter a learning curve of course. Because the CSP’s tools may be very different from, and potentially more sophisticated than, those being used in the private cloud.

One of the more useful cost containment tools involves the tagging of resources. Tags consist of key/value pairs and metadata associated with individual resources. And some can be quite granular. For example, each virtual machine, along with the CPU, memory, I/O, and other billable resources it uses, might have a tag. Other useful tags might show which applications are in a production versus development environment, or to which cost center or department each is assigned. Collectively, these tags could constitute the total utilization of resources reflected in the bill.

Organizations that make extensive use of public cloud services might also be well-served to create a script. Include loading information from all available monitoring, management, and tagging tools into a spreadsheet or similar application for detailed analyses and other uses, such as chargeback, compliance, and trending/budgeting. Ideally, information from all CSPs and the private cloud would be normalized for inclusion in a holistic view to enable optimizing price/performance for all applications running throughout the hybrid cloud.

Handling The Worst-Case Use Case: High Availability Applications

In addition to the reasons cited in the introduction for why high-availability applications are often the most costly, all three major CSPs—Google, Microsoft, and Amazon—have at least some high availability-related limitations. Examples include failovers normally being triggered only by zone outages and not by many other common failures; master instances only being able to create a single failover replica; and the use of event logs to replicate data, which creates a “replication lag” that can result in temporary outages during a failover.

None of these limitations is insurmountable, of course—with a sufficiently large budget. The challenge is finding a common and cost-effective solution for implementing high-availability across public, private, and hybrid clouds. Among the most versatile and affordable of such solutions is the storage area network (SAN)-less failover cluster. These high-availability solutions are implemented entirely in software that is purpose-built to create. As implied by the name, a shared-nothing cluster of servers and storage with automatic failover across the local area network and/or WAN to assure high availability at the application level. Most of these solutions provide a combination of real-time block-level data replication, continuous application monitoring, and configurable failover/failback recovery policies.

Some of the more robust SAN-less failover clusters also offer advanced capabilities. For example WAN optimization to maximize performance and minimize bandwidth utilization, robust support for the less-expensive Standard Edition of SQL Server. And let’s not forget manual switchover of primary and secondary server assignments for planned maintenance, and the ability to perform routine backups without disruption to the applications.

Maintaining The Proper Perspective

While trying out some of these suggestions in your hybrid cloud, endeavor to keep the monthly CSP bill in its proper perspective. With the public cloud, all costs appear on a single invoice. By contrast, the total cost to operate a private cloud is rarely presented in such a complete, consolidated fashion. And if it were, that total cost might also cause sticker shock. A useful exercise, therefore, might be to understand the all-in cost of operating the private cloud—taking nothing for granted—as if it were a standalone business such as that of a cloud service provider. Then those bills from the CSP for your mission-critical applications might not seem so shocking after all.

Article from www.dbta.com

Challenges in High Availability For Mission-Critical Applications in SME

January 20, 2019 by Jason Aw Leave a Comment

Survey On State Of Performance And High Availability For Mission-Critical Applications In SME

According to a new survey by SIOS Technology, in partnership with ActualTech Media Research, a full 98% of cloud deployments experience some type of performance issue every year.

The survey was designed to understand current challenges and trends related to the state of performance and high availability for mission-critical applications in small, medium and large companies. A total of 390 IT professionals and decision-makers responded, collectively representing a cross-section those responsible for managing databases, infrastructure, architecture, cloud services and software development. Tier-1 applications explicitly identified include Oracle, Microsoft SQL Server and SAP/HANA.

There are some clear trends. A few surprises that we didn’t see coming, and that might surprise you as well.

■ Small companies are leading the way to the public cloud with 54% planning to move more than half their mission-critical applications there by the end of 2018, which compares to 42% of large companies

■ For companies of all sizes, having complete control over the application environment was cited by 60% of the respondents as a key reason for why their mission-critical workloads remain on premises

■ Most (86%) organizations are using some form of failover clustering or other high availability mechanism for their mission-critical applications

■ Almost as many (95%) report having experienced a failure in their failover provisions

It’s evident that organizations are finally moving their critical applications to the cloud. And at a greater pace than we could have imagined a few years ago. But they’re still in the early days of adoption, placing mature operations a few years away. Here are some more details.

Misery Loves Company

A mere 2% of respondents claimed they never experience any application performance issues that ever affect any end users. The rest of us mere mortals claim to experience such issues. Here are the stats on average.

Daily (18%)
2-3 times per week (17%)
Once per week (10%)
2-3 times per month (15%)
Once per month (11%)
3-5 times per year (18%)
Only once per year (8%)

The responses were reasonably consistent among Decision Makers, IT Staff, and Data & Development Staff with one notable exception: Decision Makers perceive a lower occurrence of performance issues than staff does. Nearly half (46%) of Decision Makers responded that performance issues occur 3-5 times per year or less (compared to 23-25% for staff). Only 11% responded that issues occur daily (compared to 20-21% for staff).

Rapid Response to the Rescue

One possible explanation for this apparent discrepancy is IT Staff being made aware of problems affecting performance with an automated alert. This is followed by a rapid response to find and fix the cause.

The survey asked about high availability provisions failing (something that is certain to affect performance!). 77% learn of the problem via an alert from monitoring tools. Another 39% learn from a user complaint. (Note that multiple responses were permitted.)

As for remediation, it takes more than 5 hours to fix a problem only 3% of the time. Nearly a quarter (23%) are fixed in less than an hour and over half (56%) are fixed in 1-3 hours. Finally, 18% are fixed in 3-5 hours. Small companies are able to resolve problems more quickly (31% in less than an hour) than large ones (only 11% in less than an hour). This likely because the former utilizes the public cloud more extensively and has less complex configurations.

Culprits In The Cloud

When asked about the cause of performance issues that arise in the cloud, the main culprits are the application or the database being used. Together they account for 64% of the issues. It is important to note that this question did not distinguish between who is responsible for the managing the application and/or database, which would likely be the cloud service provider for a managed service. Additional causes include issues with the service provider (17%) or the infrastructure (15%). In 4% of the cases, the issue remained a mystery.

Written by Jerry Melnick, President and CEO of SIOS Technology

Reproduced from APMdigest

Managing a Real-Time Recovery in a Major Cloud Outage

January 19, 2019 by Jason Aw Leave a Comment

Managing A Real-Time Recovery In A Major Cloud Outage

Disasters happen, making sudden downtime reality. But there are things all customers can do to survive virtually any cloud outage.

Stuff happens. Failures—both large and small—are inevitable. What is not inevitable is extended periods of downtime.

Consider the day the South Central US Region of Microsoft’s Azure cloud experienced a catastrophic failure. A severe thunderstorm led to a cascading series of problems that eventually knocked out an entire data center. In what some have called “The Day the Azure Cloud Fell from the Sky,” most customers were offline, not just for a few seconds or minutes, but for a full day. Some were offline for over two days. While Microsoft has since addressed the many issues that led to the outage, the incident will long be remembered by IT professionals.

That’s the bad news. The good news is: There are things all Azure customers can do to survive virtually any outage. It can be from a single server failing to an entire data center going offline. In fact, Azure customers who implement robust high-availability and/or disaster recovery provisions, complete with real-time data replication and rapid, automatic failover, can expect to experience no data loss, and little or no downtime whenever catastrophe strikes.

Managing The Cloud Outage

This article examines four options for providing disaster recovery (DR) and high availability (HA) protections in hybrid and purely Azure cloud configurations. Two of the options are specific to the Microsoft SQL Server database, which is a popular application in the Azure cloud; the other two options are application-agnostic. The four options, which can also be used in various combinations, are compared in the table and include:

The Azure Site Recovery (ASR) Service
SQL Server Failover Cluster Instances with Storage Spaces Direct
SQL Server Always On Availability Groups
Third-party Failover Clustering Software

RTO and RPO 101

Before describing the four options, it is necessary to have a basic understanding of the two metrics used to assess the effectiveness of DR and HA provisions: Recovery Time Objective and Recovery Point Objective. Those familiar with RTO and RPO can skip this section.

RTO is the maximum tolerable duration of an outage. Online transaction processing applications generally have the lowest RTOs, and those that are mission-critical often have an RTO of only a few seconds. RPO is the maximum period during which data loss can be tolerated. If no data loss is tolerable, then the RPO is zero.

The RTO will normally determine the type of HA and/or DR protection needed. Low recovery times usually demand robust HA provisions that protect against routine system and software failures, while longer RTOs can be satisfied with basic DR provisions designed to protect against more widespread, but far less frequent disasters.

The data replication used with HA and DR provisions can create the need for a potential tradeoff between RTO and RPO. In a low-latency LAN environment, where replication can be synchronous, the primary and secondary datasets can be updated concurrently. This enables full recoveries to occur automatically and in real-time, making it possible to satisfy the most demanding recovery time and recovery point objectives (a few seconds and zero, respectively) with no tradeoff necessary.

Across the WAN, by contrast, forcing the primary to wait for the secondary to confirm the completion of updates for every transaction would adversely impact on performance. For this reason, data replication in the WAN is usually asynchronous. This can create a tradeoff between accommodating RTO and RPO that normally results in an increase in recovery times. Here’s why: To satisfy an RPO of zero, manual processes are needed to ensure all data (e.g. from a transaction log) has been fully replicated on the secondary before the failover can occur This extra effort lengthens the recovery time, which is why such configurations are often used for DR and not HA.

Azure Site Recovery (ASR) Service

ASR is Azure’s DR-as-a-service (DRaaS) offering. ASR replicates both physical and virtual machines to other Azure sites, potentially in other regions, or from on-premises instances to the Azure cloud. The service delivers a reasonably rapid recovery from system and site outages, and also facilitates planned maintenance by eliminating downtime during rolling software upgrades.

Like all DRaaS offerings, ASR has some limitations, the most serious being the inability to automatically detect and failover from many failures that cause application-level downtime. Of course, this is why the service is characterized as being for DR and not for HA.

With ASR, recovery times are typically 3-4 minutes depending, of course, on how quickly administrators are able to manually detect and respond to a problem. As described above, the need for asynchronous data replication across the WAN can further increase recovery times for applications with an RPO of zero.

SQL Server Failover Cluster Instance with Storage Spaces Direct

SQL Server offers two of its own HA/DR options: Failover Cluster Instances (discussed here) and Always On Availability Groups (discussed next).

FCIs afford two advantages: The feature is available in the less expensive Standard Edition of SQL Server, and it does not depend on having shared storage like traditional HA clusters do. This latter advantage is important because shared storage is simply not available in the cloud—from Microsoft or any other cloud service provider.

A popular choice for storage in the Azure cloud is Storage Spaces Direct (S2D), which supports a wide range of applications, and its support for SQL Server protects the entire instance and not just the database. A major disadvantage of S2D is that the servers must reside within a single data center, making this option suitable for some HA needs but not for DR. For multi-site HA and DR protections, the requisite data replication will need to be provided by either log shipping or a third-party failover clustering solution.

SQL Server Always On Availability Groups

While Always On Availability Groups is SQL Server’s most capable offering for both HA and DR, it requires licensing the more expensive Enterprise Edition. This option is able to deliver a recovery time of 5-10 seconds and a recovery point of seconds or less. It also offers readable secondaries for querying the databases (with appropriate licensing), and places no restrictions on the size of the database or the number of secondary instances.

An Always On Availability Groups configuration that provides both HA and DR protections consists of a three-node arrangement with two nodes in a single Availability Set or Zone, and the third in a separate Azure Region. One notable limitation is that only the database is replicated and not the entire SQL instance, which must be protected by some other means.

In addition to being cost-prohibitive for some database applications, this approach has another disadvantage. Being application-specific requires IT departments to implement other HA and DR provisions for all other applications. The use of multiple HA/DR solutions can substantially increase complexity and costs (for licensing, training, implementation and ongoing operations), making this another reason why organizations increasingly prefer using application-agnostic third-party solutions.

Third-party Failover Clustering Software

With its application-agnostic and platform-agnostic design, failover clustering software is able to provide a complete HA and DR solution for virtually all applications in private, public and hybrid cloud environments. This includes for both Windows and Linux.

Being application-agnostic eliminates the need for having different HA/DR provisions for different applications. Being platform-agnostic makes it possible to leverage various capabilities and services in the Azure cloud, including Fault Domains, Availability Sets and Zones, Region Pairs, and Azure Site Recovery.

As complete solutions, the software includes, at a minimum, real-time data replication, continuous monitoring capable of detecting failures at the application level, and configurable policies for failover and failback. Most solutions also offer a variety of value-added capabilities that enable failover clusters to deliver recovery times below 20 seconds with minimal or no data loss to satisfy virtually all HA/DR needs.

Making It Real

All four options, whether operating separately or in concert, can have roles to play in making the continuum of DR and HA protections more effective and affordable for the full spectrum of enterprise applications. This includes from those that can tolerate some data loss and extended periods of downtime, to those that require real-time recovery to achieve five-9’s of uptime with minimal or no data loss.

To survive the next cloud outage in the real-world, make certain that whatever DR and/or HA provisions you choose are configured with at least two nodes spread across two sites. Also be sure to understand how well the provisions satisfy each application’s recovery time and recovery point objectives. As well as any limitations that might exist, including the need for manual processes required to detect all possible failures, and trigger failovers in ways that ensure both application continuity and data integrity.

About Jonathan Meltzer

Jonathan Meltzer is Director, Product Management, at SIOS Technology. He has over 20 years of experience in product management and marketing for software and SaaS products that help customers manage, transform, and optimize their human capital and IT resources.

Reproduced from RTinsights