|February 10, 2019||
Step-By-Step: How To Configure SQL Server 2008 R2 Failover Cluster Instance In Azure
Still figuring out how to ensure SQL Server instance remains highly available once you make the move to Azure? Today, most people have business critical SQL Server 2008/2008 R2 configured as a clustered instance (SQL Server FCI) in their data center. When looking at Azure, you have probably come to the realization that due to the lack of shared storage it might seem that you can’t bring your SQL Server FCI to the Azure cloud. However, that is not the case thanks to SIOS DataKeeper.
SIOS DataKeeper enables you to build a SQL Server FCI in Azure, AWS or Google Cloud. And even anywhere else where shared storage is not available. DataKeeper has been enabling SANless clusters for WIndows and Linux since 1999. Microsoft documents the use of SIOS DataKeeper for SQL Server FCI in their documentation: High availability and disaster recovery for SQL Server in Azure Virtual Machines.
I’ve written about SQL Server FCI’s running in Azure before. But I never published a Step-by-Step Guide specific to SQL Server 2008/2008 R2. The good news is that it works just as great with SQL 2008/2008 R2 as it does with SQL 2012/2014/2016/2017. As well as the soon to be released 2019. Also, regardless of the version of Windows Server (2008/2012/2016/2019) or SQL Server (2008/2012/2014/2016/2017), the configuration process is similar enough that this guide should be sufficient enough to get you through any configurations.
If your flavor of SQL or Windows is not covered in any of my guides, don’t be afraid to jump in and build a SQL Server FCI and reference this guide.
This guide uses SQL Server 2008 R2 with Windows Server 2012 R2. As of the time of this writing I did not see an Azure Marketplace image of SQL 2008 R2 on Windows Server 2012 R2. So I had to download and install SQL 2008 R2 manually. Personally I prefer this combination. It’s fine if you need to use Windows Server 2008 R2. If you use Windows Server 2008 R2, don’t forget to install the hotfix described in this article. It allows Windows Server 2008 R2 to be part of a FCI in Azure.
Provision Azure Instances
I’m not going to go into great detail here with a bunch of screenshots. Reason being Azure Portal UI tends to change pretty frequently and screenshots will get stale pretty quickly. Instead, I will just cover the important topics that you should be aware of.
Fault Domains Or Availability Zones?
In order to ensure your SQL Server instances are highly available, you have to make sure your cluster nodes reside in different Fault Domains (FD) or in different Availability Zones (AZ). Apart from that, File Share Witness (see below) also needs to reside in a FD or AZ that is different than that one your cluster nodes reside in.
Here is my take on it. AZs are the newest Azure feature, but they are only supported in a handful of regions so far. AZs give you a higher SLA (99.99%) then FDs (99.95%). Furthermore, it protects you against the kind of cloud outages I describe in my post Azure Outage Post-Mortem. If you can deploy in a region that supports AZs then I recommend you use AZs.
In this guide I used AZs which you will see when you get to the section on configuring the load balancer. However, if you use FDs everything will be exactly the same, except the load balancer configuration will reference Availability Sets rather than Availability Zones.
What Is A File Share Witness You Ask?
Without going into great detail, Windows Server Failover Clustering (WSFC) requires you configure a “Witness” to ensure failover behaves properly. WSFC supports three kinds of witnesses: Disk, File Share, Cloud. Since we are in Azure, a Disk Witness is not possible. Cloud Witness is only available with Windows Server 2016 and later, so that leaves us with a File Share Witness. If you want to learn more about cluster quorums check out my post on the Microsoft Press Blog, From the MVPs: Understanding the Windows Server Failover Cluster Quorum in Windows Server 2012 R2
Add Storage To Your SQL Server Instances
As you provision your SQL Server instances you will want to add additional disks to each instance. Minimally you will need one disk for the SQL Data and Log file, one disk for Tempdb. Whether or not you should have a seperate disk for log and data files is somewhat debated when running in the cloud. On the back end, the storage all comes from the same place and your instance size limits your total IOPS. In my opinion, there really isn’t any value in separating your log and data files since you cannot ensure that they are running on two physical sets of disks. I’ll leave that for you to decide, but I put log and data all on the same volume.
Normally a SQL Server 2008 R2 FCI would require you to put tempdb on a clustered disk. However, SIOS DataKeeper has this really nifty feature called a DataKeeper Non-Mirrored Volume Resource. This guide does not cover moving tempdb to this non-mirrored volume resource, but for optimal performance you should do this. There really is no good reason to replicate tempdb since it is recreated upon failover anyway.
As far as the storage is concerned you can use any storage type, but certainly use Managed Disks whenever possible. Make sure each node in the cluster has the identical storage configuration. Once you launch the instances you will want to attach these disks and format them NTFS. Make sure each instance uses the same drive letters.
It’s not a hard requirement, but if at all possible use an instance size that supports accelerated networking. Also, make sure you edit the network interface in the Azure portal so that your instances use a static IP address. For clustering to work properly you want to make sure you update the settings for the DNS server so that it points to your Windows AD/DNS server and not just some public DNS server.
By default, the communications between nodes in the same virtual network are wide open. However, if you have locked down your Azure Security Group, you will need to know what ports must be open between the cluster nodes and adjust your security group. In my experience, almost all the issues you will encounter when building a cluster in Azure are either caused by blocked ports.
DataKeeper has some some ports that are required to be open between the clustered instance.
Those ports are as follows:
Failover cluster has its own set of port requirements that I won’t even attempt to document here. This article seems to have that covered. http://dsfnet.blogspot.com/2013/04/windows-server-clustering-sql-server.html
In addition, the Load Balancer described later will use a probe port that must allow inbound traffic on each node. The port that is commonly used and described in this guide is 59999.
And finally if you want your clients to be able to reach your SQL Server instance you want to make sure your SQL Server port is open, which by default is 1433.
Remember, these ports can be blocked by the Windows Firewall or Azure Security Groups, so to be sure to check both to ensure they are accessible.
Join The Domain
A requirement for SQL Server 2008 R2 FCI is that the instances must reside in the same Windows Server Domain. So if you have not done so, make sure you have joined the instances to your Windows domain
Local Service Account
When you install DataKeeper it will ask you to provide a service account. You must create a domain user account and then add that user account to the Local Administrators Group on each node. When asked during the DataKeeper installation, specify that account as the DataKeeper service account. Note – Don’t install DataKeeper just yet!
Domain Global Security Groups
When you install SQL 2008 R2, you will be asked to specify two Global Domain Security Groups. Do look ahead at the SQL install instructions and create those groups now. Set up a domain user account and place them in each of these security accounts. Remember specify this account as part of the SQL Server Cluster installation.
You must enable both Failover Clustering and .Net 3.5 on each instance of the two cluster instances. As you enable Failover Clustering, be sure to enable the optional “Failover Cluster Automation Server”. It is required for a SQL Server 2008 R2 cluster in Windows Server 2012 R2.
Create The Cluster And Datakeeper Volume Resources
We are now ready to start building the cluster. The first step is to create the base cluster. Because of the way Azure handles DHCP, we MUST create the cluster using Powershell and not the Cluster UI. We use Powershell because it will let us specify a static IP address as part of the creation process. If we used the UI it would see that the VMs use DHCP and it will automatically assign a duplicate IP address, so we we want to avoid that situation by using Powershell as shown below.
After the cluster creates, run Test-Cluster. This is required before SQL Server will install.
You will get warnings about Storage and Networking, but you can ignore those as they are expected in a SANless cluster in Azure. If there are any other warnings or errors you must address those before moving on.
After the cluster is created you will need to add the File Share Witness. On the third server we specified as the file share witness, create a file share and give Read/Write permissions to the cluster computer object we just created above. In this case $Cluster1 will be the name of the computer object that needs Read/Write permissions at both the share and NTFS security level.
Once the share is created, you can use the Configure Cluster Quorum Wizard as shown below to configure the File Share Witness.
It is important to wait until the basic cluster is created before we install DataKeeper since the DataKeeper installation registers the DataKeeper Volume Resource type in failover clustering. If you jumped the gun and installed DataKeeper already that is okay. Simply run the setup again and choice Repair Installation.
The screenshots below walk you through a basic installation. Start by running the DataKeeper Setup.
The account you specify below must be a domain account and must be part of the Local Administrators group on each of the cluster nodes.
When presented with the SIOS License Key manager, you can browse out to your temporary key. Or if you have a permanent key, you can copy the System Host ID and use that to request your permanent license. If you ever need to refresh a key, the SIOS License Key Manager is a program that will be installed that you can run separately to add a new key.
Create DataKeeper Volume Resource
Once DataKeeper is installed on each node you are ready to create your first DataKeeper Volume Resource. The first step is to open the DataKeeper UI and connect to each of the cluster nodes.
If everything is done correctly the Server Overview Report should look something like this.
You can now create your first Job as shown below.
After you choose a Source and Target you are presented with the following options. For a local target in the same region the only thing you need to select is Synchronous.
Choose Yes and auto-register this volume as a cluster resource.
Once you complete this process open up the Failover Cluster Manager and look in Disk. You should see the DataKeeper Volume resource in Available Storage. At this point WSFC treats this as if it were a normal cluster disk resource.
Slipstream SP3 Onto SQL 2008 R2 Install Media
SQL Server 2008 R2 is only supported on Windows Server 2012 R2 with SQL Server SP2 or later. Unfortunately, Microsoft never released a SQL Server 2008 R2 installation media that that includes SP2 or SP3. Instead, you must slipstream the service pack onto the installation media BEFORE you do the installation. If you try to do the installation with the standard SQL Server 2008 R2 media, you will run into all kinds of problems. I don’t remember the exact errors you will see. Although I do recall they didn’t really point to the exact problem and you will waste a lot of time trying to figure out what went wrong.
As of the date of this writing, Microsoft does not have a Windows Server 2012 R2 with SQL Server 2008 R2 offering in the Azure Marketplace. You probably will be bringing your own SQL license if you want to run SQL 2008 R2 on Windows Server 2012 R2 in Azure. If they add that image later, or if you choose to use the SQL 2008 R2 on Windows Server 2008 R2 image you must first uninstall the existing standalone instance of SQL Server before moving forward.
I followed the guidance in Option 1 of this article to slipstream SP3 on onto my SQL 2008 R2 installation media. You will of course have to adjust a few things as this article references SP2 instead of SP3. Make sure you slipstream SP3 on the installation media we will use for both nodes of the cluster.
Install SQL Server On The First Node
Using the SQL Server 2008 R2 media with SP3 slipstreamed, run setup and install the first node of the cluster as shown below.
If you use anything other than the Default instance of SQL Server you will have some additional steps not covered in this guide. The biggest difference is locking down the port that SQL Server uses. By default a named instance of SQL Server does NOT use 1433. Once you lock down the port, you also need to specify that port instead of 1433 including the firewall setting and the Load Balancer settings.
Here make sure to specify a new IP address that is not in use. This is the same IP address we will use later when we configure the Internal Load Balancer later.
As I mentioned earlier, SQL Server 2008 R2 utilizes AD Security Groups. Go ahead and configure SQL server now as shown below before you continue.
Specify the Security Groups you created earlier.
Make sure the service accounts you specify are a member of the associated Security Group.
Specify your SQL Server administrators here.
If everything goes well, you are now ready to configure SQL server on the second node of the cluster.
Install SQL Server On The Second Node
One the second node, run the SQL Server 2008 R2 with SP3 install and select Add Node to a SQL Server FCI.
Proceed with the installation as shown in the following screenshots.
Assuming everything went well, you should now have a two node SQL Server 2008 R2 cluster configured that looks something like the following.
However, you probably will notice that you can only connect to the SQL Server instance from the active cluster node. The problem is that Azure does not support gratuitous ARP, so your clients cannot connect directly to the Cluster IP Address. Instead, the clients must connect to an Azure Load Balancer. And that will redirect the connection to the active node. To make this work there are two steps. First, create the Load Balancer and Fix the SQL Server Cluster IP to respond to the Load Balancer Probe. Next, use a 255.255.255.255 Subnet mask. Those steps are described below.
Create The Azure Load Balancer
I’m going to assume your clients can communicate directly to the internal IP address of the SQL cluster so we will create an Internal Load Balancer (ILB) in this guide. If you need to expose your SQL Instance on the public internet you can use a Public Load Balancer instead.
In the Azure portal create a new Load Balancer following the screenshots as shown below. The Azure portal UI changes rapidly, but these screenshots should give you enough information to do what you need to do. I will call out important settings as we go along.
Here we create the ILB. The important thing to note on this screen is you must select “Static IP address assignment” and specify the same IP address that we used during the SQL Cluster installation.
Since I used Availability Zones I see Zone Redundant as an option. If you used Availability Sets your experience will be slightly different.
In the Backend pool be sure to select the two SQL Server instances. You DO NOT want to add your File Share Witness in the pool.
Here we configure the Health Probe. Most Azure documentation has us using port 59999, so we will stick with that port for our configuration.
Here we will add a load balancing rule. In our case we want to redirect all SQL Server traffic to TCP port 1433 of the active node. It is also important that you select Floating IP (Direct Server Return) as Enabled.
Run Powershell Script To Update SQL Client Access Point
Now we must run a Powershell script on one of the cluster nodes to allow the Load Balancer Probe to detect which node is active. The script also sets the Subnet Mask of the SQL Cluster IP Address to 255.255.255.255.255 so that it avoids IP address conflicts with the Load Balancer we just created.
This is what the output will look like if run correctly.
If you get to this point and you still cannot connect to the cluster remotely, you wouldn’t be the first person. There are a lot of things that can go wrong in terms of security, load balancer, SQL ports, etc. I wrote this guide to help troubleshoot connection issues.
In fact, in this very installation I ran into some strange issues in terms of my SQL Server TCP/IP Properties in SQL Server Configuration Manager. When I looked at the properties, I did not see the SQL Server Cluster IP address as one of the addresses it was listening on, so I had to add it manually. I’m not sure if that was an anomaly. It certainly was an issue I had to resolve before I could connect to the cluster from a remote client.
As I mentioned earlier, one other improvement you can make to this installation is to use a DataKeeper Non-Mirrored Volume Resource for TempDB. If you set that up please be aware of the following two configuration issues people commonly run into.
The first issue is if you move tempdb to a folder on the 1st node, you must be sure to create the exact same folder structure on the second node. If you don’t do that when you try to failover SQL Server will fail to come online since it can’t create TempDB
The second issue occurs anytime you add another DataKeeper Volume Resource to a SQL Cluster after the cluster is created. You must go into the properties of the SQL Server cluster resource and make it dependent on the new DataKeeper Volume resource you added. This is true for the TempDB volume and any other volumes that you may decide to add after the cluster is created.
Read here to know how to configure SQL servers and ensure High Availability
Reproduced with permission from Clusteringformeremortals.com
|February 1, 2019||
About High Availability Applications For Business Operations – An Interview with Jerry Melnick
We are in conversation with Jerry Melnick, President & CEO, SIOS Technology Corp. Jerry is responsible for directing the overall corporate strategy for SIOS Technology Corp. and leading the company’s ongoing growth and expansion. He has more than 25 years of experience in the enterprise and high availability software markets. Before joining SIOS, he was CTO at Marathon Technologies where he led business and product strategy for the company’s fault tolerant solutions. His experience also includes executive positions at PPGx, Inc. and Belmont Research. There he was responsible for building a leading-edge software product and consulting business focused on supplying data warehouse and analytical tools.
Jerry began his career at Digital Equipment Corporation. He led an entrepreneurial business unit that delivered highly scalable, mission-critical database platforms to support enterprise-computing environments in the medical, financial and telecommunication markets. He holds a Bachelor of Science degree from Beloit College with graduate work in Computer Engineering and Computer Science at Boston University.
What is the SIOS Technology survey and what is the objective of the survey?
SIOS Technology Corp. with ActualTech Media conducted a survey of IT staff to understand current trends and challenges related to the general state of high availability applications in organizations of all sizes. An organization’s HA applications are generally the ones that ensure that a business remains in operation. Such systems can range from order taking systems to CRM databases to anything that keeps employees, customers, and partners working together.
We’ve learned that the news is mixed when it comes to how well high availability applications are supported.
Who responded to the survey?
For this survey, we gathered responses from 390 IT professionals and decision makers from a broad range of company sizes in the US. Respondents managed databases, infrastructure, architecture, systems, and software development as well as those in IT management roles.
What were some of the key findings uncovered in the survey results?
The following are key findings based on the survey results:
Tell us about the Enterprise Application Landscape. Which applications are in use most; and which might we be surprised about?
We focused on tier 1 mission critical applications, including Oracle, Microsoft SQL Server, SAP/HANA. For most organizations operating these kinds of services, they are the lifeblood. They hold the data that enables the organization to achieve its goals.
56% of respondents to our survey are operating Oracle workloads while 49% are running Microsoft SQL Server. Rounding out the survey, 28% have SAP/HANA in production. These are all clearly critical workloads in most organizations, but there are others. For this survey, we provided respondents an opportunity to tell us what, beyond these three big applications, they are operating that can be considered mission critical. Respondents that availed themselves of this response option indicate that they’re also operating various web databases, primarily from Amazon, as well as MySQL and PostgresQL databases. To a lesser extent, organizations are also operating some NoSQL services that are considered mission critical.
How often does an application performance issue affect end users?
Application performance issues are critical for organizations. 98% of respondents indicating these issues impact end users in some way ranging from daily (experienced by 18% of respondents) to just one time per year (experience by 8% of respondents) and everywhere in between. Application performance issues lead to customer dissatisfaction and can lead to lost revenue and increased expenses. But, there appears to be some disagreement around such issues depending on your perspective in the organization. Respondents holding decision maker roles have a more positive view of the performance situation than others. Only 11% of decision makers report daily performance challenges compared to around 20% of other respondents.
Is it easier to resolve cloud-based application performance issues?
Most IT pros would like to fully eliminate the potential for application performance issues that operate in a cloud environment. But the fact is that such situations can and will happen. There is a variety of tools available in the market to help IT understand and address application performance issues. IT departments have, over the years, cobbled together troubleshooting toolkits. In general, the fewer tools you need to work with to resolve a problem, the more quickly you can bring services back into full operation. That’s why it’s particularly disheartening to learn that only 19% of responses turn to a single tool to identify cloud application performance issues. This leaves 81% of respondents having to use two or more tools. But, it gets worse. 11% of respondents need to turn to five or more tools in order to identify performance issues with the cloud applications
So now we know cloud-based application performance issues can’t be totally avoided, how long until we can expect a fix?
The real test of an organization’s ability to handle such issues comes when measuring the time it takes to recover when something does go awry. 23% of respondents can typically recover in less than an hour. Fifty-six percent (56%) of respondents take somewhere between one and three hours to recover. After that 23% take 3 or more hours. This isn’t to say that these people are recovering from a complete failure somewhere. They are reacting to a performance fault somewhere in the application. And it’s one that’s serious enough to warrant attention. A goal for most organizations is to reduce the amount of time that it takes to troubleshoot problems. This will reduce the amount of time it takes to correct them.
Do future plans about moving HA applications to the cloud show stronger migration?
We requested information from respondents around their future plans as they pertain to moving additional high availability applications to the cloud. Nine percent (9%) of respondents indicate that all of their most important applications are already in the cloud. By the end of 2018, one-half of respondents expect to have more than 50% of their HA applications migrated to the cloud. While 29% say that they will have less than half of the HA applications in such locations. Finally, 12% of respondents say that they will not be moving any more HA applications to the cloud in 2018.
How would you sum up the SIOS Technology survey results?
Although this survey and report represent people’s thinking at a single point in time, there are some potentially important trends that emerge. First, it’s clear that organizations value their mission-critical applications, as they’re protecting them via clustering or other high availability technology. A second takeaway is that even with those safeguards in place, there’s more work to be done, as those apps can still suffer failures and performance issues. Companies need to look at the data and ask themselves. Therefore, if they’re doing everything they can to protect their crucial assets. You can download the report here.
Contact us if you would like to enjoy High Availability Applications in your project.
Reproduced from Tech Target
|January 30, 2019||
Ensure High Availability for SQL Server on Amazon Web Services
Database and system administrators have long had a wide range of options for ensuring that mission-critical database applications remain highly availability. Public cloud infrastructures, like those provided by Amazon Web Services, offer their own, additional high availability options backed by service level agreements. But configurations that work well in a private cloud might not be possible in the public cloud. Poor choices in the AWS services used and/or how these are configured can cause failover provisions to fail when actually needed. This article outlines the various options available for ensuring High Availability for SQL Server in the AWS cloud.
For database applications, AWS gives administrators two basic choices. Each of which has different high availability (HA) and disaster recovery (DR) provisions: Amazon Relational Database Service (RDS) and Amazon Elastic Compute Cloud (EC2).
RDS is a fully managed service suitable for mission-critical applications. It offers a choice of six different database engines, but its support for SQL Server is not as robust as it is for other choices like Amazon Aurora, My SQL and MariaDB. Here are some of the common concerns administrators have about using RDS for mission-critical SQL Server applications:
Elastic Compute Cloud
The other basic choice is the Elastic Compute Cloud with its substantially greater capabilities. This makes it the preferred choice when HA and DR are of paramount importance. A major advantage of EC2 is the complete control it gives admins over the configuration, and that presents admins with some additional choices.
Picking The Operating System
Perhaps the most consequential choice is which operating system to use: Windows or Linux. Windows Server Failover Clustering is a powerful, proven and popular capability that comes standard with Windows. But WSFC requires shared storage, and that is not available in EC2. Because Multi-AZ, and even Multi-Region, configurations are required for robust HA/DR protection, separate commercial or custom software is needed to replicate data across the cluster of server instances. Microsoft’s Storage Spaces Direct (S2D) is not an option here, as it does not support configurations that span Availability Zones.
The need for additional HA/DR provisions is even greater for Linux, which lacks a fundamental clustering capability like WSFC. Linux gives admins two equally bad choices for high availability: Either pay more for the more expensive Enterprise Edition of SQL Server to implement Always On Availability Groups; or struggle to make complex do-it-yourself HA Linux configurations using open source software work well.
Both of these choices undermine the cost-saving rationale for using open source software on commodity hardware in public cloud services. SQL Server for Linux is only available for the more recent (and more expensive) versions, beginning in 2017. And the DIY HA alternative can be prohibitively expensive for most organizations. Indeed, making Distributed Replicated Block Device, Corosync, Pacemaker and, optionally, other open source software work as desired at the application-level under all possible failure scenarios can be extraordinarily difficult. Which is why only very large organizations have the wherewithal (skill set and staffing) needed to even consider taking on the task.
Owing to the difficulty involved implementing mission-critical HA/DR provisions for Linux, AWS recommends using a combination of Elastic Load Balancing and Auto Scaling to improve availability. But these services have their own limitations that are similar to those in the managed Relational Database Service.
All of this explains why admins are increasingly choosing to use failover clustering solutions designed specifically for ensuring HA and DR protections in a cloud environment.
Failover Clustering Purpose-Built for the Cloud
The growing popularity of private, public and hybrid clouds has led to the advent of failover clustering solutions purpose-built for a cloud environment. These HA/DR solutions are implemented entirely in software that creates, as implied by the name, a cluster of servers and storage with automatic failover to assure high availability at the application level.
Most of these solutions provide a complete HA/DR solution that includes a combination of real-time block-level data replication, continuous application monitoring and configurable failover/failback recovery policies. Some of the more sophisticated solutions also offer advanced capabilities like support for Always on Failover Cluster Instances in the less expensive Standard Edition of SQL Server for both Windows and Linux. They also offer WAN optimization to maximize multi-region performance. There’s also manual switchover of primary and secondary server assignments to facilitate planned maintenance. Including the ability to perform regular backups without disruption to the application.
Most failover clustering software is application-agnostic, enabling organizations to have a single, universal HA/DR solution. This same capability also affords protection for the entire SQL Server application. And that includes the database, logons, agent jobs, etc., all in an integrated fashion. Although these solutions are generally also storage-agnostic, enabling them to work with shared storage area networks, shared-nothing SANless failover clustering is usually preferred for its ability to eliminate potential single points of failure.
Support for Always On Failover Cluster Instances (FCIs) in the less expensive Standard Edition of SQL Server, with no compromises to availability or performance, is a major advantage. In a Windows environment, most failover clustering software supports FCIs by leveraging the built-in WSFC feature. It makes the implementation quite straightforward for both database and system administrators. Linux is becoming increasingly popular for SQL Server and many other enterprise applications. Some failover clustering solutions now make implementing HA/DR provisions just as easy as it is for Windows by offering application-specific integration.
Typical Three-Node SANless Failover Cluster
The example EC2 configuration in the diagram shows a typical three-node SANless failover cluster configured as Virtual Private Cloud (VPC) with all three SQL Server instances in different Availability Zones. To eliminate the potential for an outage in a local disaster affecting an entire region, one of the AZs is located in a different AWS region.
A three-node SANless failover cluster affords carrier-class HA and DR protections. The basic operation is the same in the LAN and/or WAN for Windows or Linux. Server #1 is initially the primary or active instance that replicates data continuously to both servers #2 and #3. It experiences a problem. Then it triggers an automatic failover to server #2, which now becomes the primary replicating data to server #3.
If the failure was caused by an infrastructure outage, the AWS staff would begin immediately diagnosing and repairing whatever caused the problem. Once fixed, it could be restored as the primary, or server #2 could continue in that capacity replicating data to servers #1 and #3. Should server #2 fail before server #1 is returned to operation, as shown, server #3 would become the primary after a manual failover. Of course, if the failure was caused by the application software or certain other aspects of the configuration, it would be up to the customer to find and fix the problem.
SANless failover clusters can be configured with only a single standby instance, of course. But such a minimal configuration does require a third node to serve as a witness. The witness is needed to achieve a quorum for determining the assignment of the primary. This important task is normally performed by a domain controller in a separate AZ. Keeping all three nodes (primary, secondary and witness) in different AZs eliminates the possibility of losing more than one vote if any zone goes offline.
It is also possible to have two- and three-node SANless failover clusters in hybrid cloud configurations for HA and/or DR purposes. One such three-node configuration is a two-node HA cluster located in an enterprise data center with asynchronous data replication to AWS or another cloud service for DR protection—or vice versa.
In clusters within a single region, where data replication is synchronous, failovers are normally configured to occur automatically. For clusters with nodes that span multiple regions, where data replication is asynchronous, failovers are normally controlled manually to avoid the potential for data loss. Three-node clusters, regardless of the regions used, can also facilitate planned hardware and software maintenance for all three servers while providing continuous DR protection for the application and its data.
Maximise High Availability for SQL Server
By offering 55 availability Zones spread across 18 geographical Regions, the AWS Global Infrastructure affords enormous opportunity to maximize High Availability for SQL Server by configuring SANless failover clusters with multiple, geographically-dispersed redundancies. This global footprint also enables all SQL Server applications and data to be located near end-users to deliver satisfactory performance.
With a purpose-built solution, carrier-class high availability need not mean paying a carrier-like high cost. Because purpose-built failover clustering software makes effective and efficient use of EC2’s compute, storage and network resources, while being easy to implement and operate, these solutions minimize any capital and all operational expenditures, resulting in high availability being more robust and more affordable than ever before.
Reproduced from TheNewStack
|January 27, 2019||
Options for When Public Cloud Service Levels Fall Short
All public cloud service providers offer some form of guarantee regarding availability. These may or may not be sufficient, depending on each application’s requirement for uptime. These guarantees typically range from 95.00% to 99.99% of uptime during the month. Most impose some type of “penalty” on the service provider for falling short of those thresholds.
Usually, most cloud service providers offer a 99.00% uptime threshold. This equates to about seven hours of downtime per month. And for many applications, those two-9’s might be enough. But for mission-critical applications, more 9’s are needed. Especially given the fact that many common causes of downtime are excluded from the guarantee.
There are, of course, cost-effective ways to achieve five-9’s high availability and robust disaster recovery protection in configurations using public cloud services, either exclusively or as part of a hybrid arrangement. This article highlights limitations involving HA and DR provisions in the public cloud. It explores three options for overcoming these limitations, and describes two common configurations for failover clusters.
Caveat Emptor in the Cloud
While all cloud service providers (CSPs) define “downtime” or “unavailable” somewhat differently, these definitions include only a limited set of all possible causes of failures at the application level. Generally included are failures affecting a zone or region, or external connectivity. All CSPs also offer credits ranging from 10% for failing to meet four-9’s of uptime to around 25% for failing to meet two-9’s of uptime.
Redundant resources can be configured to span the zones and/or regions within the CSP’s infrastructure. It will help to improve application-level availability. But even with such redundancy, there remain some limitations that are often unacceptable for mission-critical applications. Especially those requiring high transactional throughput performance. These limitations include each master being able to create only a single failover replica. And it requires the use of the master dataset for backups, and using event logs to replicate data. These and other limitations can increase recovery time during a failure and make it necessary to schedule at least some planned downtime.
The more significant limitations involve the many exclusions to what constitutes downtime. Here are just a few examples from actual Public Cloud Service Levels agreements of what is excluded from “downtime” or “unavailability” that cause application-level failures resulting from:
To be sure, it is reasonable for CSPs to exclude certain causes of failure. But it would be irresponsible for system administrators to use these as excuses. It is necessary to ensure application-level availability by some other means.
What Public Cloud Service Levels Are Available?
Provisioning resources for high availability in a way that does not sacrifice security or performance has never been a trivial endeavor. The challenge is especially difficult in a hybrid cloud environment where the private and public cloud infrastructures can differ significantly. It makes configurations difficult to test and maintain. Furthermore, it can result in failover provisions failing when actually needed.
For applications where the service levels offered by the CSP fall short, there are three additional options available based on the application itself, features in the operating system, or through the use of purpose-built failover clustering software.
Three Options for Improving Application-level Availability
The HA/DR options that might appear to be the easiest to implement are those specifically designed for each application. A good example is Microsoft’s SQL Server database with its carrier-class Always On Availability Groups feature. There are two disadvantages to this approach, however. The higher licensing fees, in this case for the Enterprise Edition, can make it prohibitively expensive for many needs. And the more troubling disadvantage is the need for different HA/DR provisions for different applications, which makes ongoing management a constant (and costly) struggle.
Uptime-Related Features Integrated Into The Operating System
Second option to improve Public Cloud Service Levels involves using uptime-related features integrated into the operating system. Windows Server Failover Clustering, for example, is a powerful and proven feature that is built into the OS. But on its own, WSFC might not provide a complete HA/DR solution because it lacks a data replication feature. In a private cloud, data replication can be provided using some form of shared storage, such as a storage area network. But because shared storage is not available in public clouds, implementing robust data replication requires using separate commercial or custom-developed software.
For Linux, which lacks a feature like WSFC, the need for additional HA/DR provisions and/or custom development is considerably greater. Using open source software like Pacemaker and Corosync requires creating (and testing) custom scripts for each application. These scripts often need to be updated and retested after even minor changes are made to any of the software or hardware being used. But because getting the full HA stack to work well for every application can be extraordinarily difficult, only very large organizations have the wherewithal needed to even consider taking on the effort.
Purpose-Built Failover Cluster
Ideally there would be a “universal” approach to HA/DR capable of working cost-effectively for all applications running on either Windows or Linux across public, private and hybrid clouds. Among the most versatile and affordable of such solutions is the third option: the purpose-built failover cluster. These HA/DR solutions are implemented entirely in software that is designed specifically to create, as their designation implies, a cluster of virtual or physical servers and data storage with failover from the active or primary instance to a standby to assure high availability at the application level.
Benefits Of These Solutions
These solutions provide, at a minimum, a combination of real-time data replication, continuous application monitoring and configurable failover/failback recovery policies. Some of the more robust ones offer additional advanced capabilities, such as a choice of block-level synchronous or asynchronous replication, support for Failover Cluster Instances (FCIs) in the less expensive Standard Edition of SQL Server, WAN optimization for enhanced performance and minimal bandwidth utilization, and manual switchover of primary and secondary server assignments to facilitate planned maintenance.
Although these general-purpose solutions are generally storage-agnostic, enabling them to work with storage area networks, shared-nothing SANless failover clusters are normally preferred based on their ability to eliminate potential single points of failure.
Two Common Failover Clustering Configurations
Every failover cluster consists of two or more nodes. Locating at least one of the nodes in a different datacenter is necessary to protect against local disasters. Presented here are two popular configurations: one for disaster recovery purposes; the other for providing both mission-critical high availability and disaster recovery. High transactional performance is often a requirement for highly available configurations. The example application is a database.
The basic SANless failover cluster for disaster recovery has two nodes with one primary and one secondary or standby server or server instance. This minimal configuration also requires a third node or instance to function as a witness, which is needed to achieve a quorum for determining assignment of the primary. For database applications, replication to the standby instance across the WAN is asynchronous to maintain high performance in the primary instance.
The SANless failover cluster affords a rapid recovery in the event of a failure in the primary. Resulting in a basic DR configuration suitable for many applications. It is capable of detecting virtually all possible failures, including those not counted as downtime in public cloud services. As such it will work in a private, public or hybrid cloud environment.
For example, the primary could be in the enterprise datacenter with the secondary deployed in the public cloud. Because the public cloud instance would be needed only during planned maintenance of the primary or in the event of its failure—conditions that can be fairly quickly remedied—the service limitations and exclusions cited above may well be acceptable for all but the most mission-critical of applications.
Three-Node SANless Failover Clusters
The figure shows an enhanced three-node SANless failover cluster that affords both five-9’s high availability and robust disaster recovery protection. As with the two-node cluster, this configuration will also work in a private, public or hybrid cloud environment. In this example, servers #1 and #2 are located in an enterprise datacenter with server #3 in the public cloud. Within the datacenter, replication across the LAN can be fully synchronous to minimize the time it takes to complete a failover. Therefore, maximize availability.
When properly configured, three-node SANless failover clusters afford truly carrier-class HA and DR. The basic operation is application-agnostic and works the same for Windows or Linux. Server #1 is initially the primary or active instance that replicates data continuously to both servers #2 and #3. If it experiences a failure, the application would automatically failover to server #2, which would then become the primary replicating data to server #3.
Immediately after a failure in server #1, the IT staff would begin diagnosing and repairing whatever caused the problem. Once fixed, server #1 could be restored as the primary with a manual failback, or server #2 could continue functioning as the primary replicating data to servers #1 and #3. Should server #2 fail before server #1 is returned to operation, as shown, server #3 would become the primary. Because server #3 is across the WAN in the public cloud, data replication is asynchronous and the failover is manual to prevent “replication lag” from causing the loss of any data.
SANless failover clustering software is able to detect all possible failures at the application level. It readily overcomes the CSP limitations and exclusions mentioned above. So, it makes it possible for this three-node configuration to be deployed entirely within the public cloud. To afford the same five-9’s high availability based on immediate and automatic failovers, servers #1 and #2 would need to be located within a single zone or region where the LAN facilitates synchronous replication.
For appropriate DR protection, server #3 should be located in a different datacenter or region, where the use of asynchronous replication and manual failover/failback would be needed for applications requiring high transactional throughput. Three-node clusters can also facilitate planned hardware and software maintenance for all three servers. At the same time, continue to provide continuous DR protection for the application and its data.
By offering multiple, geographically-dispersed datacenters, public clouds afford numerous opportunities to improve availability and enhance DR provisions. SANless failover clustering software makes effective and efficient use of all compute, storage and network resources. It also being easy to implement and operate. These purpose-built solutions minimize all capital and operational expenditures. Finally, resulting in high availability being more robust and more affordable than ever before.
# # #
About the Author
Cassius Rhue is Director of Engineering at SIOS Technology. He leads the software product development and engineering team in Lexington, SC. Cassius has over 17 years of software engineering, development and testing experience. He also holds a BS in Computer Engineering from the University of South Carolina.
Article from DRJ.com
|January 23, 2019||
Cost of Cloud for High-Availability Applications
Shortly after contracting with a cloud service provider, a bill arrives that causes sticker shock. There are unexpected and seemingly excessive charges. Those responsible seem unable to explain how this could have happened. The situation is urgent because the amount threatens to bust the IT budget unless cost-saving changes are made immediately. So how do we manage the Cost of Cloud for High-Availability Applications?
This cloud services sticker shock is often caused by mission-critical database applications. Especially these tend to be the most costly for a variety of reasons. These applications need to run 24/7. They require redundancy, which involves replicating the data and provisioning standby server instances. Data replication requires data movement, including across the wide area network (WAN). And providing high availability can result in higher costs to license Windows to get Windows Server Failover Clustering (versus using free open source Linux), or to license the Enterprise Edition of SQL Server to get Always On Availability Groups.
Before offering suggestions for managing Cost of Cloud for High-Availability Applications, it is important to note that the goal here is not to minimize those costs. But instead to optimize the price/performance for each application. In other words, it is appropriate to pay more when provisioning resources for those applications that require higher uptime and throughput performance. It is also important to note that a hybrid cloud infrastructure—with applications running in whole or in part in both the private and public cloud—will likely be the best way to achieve optimal price/performance.
Understanding Cloud Service Provider Business And Pricing Models
The sticker shock experience demonstrates the need to thoroughly understand how cloud services are priced and managing Cost of Cloud for High-Availability Applications. Only then can the available services be utilized in the most cost-effective manner.
All cloud service providers (CSPs) publish their pricing. Unless specified in the service agreement, that pricing is constantly changing. All hardware-based resources, including physical and virtual compute, storage, and networking services, inevitably have some direct or indirect cost. These are all based to some extent on the space, power, and cooling these systems consume. For software, open source is generally free. But all commercial operating systems and/or application software will incur a licensing fee. And be forewarned that some software licensing and pricing models can be quite complicated. So be sure to study them carefully.
In addition to these basic charges for hardware and software, there are potential á la carte costs for various value-added services. This includes security, load-balancing, and data protection provisions. There may also be “hidden” costs for I/O to storage or among distributed microservices, or for peak utilization that occurs only rarely during “bursts.”
Because every CSP has its own unique business and pricing model, the discussion here must be generalized. And, in general, the most expensive resources involve compute, software licensing, and data movement. Together they can account for 80% or more of the total costs. Data movement might also incur separate WAN charges that are not included in the bill from the CSP.
Storage and networking within the CSP’s infrastructure are usually the least costly resources. Solid state drives (SSDs) normally cost more than spinning media on a per-terabyte basis. But SSDs also deliver superior performance, so their price/performance may be comparable or even better. And while moving data back to the enterprise can be expensive, moving data from the enterprise to the public cloud can usually be done cost-free (notwithstanding the separate WAN charges).
Formulating Strategies For Optimizing Price/Performance
Covering the Cost of Cloud for High-Availability Applications needs meticulous checks. Here are some suggestions for managing resource utilization in the public cloud in ways that can lower costs while maintaining appropriate service levels for all applications. This include those that require mission-critical, high uptime and throughput.
In general, right-sizing is the foundational principle for managing resource utilization for optimal price/performance. When Willie Sutton was purportedly asked why he robbed banks, he replied, “Because that’s where the money is”. In the cloud, the money is in compute resources, so that should be the highest priority for right-sizing.
For new applications, start with minimal virtual machine configurations for compute resources. Add CPU cores, memory and/or I/O only as required to achieve satisfactory performance. All virtual machines for existing applications should eventually be right-sized. Begin with those that cost the most. Reduce allocations gradually while monitoring performance constantly until achieving diminishing returns.
It is worth noting that a major risk associated with right-sizing is the potential for under-sizing. However it can result in unacceptably poor performance. Unfortunately, the best way to assess an application’s actual performance is with a production workload, making the real world the right place to right-size. Fortunately, the cloud mitigates this risk by making it easy to quickly resize configurations on demand. So right-size aggressively where needed. However be prepared to react quickly in response to each change.
Storage, in direct contrast to compute, is generally relatively inexpensive in the cloud. But be careful using cheap storage, because I/O might incur a separate—and costly—charge with some services. If so, make use of potentially more cost-effective performance-enhancing technologies such as tiered storage, caching, and/or in-memory databases, where available, to optimize the utilization of all resources.
Software licenses can be a significant expense in both private and public clouds. For this reason, many organizations are migrating from Windows to Linux, and from SQL Server to less-expensive commercial and/or open source databases. But for those applications for which “premium” operating system and/or application software is warranted, check different CSPs to see if any pricing models might afford some savings for the configurations required.
Finally, all CSPs offer discounts, and combinations of these can sometimes achieve a savings of up to 50%. Examples include pre-paying for services, making service commitments, and/or relocating applications to another region.
Creating And Enforcing Cost Containment Controls
Self-provisioning for cloud services might be popular with users. But without appropriate controls, this convenience makes it too easy to over-utilize resources, including those that cost the most.
Begin the effort to gain better control by taking full advantage of the monitoring and management tools all CSPs offer. This is likely to encounter a learning curve of course. Because the CSP’s tools may be very different from, and potentially more sophisticated than, those being used in the private cloud.
One of the more useful cost containment tools involves the tagging of resources. Tags consist of key/value pairs and metadata associated with individual resources. And some can be quite granular. For example, each virtual machine, along with the CPU, memory, I/O, and other billable resources it uses, might have a tag. Other useful tags might show which applications are in a production versus development environment, or to which cost center or department each is assigned. Collectively, these tags could constitute the total utilization of resources reflected in the bill.
Organizations that make extensive use of public cloud services might also be well-served to create a script. Include loading information from all available monitoring, management, and tagging tools into a spreadsheet or similar application for detailed analyses and other uses, such as chargeback, compliance, and trending/budgeting. Ideally, information from all CSPs and the private cloud would be normalized for inclusion in a holistic view to enable optimizing price/performance for all applications running throughout the hybrid cloud.
Handling The Worst-Case Use Case: High Availability Applications
In addition to the reasons cited in the introduction for why high-availability applications are often the most costly, all three major CSPs—Google, Microsoft, and Amazon—have at least some high availability-related limitations. Examples include failovers normally being triggered only by zone outages and not by many other common failures; master instances only being able to create a single failover replica; and the use of event logs to replicate data, which creates a “replication lag” that can result in temporary outages during a failover.
None of these limitations is insurmountable, of course—with a sufficiently large budget. The challenge is finding a common and cost-effective solution for implementing high-availability across public, private, and hybrid clouds. Among the most versatile and affordable of such solutions is the storage area network (SAN)-less failover cluster. These high-availability solutions are implemented entirely in software that is purpose-built to create. As implied by the name, a shared-nothing cluster of servers and storage with automatic failover across the local area network and/or WAN to assure high availability at the application level. Most of these solutions provide a combination of real-time block-level data replication, continuous application monitoring, and configurable failover/failback recovery policies.
Some of the more robust SAN-less failover clusters also offer advanced capabilities. For example WAN optimization to maximize performance and minimize bandwidth utilization, robust support for the less-expensive Standard Edition of SQL Server. And let’s not forget manual switchover of primary and secondary server assignments for planned maintenance, and the ability to perform routine backups without disruption to the applications.
Maintaining The Proper Perspective
While trying out some of these suggestions in your hybrid cloud, endeavor to keep the monthly CSP bill in its proper perspective. With the public cloud, all costs appear on a single invoice. By contrast, the total cost to operate a private cloud is rarely presented in such a complete, consolidated fashion. And if it were, that total cost might also cause sticker shock. A useful exercise, therefore, might be to understand the all-in cost of operating the private cloud—taking nothing for granted—as if it were a standalone business such as that of a cloud service provider. Then those bills from the CSP for your mission-critical applications might not seem so shocking after all.
Article from www.dbta.com