|August 18, 2020||
Step-By-Step: How to configure a SANless MySQL Linux failover cluster in Amazon EC2
In this step by step guide, I will take you through all steps required to configure a highly available, 2-node MySQL cluster (plus witness server) in Amazon’s Elastic Compute Cloud (Amazon EC2). The guide includes both screenshots, shell commands and code snippets as appropriate. I assume that you are somewhat familiar with Amazon EC2 and already have an account. If not, you can sign up today. I’m also going to assume that you have basic familiarity with Linux system administration and failover clustering concepts like Virtual IPs, etc.
Failover clustering has been around for many years. In a typical configuration, two or more nodes are configured with shared storage to ensure that in the event of a failover on the primary node, the secondary or target node(s) will access the most up-to-date data. Using shared storage not only enables a near-zero recovery point objective, it is a mandatory requirement for most clustering software. However, shared storage presents several challenges. First, it is a single point of failure risk. If shared storage – typically a SAN – fails, all nodes in the cluster fails. Second, SANs can be expensive and complex to purchase, setup and manage. Third, shared storage in public clouds, including Amazon EC2 is either not possible, or not practical for companies that want to maintain high availability (99.99% uptime), near-zero recovery time and recovery point objectives, and disaster recovery protection.
The following demonstrates how easy it is to create a SANless cluster in the clouds to eliminate these challenges while meeting stringent HA/DR SLAs. The steps below use MySQL database with Amazon EC2 but the same steps could be adapted to create a 2-node cluster in AWS to protect SQL, SAP, Oracle, or any other application.
NOTE: Your view of features, screens and buttons may vary slightly from screenshots presented below
1. Create a Virtual Private Cloud (VPC)
This article will describe how to create a cluster within a single Amazon EC2 region. The cluster nodes (node1, node2 and the witness server) will reside different Availability Zones for maximum availability. This also means that the nodes will reside in different subnets.
The following IP addresses will be used:
Step 1: Create a Virtual Private Cloud (VPC)
First, create a Virtual Private Cloud (aka VPC). A VPC is an isolated network within the Amazon cloud that is dedicated to you. You have full control over things like IP address blocks and subnets, route tables, security groups (i.e. firewalls), and more. You will be launching your Azure Iaas virtual machines (VMs) into your Virtual Network.
From the main AWS dashboard, select “VPC”
Under “Your VPCs”, make sure you have selected the proper region at the top right of the screen. In this guide the “US West (Oregon)” region will be used, because it is a region that has 3 Availability Zones. For more information on Regions and Availability Zones, click here.
Give the VPC a name, and specify the IP block you wish to use. 10.0.0.0/16 will be used in this guide:
You should now see the newly created VPC on the “Your VPCs” screen:
Step 2: Create an Internet Gateway
Next, create an Internet Gateway. This is required if you want your Instances (VMs) to be able to communicate with the internet.
On the left menu, select Internet Gateways and click the Create Internet Gateway button. Give it a name, and create:
Next, attach the internet gateway to your VPC:
Select your VPC, and click Attach:
Step 3: Create Subnets (Availability Zones)
Next, create 3 subnets. Each subnet will reside in it’s own Availability Zone. The 3 Instances (VMs: node1, node2, witness) will be launched into separate subnets (and therefore Availability Zones) so that the failure of an Availability Zone won’t take out multiple nodes of the cluster.
The US West (Oregon) region, aka us-west-2, has 3 availability zones (us-west-2a, us-west-2b, us-west-2c). Create 3 subnets, one in each of the 3 availability zones.
Under VPC Dashboard, navigate to Subnets, and then Create Subnet:
Give the first subnet a name (“Subnet1)”, select the availability zone us-west-2a, and define the network block (10.0.0.0/24):
Repeat to create the second subnet availability zone us-west-2b:
Repeat to create the third subnet in availability zone us-west-2c:
Once complete, verify that the 3 subnets have been created, each with a different CIDR block, and in separate Availability Zones, as seen below:
Step 4: Configure Route Tables
Update the VPC’s route table so that traffic to the outside world is sent to the Internet Gateway created in a previous step. From the VPC Dashboard, select Route Tables. Go to the Routes tab, and by default only one route will exist which allows traffic only within the VPC.
Add another route:
The Destination of the new route will be “0.0.0.0/0” (the internet) and for Target, select your Internet Gateway. Then click Save:
Next, associate the 3 subnets with the Route Table. Click the “Subnet Associations” tab, and Edit:
Check the boxes next to all 3 subnets, and Save:
Verify that the 3 subnets are associated with the main route table:
Later, we will come back and update the Route Table once more, defining a route that will allow traffic to communicate with the cluster’s Virtual IP, but this needs to be done AFTER the linux Instances (VMs) have been created.
Step 5: Configure Security Group
Edit the Security Group (a virtual firewall) to allow incoming SSH and VNC traffic. Both will later be used to configure the linux instances as well as installation/configuration of the cluster software.
On the left menu, select “Security Groups” and then click the “Inbound Rules” tab. Click Edit:
Add rules for both SSH (port 22) and VNC. VNC generally uses ports in the 5900, depending on how you configure it, so for the purposes of this guide, we will open the 5900-5910 port range. Configure accordingly based on your VNC setup:
Step 6: Launch Instances
We will be provisioning 3 Instances (Virtual Machines) in this guide. The first two VMs (called “node1” and “node2”) will function as cluster nodes with the ability to bring the MySQL database and it’s associated resources online. The 3rd VM will act as the cluster’s witness server for added protection against split-brain.
Go to the main AWS dashboard, and select EC2:
Create your first instance (“node1”). Click Launch Instance:
Select your linux distribution. The cluster software used later supports RHEL, SLES, CentOS and Oracle Linux. In this guide we will be using RHEL 7.X:
Size your instance accordingly. For the purposes of this guide and to minimize cost, t2.micro size was used because it’s free tier eligible. See here for more information on instance sizes and pricing.
Next, configure instance details. IMPORTANT: make sure to launch this first instance (VM) into “Subnet1“, and define an IP address valid for the subnet (10.0.0.0/24) – below 10.0.0.4 is selected because it’s the first free IP in the subnet.
Next, add an extra disk to the cluster nodes (this will be done on both “node1” and “node2”). This disk will store our MySQL databases and the later be replicated between nodes.
NOTE: You do NOT need to add an extra disk to the “witness” node. Only “node1” and “node2”. Add New Volume, and enter in the desired size:
Define a Tag for the instance, Node1:
Associate the instance with the existing security group, so the firewall rules created previous will be active:
IMPORTANT: If this is the first instance in your AWS environment, you’ll need to create a new key pair. The private key file will need to be stored in a safe location as it will be required when you SSH into the linux instances.
Repeat the steps above to create your second linux instance (node2). Configure it exactly like Node1. However, make sure that you deploy it into “Subnet2” (us-west-2b availability zone). The IP range for Subnet2 is 10.0.1.0/24, so an IP of 10.0.1.4 is used here:
Make sure to add a 2nd disk to Node2 as well. It should be the same exact size as the disk you added to Node1:
Give the second instance a tag…. “Node2”:
Repeat the steps above to create your third linux instance (witness). Configure it exactly like Node1&Node2, EXCEPT you DON’T need to add a 2nd disk, since this instance will only act as a witness to the cluster, and won’t ever bring MySQL online.
Make sure that you deploy it into “Subnet3” (us-west-2c availability zone). The IP range for Subnet2 is 10.0.2.0/24, so an IP of 10.0.2.4 is used here:
NOTE: default disk configuration is fine for the witness node. A 2nd disk is NOT required:
Tag the witness node:
It may take a little while for your 3 instances to provision. Once complete, you’ll see then listed as running in your EC2 console:
Step 7: Create Elastic IP
Next, create an Elastic IP, which is a public IP address that will be used to connect into you instance from the outside world. Select Elastic IPs in the left menu, and then click “Allocate New Address”:
Select the newly created Elastic IP, right-click, and select “Associate Address”:
Associate this Elastic IP with Node1:
Repeat this for the other two instances if you want them to have internet access or be able to SSH/VNC into them directly.
Step 8: create Route Entry for the Virtual IP
At this point all 3 instances have been created, and the route table will need to be updated one more time in order for the cluster’s Virtual IP to work. In this multi-subnet cluster configuration, the Virtual IP needs to live outside the range of the CIDR allocated to your VPC.
Define a new route that will direct traffic to the cluster’s Virtual IP (10.1.0.10) to the primary cluster node (Node1)
From the VPC Dashboard, select Route Tables, click Edit. Add a route for “10.1.0.10/32” with a destination of Node1:
Step 9: Disable Source/Dest Checking for ENIs
Next, disable Source/Dest Checking for the Elastic Network Interfaces (ENI) of your cluster nodes. This is required in order for the instances to accept network packets for the virtual IP address of the cluster.
Do this for all ENIs.
Select “Network Interfaces”, right-click on an ENI, and select “Change Source/Dest Check”.
Step 10: Obtain Access Key ID and Secret Access Key
Later in the guide, the cluster software will use the AWS Command Line Interface (CLI) to manipulate a route table entry for the cluster’s Virtual IP to redirect traffic to the active cluster node. In order for this to work, you will need to obtain an Access Key ID and Secret Access Key so that the AWS CLI can authenticate properly.
In the top-right of the EC2 Dashboard, click on your name, and underneath select “Security Credentials” from the drop-down:
Expand the “Access Keys (Access Key ID and Secret Access Key)” section of the table, and click “Create New Access Key”. Download Key File and store the file in a safe location.
Step 11: Configure Linux OS
Connect to the linux instance(s):
To connect to your newly created linux instances (via SSH), right-click on the instance and select “Connect”. This will display the instructions for connecting to the instance. You will need the Private Key File you created/downloaded in a previous step:
Here is where we will leave the EC2 Dashboard for a little while and get our hands dirty on the command line, which as a Linux administrator you should be used to by now.
You aren’t given the root password to your Linux VMs in AWS (or the default “ec2-user” account either), so once you connect, use the “sudo” command to gain root privileges:
Unless you have already have a DNS server setup, you’ll want to create host file entries on all 3 servers so that they can properly resolve each other by nameEdit /etc/hosts
Add the following lines to the end of your /etc/hosts file:
Edit /etc/sysconfig/linux and set “SELINUX=disabled”:
By default, these Linux instances will have a hostname that is based upon the server’s IP address, something like “ip-10-0-0-4.us-west-2.compute.internal”
You might notice that if you attempt to modify the hostname the “normal” way (i.e. editing /etc/sysconfig/network, etc), after each reboot, it reverts back to the original!! I found a great thread in the AWS discussion forums that describes how to actually get hostnames to remain static after reboots.
Details here: https://forums.aws.amazon.com/message.jspa?messageID=560446
Comment out modules that set hostname in “/etc/cloud/cloud.cfg” file. The following modules can be commented out using #.
Next, also change your hostname in /etc/hostname.
Reboot Cluster Nodes
Reboot all 3 instances so that SELinux is disabled, and the hostname changes take effect.
Install and Configure VNC (and related packages)
In order to access the GUI of our linux servers, and to later install and configure our cluster, install VNC server, as well as a handful of other required packages (cluster software needs the redhat-lsb and patch rpms).
The following URL is a great guide to getting VNC Server running on RHEL 7 / CentOS 7:For RHEL 7.x/CentOS7.x:
NOTE: This example configuration runs VNC on display 2 (:2, aka port 5902) and as root (not secure). Adjust accordingly!
For RHEL/CentOS 6.x systems:
Open a VNC client, and connect to the <ElasticIP:2>. If you can’t get it, it’s likely your linux firewall is in the way. Either open the VNC port we are using here (port 5902), or for now, disable the firewall (NOT RECOMMENDED FOR PRODUCTION ENVIRONMENTS):
Partition and Format the “data” disk
When the linux instances were launched, and extra disk was added to each cluster node to store the application data we will be protecting. In this case it happens to be MySQL databases.
The second disk should appear as /dev/xvdb. You can run the “fdisk -l” command to verify. You’ll see that
# Start End Size Type NameDisk /dev/xvda: 10.7 GB, 10737418240 bytes, 20971520 sectors Units = sectors of 1 * 512 = 512 bytes
1 2048 4095 1M BIOS boot parti
Here I will create a partition (/dev/xvdb1), format it, and mount it at the default location for MySQL, which is
# mkfs.ext4 /dev/xvdb1
On node1, mount the filesystem:
The EC2 API Tools (EC2 CLI) must be installed on each of the cluster nodes, so that the cluster software can later manipulate Route Tables, enabling connectivity to the Virtual IP.
Step 12: Install EC2 API Tools
The following URL is an excellent guide to setting this up:
Here are the key steps:
If java isn’t already installed (run “which java” to check), install it:
Example (Based on default config of RHEL 7.2 system. Adjust accordingly)
You’ll need your AWS Access Key and AWS Secret Key. Keep these values handy, because they will be needed later during cluster setup too! Refer to the following URL for more information:
Test CLI utility functionality:
Step 13: Install and Configure MySQL
Next, install the MySQL packages, initialize a sample database, and set “root” password for MySQL. In RHEL7.X, the MySQL packages have been replaced with the MariaDB packages.
Create a MySQL configuration file. We will place this on the data disk (that will later be replicated –
Move the original MySQL configuration file aside, if it exists:
On “node2”, you ONLY need to install the MariaDB/MySQL packages. The other steps aren’t required:On “node2”:
Step 14: Install and Configure the Cluster
At this point, we are ready to install and configure our cluster. SIOS Protection Suite for Linux (aka SPS-Linux) will be used in this guide as the clustering technology. It provides both high availability failover clustering features (LifeKeeper) as well as real-time, block level data replication (DataKeeper) in a single, integrated solution. SPS-Linux enables you to deploy a “SANLess” cluster, aka a “shared nothing” cluster meaning that cluster nodes don’t have any shared storage, as is the case with EC2 Instances.
Install SIOS Protection Suite for Linux
Perform the following steps on ALL 3 VMs (node1, node2, witness):
Download the SPS-Linux installation image file (sps.img) and and obtain either a trial license or purchase permanent licenses. Contact SIOS for more information.
You will loopback mount it and run the “setup” script inside, as root (or first “sudo su -” to obtain a root shell) For example:
During the installation script, you’ll be prompted to answer a number of questions. You will hit Enter on almost every screen to accept the default values. Note the following exceptions:
Install Witness/Quorum package
The Quorum/Witness Server Support Package for LifeKeeper (steeleye-lkQWK) combined with the existing failover process of the LifeKeeper core allows system failover to occur with a greater degree of confidence in situations where total network failure could be common. This effectively means that failovers can be done while greatly reducing the risk of “split-brain” situations.
Install the Witness/Quorum rpm on all 3 nodes (node1, node2, witness):
On ALL 3 nodes (node1, node2, witness), edit /etc/default/LifeKeeper, set NOBCASTPING=1
Install the EC2 Recovery Kit Package
SPS-Linux provides specific features that allow resources to failover between nodes in different availability zones and regions. Here, the EC2 Recovery Kit (i.e. cluster agent) is used to manipulate Route Tables so that connections to the Virtual IP are routed to the active cluster node.
Install the EC2 rpm (node1, node2):
Install a License key
On all 3 nodes, use the “lkkeyins” command to install the license file that you obtained from SIOS:
On all 3 nodes, use the “lkstart” command to start the cluster software:
Set User Permissions for LifeKeeper GUI
On all 3 nodes, create a new linux user account (i.e. “tony” in this example). Edit /etc/group and add the “tony” user to the “lkadmin” group to grant access to the LifeKeeper GUI. By default only “root” is a member of the group, and we don’t have the root password here:
Open the LifeKeeper GUI
Make a VNC connection to the Elastic IP (Public IP) address of node1. Based on the VNC configuration from above, you would connect to <Public_IP>:2 using the VNC password you specified earlier. Once logged in, open a terminal window and run the LifeKeeper GUI using the following command:
You will be prompted to connect to your first cluster node (“node1”). Enter the linux userid and password specified during VM creation:
Next, connect to both “node2” and “witness” by clicking the “Connect to Server” button highlighted in the following screenshot:
You should now see all 3 servers in the GUI, with a green checkmark icon indicating they are online and healthy:
Create Communication Paths
Right-click on “node1” and select Create Comm Path
Select BOTH “node2” and “witness” and then follow the wizard. This will create comm paths between:
node1 <—> node2 node1 <—> witness node2 <—> witness
The icons in front of the servers have changed from a green “checkmark” to a yellow “hazard sign”. This is because we only have a single communication path between nodes.
If the VMs had multiple NICs (information on creating Azure VMs with multiple NICs can be found here, but won’t be covered in this article), you would create redundant comm paths between each server.
Verify Communication Paths
Use the “lcdstatus” command to view the state of cluster resources. Run the following commands to verify that you have correctly created comm paths on each node to the other two servers involved:
# /opt/LifeKeeper/bin/lcdstatus -q -d node1
ALIVE 1 witness TCP 10.0.0.4/10.0.2.4 ALIVE 1
ALIVE 1 witness TCP 10.0.1.4/10.0.2.4 ALIVE 1
MACHINE NETWORK ADDRESSES/DEVICE STATE PRIO node1 TCP 10.0.2.4/10.0.0.4
Create a Data Replication cluster resource (i.e. Mirror)
Next, create a Data Replication resource to replicate the /var/lib/mysql partition from node1 (source) to node2 (target). Click the “green plus” icon to create a new resource:
After the resource has been created, the “Extend” (i.e. define backup server) wizard will appear.
Use the following selections:
The cluster will look like this:
Create Virtual IP
Next, create a Virtual IP cluster resource. Click the “green plus” icon to create a new resource:
Extend the IP resource with these selections:
The cluster will now look like this, with both Mirror and IP resources created:
Configure a Ping List for the IP resource
By default, SPS-Linux monitors the health of IP resources by performing a broadcast ping. In many virtual and cloud environments, broadcast pings don’t work. In a previous step, we set “NOBCASTPING=1” in
This is a list of IP addresses to be pinged during IP health checks for this IP resource.
In this guide, we will add the witness server (10.0.2.4) to our ping list.
Right-click on the IP resource (ip-10.1.0.10) and select Properties:
You will see that initially, no ping list is configured for our 10.1.0.0 subnet. Click “Modify Ping List”:
Enter “10.0.2.4” (the IP address of our witness server), click “Add address” and finally click “Save List”:
Create the MySQL resource hierarchy
Next, create a MySQL cluster resource. The MySQL resource is responsible for stopping/starting/monitoring of your MySQL database.
Before creating MySQL resource, make sure the database is running. Run “ps -ef | grep sql” to check.
If it’s running, great – nothing to do. If not, start the database back up:
Follow the wizard with to create the IP resource with these selections:To create, click the “green plus” icon to create a new resource:
Extend the IP resource with the following selections:
As a result, your cluster will look as follows. Notice that the Data Replication resource was automatically moved underneath the database (dependency automatically created) to ensure it’s always brought online before the database:
Create an EC2 resource to manage the route tables upon failover
SPS-Linux provides specific features that allow resources to failover between nodes in different availability zones and regions. Here, the EC2 Recovery Kit (i.e. cluster agent) is used to manipulate Route Tables so that connections to the Virtual IP are routed to the active cluster node.
To create, click the “green plus” icon to create a new resource:
Extend the IP resource with the following selections:
The cluster will look like this. Notice how the EC2 resource is underneath the IP resource:
Create a Dependency between the IP resource and the MySQL Database resource
Create a dependency between the IP resource and the MySQL Database resource so that they failover together as a group. Right click on the “mysql” resource and select “Create Dependency”:
On the following screen, select the “ip-10.1.0.10” resource as the dependency. Click Next and continue through the wizard:
At this point the SPS-Linux cluster configuration is complete. The resource hierarchy will look as follows:
Step 15: Test Cluster Connectivity
At this point, all of our Amazon EC2 and Cluster configurations are complete! Cluster resources are currently active on node1:
Test connectivity to the cluster from the witness server (or another linux instance if you have one) SSH into the witness server, “sudo su -” to gain root access. Install the mysql client if needed:
Test MySQL connectivity to the cluster:
Execute the following MySQL query to display the hostname of the active cluster node:
Using LifeKeeper GUI, failover from Node1 -> Node2″. Right-click on the mysql resource underneath node2, and select “In Service…”:
After failover has completed, re-run the MySQL query. You’ll notice that the MySQL client has detected that the session was lost (during failover) and automatically reconnects:
Execute the following MySQL query to display the hostname of the active cluster node, verifying that now “node2” is active:
Reproduced with permission from SIOS
|August 11, 2020||
|August 2, 2020||
Why is AWS EC2 Application Monitoring So Hard?
Congratulations! You’ve migrated your core applications to the AWS cloud. Or, you are developing new “cloud-native” applications and hosting them in the cloud. Perhaps you are taking advantage of Amazon EC2’s scalability and its elastic architecture. Either way, you now want to ensure that those applications stay up and running, or that you are alerted quickly if and when something happens.
Because something will happen. Our customer data shows that companies using only three EC2 instances experience downtime at least once a month. That means unhappy users unable to access their applications. You need a monitoring solution to tell you what’s going on.
How to narrow down EC2 application monitoring solutions
The first step in your search for the perfect EC2 monitoring solution should be to understand your requirements and your own technical capabilities. Monitoring solutions are not all alike.
Are you interested in a feature-rich solution that monitors a wide array of systems? Or one that focuses on a core set of systems, such as your EC2 environment?
What do you want to do with the output from your application monitoring solution? Do you want as much information as possible to help your developers’ troubleshoot issues? Or are you looking for quick alerts and assistance in remediating from any failures?
And what is your technical appetite to install and manage another application? Do you love scripting? Or do you want something that is “set-it-and-forget-it”?
A search for “application performance monitoring solutions” on Google returns 1,170,000,000 results! Jump into the Amazon AWS Marketplace and you’ll find 453 products listed in the DevOps – Monitoring category. Having a clear sense of your requirements and your own technical capabilities will help you narrow down your search.
Monitoring applications running on Amazon EC2 with Amazon CloudWatch or other APM solutions
If you are hosting your applications on Amazon EC2, then you might consider using Amazon CloudWatch. How familiar are you with standard and custom metrics? You should know that you need quite a lot of technical expertise to run Amazon CloudWatch properly. Amazon CloudWatch is a great solution for users who need data and actionable insights to respond to system-wide performance changes, optimize resources and a unified view of their operational health. But this all comes at a price in terms of the knowledge and experience needed to configure and manage Amazon CloudWatch properly.
Another choice is for you to evaluate and acquire one of the many commercially available application performance monitoring (“APM”) solutions on the market, such as from AppDynamics, Datadog, Dynatrace, or New Relic. But keep in mind your requirements. How broadly do you need to monitor? And what do you intend to do with that information? Are you ready to be overwhelmed with alerts? And be aware that many APM solutions do nothing to help you recover beyond pinpointing the issue. You still have to drop everything to manually restart services or reboot your instances.
Monitor applications running on Amazon EC2 using SIOS AppKeeper
But there is another way. SIOS AppKeeper is a SaaS service that can be configured to automatically discover any EC2 instances and their services. It then automatically take any number of actions if and when downtime is experienced. So instead of getting alerts that something is wrong, you get notified that something happened and was automatically addressed.
SIOS AppKeeper starts at only US $40 per instance per month. We invite you to view this short video to see how easy it is to install and use AppKeeper.
One of our customers, Hobby Japan, a publishing company in Tokyo, was initially using Amazon CloudWatch but their understaffed IT team couldn’t respond fast enough to alerts. They wanted to leverage automation and moved to SIOS AppKeeper. Since moving to AppKeeper they haven’t experienced any issues or unexpected downtime with their EC2 instance. Here’s a link to a case study on Hobby Japan.
Monitoring your cloud applications shouldn’t be a full-time job. You want a monitoring solution that is easy to install and use, doesn’t overwhelm you with alerts, and hopefully takes care of systems impairments automatically. We encourage you to try a 14-day free trial of SIOS AppKeeper by signing up here.
|July 30, 2020||
Planning is Key to Enterprise Availability (and to a Happy Marriage)
Planning dates and getaways, fabulously romantic dinners are a great part of loving your spouse well. Seminars and workshops overflowing with tips for improving your relationship abound in nearly every area of the world.
But, listen in on the training session provided by SIOS Technology Corp. Project Manager for Professional Services, Edmond Melkomian, and you’ll quickly learn that planning dinners and anniversary retreats aren’t the only way to love your spouse well.
In a recent class on SIOS Protection Suite for Linux, Edmond shared three tips that help you love your spouse well in an enterprise world: plan, plan, plan.
1. “Plan to plan” your enterprise availability solution
In his course, Edmond Melkomian asked students to name the first thing you should do when deploying an enterprise solution. His answer, “Plan, plan, plan.” It seems obvious, but the first step is to start making the plan. A fairly decent start for a plan includes developing the details for each of the project phases, such as milestones, checkpoints, risks, risk mitigation and strategies, stakeholders, timelines, stakeholder communication plans. A decent plan will also include details about kickoff, sign-off and closure, and resources (staffing, management, legal/contracts).
Plan to create, review, modify, and update your plan throughout the solution lifecycle.
2. Plan what to deploy for enterprise availability
Plan what to deploy. It is likely that a large portion of your enterprise infrastructure exists beyond the realm of the current team’s lifespan with your company. As you migrate to the cloud, or update your availability strategy, it is worth the time and effort to make a plan regarding what to deploy. Focus your plan on ensuring that you deploy redundancy at all critical components, network, compute, storage, power, cooling, and applications. All data centers and cloud providers typically ensure cooling, power, and network redundancy to start.
A number of firms offer architectural teams, cloud solution providers, availability experts, application architects, and migration specialists who help teams discover critical and sometimes hidden dependencies as well as high risk areas vulnerable to Single Points of Failure (SPOF’s). This investigative work will feed into your plan of what to deploy and/or update in your availability strategy.
Plan on reviewing what you need to deploy.
3. Plan to keep a QA/pre-production cluster for reliable availability
When I was in the SIOS Technology Corp. development team, I’ll never forget a Friday night call with a long time, but frantic customer. Earlier in the month a frequent customer unsuccessfully deployed a new software solution into a production environment. The result was a massive failure. He called our 800 number at 4:30pm (EST) on Friday. Why do I recall that exact time? Friday was date night. My wife and I had dinner plans, a babysitter for the six girls on standby (by the hour), and hopes for a romantic and relaxing evening. I was just about to head out for the day when the phone rang. After a tense first hour, we were back up and running. This unfortunate episode could have been avoided or mitigated by keeping a UAT or QA system on hand.
As Harrison Howell, the Software Engineer for Customer Experience at SIOS Technology Corp. noted in his blog 6-common-cloud-migration-challenges the limits of on-prem are no longer the same limits.
Customers coming from an on-prem system need to remember that resources are no longer a limiting factor. In the cloud, systems can be effortlessly copied and run in isolation of production, something not trivial on-premises. On-demand access to IT resources allows UAT of HA and DR to expand beyond “shut down the primary node”. Networks can be sabotaged, kernels can be panicked, even databases can be corrupted and none of this will impact production! Identifying and testing these scenarios improves HA and DR posture.
Plan on deploying and keeping a UAT system for HA and DR testing. As Harrison mentions, “identifying and testing [issues]” “improves [your overall] HA and DR posture,” and that improves your chances of a successful date night.
4. Plan regular maintenance and updates (including documentation)
Lastly, plan times for regular maintenance and updates to maintain Enterprise Availability. Your enterprise needs to remain highly available to remain highly profitable and successful. Environments don’t remain stagnant, and patches, security updates, expansion, and general maintenance are a regular occurrence from inception to retirement. Creating a plan for how and when you will incorporate updates and maintenance into your enterprise will ensure that you are not only kept up to date, but that you minimize risks and downtime while doing it. Be sure to include in your plan the use of a test system. Develop a planned routine and process for validating patches, kernel and OS updates, and security software, and don’t forget to update the project documentation and future plans as you go and grow.
If you can remember to plan for a highly redundant, highly reliable and highly available system upfront, plan to keep a QA/Pre-production cluster after Go-Live, and plan for regular maintenance and updates you will also be able to keep your plans with your spouse for date night. And not just date night, but you’ll also be able to keep your night’s free from 3am wake up calls due to down production systems. This is my tip for loving your spouse well.
I love my wife and so I help customers deploy SIOS Technology Corp.’s DataKeeper Cluster Edition and SIOS Protection Suite for Windows and Linux products as a part of a highly available enterprise protection solution. Contact SIOS.
— Cassius Rhue, VP, Customer Experience
Article reproduced with permission from SIOS
|July 22, 2020||
Backup, replication, and high availability (HA) clustering are fundamental parts of IT risk management, and they are as indispensable as the wheels on a car. Replication is also essential to IT data protection.
Backup and HA Cluster Environments Are Not Mutually Exclusive
While backup, replication, and failover are all important, there are key distinctions among them that need to be understood to ensure they are applied properly.
For example, while you can use replication to maintain a continuously up-to-date copy of data, without considering it in the larger data protection environment, you will also copy problem data (such as virus-infected data).
In such cases, a backup is essential to bring the data back to the last known good point. By performing replication, you can access the image replicated immediately before the system failure (= RTO / RTO is superior) in a way that simply storing data by generation and supporting it in an eDiscovery type model cannot.
Therefore, SIOS Protection Suite includes both SIOS LifeKeeper clustering software and DataKeeper replication software. SIOS LifeKeeper is an HA failover cluster product that monitors application health and orchestrates application failover and DataKeeper is block-based storage replication software. However, just because it is an HA cluster does not mean that backup is unnecessary. Consider the precautions and points to note when backing up in an HA cluster environment using SIOS Protection Suite.
Five Points of Backup in a High Availability Clustering Environment
Consider the following five points as the target of backup acquisition:
Backup the OS
To back up the OS it is common to use a standard OS utility or third-party backup software. However, since there is no special consideration for the high availability environment, we will not cover it here.
Backup the SIOS Protection Suite Clustering Software
SIOS Protection Suite includes SIOS LifeKeeper / DataKeeper program can also be obtained with the OS standard utility or third-party backup software, but if the program disappears due to a disk failure etc. without intentionally backing it up, you need to reinstall it. There will probably be some people who think about the dichotomy of doing so.
Backup the SIOS Protection Suite Configuration Information
SIOS LifeKeeper comes with a simple command called lkbackup that enables you to backup the configuration information. lkbackup can be run on SIOS LifeKeeper and related resources and will not impact running services.
This command can be executed in the following three main cases.
If you back up the configuration information with lkbackup, even if the configuration information disappears due to a disk failure or if the configuration information is corrupted due to an operation mistake, etc.) you can quickly return to the original operational state.
Backup Operational Programs
Although backing up operation programs refers to backing up the business applications being protected in your HA cluster, it is possible to create and restore a backup image using the OS standard utility or third-party backup software as in 1. and 2 above.
Backup Business Application Data
In an HA cluster environment, shared storage that can be accessed by both active and standby servers is provided. During normal operation, the shared storage is used by the active cluster node. Application data (for example, database data) is usually storage in this shared storage, but the following points should be kept in mind when backing up this storage.
For shared storage configuration
When acquiring a backup of the data located in a SANless cluster configuration with storage shared by both the active cluster node and the standby system, the data can only be accessed from the active system (the standby system cannot access the data). As a result, the backup is also active. In this case, ensure that there is sufficient processing power to handle a failover and backup restore scenario.
For data replication configuration
In the case of the data replication configuration, the backup from the operating system is the basic, but by temporarily stopping the mirroring and releasing the lock, the backup can also be executed on the standby system side. However, in this case, the data is temporarily out of sync.
Backing up a cluster node from an external backup server
To perform a cluster node backup from an external backup server, use either the virtual or real IP address of the cluster node. The points to note in each case are as follows.
Backing up using the virtual IP address of a cluster node
From the backup server’s perspective, backup is executed to the node indicated by the virtual IP address of LifeKeeper. In this case, the backup server does not need to be aware of which node is the active node.
Backing up using the real IP address of the cluster node
From the backup server’s perspective, the backup is performed to the real IP address without using the virtual IP address of LifeKeeper. Since the shared storage cannot be accessed from the standby cluster node, the backup server and client must check which node is the active node.
Combining backup, replication, and failover clustering in a well-tested and verified configuration backup is indispensable. Using perform sufficient operation verification in advance on the user side.
Reproduced with permission from SIOS