Clustering Simplified Archives - Page 82 of 107

How To Trigger Email Alerts From Windows Performance Monitor

November 10, 2018 by Jason Aw Leave a Comment

Step-By-Step: How To Trigger Email Alerts From Windows Performance Monitor

Windows Performance Counter Alerts can be configured to be triggered on any Performance Monitor (Perfmon) Counter through the use of a User Defined Data Collector Set. However, if you wish to be notified via email when an Alert is triggered, you have to use a combination of Perfmon, Task Scheduler and good ol’ Powershell. Follow the steps below to Trigger Email Alerts From Windows Performance Monitor.

Step 1 – Write A Powershell Script

The first thing that you need to do is write a Powershell script that when run can send an email. While researching this I discovered many ways to accomplish this task. What I’m about to show you is just one way, but feel free to experiment and use what is right for your environment.

In my lab, I do not run my own SMTP server. I write a script that could leverage my Gmail account. You will see in my Powershell script, the password to the email account that authenticates to the SMTP server is in plain text. If you are concerned that someone may have access to your script and discover your password, then you will want to encrypt your credentials. Gmail requires and SSL connection. Your password should be safe on the wire, just like any other email client.

Here is an example of a Powershell script when used in conjunction with Task Scheduler and Perfmon. Together, it can send an email alert automatically when any user defined performance counter threshold condition is met. In my environment I set this to C:\Alerts\Alerts.ps1

$counter = $Args[0]
$dtandtime = $Args[1]
$ctr_value = $Args[2]
$threshold = $Args[3]
$value = $Args[4]
$FileName="$env:ComputerName"
$EmailFrom = "sios@medfordband.com"
$EmailTo = "dave@medfordband.com"
$Subject ="Alert From $FileName"
$Body = "Data and Time of Alert: $dtandtime`nPerfmon Counter: $ctr_value`nThreshold Value: $threshold `nCurrent Value: $value"
$SMTPServer = "smtp.gmail.com"
$SMTPClient = New-Object Net.Mail.SmtpClient($SmtpServer, 587)
$SMTPClient.EnableSsl = $true
$SMTPClient.Credentials = New-Object System.Net.NetworkCredential("sios@medfordband.com", "ChangeMe123");
$SMTPClient.Send($EmailFrom, $EmailTo, $Subject, $Body)

An example of an email generated from that Powershell script looks like this.

How To Trigger Email Alerts From Windows Performance Monitor

You probably noticed that this Powershell script takes four arguments. It also assigns them to variables used in the output. It saves the computer name to a variable which is used as part of the output. By doing this, the script can be used to send an email on any Perfmon Alerting Counter and on any server without additional customization.

Step 2 – Set Up A Scheduled Task

In Task Scheduler, we are going to Create a new Task as show in the following screen shots.

How To Trigger Email Alerts From Windows Performance Monitor

Give the Task a name, you will need to remember it for the next step.

How To Trigger Email Alerts From Windows Performance Monitor

Notice that there are no Triggers. This Task will actually be triggered through the Perfmon Counter Alert which we will set up in Step 3.

How To Trigger Email Alerts From Windows Performance Monitor

You want to define a new action on the Action tab. The action will be to Start a Program and use the following inputs,. Please adjust for your specific environment.

Program Script: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
Add Arguments: -File C:\Alerts\Alerts.ps1 $(Arg0)

How To Trigger Email Alerts From Windows Performance Monitor

Step 3 – Create The Performance Counter

Create a new Data Collector Set

How To Trigger Email Alerts From Windows Performance Monitor

Add whichever Performance Counters you would like to monitor and set the Alerting threshold.

How To Trigger Email Alerts From Windows Performance Monitor

Once you have created the Data Collector Set go into the Properties of it and make sure the Alerting threshold and Sample Interval is set properly for each Performance Counter. Keep in mind, if you sample every 10 seconds then you should expect to receive an email every 10 seconds as long as the performance counter exceeds the threshold you set.

How To Trigger Email Alerts From Windows Performance Monitor

If you select the Log an entry in the application event log don’t expect to see any entries in the normal Application event log. It will be written to the Microsoft-Windows-Diagnosis-PLA/Operational log in the Application and Services log directory.

How To Trigger Email Alerts From Windows Performance Monitor

And then finally we have to set an Alert Task that will trigger the Scheduled Task (EmailAlert) that we created in Step 2. You see that we also pass some of the Task arguments which are used by the Powershell script to customize the email with the exact error condition associated with the Alert.

How To Trigger Email Alerts From Windows Performance Monitor

Once the Data Collector is configured properly you will want to start it.

How To Trigger Email Alerts From Windows Performance Monitor

If you configured everything correctly you should start seeing emails any time an alert threshold is met.

If it doesn’t seem to be working, check the following…

Run the Powershell script manually to make sure it works. You may need to manually set some of the variables for testing purposes. In my case it took a little tweaking to get the Powershell script to work properly, so start with that.
Check the Task History to make sure the Alert Counter is triggering the Task.
Run the Task manually and see if it triggers the Powershell.

Step 4 – Set The Performance Counter To Run Automatically

If you think you are all set to Trigger Email Alerts From Windows Performance Monitor, you have one more step. Whenever you reboot a server the Perfmon Counter Alert will not start automatically. In order to survive a reboot you must run the following at a command prompt. Note “Alerts” referenced in the script below is the name of my user defined Data Collector Set.

schtasks /create /tn Alerts /sc onstart /tr "logman start Alerts" /ru system

There are a few edge cases where you might have to create another Trigger to start the Data Collector set. For example, SIOS DataKeeper Perfmon Counters only collects data from the Source of the mirror. If you try to start the Data Collection Set on the Target Server you will see that it fails to start. However, if your cluster fails over, the old target now becomes the source of the mirror, so you will want to start monitoring DataKeeper counters on that new Source. You could create a Cluster Generic Script Resource that starts the Data Collector Set upon failover, but that is a topic for another time.

The easier way ensure the counter is running on the new Source is to set up a Scheduled Task that is triggered by an EventID that indicates the the server is becoming the source of the mirror. In this case I set up trigger on both Systems so that each time EventID 23 occurs the Trigger runs Logman to start the Data Collector Set. Every time a failover occurs Event ID 23 is logged on the new system when it becomes the source, so the Data Collector Set will automatically begin.

How To Trigger Email Alerts From Windows Performance Monitor

That’s it, you now can receive email Alerts directly from your server should any Perfmon counters you care about start getting out of hand.

Did you enjoy reading How To Trigger Email Alerts From Windows Performance Monitor? Click here to find out more.
Reproduced with permission from Clusteringformeremortals.com

How To Trigger Email Alerts From Windows Event Using Windows Server 2016

November 9, 2018 by Jason Aw Leave a Comment

Step-By-Step: How To Trigger An Email Alert From A Windows Event That Includes The Event Details Using Windows Server 2016

Introduction

Trigger Email Alerts From Windows Event Using Windows Server 2016 only requires a few steps. Specify the action that will occur when that Task is triggered. Since Microsoft has decided to deprecate the “Send an e-mail” option, the only choice we have is to Start a Program. In our case, that program will be a Powershell script to collect the Event Log information and parse it. This way, we can send an email that includes important Log Event details.

This work was verified on Windows Server 2016. But I suspect it should work on Windows Server 2012 R2 and Windows Server 2019 as well. If you get it working on any other platforms, please comment and let us know if you had to change anything.

Step 1- Write A Powershell Script

The first thing to do is write a Powershell script that when run can send an email. While researching this I discovered many ways to accomplish this task, so what I’m about to show you is just one way, but feel free to experiment and use what is right for your environment.

In my lab I do not run my own SMTP server, so I had to write a script that could leverage my Gmail account. You will see in my Powershell script that the password to the email account which authenticates to the SMTP server is in plain text. If you are concerned that someone may have access to your script and discover your password, do encrypt your credentials. Gmail requires and SSL connection. Your password should be safe on the wire, just like any other email client.

I have an example of a Powershell script. When used in conjunction with Task Scheduler, it will send an email alert automatically when any specified Event is logged in the Windows Event Log. In my environment, I saved this script to C:\Alerts\DataKeeper.ps1

$EventId = 16,20,23,150,219,220

$A = Get-WinEvent -MaxEvents 1  -FilterHashTable @{Logname = "System" ; ID = $EventId}
$Message = $A.Message
$EventID = $A.Id
$MachineName = $A.MachineName
$Source = $A.ProviderName


$EmailFrom = "sios@medfordband.com"
$EmailTo = "sios@medfordband.com"
$Subject ="Alert From $MachineName"
$Body = "EventID: $EventID`nSource: $Source`nMachineName: $MachineName `nMessage: $Message"
$SMTPServer = "smtp.gmail.com"
$SMTPClient = New-Object Net.Mail.SmtpClient($SmtpServer, 587)
$SMTPClient.EnableSsl = $true
$SMTPClient.Credentials = New-Object System.Net.NetworkCredential
("sios@medfordband.com", "mySMTPP@55w0rd");
$SMTPClient.Send($EmailFrom, $EmailTo, $Subject, $Body)

An example of an email generated from that Powershell script looks like this.

Trigger Email Alerts From Windows Event Using Windows Server 2016

You probably noticed that this Powershell script uses the Get-WinEvent cmdlet to grab the most recent Event Log entry based upon the LogName, Source and eventIDs specified. It then parses the event and assigns EventID, Source, MachineName and Message to variables that will be used to compose the email. You will see that the LogName, Source and eventIDs specified are the same as the ones you will specify when you set up the Scheduled Task in Step 2.

Step 2 – Set Up A Scheduled Task

In Task Scheduler Create a Task as show in the following screen shots.

Create Task
Make sure the task is set to Run whether the user is logged on or not.
On the Triggers tab choose New to create a Trigger that will begin the task “On an Event”. In my example, I will be creating an event that triggers any time DataKeeper (extmirr) logs an important event to the System log.

Create a custom event and New Event Filter as shown below…For my trigger I am triggering on commonly monitored SIOS DataKeeper (ExtMirr) EventIDs 16, 20, 23,150,219,220 . You will need to set up your event to trigger on the specific Events that you want to monitor. You can put multiple Triggers in the same Task if you want to be notified about events that come from different logs or sources.

Create a New Event Filter
Once the Event Trigger is configured, you will need to configure the Action that occurs when the event is run. In our case we are going to run the Powershell script that we created in Step 1.
The default Condition parameters should be sufficient.
And finally, on the Settings tab make sure you allow the task to be run on demand and to “Queue a new instance” if a task is already running.

Step 3 (If Necessary) – Fix The Microsoft Windows DistributedCOM Event ID: 10016 Error

In theory, if you did everything correctly, you should be able to Trigger Email Alerts From Windows Event Using Windows Server 2016. However, I ran into a weird permission issue on one of my servers. Here’s my fix to my problem. Hope it will help you too.

In my case when I manually triggered the event, or if I ran the Powershell script directly, everything worked as expected and I would receive an email. However, if one of the EventIDs being monitored was logged into the event log it would not result in an email being sent. The only clue I had was the Event ID: 10016. It was logged in my Systems event log every time I expected the Task Trigger to detect a logged event.

Log Name: System
Source: Microsoft-Windows-DistributedCOM
Date: 10/27/2018 5:59:47 PM
Event ID: 10016
Task Category: None
Level: Error
Keywords: Classic
User: DATAKEEPER\dave
Computer: sql1.datakeeper.local
Description:
The application-specific permission settings do not grant Local Activation permission 
for the COM Server application with CLSID 
{D63B10C5-BB46-4990-A94F-E40B9D520160}
and APPID 
{9CA88EE3-ACB7-47C8-AFC4-AB702511C276}
to the user DATAKEEPER\dave SID (S-1-5-21-25339xxxxx-208xxx580-6xxx06984-500) 
from address LocalHost 
(Using LRPC) running in the application container Unavailable SID (Unavailable). 
This security permission can be modified using the Component Services administrative tool.

Many of the Google search results for that error indicate that the error is benign. It included instructions on how to suppress the error instead of fixing it. However, I was pretty sure this error was the cause of my current failure. If I don’t fix it right, it would be difficult to Trigger Email Alerts From Windows Event Using Windows Server 2016.

After much searching, I stumbled upon this newsgroup discussion. The response from Marc Whittlesey pointed me in the right direction. This is what he wrote…

There are 2 registry keys you have to set permissions before you go to the DCOM Configuration in Component services: CLSID key and APPID key.

I suggest you to follow some steps to fix issue:

1. Press Windows + R keys and type regedit and press Enter.
2. Go to HKEY_Classes_Root\CLSID\*CLSID*.
3. Right click on it then select permission.
4. Click Advance and change the owner to administrator. Also click the box that will appear below the owner line.
5. Apply full control.
6. Close the tab then go to HKEY_LocalMachine\Software\Classes\AppID\*APPID*.
7. Right click on it then select permission.
8. Click Advance and change the owner to administrators.
9. Click the box that will appear below the owner line.
10. Click Apply and grant full control to Administrators.
11. Close all tabs and go to Administrative tool.
12. Open component services.
13. Click Computer, click my computer, and then click DCOM.
14. Look for the corresponding service that appears on the error viewer.
15. Right click on it then click properties.
16. Click security tab then click Add User, Add System then apply.
17. Tick the Activate local box.

So use the relevant keys here and the DCOM Config should give you access to the greyed out areas:
CLSID {D63B10C5-BB46-4990-A94F-E40B9D520160}

APPID {9CA88EE3-ACB7-47C8-AFC4-AB702511C276}

I was able to follow Steps 1-15 pretty much verbatim. However, when I got to Step 16, I really couldn’t tell exactly what he wanted me to do. At first I granted the DATAKEEPER\dave user account Full Control to the RuntimeBroker, but that didn’t fix things. Eventually I just selected “Use Default” on all three permissions and that fixed the issue.

Trigger Email Alerts From Windows Event Using Windows Server 2016
I’m not sure how or why this happened. I figured I better write it all down in case it happens again because it took me a while to figure it out.

Step 4 – Automating The Deployment

If you need to enable the same alerts on multiple systems, export your Task to an XML file and Import it on your other systems.

Trigger Email Alerts From Windows Event Using Windows Server 2016

Or even better yet. Automate the Import as part of your build process through a Powershell script after making your XML file available on a file share as shown in the following example.

PS C:\> Register-ScheduledTask -Xml (get-content 
'\\myfileshare\tasks\DataKeeperAlerts.xml' | out-string) 
-TaskName "DataKeeperAlerts" -User datakeeper\dave 
-Password MyDomainP@55W0rd –Force

Trigger Email Alerts From Windows Event Using Windows Server 2016

In my next post, I will show you how to be notified when a specified Service either starts or stops. Of course you could just monitor for EventID 7036 from Service Control Monitor. But that would notify you whenever any service starts or stops. We will need to dig a little deeper to make sure we get notified only when the services we care about start or stop.

If you’re interested in our how-to articles like Trigger Email Alerts From Windows Event Using Windows Server 2016, click here.

Reproduced from Clusteringformeremortals.com

Azure Outage Post Mortem Part 3

November 8, 2018 by Jason Aw Leave a Comment

Concluding The Azure Outage Post-Mortem Part 3

My previous blog posts, Azure Outage Post-Mortem – Part 1 and Azure Outage Post-Mortem Part 2, made some assumptions based upon limited information coming from blog posts and twitter. I just attended a session at Ignite which gave a little more clarity as to what actually happened. Sometime tomorrow you should be able to view the session for yourself.

BRK3075 – Preparing for the unexpected: Anatomy of an Azure outage

The official Root Cause Analysis will be published soon. In the meantime, here are some tidbits of information gleaned from the session.

The Cause

From the azure outage post mortem, the outage was NOT caused by a lightning strike as previously reported. Instead, due to the nature of the storm, there were electrical storm sags and swells. As a result, it locked out a chiller plant in the 1^st datacenter. During this first outage they were able to recover the chiller quickly with no noticeable impact. Shortly thereafter, there was a second outage at a second datacenter which was not recovered properly. That began an unfortunate series of events.

2nd Outage

During this outage, Microsoft states that “Engineers didn’t triage alerts correctly – chiller plant recovery was not prioritized”. There were numerous alerts being triggered at this time. Unfortunately the chiller being offline did not receive the priority it should have. The RCA as to why that happened is still being investigated.

Microsoft states that of course redundant chiller systems are in place. However, the cooling systems were not set to automatically failover. Recently installed new equipment had not been fully tested. So it was set to manual mode until testing had been completed.

After 45 minutes, the ambient cooling failed, hardware shutdown, air handlers shut down because they thought there was a fire. Staff had been evacuated due to the false fire alarm. During this time temperature in the data center was increasing. Some hardware was not shut down properly, causing damage to some storage and networking.

After manually resetting the chillers and opening the air handlers, the temperature began to return to normal. It took about 3 hours and 29 minutes before they had a complete picture of the status of the datacenter.

The biggest issue was there was damage to storage. Microsoft’s primary concern is data protection. Microsoft will work to recover data to ensure no data loss. This of course took some time, which extend the overall length of the outage. The good news is that no customer data was lost. The bad news is that it seemed like it took 24-48 hours for things to return to normal. This was based upon what I read on Twitter from customers complaining about the prolonged outage.

Assumptions

Everyone expected that this outage would impact customers hosted in the South Central Region. But what they did not expect was that the outage would have an impact outside of that region. In the session, Microsoft discusses some of the extended reach of the outage.

Azure Service Manager (ASM)

This controls Azure “Classic” resources, AKA, pre-ARM resources. Anyone relying on ASM could have been impacted. It wasn’t clear to me why this happened. It appears that South Central Region hosts some important components of that service which became unavailable.

Visual Studio Team Service (VSTS)

Again, it appears that many resources that support this service are hosted in the South Central Region. This outage is described in great detail by Buck Hodges (@tfsbuck), Director of Engineering, Azure DevOps this blog post.

POSTMORTEM: VSTS 4 SEPTEMBER 2018

Azure Active Directory (AAD)

When the South Central region failed, AAD did what it was designed to due and started directing authentication requests to other regions. As the East Coast started to wake up and online, authentication traffic started picking up. Now normally AAD would handle this increase in traffic through autoscaling. But the autoscaling has a dependency on ASM, which of course was offline. Without the ability to autoscale, AAD was not able to handle the increase in authentication requests. Exasperating the situation was a bug in Office clients which made them have very aggressive retry logic, and no backoff logic. This additional authentication traffic eventually brought AAD to its knees.

They ran out of time to discuss this further during the Ignite session. One feature that they will be introducing will be giving users the ability to failover Storage Accounts manually in the future. So in the case where recovery time objective (RTO) is more important than (RPO), the user will have the ability to recover their asynchronously replicated geo-redundant storage in an alternate data center if Microsoft experience another extended outage in the future.

What You Can Do Now

Until that time, you will have to rely on other replication solutions such as SIOS DataKeeper Azure Site Recovery. Or application specific replication solutions which has the ability to replicate data across regions and put the ability to enact your disaster recovery plan in your control.

Read more about our azure outage post mortem
Reproduced with permission from Clusteringformeremortals.com

Azure Outage Post Mortem Part 2

November 7, 2018 by Jason Aw Leave a Comment

What happened? Here’s our Azure Outage Post Mortem Part 2

My previous blog post says that Cloud-to-Cloud or Hybrid-Cloud would give you the most isolation from just about any issue a CSP could encounter. However most of the downtime caused by this natural disaster could have been avoided if Availability Zones was available in the South Central region. Microsoft published a Preliminary RCA of the September 4th South Central Outage.

The most important part of that whole summary is as follows…

“DESPITE ONSITE REDUNDANCIES, THERE ARE SCENARIOS IN WHICH A DATACENTER COOLING FAILURE CAN IMPACT CUSTOMER WORKLOADS IN THE AFFECTED DATACENTER.”

What Does That Mean To You?

If your applications all run in the same datacenter, you are susceptible to the same type of outage in the future. In Microsoft’s defense, this really shouldn’t be news to you. This has always been true whether you run in Azure, AWS, Google or even your own datacenter. Failure to plan ahead with data replication to a different datacenter and a plan in place to quickly recover your applications is simply a lack of planning on your part.

Microsoft didn’t publish exact Availability Zone locations. If you believe this map published here, you could guess that they are probably anywhere from a 2-10 miles apart from each other.

Azure Datacenters.png

In all but the most extreme cases, replicating data across Availability Zones should be sufficient for data protection. Some applications such as SQL Server have built in replication technology. However for a broad range of applications, operating systems and data types, do investigate block level replication SANless cluster solutions. SANless cluster solutions have traditionally been used for multisite clusters. But the same technology can also be used in the cloud across Availability Zones, Regions, or Hybrid-Cloud for high availability and disaster recovery.

Implementing a SANless cluster that spans Availability Zones, be it Azure, AWS or Google, is a pretty simple process given the right tools. As part of the azure outage post mortem, here are a few resources to help get you started.

Step-by-Step: Configuring a File Server Cluster in Azure that Spans Availability Zones

How to Build a SANless SQL Server Failover Cluster Instance in Google Cloud Platform

MS SQL Server v.Next on Linux with Replication and High Availability #Azure #Cloud #Linux

Deploying Microsoft SQL Server 2014 Failover Clusters in #Azure Resource Manager (ARM)

SANless SQL Server Clusters in AWS

SANless Linux Cluster in AWS Quick Start

Lessons From Azure Outage Post Mortem

If you are in Azure, you may also want to consider Azure Site Recovery (ASR). ASR lets you replicate the entire VM from one Azure region to another region. ASR will replicate your VMs in real-time and allow you to do a non-disruptive DR test whenever you like. It supports most versions of Windows and Linux and is relatively easy to set up.

You can also create replication jobs that have “Multi-VM Consistency”. This means that servers must be recovered from the exact same point in time can be put together in this consistency group and they will have the exact same recovery point. Essentially, if you build a SANless cluster with DataKeeper in a single region for high availability, you have two options for DR. One is you could extend your SANless cluster to a node in a different region, or you could use ASR to replicate both nodes in a consistency group.

asr

What’s The Difference?

The trade off with ASR is that the RPO and RTO is not as good as you will get with a SANless multi-site cluster. Although it is easy to configure and works with just about any application. Just be careful. If your application exceeds 10 MBps in disk write activity on a regular basis, ASR will not be able to keep up. Also, clusters based on Storage Spaces Direct cannot be replicated with ASR and in general lack a good DR strategy when used in Azure.

For a while after Managed Disks were released, ASR did not fully support them until about a year later. Full support for Managed Disks was a big hurdle for many people looking to use ASR. Fortunately since about February of 2018, ASR fully supports Managed Disks. However, there is another problem that was just introduced.

With the introduction of Availability Zones, ASR is once again caught behind the times. They currently don’t support VMs that have been deployed in Availability Zones.

2018-09-25_00-10-24 — Support matrix for replicating from one Azure region to another

I went ahead and tried it anyway. It seems possible to configure replication and I was able to do a test failover.

ASR-and-AZ — I used ASR to replicate SQL1 and SQL3 from Central to East US 2 and did a test failover. Other than not placing the VMs in AZs in East US 2 it seems to work.

Read more about my analysis of the Azure Outage Post Mortem
Reproduced with permission from Clusteringformeremortals.com

Azure Outage Post-Mortem Part 1

November 6, 2018 by Jason Aw Leave a Comment

Azure Outage Post-Mortem

The first official Post-Mortems are starting to come out of Microsoft in regards to the Azure Outage that happened last week. This first Azure Outage Post-Mortem addresses the Azure DevOps outage specifically (previously known as Visual Studio Team Service, or VSTS). It gives us some additional insight into the breadth and depth of the outage. It confirms the cause of the outage. It also gives us some insight into the challenges Microsoft faced in getting things back online quickly. Additionally, it hints at some some features/functionality Microsoft may consider pursuing to handle this situation better in the future.

As I mentioned in my previous article, features such as the new Availability Zones being rolled out in Azure, might have minimized the impact of this outage. In the post-mortem, Microsoft confirms what I previously said.

The primary solution we are pursuing to improve handling datacenter failures is Availability Zones, and we are exploring the feasibility of asynchronous replication.

Other Preventions To Take

Until Availability Zones are rolled out across more regions the only disaster recovery options, you have are cross-region, hybrid-cloud or even cross-cloud asynchronous replication. Software based #SANless clustering solutions available today will enable such configurations. Providing a very robust RTO and RPO, even when replicating great distances.

With SaaS/PaaS solutions, you depend on the Cloud Service Provider (CSPs) to have an iron clad HA/DR solution in place. In this case, it seems as if a pretty significant deficiency was exposed. We can only hope that it leads all CSPs to take a hard look at their SaaS/PaaS offerings. As well as to address any HA/DR gaps that might exist. Until then, it is incumbent upon the consumer to understand the risks. They need to do what they can to mitigate the risks of extended outages, or just choose not to use PaaS/SaaS until the risks are addressed.

RTO or RPO?

The post-mortem really gets to the root of the issue…what do you value more, RTO or RPO?

I fundamentally do not want to decide for customers whether or not to accept data loss. I’ve had customers tell me they would take data loss to get a large team productive again quickly, and other customers have told me they do not want any data loss and would wait on recovery for however long that took.

It will be impossible for a CSP to make that decision for a customer. CSP won’t want to lose customer data, unless the original data is just completely lost and unrecoverable. In that case, a near real-time async replica is as good as you are going to get in terms of RPO in an unexpected failure.

However, was this outage really unexpected and without warning? Modern satellite imagery and improvements in weather forecasting gave fair warning there was going to be significant weather related events in the area.

Hurricane Florence is heading down Southeast US as I write this post. Take proactive measures to move workloads out of impacted region if the data center is in the path. The benefit of a proactive disaster recovery vs a reactive disaster recovery are numerous. No data loss, ample time to address unexpected issues. It also includes managing human resources such that employees can worry about taking care of their families, rather than be at work.

Again, enacting a proactive disaster recovery would be a hard decision for a CSP to make on behalf of all their customers. Planned migrations across regions will incur some amount of downtime. This decision will have to be put in the hands of the customer. Take lessons from this Azure Outage Post-Mortem to educate your customers.

Slide 2.png — Hurricane Florence Satellite Image taken from the new GOES-16 Satellite, courtesy of Tropical Tidbits

Get Protected

So what can you do to protect your business critical applications and data? Let’s gleam some lessons from Azure Outage Post-Mortem. Cross-region, cross-cloud or hybrid-cloud models with software based #SANless cluster solutions are going a long way to address your HA/DR concerns. Furthermore, it’s got an excellent RTO and RPO for cloud based IaaS deployments. There are other options apart from application specific solutions. Software-based, block level volume replication solutions such SIOS DataKeeper and SIOS Protection Suite replicate all data and provide a data protection solution for both Linux and Windows platforms.

My oldest son just started his undergrad degree in Meteorology at Rutgers University. Imagine a day when artificial intelligence (AI) and machine learning (ML) processes weather related data from NOAA. They could trigger a planned disaster recovery migration two days before the storm strikes? I think I just found a perfect topic for his Master’s thesis. Or better yet, have him and his smart friends at the WeatherWatcher LLC get funding for a tech startup that applies AI and ML to weather related data to control proactive disaster recovery events.

I think we are just at the cusp of IT analytics solutions. We can apply advanced machine-learning technology to cut the time and effort to ensure delivery of critical application services. SIOS iQ is one of the solutions leading the way in that field.

Batten down the hatches and get ready. Hurricane season is just starting and we are already in for a wild ride. If you would like to discuss your HA/DR strategy reach out to me on Twitter @daveberm.