cluster Archives - SIOS SANless clusters

How to get the most out of your “GET” commands in DataKeeper

September 5, 2024 by Jason Aw Leave a Comment

How to get the most out of your “GET” commands in DataKeeper

In part 2 of the three-part DataKeeper dashboard blog series, this blog is a follow up from DataKeeper UI vs. Car Dashboards blog. As with your car, when those indicators (traffic light colors) on your dashboard flash, you pop the hood to identify what they represent. Starting points:

Battery Light = check battery, cable connections, alternator, etc.
Oil Light = checking the dipstick for low, high or no oil (No oil, DON’T MOVE YOUR CAR)
Coolant Light = Is there any coolant/water in the overfill tank?

“Popping the hood” on your DataKeeper Cluster Edition software has several similarities and it often means using command line interface for DataKeeper Administration. As for Using EMCMD (Extended Mirroring Command) with SIOS DataKeeper, the GET commands make up 1/3 of the most commonly used tasks/commands of the approximately 48 commands. Note that they are informational only and will not impact your nodes. Below are a few helpful Get commands and their usage to identify the reasons for warning colors in your DataKeeper User Interface/DataKeeper.msc (traffic light colors) in the areas of Storage, Networking and Other.

Note:

From an elevated/Administrator command prompt as always use:

cd %extmirrbase% (This is just a shortcut to the install path <root>\Program Files (x86)\SIOS\DataKeeper>)

Obtaining Status of your DataKeeper Mirrors

A great place to start your initial triage below:

getmirrorvolinfo (Other) It is the most heavily used tool as it provides the Mirror roles, Source/Target and the 5 different states a Mirror can be in. It can be run on the source or the target to check to see if the mirror configuration exists.

getserviceinfo (Other) – Great information about
- the Driver Version and DataKeeper Service Version
- and the time the SIOS DataKeeper Service was started/restarted
getcompletevolumelist (Storage/Network) – a total list
- of all disks
- DataKeeper and non-DataKeeper Volumes
- their Role (Source/Target) and Total Capacity (in bytes)
getjobinfo (Storage/Network) this shows/echos the job information listed in the DataKeeper console
getvolumeinfo (Storage,Network) Great utility when
- comparing the Total Drive Capacity of the Source (diskmgmt.msc) and Target Volumes (bytes)
- IP addresses and present Mirror status

Cross-referencing DataKeeper “GET” commands to various Windows Server Commands

Getjobinfo and “ipconfig /all” (Networking)
- Are the IP addresses identical in both outputs? If the Mirror Replication is to be segmented to another network then the getjobinfo must match the ipconfig /all output. Executing a changemirrorendpoints will correct this discrepancy (to be discussed in another blog)
  - Also the mirrorendpoints information from the getjobinfo output should match the IP addresses locate in the ipconfig /all output
After performing a Resize of the existing Source and Target Volumes, does the emcmd . getvolumeinfo command reflect a different Total Space than the DISKPART command and its Size? Via DISKPART a “extend” or “extend filesystem” may be required so as the resized volumes are recognized by the Operating System
Getcompletevolumelist has information similar to that of the Disk Management applet (diskmgmt.msc)
- File System type
- Volume Total Space (bytes)
- Getvolumeinfo provides similar output
Getserviceinfo as well as “net start” can provide status about the SIOS DataKeeper Service; if it is Start/Stop

Take Charge of DataKeeper: Apply Your GET Command Knowledge Now

Now that you are armed with some basic knowledge about the lights on your car’s dashboard you will become a DIYer when it comes to DataKeeper Administration in blog 3 in the DataKeeper Dashboard series.

Reproduced with permission from SIOS

How to Set Up a DataKeeper Cluster Edition (DKCE Cluster)

January 30, 2024 by Jason Aw Leave a Comment

How to Set Up a DataKeeper Cluster Edition (DKCE Cluster)

What is a DKCE cluster?

DKCE is an acronym for DataKeeper Cluster Edition. DKCE is a SIOS software that combines the use of DataKeeper with features of Windows Failover Clustering to provide high availability through migration-based data replication.

The steps to create a DKCE cluster

For this example, I will set up a three-node cluster with the third node maintaining node majority.

Step 1: You must have DataKeeper installed on 2/3 of your systems to set up a DKCE cluster. Click on the following link to follow our quick start guide to complete this install: https://docs.us.sios.com/dkce/8.10.0/en/topic/datakeeper-cluster-edition-quick-start-guide

Step 2: Add the servers that you plan on managing in the Server Manager. This will need to be done on all servers you plan on adding to the cluster.

On your server navigate to Server Manager

Click “Add other servers to manage”

I added the servers by their name here. To do so this way you will need to verify your system name and IP entries in the host file, located here: C:\Windows\System32\drivers\etc\hosts

After all servers have been added, you can verify by navigating to “All servers” in Server Manager

Step 3: You may notice a winRM error, to bypass it run this command in PS as an administrator. Run this command to add servers in your cluster as trusted hosts. This command will need to be run on every system in your cluster.

Set-Item WSMan:\localhost\Client\TrustedHosts -Value ‘<name of server 1>,<name of server 2>’

Step 4: Install Failover Clustering

Follow these steps to install failover clustering.

Step 5: Navigate to the Failover Cluster Manager

Step 6: Click “Create Cluster”

Step 7: Next, add the servers that should be in the cluster and click “Add” after each entry.

Step 8: The list should be similar to the one in the following image

Step 9: Choose “Run all tests” for the validation test, and click Next.

Step 10: Once the tests have been completed, click “Finish”

Step 11: Name your cluster, I have named this one “Cluster1”, click “Next”

Step 12: Verify “Add all eligible storage to the cluster” is checked, click “Next”

Step 13: Once step 12 is completed, click “Finish”

Step 14: In Failover Cluster Manager, the cluster will initially be offline. I will be assigning an unused IP to bring it online. In the “Cluster Core Resources” right-click on the IP address resource and select Properties.

In the properties panel, my subnet mask is /28, therefore I will choose an available IP within the range, 12.0.0.14. Click “Apply”

In the “Cluster Core Resources” right-click on the cluster and select “Bring Online”

The resources should now be online

Step 15: Navigate to DataKeeper

Step 16: Right-click on Job and click “Create Job” to begin creating our first mirror

Give the job a name I am naming my job “job1”, and click “Create Job”

Choose the source and volume to replicate data from. I have chosen Box1 as my source, and volume D, click “Next”

Next, choose a server and a volume to be the target. I have chosen Box2, and volume D.

A prompt will appear to ask you to auto-register the volume you have created as a WSFC volume, select “Yes” to make this volume Highly Available.

In DataKeeper you can now see that the volume is currently mirroring.

Step 17: In Failover Cluster Manager, navigate to Storage, then Disks. You will see the volume that you have auto-registered is WSFC.

Step 18: Let’s verify the Owners that should be checked. Right-click on the volume, and click “Properties”

Since I need a third to be a witness and maintain node majority, Box3 will need to be unchecked / remain unchecked.

Step 19: Now we can test a migration through Failover Cluster Manager. Navigate to File Explorer and create a new text file in the volume that is currently being mirrored. Do this on your Source.

Navigate to Failover Cluster Manager, click “DataKeeper Volume D” and select “Move Available Storage” from the Actions pane.

Right-click “Best Possible Node”. This should automatically migrate to your target.

In Failover Cluster Manager verify the owner of “DataKeeper Volume D” is now the target node

Navigate to DataKeeper to verify that your target is now the Source, and vice versa.

Successful DKCE Cluster Setup

You have completed the setup of a DKCE cluster.

SIOS provides resources and training for all our products.

Reproduced with permission from SIOS

High Availability Lessons from Disney and Pixar’s Soul

July 7, 2022 by Jason Aw Leave a Comment

High Availability Lessons from Disney and Pixar’s Soul

In Disney and Pixar’s Soul, the main character Joe Gardner (voiced by Jamie Foxx) has dreamed of being a professional jazz pianist. However, despite his many attempts, to his mother’s dismay, he finds himself miles away from his dream, living as “a middle-aged middle school band teacher.” But then, “thanks to a last-minute opportunity to play in jazz legend Dorothea Williams’ quartet, his dreams seem like they are finally about to become a reality. That is until “a fateful misstep sends him to The Great Before—a place where souls get their interests, personalities, and quirks— and Joe is forced to work with a “22”, an ancient soul with no interest in living on earth, to “somehow return to Earth before it’s too late (D23.com).”

Disney and Pixar’s Soul is a great movie with lots of interesting and relatable characters, humorous, descriptive and sometimes disturbingly relatable takes on life, purpose and living. But, it is also a movie with rich leadership lessons, life lessons, and lessons on higher availability.

Seven thoughts on Higher availability from Disney and Pixar’s Soul.

1. Pay attention to what’s going on

In Disney and Pixar’s Soul Joe lands his dream gig. But as Joe starts walking and sharing the great news, he is so engaged with his phone that he walks into the street, nearly gets crushed under a ton of bricks, and then he wanders dangerously towards an open, but clearly marked manhole. So what’s the lesson for higher availability– pay attention. Pay attention to the alerts and error messages from your monitoring and recovery solutions. Pay attention to the changes being made by your hosting providers, and especially to critical notices from vendors and partners and security teams. Alerts and warnings are there for a reason, failing to address them or take the appropriate action when you see the warning could lead you into a deep hole.

2. Don’t fall into a hole

Oblivious to the warnings, or ignoring them, Joe finally meets his end when he falls into an open manhole and becomes a soul. This immediately alters his dreams and plans. So, what hole could your enterprise be poised to fall into? Are there open holes lurking in the path of your enterprise such as: coverage holes, versioning gaps, holes in maintenance plans and reality, or even a black hole with vendor responsiveness? Look around your environment, what holes could you fall into beyond the obvious single points of failure? Is there a warning that you have an open hole related to unprotected critical applications, communication gaps between your teams, or even holes in your process and crisis management. Don’t fall into a hole that could damage or even end your high availability.

3. Don’t rush high availability

After becoming a soul Joe begins actively trying to get back to his own body. When he gets paired with 22, she takes him to Moonwind who agrees to try to help him find his body, which they do. But Joe becomes too eager to jump back into his body, despite Moonwind’s caution. In his rush both he and 22 fall back to earth, but Joe ends up in the body of a cat and 22 ends up in his body. Like Joe if we aren’t patient, the jump happens too soon and we end up in a precarious or even worse situation. We may not be in the body of a cat, but we may also be far from the best position necessary to maintain HA. Jumping too soon looks like:

Deploying software without an architecture or holistic solution
Deploying in production without testing in QA
Deploying into the cloud without understanding the cloud or what the cloud means by HA
Deploying into production based on a timeline and not completed acceptance tests
Deploying without a purpose built, commercial grade solution for application monitoring and orchestration

4. Don’t quit too soon – high availability is never easy

When Connie, a young trombone player, comes to the apartment of her teacher she is frustrated and wants to quit. She begins by telling Joe (who is actually 22 in Joe’s body) that she’s frustrated and that she just wants to give up and quit. But after a few moments, she plays one last piece on the trombone and realizes that it is too soon to quit. In higher availability, we are all a lot like Connie. Sometimes, a difficulty makes us feel like we are at the end of our rope and want to quit. Sometimes an outage will make us feel certain that it’s time to throw in the towel. Don’t be so quick to quit. HA is never easy, never! But, it is always too soon to quit striving to end downtime, so like Connie, maybe we just need to keep at it. Which leads me to the next lesson.

5. You haven’t tried everything

In the movie 22 is a soul who hasn’t lived yet. She believes that she has tried all the possible things to give her a spark, but when she falls into Joe’s body she realizes there is a lot that she hasn’t tried. In creating a higher availability solution, it can be easy to feel like you’ve tried everything and every product, but most likely you haven’t. A fresh perspective, or looking at the challenges and problems with a new set of eyes may help you improve your system and enterprise availability.

Some things to try for higher availability can be simple, such as:

Set up additional alerts for key monitoring metrics
Add analytics.
Perform regular maintenance (patches, updates, security fixes)
Document your processes
Document your operational playbook
Improve your lines of communication
Perform regular maintenance

Other ideas may require more work, research, time and money but could be worth it if you haven’t explored them in the past.

Ways to improve your higher availability with more time and effort include:

Remove hacks and workarounds.
Create solid repeatable solution architectures
Go commercial and purpose built
Hire a consultant
Audit and document your architecture
Upsize your VM; CPU, memory, and IOPs
Add additional redundancy at the zone or region level

6. Ask more (and better) questions

After Joe, as Mr. Mittens, accidentally cuts a path down the middle of his hair, Mr. Mittens and Joe have to take a trip to see Dez, Joe’s barber. While Joe is in the barbers chair with Dez they begin having a conversation about purpose, life, existential existence and more. After the haircut, 22 asks Dez why they never had conversations like this before, about Dez’s life. Dez responds that he’d never asked before. Sometimes we can get so tunnel focused in solutions, in methods for the cloud or on-premise, in languages and architectures, and in telling others what we are doing that we forget to ask questions that can open up a whole new world. As Joe asked questions he learned more about Dez, and about himself. Perhaps the lesson for better HA is to start asking more questions about our solution, about the architecture, about the business goals and challenges, about the end customer goals, about our teams, and even about our roles and responsibilities within the bigger picture.

Some simple questions to increase our availability include:

If a disaster happens tomorrow, what system, process, product, or solution would be the cause?
What is the single most important thing to protect? Application, data, metadata, all of the above?
What RPO can our applications and databases tolerate?
What won’t our customers tolerate?
What am I missing?
Where do we have this architecture documented?
What don’t I understand?

7. Perseverance pays off

“The counts off,” says Terry. Tasked with keeping track of the entrants to The Great Beyond, Terry is meticulously counting the number of souls that should be arriving or have arrived. After Joe takes a detour to The Great Before, Terry grows determined to find the missing soul and fix the tally. When he begins his work, he is in a long corridor of file cabinets that stretch as far and as high as the eye can see. But after a while, he finds the file of Joe and discovers that Joe has found a loophole and that is why the count was off. The same perseverance displayed by Terry will also pay off in the realm of higher availability. In the face of a daunting uncertainty, a plethora of log files, and an ocean of possible failure scenarios the moments of perseverance to uncover and then remedy problems before they occur, or analyze and remediate them effectively after they occur will lead us to the better outcomes we desire. Similarly, a lack of diligence and perseverance will mean that the same problem will likely resurface later, even in a new environment with new software.

As the movie Soul ends, Joe returns to the Great Before, finds and then convinces 22 to take her Earth pass and take the plunge. Reminiscent of when she fell to earth with Joe, she takes another plunge. To the dismay of my children, the movie ends without describing what 22 makes of her life or the new opportunities that follow. She simply leaps from the Great Before with an anticipation of what will happen next. Perhaps we too stand at a moment where we can take the plunge… a moment in the “Great Before” and an opportunity to make this a year of additional higher availability.

– Cassius Rhue, VP Customer Experience

Reproduced with permission from SIOS

Introduction To Clusters – Part 1

November 18, 2021 by Jason Aw Leave a Comment

Introduction To Clusters – Part 1

What is clustering in the first place?

Clustering technology is a technology that allows you to connect multiple servers to act as a single functional unit.

Types of clustering

You can cluster servers for several purposes. For example, you can combine the processing power of multiple small servers for high performance. You can also distribute processing work to multiple nodes using a load balancer for added efficiency.

High availability (HA) clustering is a process of combining server nodes to protect important applications from downtime and data loss.

Introduction To Clusters — In a traditional shared storage failover cluster, a primary node and secondary or remote node share the same storage.

HA Clustering

High availability (HA) clustering is a mechanism that reduces downtime by eliminating single points of failure (SPOF). In an HA cluster, important applications are run on a primary node which is connected to one or more secondary or remote nodes in a cluster. Clustering software monitors the health of the application, server, and network. In the event of a failure on the primary node, it moves application operations over to a secondary node in a process called a failover, where operation continues.

High Availability

Application high availability is a measure of how much time in a given year an application is available and operational. In general, HA clusters provide 99.99% (Four nines) availability or a little more than 52 minutes of downtime over the course of a given year.

It is important to note that in a traditional HA cluster, all of the cluster nodes are connected to the same shared storage – typically a SAN. In this way, after a failover, the secondary node is accessing the same data as the primary node and operation can continue.

SANless Clusters

However, many companies prefer to use a SANless cluster for several reasons. First, shared storage represents a critical single point of failure. Second, shared storage is often not an option in public cloud environments. Third, SANs can sometimes impede performance of database applications, such as SQL Server, Oracle, and SAP.

Instead of shared storage, these companies use efficient, host-based, block-level replication to synchronize local storage on all cluster nodes. In the event of a failover, the secondary node is connected to local storage with an identical copy of the primary storage. This not only eliminates the SAN SPOF risk but also enables the addition of fast disk (SSD) to local on-premises storage for cost-efficient high performance. SANless clustering also enables companies to migrate on-premises HA environments to the cloud with minimal effort or disruption of ongoing business processes.

Reproduced from SIOS

Disaster Recovery Made Simple

October 22, 2021 by Jason Aw Leave a Comment

Disaster Recovery Made Simple

Heard the term disaster recovery (DR) thrown around often? DR is a strategy and set of policies, procedures, and tools. It ensures critical IT systems, databases, and applications continue to operate and be available to users when a man-made or natural disaster happens. It typically involves moving application operation to a redundant DR environment that is geographically separated from the primary environment. While the IT team owns the disaster recovery strategy, DR is an important component of every organization’s Business Continuity Plan. The latter is a strategy and set of policies, procedures, and tools to ensure business operations continue through an interruption in service.

It may sound confusing at first. But we’ve collected some quick facts to make disaster recovery simple to understand:

Point 1. Implement an IT disaster recovery or a disaster recovery plan (DRP)

A DRP is a strategy and set of policies, procedures, and tools that ensure critical IT systems, databases, and applications continue to operate and be available to users when a disaster strikes the organization’s primary computing environment. While the IT team owns the disaster recovery strategy, DR is an important component of every organization’s Business Continuity Plan.

Point 2. Ensure Geographic Separation

An essential part of application disaster recovery is ensuring there is a redundant, geographically separated application environment available. You have either efficient, block level replication and or a clustering software that can failover operation to it in the event of a disaster. If your application is running in a cloud, your clustering environment should failover across cloud regions and availability zones for disaster recovery.

Point 3. Test, test, and test some more

In a recent Spiceworks survey, 59 percent of organizations indicated they had experienced one to three outages (that is, any interruption to normal levels of IT-related service) over the course of one year. 11 percent have experienced four to six. 7 percent have experienced seven or more. In short, a DR event is nearly inevitable. Be sure you conduct regular testing to ensure you know exactly what will happen when it does.

Point 4. Understand Your Risk

The disaster in DR does not need to be a full-fledged hurricane, tornado, flood, or earthquake that impacts your business. Disasters come in many forms, including a cyber-attack, fire, theft, or vandalism. In fact, simple human error still rates among the leading causes of IT data center downtime. In short, a disaster is any crisis that results in a down system for a long duration and/or major data loss on a large scale that impacts your IT infrastructure, data center, and your business.

Point 5. Ensure Your DRP has a Checklist

It should include critical IT systems and network prioritized by their expected time for recovery (RTO). Document the steps needed to restart, reconfigure and recover systems and networks. Employees should know where to locate the DRP and how to execute basic emergency steps in the event of an unforeseen incident.

Point 6. Substantiate DRPs through testing

DRPs should identify deficiencies and provides opportunities to fix problems before a disaster occurs. Testing can offer proof that the plan is effective and that it will enable you to meet recovery point and recovery time objectives (RPOs and RTOs). Since IT systems and technologies are constantly changing, DR testing also helps ensure a disaster recovery plan is up to date.

Choose a failover clustering technology that makes DR testing simple by facilitating fast, simple, reliable switchover of application operation to DR nodes and back.

When you look at those statistics, you know you are living on borrowed time if you don’t have a disaster recovery plan in place. The SIOS disaster recovery solution is a multi-site, geographically dispersed cluster that meets RPO and RTOs with ease. What makes SIOS different from many other DR providers is that it offers one solution that meets both high availability and disaster recovery needs. To learn more about our DR solutions, check out the insights page here.

Reproduced with permission from SIOS