April 27, 2025 |
Data Recovery Strategies for a Disaster-Prone WorldData Recovery Strategies for a Disaster-Prone WorldWorking in a position with its roots in software engineering, system administration, and customer support positions, one has a unique opportunity of seeing a variety of configurations and a myriad of issues. Additionally, such a position also gives one perspective on users’ various needs, pain points, and concerns in a way that someone working in a purely engineering role might not be exposed to. As a result of almost 5 years on the support team, I have noticed patterns in various teams with which I have worked. Further, when called to help on various configurations, I have a unique opportunity to draw parallels between the different use cases and root causes. . As a result, there is a foundation that I like to ensure is set when it is time to begin collaborating with a new team. Setting this foundation means ensuring administration practices facilitate working optimally with an HA/DR suite, ensuring teams know how to design for High Availability and how to leverage the utilities beyond the software on their systems to achieve success. This foundation can be crucial to ensuring a team knows how to meet or exceed their operational standards. It seemed appropriate to summarize the common questions and their answers to serve as a resource for those who are new to, but interested in implementing a High Availability solution, or simply want to change to using a new High Availability solution. Whether you are a student just now starting to study system administration/systems engineering, or you are a veteran software engineer who has been asked to expand the scope of your role to include system architecture planning, the points below can aid in your journey to get the most out of a high availability/disaster recovery suite. Without further ado, the questions below summarize the common talking points I have seen in my role, and will help make your search for understanding key concepts and finding a fitting solution easier. What is Disaster Recovery and what does it entail?Disaster recovery, when coupled with high availability, works to optimize the recovery time objective (RTO) – how long a service is inaccessible before being restored – and recovery point objective (RPO) – the data you can stand to lose when restoring from a backup. How does Disaster Recovery differ from traditional approaches to weathering outages?Traditionally, without a highly available infrastructure, an environment experiencing a disaster may have a lengthy return time objective. Systems need to be restored, issues may need to be resolved, and applications started by administrators. Depending on the severity of the issue, it could take hours or more to get back up and running. Teams must work efficiently and exhibit tight communication to ensure service is restored without mistake, lest they risk additional delay in returning to operation. Additionally, the data lost during this sort of outage could be significant. If backups were not taken recently or if the copies of up-to-date data are not accessible, then teams could be relying on data that has gone “stale” and experience operational setbacks on an organizational scale due to the loss of critical data. To look at things from a customer perspective, how long are you willing to wait to obtain access to an online service when you need it? As a customer, how accepting are you if an online storefront loses record of your transactions? When introducing a highly available infrastructure, a means to mirror storage, and a means to orchestrate the high availability, the factors influencing RTO and RPO are all optimized, and a disaster can be weathered with far more grace. A highly available infrastructure is redundant, so a standby system is available to take over operation. Further, the orchestrator – software to manage the clustered environment – is able to systemically start services on a standby system with greater responsiveness, reliability, and efficiency than a manual intervention can achieve. As a result, the return time objective is reduced, and rather than taking hours to recover from a disaster, it can take mere minutes or less. Another facet of highly available infrastructure is the redundancy of data. Disks can be “mirrored”, in which disks that are attached to different systems can all receive the exact same data in real time. As a result, the data available on the aforementioned standby system can be an exact copy, effectively maintaining a backup of the data immediately before a disaster occurs. In turn, when service is restored, applications are running with a near-zero recovery point objective, keeping the recovery point objective to the most current state of operation possible when it is time for the orchestrator to move operations to the standby system. What are the most common mistakes organizations make when designing high availability disaster recovery (HADR) strategies, and how can they avoid them?One of the most common missteps observed is the lack of a QA/Testing environment. The SIOS Customer Experience team has responded to multiple instances of such, where organizations attempt to do application/operating system patching/upgrades or just routine maintenance and experience issues due to inadequate planning or some sort of unfortunate incompatibility. Then, there is a downtime that occurs for the environment, and a maintenance procedure turns into a recovery procedure. This introduces delays, complications, and potential for a spiraling issue to occur within a production environment. By far, the biggest recommendation that can be offered to organizations is to create a one-to-one copy of the production environment that operates in a quality assurance capacity. Every procedure that needs to occur on production should first go through a “dress rehearsal” in the QA environment. This gives organizations the freedom to exercise the planned operations and make improvements without risking the productive capacity of their infrastructure. Practicing operations in a safe, low-stakes environment ensures that teams are ready to operate in the production environment without the risk of encountering an unexpected issue and having to go “off script” to respond quickly and correctly while under pressure. If a problem happens in the QA environment, then support teams can be contacted, and the issue can be investigated with the safety of this issue being insulated from affecting business operations. This can greatly improve the potential for solutions to be found and implemented into operations in a controlled, planned, and effective manner. The aforementioned benefit of the QA environment is important for any organization; however as organizations adopt more complex maintenance strategies, the existence of this test environment becomes all the more important. The use of this testing environment not only facilitates smoother upgrade procedures but also allows companies to mitigate risk when adopting maintenance models that introduce complexity for the return of improved system availability during maintenance activities. In any scenario, testing the maintenance plan in a QA environment, improving the plan based on findings from the “dress rehearsal”, and using the experience gained from this practice enables organizations to manage production systems while minimizing the risk of encountering issues. What is the importance of eliminating single points of failure?Another common obstacle that teams can experience arises from having a “weakest link” in the architecture that does not benefit from the degree of planning that other facets of the environment receive. This is best described with an example. The SIOS Customer Experience team once worked with a customer who designed extensively around keeping SAP applications running in their environment and were very well insulated from issues affecting the systems running the SAP applications. Unfortunately, this customer invested much of the planning effort into protecting their applications and did not afford that same planning effort to other aspects of their environment. As a result, all the systems relied on a singular internal DNS system that resolved hosts within their private network. Despite all of the effort in protecting SAP, when an issue occurred on their DNS system, the whole environment experienced significant issues when name resolution was no longer available. Effectively, the effort placed into protecting their SAP applications did not help their environment weather the issue, simply because the DNS was a “weak link” that all of the other systems relied upon to function properly. When planning environments, it is crucial to step back and look at the bigger picture – pay attention to the weakest links that show up in an architecture. Improving the weakest links uplifts the potential for the entire environment to weather a disaster. For organizations relying heavily on cloud services, how can they protect against Zone or region-wide disasters?Protecting against zone or region-wide disasters can be done by just distributing resources geographically. For example, one might host their primary application server in the US-East region. Then, to be protected against an outage affecting the US-East region, there are standby systems hosted in a “Disaster Recovery Site” that is far away from the US-East region – maybe the US-West region. While this does introduce some additional steps to ensure cross-region communication, the effort is invaluable as this provides protection against zone and region-wide levels of disaster. A total outage of the cloud provider’s US-East region can be withstood by bringing applications in service in the US-West region. Protection against outages that occur in a specific region doesn’t need to be complicated, and ensuring a Disaster Recovery site exists to assume operations will improve application availability and data redundancy in production environments. How do you recommend organizations balance the complexity and cost of implementing robust HA/DR strategies with the need for business agility?There is a common assumption that HA/DR solutions are either complex or expensive, or both. In the wake of this assumption, it is essential to keep a strong perspective on the stakes at hand. Systems are operational for some business purpose, and this translates into the production of revenue. When systems are down due to an outage, there is much more cost than just the lost revenue. Without an HA/DR strategy in place, an outage requires employees to be actively troubleshooting the issue, producing a cost of employee-hours to factor into the cost of downtime, perhaps even at hours when employees are not well-rested and prepared to do their best work. In addition to this, there is a lingering collateral cost in terms of interruption of regular duties and delay/slowness when employees have to task switch into resolving production issues and then switch back to their regular duties. Even further, there are reputational costs that could cause failure to recognize opportunities for revenue. For instance, what comes to mind if you think of “CrowdStrike”? Even if this doesn’t immediately bring the issues and related bad press that CrowdStrike experienced in July of 2024, at the time of writing this (March 25th, 2025), their stock prices have only just returned to the levels they were at before the issue on July 19th, 2024. Taking into account the opportunity cost of configuring an HA/DR solution, the aforementioned factors can vastly change the analysis. Commonly, SIOS customers find that the implementation of an HA/DR solution saves them money in the long run. Additionally, backed by decades of improvement and iteration on the HA/DR offerings from SIOS Technology, the complexity of configuring such a solution is more approachable and less complex than ever. If there are factors at play that still bring concern over the complexity of introducing an HA/DR solution to a production environment, SIOS Technology has professional services offerings that can help to train teams, perform installation and configuration activities, or simply validate existing configurations. With these opportunities, bringing High Availability into a system architecture is not only less complex than it has ever been, but it can be implemented faster than ever before. Finally, for organizations concerned about complexity due to unique configurations or trying to reach the absolute maximum utility of an HA/DR solution, our world-class support team is available to help bring any implementation to its full potential. How do SIOS Technology’s solutions play a role in helping organizations implement the disaster recovery approach that you advocate for?SIOS Technology’s solutions can meet all of the aspects addressed previously, to recount some of them: Modern approaches to disaster recovery are adopted by way of our LifeKeeper and DataKeeper products, which together we call SIOS Protection Suite. Whether on Linux or Windows, these products are available to provide cluster-wide orchestration of resources to ensure a quick and efficient response to disasters while also ensuring data is replicated and available on standby systems. LifeKeeper monitors applications for faults and communicates between nodes to ensure systems are valid targets for application recovery. Datakeeper replicates data in real time to ensure standby systems are able to inherit applications in the event of an issue and continue operation on the latest available data. Hand in hand, these products work to minimize the length of time applications are down and minimize the loss of data in the event of a disaster. These products also integrate fully within your environment. There are mechanisms to provide efficient networking control so clients can always resolve the connection to the application servers. The solutions at play will not only monitor applications or specific components of a system, but also an entire system and environment. Through the use of “quorum” functionality, environments are monitored at a “big picture” level to ensure applications are restored on the correct systems and data is protected. There are protections in place for a myriad of disaster scenarios, so SIOS Protection Suite is able to respond appropriately. SIOS Protection Suite is also able to work across regions, providing the protection we discussed against zone or region-level disasters. Applications can be migrated across regions, and data can be replicated across regions with the same ease as it can be replicated within the same region. Additionally, environments can be multi-tiered. Multiple nodes can be hosted in the primary region and act as either active or standby systems, providing fast responsiveness to system-level issues, while a disaster recovery site in a different region can also be maintained to ensure there is protection from region-level disasters with the same speed and efficacy of protection. Finally, the SIOS Protection Suite product benefits from decades of real-world use. It has been put through its paces in a wide range of scenarios and deployment configurations, and benefited from years of ease-of-use improvements. As a result, this is a solution that is flexible, easily adopted, and fits seamlessly into production environments. The complexity of designing and configuring an HA/DR solution is avoided by adopting SIOS Protection Suite and enjoying the benefits of a rich development history with countless improvements, coupled with the world-class support team that is available to help in case of any questions or concerns that may arise. In addition to all of this, there are also opportunities to undergo collaborative installation or validation procedures for SIOS Protection Suite offerings, ensuring your environment is ready for whatever the world can throw at it. Finally, teams that need strongly experienced staff and want to maximize their leverage of SIOS Protection Suite and its components, SIOS offers training engagements where teams are able to work with our staff to understand the components at play and have an active discussion to facilitate deep understandings that ensure staff can hit the ground running with all of the information needed to implement the solution to its highest potential. Protect your business from downtime and data loss—request a demo or start your free trial to see SIOS in action. Author: Philip Merry, CX – Software Engineer at SIOS Technology Corp. Reproduced with permission from SIOS |
April 21, 2025 |
DataKeeper and Baseball: A Strategic Take on Disaster RecoveryDataKeeper and Baseball: A Strategic Take on Disaster RecoveryThroughout my career, DataKeeper is becoming the industry standard within “think tanks” and “water cooler” chatter, when it comes to Data Protection and Disaster Recovery. How about the great American pastime of baseball and its comparison to DataKeeper? Albeit I’m a huge fan of the sport, as these two things are seemingly unrelated, there are some similarities to be drawn upon. Building a Winning Game Plan for Data ProtectionFirst and foremost, both Baseball and DataKeeper require an acute “game plan”. In baseball, teams have practiced and devised a plan to outcompete their opponents in hopes of a victory. Similarly, DataKeeper requires a “thought-provoking” strategy to ensure data protection is leveraged and can be recovered should something catastrophic occur. Secondly, teamwork remains paramount. Infielders, outfielders, managers, and the batboy each have a specific role to ensure the best chance of victory. With DataKeeper, multiple teams may be involved, e.g., Database Administrators, Infrastructure staff, Customer Experience/Support, Management, just to name a few. All should be thoroughly invested in effectively protecting and recovering data. Where Baseball and DataKeeper Differ: The Stakes Are Higher in ITThere are some differences that can’t be overlooked. While losing a baseball game, especially if it’s the World Series, Game 7, the last inning, 2 outs, 3 balls – 2 strikes, can be a “bummer”, the stakes are much, much higher with DataKeeper. Losing data can have serious consequences for a business. While baseball players require a unique skill set of athleticism, DataKeeper is a solution that requires knowledge of Enterprise Systems and related processes. In summary, while baseball and DataKeeper may seem totally different, there are some parallels we can draw upon in conclusion. Both require:
Whether you’re a fan of baseball or an IT professional, it is evident that both require a level of skill and dedication to succeed. What’s Your Data Protection Game Plan?Check out the game plans/solutions that are offered at us.sios.com/solutions/ PLAY BALL . . . Reproduced with permission from SIOS |
April 15, 2025 |
Budgeting for SQL Server Downtime RiskBudgeting for SQL Server Downtime RiskIn this TechRadar Pro article “Budgeting for SQL Server Downtime Risk,” SIOS’ Dave Bermingham emphasizes the importance of aligning business continuity plans with realistic budgets to mitigate interruptions in mission-critical SQL Server deployments. He advises organizations to assess the significance of each SQL Server instance, understand the potential impacts of downtime—including lost revenue, reduced productivity, data corruption, and legal penalties—and allocate appropriate resources, whether on-premises, cloud, or hybrid, to ensure preparedness for disasters. Reproduced from SIOS |
April 10, 2025 |
Migrating from SIOS DataKeeper for Linux to DRBDMigrating from SIOS DataKeeper for Linux to DRBDSIOS introduced the Distributed Replicated Block Device (DRBD) Recovery Kit in SIOS LifeKeeper for Linux version 9.9.0. Migrating from SIOS DataKeeper for Linux to DRBD is a simple process for those who want to experiment with DRBD features within LifeKeeper as well as for those who were previously more acquainted with DRBD. Understanding DRBD and Its Benefits in LifeKeeperDRBD is a software-based, shared-nothing, replicated storage solution mirroring the content of block devices (hard disks, partitions, logical volumes, etc.) between hosts. The LifeKeeper for Linux DRBD Recovery Kit provides the ability to configure and control DRBD resources for high availabilty. Comparing SIOS DataKeeper for Linux and DRBDSIOS DataKeeper for Linux provides an integrated data mirroring capability for LifeKeeper environments. It’s an alternative for customers who want to build a high availability cluster (using SIOS LifeKeeper) without shared storage or who simply want to replicate business-critical data in real-time between servers. SIOS DataKeeper provides synchronous or asynchronous volume-level mirroring to replicate data from the primary server (mirror source) to one or more backup servers (mirror targets).Steps for creating your PostgreSQL resource are excluded from this blog, but more information on configuring PostgreSQL with SIOS LifeKeeper can be found here. How to Migrate Your PostgreSQL Database to DRBD
lkcli resource remove –tag pgsql-demo
cp -pra /pgsql-demo* /backup/
lkcli resource create drbd –tag drbd-pgsql-demo –device /dev/mapper/singledrbd-lk1 –fstype ext3 –mount_point /tmp/pgsql-demo Be sure to select the same fstype as the previous DataKeeper for Linux resource. The devices selected should also be sufficient for the amount of data and logs for the PostgreSQL database datasets.
lkcli resource extend drbd –tag drbd-pgsql-demo –dest node-a –device /dev/xvdc3 –mode synchronous –laddr 10.15.29.165 –raddr 10.15.27.49
lkcli resource remove –tag /pgsql-demo
chown postgres:postgres /tmp/pgsql/demo
cp -pra /backup/* /tmp/pgsql-demo
lkcli resource remove –tag /tmp/pgsql-demo
lkcli dependency delete –parent /pgsql-demo –child datarep-pgsql-demo Break the dependency between the file system and the DRBD resource. lkcli dependency delete –parent /tmp/pgsql-demo –child drbd-pgsql-demo
lkcli dependency create –parent /pgsql-demo –child drbd-pgsql-demo
lkcli resource restore –tag pgsql-demo BEGIN restore of “pgsql-demo” on server “node-b” waiting for server to start…. done server started END successful restore of “pgsql-demo” on server “node-b”
For example: psql -p 3308 -h /pgsql-demo/socket -U psql psql -p <port> -h <socket directory> -U <db user>
lkcli resource delete /tmp/pgsql-demo
lkcli resource delete –tag datarep-pgsql-demo 15. Verify switchovers and the connection Why Migrate from SIOS DataKeeper for Linux to DRBD?Migrating from SIOS DataKeeper for Linux to DRBD is a simple process for those who want to experiment with DRBD features within LifeKeeper as well as for those who were previously more acquainted with DRBD or want to take advantage of DRBD’s faster async replication speed and a broader array of kernel support. Ready to get started with DRBD? Contact SIOS today to learn how LifeKeeper can help you migrate smoothly and leverage the full potential of DRBD for high availability and disaster recovery Author: Cassius Rhue, VP of Customer Experience at SIOS Technology Corp. Reproduced with permission from SIOS |
April 3, 2025 |
Why is Storageless/Nodeless Quorum Dangerous for Cluster Availability?Why is Storageless/Nodeless Quorum Dangerous for Cluster Availability?Generally, a quorum is defined as a body or group of people who are present to make decisions. In LifeKeeper, Quorum enforces a consensus that uses the status of nodes in a cluster to carry out the next step in handling a node failure within a cluster. LifeKeeper quorum can be operated under three modes; Storage, Majority, and TCP Remote (TCP Remote is only available with LifeKeeper for Linux).
Understanding the Importance of Quorum in ClustersQuorum’s purpose is to maintain the availability of applications by taking remedial actions to navigate unplanned situations. It accomplishes this by lessening the risk of split-brain situations and reducing downtime by maintaining communication between all the nodes in the cluster. The Risks of Operating Without Quorum in Your ClusterThere is a risk involved when using a cluster configured without Quorum. The following scenarios will address the effect of not having a quorum and the importance of implementing it. Scenario 1: Reducing downtimeUnintentional downtime can happen when one or more systems are not available for use as a result of an unavoidable action, for example, a crash or a temporary failure in network communication. With quorums like storage or TCP remote configured, access to storage devices and or ports can be used to keep track of the status of the communication in the cluster. This additional measure can prevent an unnecessary failover that could cause significant downtime. In other cases, Quorum will take measures to either shut down or reboot the server to restore it to a healthy state and avoid longer downtime. Scenario 2: Split BrainA split-brain is when multiple systems in the cluster believe they are the primary server. This can happen when a primary server loses communication to its secondary server, and the secondary server believes the primary system went down. This leads to two active primary systems in the cluster. If Majority quorum was configured, another system would be provisioned as the witness to serve as a vote for which system should serve as the primary system, preventing the split-brain from happening. Why Proper Quorum Configuration MattersOperating a cluster without storage or majority quorum is dangerous because it increases the risk of experiencing data loss or prolonged downtime as a result of a split-brain and/or a network outage. Using Quroum can provide counteractive measures by making sure the cluster is always healthy and that any unhealthy system is handled appropriately. Contact SIOS today to learn how our high availability solutions can help you configure quorum the right way and keep your clusters protected. Author: Alexus Gore, Customer Experience Software Engineer at SIOS Technology Corp. Reproduced with permission from SIOS |
- Results 1-5 of 957
- Page 1 of 192 >