Every organization needs to plan for disaster recovery, ensuring that there are predetermined actions in place when things go horribly wrong. For every organization I have had the pleasure of working for, defining disaster recover has been an essential function for the IT department as one of its main mandates for the business. Two very effective ways of evaluating a Disaster Recovery plan are Recovery Time Objective (RTO) and Recovery Point Objective (RPO):
- The Recovery Time Objective is the maximum acceptable amount of downtime due to an outage.
- The Recovery Point Objective is the maximum acceptable period of data loss due to an outage.
Recovery Time Objective
RTO, or the Recovery Time Objective, refers to the maximum amount of time that it should take for an organization to recover an application or system from an outage. To determine an acceptable RTO, IT must work with organization leaders to prioritize systems and applications according to their respective risk for business loss. High-priority systems can potentially have an RTO of seconds if IT is given the resources to invest in appropriate failover infrastructure.
How to Determine Your RTO
The point of defining an RTO is to determine how much risk the organization is willing to absorb with systems down. That means considering tangible damages, such as the potential revenue loss caused by system outages, as well as intangible damages, such as harm to the company’s reputation due to outages.
You can follow these guidelines to determine your RTO:
- Identify critical systems and processes: The first step is to identify which systems and processes are critical to the organization’s operations. This includes anything that is essential for the organization to function, such as financial systems, customer-facing systems, and production systems.
- Assess the impact of system failures: Next, the organization should assess the potential impact of system failures on the organization’s operations. As mentioned before, this would include tangible and intangible factors such as lost revenue, lost productivity, and damage to the organization’s reputation.
- Define the RTO: Based on the assessed impact of system failures, the organization can then determine the RTO for each critical system and process. The RTO should be set at a level that is acceptable to the organization, considering the potential impact of system failures and the organization’s overall risk tolerance.
Once defined, the RTOs serve as guideposts when developing a disaster recovery plan. It is important to note that non-critical systems should still be considered but would fall more in the category of business continuity.
Recovery Point Objective
RPO, or the Recovery Point Objective, refers to the maximum amount of data that can be lost in the recovery process. An RPO of 10 minutes, for instance, would mean that 10 minutes of lost data would be considered acceptable.
This value is generally determined by the frequency with which recovery points are created for a given system. High-priority systems can achieve a zero data loss RPO value, if the RTO is sufficiently minimized.
This may include considering factors such as the importance of your systems to your business operations, the sensitivity of your data, and the potential consequences of lost or corrupted data. Similar to RTO, following these general guidelines should help in defining this metric:
- Identify critical data: The first step is to identify which data is critical to the organization’s operations. This includes anything that is essential for the organization to function, such as financial data, customer data, and production data.
- Assess the impact of data loss: Next, the organization should assess the potential impact of data loss on the organization’s operations. That means asking “What would that data loss cost us?” The answer could be money, time, resources, or a combination of all those things.
- Define the RPO: Based on the assessed impact of data loss, the organization can then determine the RPO for each critical data set. The RPO should be set at a level that is acceptable to the organization, considering the potential impact of data loss and the organization’s overall risk tolerance.
Once RPOs have been defined, a data backup and recovery plan should be developed as a sub-plan to the larger disaster recovery plan. This plan should include the steps to be taken in the event of an outage to restore critical data. These steps would include procedures for data backups, restoration, whom to contact, and when.
From here, let’s dig into how RTO and RPO fit into a broader disaster recovery plan.
Determine What To Plan For
As part of a disaster recovery plan, you need to consider what types of disaster or outage that your organization is preparing for. For example, if your organization is primarily concerned with natural disasters, you may need to design your disaster recovery plan to handle longer periods of downtime. If you are more concerned with cyber-attacks or hardware failures, you may be able to design a disaster recovery plan that can recover operations more quickly.
If natural disasters are common to your organization’s geographic location, then your disaster recovery plan, and by extension your data backup and recovery plan, should include appropriate accommodations. Solutions to those types of outages could include cloud resources or secondary sites outside of the geographic region affected by the natural disasters. Areas prone to black or brown outs should keep redundant power sources in mind, including backup generators and battery systems.
Designing a Plan
Once you have a clear understanding of your RTO and RPO requirements, you can begin to design and implement a disaster recovery plan that meets those needs. This may involve using a variety of techniques and technologies, such as backup and restore procedures, replication and failover systems, and cloud-based disaster recovery solutions.
Designing and implementing a disaster recovery (DR) plan that meets your organization’s Recovery Time Objective (RTO) and Recovery Point Objective (RPO) requirements is just the first step. It is also critical to regularly test and validate the Disaster recovery plan to ensure that it is effective and able to meet your organization’s needs.
How a Disaster Recovery Plan Is Used
After defining RTOs and RPOs and developing a disaster recovery plan that includes a data backup and recovery plan, you still must test and validate it in perpetuity. A plan is only useful if it is current and realistic, which will require it to grow and change with the organization as new systems are added and removed.
After the plan’s initial inception testing may reveal that the plan doesn’t work in real-life scenarios as it does on paper, and may need to be adjusted to include other factors. How you test your plan may vary, but it is important to continue to test it so that it can remain useful.
Testing the Plan
There are a variety of ways to test and validate your disaster recovery plan, including conducting periodic drills or simulations in which the disaster recovery plan is put to the test. These drills or simulations can help to identify any weaknesses or gaps in your plan and can give you the opportunity to make any necessary improvements or adjustments.
These simulations or drills can be accomplished completely in-house using tools available online from sources like CISA’s Tabletop Exercise Packages or by hiring organizations that offer simulation services. Drills can come in all shapes and sizes, but it would be useful to have different departments of the business involved to best simulate a real-life scenario. Only having IT be involved in the table-top testing exercises severely limits the type of feedback that can be harvested from them.