RTO and RPO are key numbers a business determines as part of business continuity planning.
Recovery Time Objective, or RTO, is the maximum amount of time a service outage can last and still be considered acceptable by the business. It’s an ‘objective’, in the sense that the organization’s goal is to get the service back up and running within the RTO window. Once you know your RTO, you can plan accordingly to get the service back online within that window.
For instance, a web server might have an RTO of on hour. If that server goes down, the company’s goal is to get it back online in under an hour – and the IT team can plan ahead of time to make sure that happens.
Recovery Point Objective, or RPO, refers to how much data loss is acceptable. If a cache of data has an RPO of one week, that means the business can accept losing up to one week a data. Once a company knows its RPO, it can then set its backup cycle accordingly. If a one-week RPO is acceptable, the company only has to backup the data once a week to ensure they can maintain their RPO.
In some cases, any loss of data or service for more than a second or two is unacceptable. If that’s the case, the service has to rely on failover systems to maintain service and data integrity at all times. That level of uptime can be costly to maintain – but as cloud infrastructure has developed, it has become cheaper and more viable, especially for common use cases such as hosting a website.
As you can see, RTO and RPO have some key similarities. They’re both important numbers to know as part of a business continuity plan, and they’re both measured in terms of time. But where RTO applies to service outages, RPO applies to data. From there, several differences arise.
How RTO & RPO Fit Into Business Continuity Planning
A business continuity plan outlines what a business will do in the event of a disaster that takes all or part of the business offline. It’s concerned chiefly with two things: getting essential business functions up and running, and ensuring the business survives the disaster intact.
Some business functions are more essential than others. For a bank, allowing customers to deposit and withdraw money is fundamental. Lending might be an important part of the bank’s business, but their customers won’t riot if the bank puts new loans on pause for a couple weeks while it recovers from a disaster.
When putting together a business continuity plan, the bank might determine that cash transactions have a six-hour RTO, while issuing new loans has an RTO of two weeks. Once the bank understands that, they can work on preparations so that cash transactions will be back online within six hours of a disaster.
Some data is more essential than others, as well. You wouldn’t want to lose ten minutes of transaction data, for instance, which is why it makes sense to back up data as it comes in. In this case, the data might have an RPO of less than one second.
Losing two hours of work, on the other hand, is painful – but not irreplaceable. Even then, losing work has become increasingly uncommon with the proliferation of software and storage services that simultaneously save files to a cloud server and to the users’ local computer. Twenty years ago, who knew that instantaneous backup would become so common? Not that it ever hurts to email yourself important documents, just in case. 😉
The right time to come up with a business continuity plan is before a disaster strikes – if you’re coming up with it on the fly, it’s not really a plan, is it? But by figuring it out ahead of time, you can set up the failsafes and map out the procedures that will restore essential services as quickly as possible, while minimizing data loss.
RTO & RPO in Practice: Business Impact Analysis
To come up with the right RTO or RPO, you’ll need to ask yourself a few key questions.
- What recovery period would be necessary to keep the business running?
- What recovery period would be ideal?
- What recovery period is possible, given our resources and the potential circumstances?
- What will it take to get a service up and running again in a given timeframe?
- How much will it cost per year to ensure that objective?
- How much money will be lost while the service is down?
- What other impacts might the outage or data loss have?
This type of thinking falls in the realm of business impact analysis, which assesses the potential impact of outages, disruptions, and other adverse events upon a business. Much of it comes down to key questions of what could go wrong and what it would take to mitigate these situations.
An small ecommerce site that goes down might lose $1,000 per hour on average. That can be a lot for a small business, and the business owner might indeed be very motivated to get their site back online as quickly as possible. But it might not make sense to spend hundreds of thousands of dollars a year on a backup site to ensure an RTO of less than an hour.
These cost considerations become very different when you consider a major retailer, such as Amazon, that might lose millions of dollars an hour in the event of an outage – and maybe even worse consequences if Amazon Web Services goes down, too.
Prolonged outages can also result in a loss of customer trust or even bad headlines and stock market dips. The stakes might seem low for a software startup that’s still building its revenue base – but if they lose the trust of key product evangelists, even a six-hour outage could seriously dent their growth trajectory.
Time of day makes a big difference in the event of an outage. A business that makes $50,000 an hour during the day might only pull $5,000 an hour after dark. Is it still worth waking up key personnel in the dead of night to get it back up and running? That’s up to you – and them.
Although RTO and RPO do not account for time of day, the business continuity plan they shape can still take timing into account. If key data comes in when markets close at 4:00 pm, for instance, it might make sense to schedule your daily backup for 5:00 pm. Because it’s still a daily backup, your RPO technically is just one day. But if you lose data overnight, your backup will still have a full account of the previous day.
RTO, RPO, and Failover Technology
As cloud infrastructure has developed, failover systems have become much more common. By using secondary servers and backup systems, failover systems can keep a service running smoothly in the event the primary servers go down.
Failover systems are simply a part of the business for many web services providers. Many businesses benefit from these systems without having to know much about it.
As a website owner, I don’t worry much about keeping this site online. Instead, I pay my web hosting provider to do that for me. They ensure 99.99% uptime, and I’m sure they use failover systems to do so. Because they manage hosting for thousands of websites, it’s easier and more cost effective for them to worry about my site’s RTO than for me to come up with my own business continuity plan and set up my own failover systems.
What Happens if a Company Misses its RTO or RPO
If a company fails to meet its RTO or RPO, the results can sometimes be minimal – and sometimes disastrous. By nature, RTO and RPO are both objectives, or goals. Once you set an RTO, you come up with a plan to achieve that goal.
But disasters can be unruly: as the saying goes, no plan ever survives contact with the enemy. There are disasters you see coming, like a hurricane, and those that can strike at any time, like an earthquake. You might aim for an RTO of one hour, for instance. But if disaster strikes at 3:00 am, your tight one hour plan might be delayed while your head of IT wakes up, gets dressed, makes coffee, and then comes into work.
RTO and RPO only become ironclad if the outage or data loss could bring down the business. Otherwise, they’re primarily goals that structure business continuity planning. Hopefully, when disaster strikes you can simply execute the plan as written to bring everything back together within the RTO and RPO you set out to achieve.