RPO, RTO, WRT, MTD …WTH?!

10 December, 2013 BCDR, Business Continuity/Disaster Recovery, Site Recovery Manager, VMware 56

Some time ago, I was engaged in a discussion with one of our customers to investigate the possibility of VMware Site Recovery Manager implementation in their datacenter. The discussion turned technical pretty soon and when I asked what their RPO or RTO requirements were, they could not answer it straight away simply because they didn’t know what it was or what it meant. And when I mentioned WRT and MTD, they were stunned even more. So to clarify it a little bit for them I started drawing and explaining the following along the way.

Consider the following scenario.

Stage 1: Business as usual

At this stage all systems are running production and working correctly.

Stage 2: Disaster occurs

On a given point in time, disaster occurs and systems needs to be recovered. At this point the Recovery Point Objective (RPO) determines the maximum acceptable amount of data loss measured in time. For example, the maximum tolerable data loss is 15 minutes.

Stage 3: Recovery

At this stage the system are recovered and back online but not ready for production yet. The Recovery Time Objective (RTO) determines the maximum tolerable amount of time needed to bring all critical systems back online. This covers, for example, restore data from back-up or fix of a failure. In most cases this part is carried out by system administrator, network administrator, storage administrator etc.

Stage 4: Resume Production

At this stage all systems are recovered, integrity of the system or data is verified and all critical systems can resume normal operations. The Work Recovery Time (WRT) determines the maximum tolerable amount of time that is needed to verify the system and/or data integrity. This could be, for example, checking the databases and logs, making sure the applications or services are running and are available. In most cases those tasks are performed by application administrator, database administrator etc. When all systems affected by the disaster are verified and/or recovered, the environment is ready to resume the production again.

The sum of RTO and WRT is defined as the Maximum Tolerable Downtime (MTD) which defines the total amount of time that a business process can be disrupted without causing any unacceptable consequences. This value should be defined by the business management team or someone like CTO, CIO or IT manager.

This is of course a simple example of a Business Continuity/Disaster Recovery plan and should be included in your Business Impact Analysis (BIA).

I hope this short explanation gives you some starting points when discussing a Business Continuity/Disaster Recovery implementation with your customer.

Cheers!

– Marek.Z

Obaid says:

11 December, 2013 at 10:44

It seems there is a typo in third last paragraph,”The sum of RTO and PRO” I think you meant “The sum of RTO and WRT”?
Thanks for the nice graphical explanation.

Reply
- Marek.Z says:
  
  11 December, 2013 at 11:57
  
  Hi Obaid,
  
  Yes, indeed. Thank you for pointing it out.
  
  Cheers!
  
  Reply
  - John says:
    
    11 May, 2019 at 18:53
    
    Thank you Marek. Great explanation and easy to understand.
    
    Reply
Apollokre1d says:

11 October, 2014 at 15:42

Excellent illustration, thank you!

Reply
kblnrz says:

18 October, 2014 at 21:33

Great post!

I just wonder why you use ‘Business as usual’ instead of ‘Backup as usual or Last Valid Backup’ ? IMHO it will be more accurate.

KB

Reply
- Marek says:
  
  20 October, 2014 at 10:55
  
  Thanks
  I assume your backup strategy is already in place and you are able to recover from backup based on your company requirements. Besides, backup is out of scope in this blog post. I just wanted to explain what the acronyms mean.
  
  Cheers!
  
  Reply
Kevin Carter says:

13 April, 2015 at 03:30

Great writeup!

Reply
Peeyush says:

22 May, 2015 at 12:06

So MTD is same as MTO? Just a different terminology or they do have a difference ?

Reply
- Marek says:
  
  25 May, 2015 at 11:04
  
  Just as depicted, MTD = RTO + WRT.
  
  Don’t know about MTO abbreviation though…
  
  Reply
Mohammed Arif says:

24 May, 2015 at 14:28

Nice article.. But I still have a doubt about relation between BIA, RTO and MTD.

MTD will be input to BIA or vice versa.Kindly clarify

Thanks
Mohammed Arif

Reply
- Marek says:
  
  25 May, 2015 at 11:06
  
  So, basically, you got your BIA for your entire corporation. Not just your IT infra and this article was written based on my consulting job for a few customers.
  
  Hope this helps.
  
  Reply
Mitu says:

17 July, 2015 at 08:26

Thanks a lot Marek.
Now:) i understood the entire concept .

Reply
- Marek says:
  
  17 July, 2015 at 09:08
  
  Cheers! 🙂
  
  Reply
Robert says:

29 August, 2015 at 18:40

Studying for my CompTIA Advanced Security Professional (CASP) exam – this really helped, thank you

Reply
- Marek says:
  
  31 August, 2015 at 09:22
  
  Thanks! Good luck on the exam!
  
  Cheers!
  
  Reply
Leon Funnell says:

24 September, 2015 at 11:32

What about the “think time” to decide if you want to invoke your recovery or attempt to repair the application without invoking? I don’t see this represented here. In the case where the invocation of recovery could result in a loss of data or risk of further issues, you want to be 100% sure you need to invoke before you do it. This increases the MTD.

Reply
- Marek says:
  
  24 September, 2015 at 11:41
  
  Hi,
  
  I assumed that the decision for recovery already has bee made and this blogs simply explains the steps that usually are performed during the recover. This of course can be tweaked to suite you environment.
  
  Cheers!
  
  Reply
Alka DeSouza says:

1 December, 2015 at 08:48

Thanks for the explanation. The graphics helped clarify it better.

Reply
Jason Lohr says:

9 February, 2016 at 19:06

It’s a wonder why the textbook I’m reading doesn’t employ a graphical explanation such as this one. Thank you!

Reply
Chadwick Taylor says:

27 May, 2016 at 18:50

First, let me say that this is a great explanation of the terms.
Regarding Leon Funnell’s comment above, I personally have found that the clock for RTO does not start ticking until the decision to declare a disaster has been made. I always include that as a step in my process. That makes it clear to management that the decision to declare is a very important step; and the longer it takes to decide to declare, the longer critical systems remain down.

Reply
- Marek says:
  
  30 May, 2016 at 11:32
  
  Good point.
  
  Imo this really comes down to your approach to BC/DR for your organisation.
  
  Cheers!
  
  Reply
- Yacine says:
  
  15 September, 2020 at 08:48
  
  Very pertinent and critical point Taylor , declaring a crisis remain the C-suite responsibility .
  
  Reply
Leon Funnell says:

31 May, 2016 at 10:55

In our organisation MDT is distinct and separate from RTO. MDT is RTO + think time (nominally 1 hour). This is because unless you have automated and loss-less DR, the decision point must be taken as recovery to DR is not your only option. Reboots, reconfiguration, troubleshooting are usually first steps.

Reply
fara says:

19 July, 2016 at 09:22

Hi sir, from your point of view, in order to determine RTO for certain process key (note that this is business process, not RTO for IT), do I need to consider WRT as part or RTO?

Reply
- Marek says:
  
  19 July, 2016 at 11:50
  
  Hi Fara,
  
  Well, it depends. Is the process operational? Or do you need to perform additional steps to get it operational again.
  
  Cheers!
  
  Reply
Sebastian says:

30 July, 2016 at 07:57

Simple,effective representation and explanation of the acronyms, which makes it very to understand and grasp the concepts. As Jason said,many a text book lack this sort of representation. Thanks for the effort and explanation Marek. Helped me in getting clarity.

Regards,

Sebastian

Reply
Tavonga says:

17 August, 2016 at 15:27

Thanx so much this article is great,am studying for my CISSP it clarified a lot of things in a very simple manner…however is Maximum Allowable Downtime(MAD) and Maximum Tolerable Downtime(MTD) the same thing

Reply
- Marek says:
  
  18 August, 2016 at 10:05
  
  Hi,
  
  I am not sure. I didn’t see the MAD before but judging from the name, I think this could be the same thing.
  
  Reply
- Milan Niznansky says:
  
  30 November, 2016 at 12:29
  
  Normally they are not.
  MAD can be e.g. a legally mandated requirement while MTD can be an internal company target. When you have defined both, MTD would be shorter or same as MAD.
  In the IAAS world, to make it worse, the customer would rarely share true MAD as it can be of a highly confidential nature and ask the vendor to treat MTD as “MAD”. This can create further complexity by obfuscation for the guy asked to solution it. But hey, life is complex.
  
  Reply
  - Marek says:
    
    30 November, 2016 at 16:35
    
    Alright, cool. Thanks for sharing.
    
    Reply
Umair Ahmad says:

29 August, 2016 at 07:59

Hi Marek, what an article indeed..
I have a small query.. RTO is normally provided by the business if I am not wrong. So, from IT point of view, we will need WRT that is the “time to be taken by IT”. For wxample, Supply chain wants their applications back up in 1 hour, IT will say that we will need 30 Min more time (WRT) to verify the integrity and blah blah.. Hence MTD will become 90 min. The business will start shouting that we gave 60 min RTO and you are taking 90 min… Kindly guide..

Reply
- Marek says:
  
  30 August, 2016 at 19:56
  
  Hi Umair,
  
  Well, in most cases, yes. RTO is usually stated by the business as they have to decide what is tolerable. It is a business decision.
  
  From IT point of view, during the RTO period, you restore your systems to the up-and-running state. After that you need to verify the integrity of the data (WRT) thus, as you describe, the MTD will become your 90 minutes. So, in fact, the IT needs 90 minutes to go back to the production state. But it really depends on your organisations’ procedures and requirements but also on IT staff that is involved with the recovery.
  
  Hope this helps.
  
  Cheers!
  
  Reply
Sameer Jan says:

25 November, 2016 at 18:25

Thanks for the write up … Nicely documented.

Reply
Faris Mubarek says:

28 November, 2016 at 13:10

It is an excellent explanation of the key words. Thank you.

Reply
Santha Kumar N. says:

2 December, 2016 at 04:09

Happy Morning Marek,

simple and easy to understand with the pictorial representation

Reply
Santha Kumar N. says:

13 February, 2017 at 10:05

Happy Morning Marek,

Below text is from CISM Review Manual where AIW (Acceptable interruption window) was mentioned. Is AIW same as MTD? Need your help to clarify!

Regards,
Santha Kumar N.

4.10.6 BASIS FOR RECOVERY SITE SELECTIONS
The type of site selected for a response and recovery strategy should be based on the following considerations:
•AIW—The total time that the organization can wait from the point of failure to the restoration of critical services/applications. After this time, the cumulative losses caused by the interruption may threaten the existence of the organization.

Reply
- Marek says:
  
  14 February, 2017 at 13:33
  
  Hi Santha,
  
  Well, judging from the text you posted I would assume that it is indeed the same.
  
  Cheers!
  
  Reply
Syed says:

4 March, 2017 at 13:45

Thanks for this.. Where does AIW fit into the picture ? Is it the same as MTD ??

Reply
- Marek says:
  
  6 March, 2017 at 13:48
  
  Hi,
  
  Well, as already mentioned by the poster above. I would say yes but I am not 100% sure.
  
  Cheers!
  
  Reply
Dhamo says:

23 July, 2017 at 00:27

Nice explanation. I was struggling to understand the difference between these. Thank you

Reply
Gulia says:

29 August, 2017 at 13:35

Hi,

Thank you, nice explanation, simple and easy

Reply
- Marek says:
  
  30 August, 2017 at 09:45
  
  Cheers! 🙂
  
  Reply
Brandon says:

23 October, 2017 at 19:26

Amazing simplified graphic, really clarified things. Acronym overload makes me go cross eyed so thank you!

Reply
- Marek says:
  
  23 October, 2017 at 21:57
  
  Cheers!
  
  Reply
Mark says:

9 February, 2018 at 15:19

Fabulous discussion, but a couple of nuanced corrections: First, RPO is more or less correct, but it should be stressed that this is a metric that looks backward in time from the disaster. That said, the arrow should just start at the disaster line (Stage 2) and point to the left. Stressing the notion of “backward-looking” and the arrow pointing to the left help communicate this point. Second, your definition of MTD is wrong. Period. Full stop. Whereas all of these definitions are often confused and therefore there is no definitive reference point, the most widely accepted definition of MTD is that it is the recovery metric for a business process, not just for IT assets. Look up the top Google hits on this term and you will see. The definition in the drawing above does not take into account all other factors required to get a business process recovered, including relocating staff, activating alternate workspace, and all the internal and external third party dependencies that are required: vendors, suppliers, business partners, regulators, etc. Some of this stuff may happen before, after, or during RTO, but it is all cumulatively part of MTD. Personally, I’d include all the stuff you put into your MTD definition as part of the RTO metric…and so would you, judging by your own definition of MTD: “When all SYSTEMS affected by the disaster ARE verified and/or RECOVERED.” Caps added by me for emphasis. The tasks that you mention in MTD are part of the recovery of systems and should be part of RTO. Peace.

Reply
- Marek says:
  
  9 February, 2018 at 16:02
  
  Hi,
  
  First off all, thanks for taking your time to contribute to this discussion.
  
  Regarding RPO, good point. I’ll update the blog on the backward in time arrow. Btw, the arrows are just to indicate the boundaries not a direction of time.
  
  Regarding MTD, as described below, this post is dedicated for an IT department. I understand that it is not just for IT assets but I don’t agree with you that the definition is incorrect.
  
  On the other hand, this article is almost 5 years old. I should probably review and update it.
  
  Cheers!
  
  Reply
Troy M. says:

2 May, 2018 at 15:28

All of these are metrics used in SLA’s\SLO’s which are typically measured over periods, often times over the period of a month. Is there a widely used acronym that describes the total allowable downtime within a period of time? For instance, if you have repeated outages within a month and each outage is fully recovered within the RTO, RPO, MTD but the total amount of downtime is unacceptable, what acronym is used to describe the maximum tolerable downtime collectively in a period or the maximum number of allowable outages in a period of time?

Reply
- Marek says:
  
  2 May, 2018 at 18:21
  
  To be hones Troy, I have no idea. Maybe someone from other readers knows…
  
  Reply
- Albert M. says:
  
  11 December, 2019 at 13:37
  
  I do no believe there is an acronym for a cumulative downtime over a certain amount of days/month/year. I can only think of SLA which is expressed in % over a one-year period.
  
  Reply
Abdullah says:

28 June, 2018 at 21:37

Very nice explanation, Thanks

Reply
Jinu says:

18 December, 2018 at 06:20

Thanks Mareck..Explained well with pictorial representation. helped me to the understand the concept easily

Reply
Sasa says:

11 March, 2020 at 08:33

Even in 2020 this post is still used….thanks for your work, great job 🙂

Reply
Chris N says:

15 April, 2020 at 01:29

well done. Lovely explanation. Refreshing discussions where I have gleaned a lot of information . Thanks to all of you once again especially the author of the post

Reply
Jeff Lowder says:

22 May, 2020 at 08:44

Hi Marek, Great article. Would you be willing to grant me permission to reproduce your diagram in a book I’m writing?

Reply
James Anderson says:

21 December, 2020 at 18:59

Very much appreciated this. Thank-you.

Reply
Op says:

8 July, 2021 at 12:03

RTO would be higher then RPO or what should be the right answer ? bit confused? . Also request to pls share any more article on BIA and BCP metric with sample

Reply

Default Reasoning

Construction of sensible guesses when some useful information is lacking and no contradictory evidence is present…

RPO, RTO, WRT, MTD …WTH?!

Stage 1: Business as usual

Stage 2: Disaster occurs

Stage 3: Recovery

Stage 4: Resume Production

56 Comments

Leave a reply...Cancel reply

Stage 1: Business as usual

Stage 2: Disaster occurs

Stage 3: Recovery

Stage 4: Resume Production

Be Sociable, Share!

56 Comments

Leave a reply...Cancel reply