Latest Articles

RPO, RTO, WRT, MTD…WTH?!

Some time ago, I was engaged in a discussion with one of our customers to investigate the possibility of VMware Site Recovery Manager implementation in their datacenter. The discussion turned technical pretty soon and when I asked what their RPO or RTO requirements were, they could not answer it straight away simply because they didn’t know what it was or what it meant. And when I mentioned WRT and MTD, they were stunned even more. So to clarify it a little bit for them I started drawing and explaining the following along the way.

Consider the following scenario.

Stage 1: Business as usual

At this stage all systems are running production and working correctly.

Stage 2: Disaster occurs

BCDR-02

On a given point in time, disaster occurs and systems needs to be recovered. At this point the Recovery Point Objective (RPO) determines the maximum acceptable amount of data loss measured in time. For example, the maximum tolerable data loss is 15 minutes.

Stage 3: Recovery

BCDR-03

At this stage the system are recovered and back online but not ready for production yet. The Recovery Time Objective (RTO) determines the maximum tolerable amount of time needed to bring all critical systems back online. This covers, for example, restore data from back-up or fix of a failure. In most cases this part is carried out by system administrator, network administrator, storage administrator etc.

Stage 4: Resume Production

BCDR-04

At this stage all systems are recovered, integrity of the system or data is verified and all critical systems can resume normal operations. The Work Recovery Time (WRT) determines the maximum tolerable amount of time that is needed to verify the system and/or data integrity. This could be, for example, checking the databases and logs, making sure the applications or services are running and are available. In most cases those tasks are performed by application administrator, database administrator etc. When all systems affected by the disaster are verified and/or recovered, the environment is ready to resume the production again.

BCDR-05

The sum of RTO and WRT is defined as the Maximum Tolerable Downtime (MTD) which defines the total amount of time that a business process can be disrupted without causing any unacceptable consequences. This value should be defined by the business management team or someone like CTO, CIO or IT manager.

This is of course a simple example of a Business Continuity/Disaster Recovery plan and should be included in your Business Impact Analysis (BIA).

I hope this short explanation gives you some starting points when discussing a Business Continuity/Disaster Recovery implementation with your customer.

Cheers!

– Marek.Z

About Marek (288 Articles)
Marek is an IT professional with 15+ years of experience in the IT industry and is currently working as PSO Senior Consultant SDDC at VMware for the NEMEA region.

35 Comments on RPO, RTO, WRT, MTD…WTH?!

  1. It seems there is a typo in third last paragraph,”The sum of RTO and PRO” I think you meant “The sum of RTO and WRT”?
    Thanks for the nice graphical explanation.

  2. Apollokre1d // 11 October, 2014 at 15:42 // Reply

    Excellent illustration, thank you!

  3. Great post!

    I just wonder why you use ‘Business as usual’ instead of ‘Backup as usual or Last Valid Backup’ ? IMHO it will be more accurate.

    KB

    • Thanks
      I assume your backup strategy is already in place and you are able to recover from backup based on your company requirements. Besides, backup is out of scope in this blog post. I just wanted to explain what the acronyms mean.

      Cheers!

  4. So MTD is same as MTO? Just a different terminology or they do have a difference ?

  5. Mohammed Arif // 24 May, 2015 at 14:28 // Reply

    Nice article.. But I still have a doubt about relation between BIA, RTO and MTD.

    MTD will be input to BIA or vice versa.Kindly clarify

    Thanks
    Mohammed Arif

    • So, basically, you got your BIA for your entire corporation. Not just your IT infra and this article was written based on my consulting job for a few customers.

      Hope this helps.

  6. Thanks a lot Marek.
    Now:) i understood the entire concept .

  7. Studying for my CompTIA Advanced Security Professional (CASP) exam – this really helped, thank you

  8. Leon Funnell // 24 September, 2015 at 11:32 // Reply

    What about the “think time” to decide if you want to invoke your recovery or attempt to repair the application without invoking? I don’t see this represented here. In the case where the invocation of recovery could result in a loss of data or risk of further issues, you want to be 100% sure you need to invoke before you do it. This increases the MTD.

    • Hi,

      I assumed that the decision for recovery already has bee made and this blogs simply explains the steps that usually are performed during the recover. This of course can be tweaked to suite you environment.

      Cheers!

  9. Alka DeSouza // 1 December, 2015 at 08:48 // Reply

    Thanks for the explanation. The graphics helped clarify it better.

  10. Jason Lohr // 9 February, 2016 at 19:06 // Reply

    It’s a wonder why the textbook I’m reading doesn’t employ a graphical explanation such as this one. Thank you!

  11. Chadwick Taylor // 27 May, 2016 at 18:50 // Reply

    First, let me say that this is a great explanation of the terms.
    Regarding Leon Funnell’s comment above, I personally have found that the clock for RTO does not start ticking until the decision to declare a disaster has been made. I always include that as a step in my process. That makes it clear to management that the decision to declare is a very important step; and the longer it takes to decide to declare, the longer critical systems remain down.

    • Good point.

      Imo this really comes down to your approach to BC/DR for your organisation.

      Cheers!

  12. In our organisation MDT is distinct and separate from RTO. MDT is RTO + think time (nominally 1 hour). This is because unless you have automated and loss-less DR, the decision point must be taken as recovery to DR is not your only option. Reboots, reconfiguration, troubleshooting are usually first steps.

  13. Hi sir, from your point of view, in order to determine RTO for certain process key (note that this is business process, not RTO for IT), do I need to consider WRT as part or RTO?

    • Hi Fara,

      Well, it depends. Is the process operational? Or do you need to perform additional steps to get it operational again.

      Cheers!

  14. Sebastian // 30 July, 2016 at 07:57 // Reply

    Simple,effective representation and explanation of the acronyms, which makes it very to understand and grasp the concepts. As Jason said,many a text book lack this sort of representation. Thanks for the effort and explanation Marek. Helped me in getting clarity.

    Regards,

    Sebastian

  15. Thanx so much this article is great,am studying for my CISSP it clarified a lot of things in a very simple manner…however is Maximum Allowable Downtime(MAD) and Maximum Tolerable Downtime(MTD) the same thing

    • Hi,

      I am not sure. I didn’t see the MAD before but judging from the name, I think this could be the same thing.

    • Milan Niznansky // 30 November, 2016 at 12:29 // Reply

      Normally they are not.
      MAD can be e.g. a legally mandated requirement while MTD can be an internal company target. When you have defined both, MTD would be shorter or same as MAD.
      In the IAAS world, to make it worse, the customer would rarely share true MAD as it can be of a highly confidential nature and ask the vendor to treat MTD as “MAD”. This can create further complexity by obfuscation for the guy asked to solution it. But hey, life is complex.

  16. Umair Ahmad // 29 August, 2016 at 07:59 // Reply

    Hi Marek, what an article indeed..
    I have a small query.. RTO is normally provided by the business if I am not wrong. So, from IT point of view, we will need WRT that is the “time to be taken by IT”. For wxample, Supply chain wants their applications back up in 1 hour, IT will say that we will need 30 Min more time (WRT) to verify the integrity and blah blah.. Hence MTD will become 90 min. The business will start shouting that we gave 60 min RTO and you are taking 90 min… Kindly guide..

    • Hi Umair,

      Well, in most cases, yes. RTO is usually stated by the business as they have to decide what is tolerable. It is a business decision.

      From IT point of view, during the RTO period, you restore your systems to the up-and-running state. After that you need to verify the integrity of the data (WRT) thus, as you describe, the MTD will become your 90 minutes. So, in fact, the IT needs 90 minutes to go back to the production state. But it really depends on your organisations’ procedures and requirements but also on IT staff that is involved with the recovery.

      Hope this helps.

      Cheers!

  17. Thanks for the write up … Nicely documented.

  18. It is an excellent explanation of the key words. Thank you.

  19. Santha Kumar N. // 2 December, 2016 at 04:09 // Reply

    Happy Morning Marek,

    simple and easy to understand with the pictorial representation

  20. Santha Kumar N. // 13 February, 2017 at 10:05 // Reply

    Happy Morning Marek,

    Below text is from CISM Review Manual where AIW (Acceptable interruption window) was mentioned. Is AIW same as MTD? Need your help to clarify!

    Regards,
    Santha Kumar N.

    4.10.6 BASIS FOR RECOVERY SITE SELECTIONS
    The type of site selected for a response and recovery strategy should be based on the following considerations:
    •AIW—The total time that the organization can wait from the point of failure to the restoration of critical services/applications. After this time, the cumulative losses caused by the interruption may threaten the existence of the organization.

4 Trackbacks & Pingbacks

  1. Business Continuity Wrt | Guide to Continuity Planning
  2. VCAP-DCD Exam Experience | vmroyale.com
  3. Memilih Disaster Recovery Center untuk Bisnis Anda
  4. How to backup NSX?

Leave a reply...