RPO, RTO, WRT, MTD…WTH?!

Some time ago, I was engaged in a discussion with one of our customers to investigate the possibility of VMware Site Recovery Manager implementation in their datacenter. The discussion turned technical pretty soon and when I asked what their RPO or RTO requirements were, they could not answer it straight away simply because they didn’t know what it was or what it meant. And when I mentioned WRT and MTD, they were stunned even more. So to clarify it a little bit for them I started drawing and explaining the following along the way.

Consider the following scenario.

Stage 1: Business as usual

At this stage all systems are running production and working correctly.

Stage 2: Disaster occurs

BCDR-02

On a given point in time, disaster occurs and systems needs to be recovered. At this point the Recovery Point Objective (RPO) determines the maximum acceptable amount of data loss measured in time. For example, the maximum tolerable data loss is 15 minutes.

Stage 3: Recovery

BCDR-03

At this stage the system are recovered and back online but not ready for production yet. The Recovery Time Objective (RTO) determines the maximum tolerable amount of time needed to bring all critical systems back online. This covers, for example, restore data from back-up or fix of a failure. In most cases this part is carried out by system administrator, network administrator, storage administrator etc.

Stage 4: Resume Production

BCDR-04

At this stage all systems are recovered, integrity of the system or data is verified and all critical systems can resume normal operations. The Work Recovery Time (WRT) determines the maximum tolerable amount of time that is needed to verify the system and/or data integrity. This could be, for example, checking the databases and logs, making sure the applications or services are running and are available. In most cases those tasks are performed by application administrator, database administrator etc. When all systems affected by the disaster are verified and/or recovered, the environment is ready to resume the production again.

BCDR-05

The sum of RTO and WRT is defined as the Maximum Tolerable Downtime (MTD) which defines the total amount of time that a business process can be disrupted without causing any unacceptable consequences. This value should be defined by the business management team or someone like CTO, CIO or IT manager.

This is of course a simple example of a Business Continuity/Disaster Recovery plan and should be included in your Business Impact Analysis (BIA).

I hope this short explanation gives you some starting points when discussing a Business Continuity/Disaster Recovery implementation with your customer.

Cheers!

– Marek.Z

40 Comments

  1. It seems there is a typo in third last paragraph,”The sum of RTO and PRO” I think you meant “The sum of RTO and WRT”?
    Thanks for the nice graphical explanation.

  2. Great post!

    I just wonder why you use ‘Business as usual’ instead of ‘Backup as usual or Last Valid Backup’ ? IMHO it will be more accurate.

    KB

    • Thanks
      I assume your backup strategy is already in place and you are able to recover from backup based on your company requirements. Besides, backup is out of scope in this blog post. I just wanted to explain what the acronyms mean.

      Cheers!

  3. Nice article.. But I still have a doubt about relation between BIA, RTO and MTD.

    MTD will be input to BIA or vice versa.Kindly clarify

    Thanks
    Mohammed Arif

    • So, basically, you got your BIA for your entire corporation. Not just your IT infra and this article was written based on my consulting job for a few customers.

      Hope this helps.

  4. What about the “think time” to decide if you want to invoke your recovery or attempt to repair the application without invoking? I don’t see this represented here. In the case where the invocation of recovery could result in a loss of data or risk of further issues, you want to be 100% sure you need to invoke before you do it. This increases the MTD.

    • Hi,

      I assumed that the decision for recovery already has bee made and this blogs simply explains the steps that usually are performed during the recover. This of course can be tweaked to suite you environment.

      Cheers!

  5. It’s a wonder why the textbook I’m reading doesn’t employ a graphical explanation such as this one. Thank you!

  6. First, let me say that this is a great explanation of the terms.
    Regarding Leon Funnell’s comment above, I personally have found that the clock for RTO does not start ticking until the decision to declare a disaster has been made. I always include that as a step in my process. That makes it clear to management that the decision to declare is a very important step; and the longer it takes to decide to declare, the longer critical systems remain down.

  7. In our organisation MDT is distinct and separate from RTO. MDT is RTO + think time (nominally 1 hour). This is because unless you have automated and loss-less DR, the decision point must be taken as recovery to DR is not your only option. Reboots, reconfiguration, troubleshooting are usually first steps.

  8. Hi sir, from your point of view, in order to determine RTO for certain process key (note that this is business process, not RTO for IT), do I need to consider WRT as part or RTO?

    • Hi Fara,

      Well, it depends. Is the process operational? Or do you need to perform additional steps to get it operational again.

      Cheers!

  9. Simple,effective representation and explanation of the acronyms, which makes it very to understand and grasp the concepts. As Jason said,many a text book lack this sort of representation. Thanks for the effort and explanation Marek. Helped me in getting clarity.

    Regards,

    Sebastian

  10. Thanx so much this article is great,am studying for my CISSP it clarified a lot of things in a very simple manner…however is Maximum Allowable Downtime(MAD) and Maximum Tolerable Downtime(MTD) the same thing

    • Hi,

      I am not sure. I didn’t see the MAD before but judging from the name, I think this could be the same thing.

    • Normally they are not.
      MAD can be e.g. a legally mandated requirement while MTD can be an internal company target. When you have defined both, MTD would be shorter or same as MAD.
      In the IAAS world, to make it worse, the customer would rarely share true MAD as it can be of a highly confidential nature and ask the vendor to treat MTD as “MAD”. This can create further complexity by obfuscation for the guy asked to solution it. But hey, life is complex.

  11. Hi Marek, what an article indeed..
    I have a small query.. RTO is normally provided by the business if I am not wrong. So, from IT point of view, we will need WRT that is the “time to be taken by IT”. For wxample, Supply chain wants their applications back up in 1 hour, IT will say that we will need 30 Min more time (WRT) to verify the integrity and blah blah.. Hence MTD will become 90 min. The business will start shouting that we gave 60 min RTO and you are taking 90 min… Kindly guide..

    • Hi Umair,

      Well, in most cases, yes. RTO is usually stated by the business as they have to decide what is tolerable. It is a business decision.

      From IT point of view, during the RTO period, you restore your systems to the up-and-running state. After that you need to verify the integrity of the data (WRT) thus, as you describe, the MTD will become your 90 minutes. So, in fact, the IT needs 90 minutes to go back to the production state. But it really depends on your organisations’ procedures and requirements but also on IT staff that is involved with the recovery.

      Hope this helps.

      Cheers!

  12. Happy Morning Marek,

    Below text is from CISM Review Manual where AIW (Acceptable interruption window) was mentioned. Is AIW same as MTD? Need your help to clarify!

    Regards,
    Santha Kumar N.

    4.10.6 BASIS FOR RECOVERY SITE SELECTIONS
    The type of site selected for a response and recovery strategy should be based on the following considerations:
    •AIW—The total time that the organization can wait from the point of failure to the restoration of critical services/applications. After this time, the cumulative losses caused by the interruption may threaten the existence of the organization.

4 Trackbacks / Pingbacks

  1. Business Continuity Wrt | Guide to Continuity Planning
  2. VCAP-DCD Exam Experience | vmroyale.com
  3. Memilih Disaster Recovery Center untuk Bisnis Anda
  4. How to backup NSX?

Leave a reply...