Question 1

What's the difference between MTTR and MTTF?

Accepted Answer

MTTR (Mean Time to Recovery) measures how quickly you fix failures. MTTF (Mean Time to Failure) measures how long systems run before failing. DORA uses MTTR because it focuses on response capability rather than prevention. In practice, teams should optimize both: reduce failure frequency (MTTF) through testing and reduce recovery time (MTTR) through automation and observability.

Question 2

What is a good MTTR?

Accepted Answer

DORA benchmarks: Elite teams recover in under 1 hour, High performers under 1 day, Medium under 1 week, Low over 1 week. For most SaaS products, getting under 4 hours is a strong first target. Under 1 hour requires excellent monitoring, automated rollback capabilities, and well-practiced incident response.

Question 3

How do you reduce MTTR?

Accepted Answer

Focus on four areas: (1) Detection — set up proactive monitoring with tight alert thresholds so you catch issues before users report them, (2) Diagnosis — invest in observability (structured logs, distributed tracing, metrics dashboards) and maintain runbooks for common failure modes, (3) Fix — automate deployments so you can ship a hotfix in minutes not hours, (4) Process — run regular incident retrospectives and practice incident response.

Question 4

Should MTTR include detection time?

Accepted Answer

Yes, per the DORA definition. MTTR covers the entire window from incident occurrence to service restoration. This incentivizes investment in monitoring and alerting, not just fast fixes. Some teams also track 'time to detect' separately to isolate monitoring improvements from fix implementation improvements.

What Are Mean Time to Recovery?

What it means

Why Mean Time to Recovery matter

How to measure

Real-world example

Related terms

Common questions

What's the difference between MTTR and MTTF?

What is a good MTTR?

How do you reduce MTTR?

Should MTTR include detection time?

Track Mean Time to Recovery Automatically