What Are Mean Time to Recovery?
A DORA metric that measures the average time it takes to restore service after a production incident. Elite teams recover in under one hour.
2-minute setup • No credit card required
What it means
Mean time to recovery (MTTR), also called mean time to restore, measures the elapsed time from when a production incident is detected to when service is fully restored. It includes detection time, diagnosis time, fix implementation, and deployment of the fix. MTTR is the fourth DORA metric and represents the stability side of delivery performance — it answers 'when things go wrong, how quickly can we fix them?' The metric is calculated as the average (or median) recovery time across all incidents in a time period. DORA benchmarks: Elite (under 1 hour), High (under 1 day), Medium (under 1 week), Low (over 1 week). MTTR is closely related to deployment frequency — teams that can deploy quickly can also ship fixes quickly. A team with a weekly deployment window has a minimum MTTR of up to a week, regardless of how fast they diagnose the issue.
Why Mean Time to Recovery matter
MTTR directly impacts customer trust, SLA compliance, and revenue. For a SaaS product, every hour of downtime is lost revenue and damaged reputation. For engineering leaders, MTTR reveals the team's incident response maturity: fast detection (monitoring and alerting), efficient diagnosis (observability and runbooks), quick fix deployment (CI/CD automation), and clear communication (incident management process). Teams with high MTTR often have poor observability, no runbooks, manual deployment processes, or unclear incident ownership. Improving MTTR is one of the fastest ways to improve both reliability and developer on-call experience.
How to measure
Track two timestamps per incident: (1) when the incident was detected (alert fired or user report received) and (2) when service was fully restored (confirmed by monitoring). MTTR = sum of all recovery times / number of incidents. Use your incident management tool (PagerDuty, Opsgenie, Incident.io) for automatic timestamps. Calculate monthly for meaningful trends. Use median instead of mean if you have outlier incidents that skew the average. Only include production incidents that affected users — internal staging issues don't count.
Real-world example
An e-commerce team has an average MTTR of 4 hours. Breaking it down across their last 10 incidents: detection takes 45 minutes (their alerting thresholds are too lenient), diagnosis takes 1.5 hours (engineers grep through logs manually), fix implementation takes 30 minutes, and deployment takes 1.25 hours (manual release process with approval gates). They make three changes: tighten alert thresholds (detection drops to 5 minutes), add structured logging with a log aggregator (diagnosis drops to 20 minutes), and automate deployment for hotfixes (deployment drops to 10 minutes). Their MTTR falls to under 1 hour.
Related terms
Common questions
What's the difference between MTTR and MTTF?
MTTR (Mean Time to Recovery) measures how quickly you fix failures. MTTF (Mean Time to Failure) measures how long systems run before failing. DORA uses MTTR because it focuses on response capability rather than prevention. In practice, teams should optimize both: reduce failure frequency (MTTF) through testing and reduce recovery time (MTTR) through automation and observability.
What is a good MTTR?
DORA benchmarks: Elite teams recover in under 1 hour, High performers under 1 day, Medium under 1 week, Low over 1 week. For most SaaS products, getting under 4 hours is a strong first target. Under 1 hour requires excellent monitoring, automated rollback capabilities, and well-practiced incident response.
How do you reduce MTTR?
Focus on four areas: (1) Detection — set up proactive monitoring with tight alert thresholds so you catch issues before users report them, (2) Diagnosis — invest in observability (structured logs, distributed tracing, metrics dashboards) and maintain runbooks for common failure modes, (3) Fix — automate deployments so you can ship a hotfix in minutes not hours, (4) Process — run regular incident retrospectives and practice incident response.
Should MTTR include detection time?
Yes, per the DORA definition. MTTR covers the entire window from incident occurrence to service restoration. This incentivizes investment in monitoring and alerting, not just fast fixes. Some teams also track 'time to detect' separately to isolate monitoring improvements from fix implementation improvements.
Track Mean Time to Recovery Automatically
Gitmore turns your git activity into automated reports with real metrics — delivered to Slack and email.
Get Started FreeNo credit card • No sales call • Reports in 2 minutes