Appearance
Mean Time to Recovery (MTTR)
Definition: Mean Time to Recovery measures the average time it takes to restore service after a production failure occurs. It tracks the duration from the moment a failure is detected until it is resolved and the system is fully operational again.
Why It Matters
MTTR is a critical DORA metric that measures the resilience and stability of a system. While Change Failure Rate
measures how often failures happen, MTTR measures how quickly the team can respond when they do.
Measures Resilience: A low MTTR indicates that a team has robust monitoring, effective incident response processes, and the ability to diagnose and resolve problems quickly.
Minimizes Customer Impact: The faster you can recover from a failure, the less impact it has on your customers. A low MTTR is essential for maintaining user trust and business continuity.
Encourages Fearless Deployment: When teams know they can recover from failures quickly, they are more confident in deploying changes frequently. This supports a high-velocity development culture.
How to Measure It
MTTR is calculated as the average time taken to resolve failures over a specific period.
MTTR = Total Downtime from Failures / Number of Failures
"Downtime" begins when the incident is first detected (either by monitoring alerts or user reports) and ends when the fix is deployed and the service is stable.
Interpretation & Benchmarks
Goal: The primary goal is to achieve the lowest possible MTTR.
Focus on Process: Improving MTTR involves optimizing the entire incident response lifecycle, including alerting, on-call procedures, diagnostic tooling, and deployment pipelines.
Industry Benchmarks (DORA):
Elite: Less than one hour.
High: Less than one day.
Medium: Between one day and one week.
Low: More than one week.