Incident Management#mttr#incident-response#sre#reliability

How to Reduce MTTR by 60%: Lessons from 10,000 Incidents

MTTR is a composite metric with four distinct phases, each requiring a different intervention. Here's what 10,000 incidents analyzed by ObservabilityOS revealed about where time is actually lost — and how to recover it.

OO

ObservabilityOS Team

Core Engineering & DevRel

July 15, 20268 min read

MTTR Is Not One Number — It's Four

Mean Time To Repair is the total time from incident start to service restoration. But 'total time' is a composite of four distinct phases, each with different causes of delay and different interventions. Teams that report MTTR as a single metric and try to improve it as a single metric invariably fail — because the interventions for each phase are completely different.

The four phases: TTD (Time to Detect) is how long from actual incident start to first alert. TTI (Time to Investigate) is how long from first alert to identifying the root cause. TTR (Time to Resolve) is how long from root cause identification to service restoration. TTC (Time to Communicate) is the time spent updating stakeholders throughout the incident. Teams optimizing for their longest phase achieve the most improvement per unit of effort.

Where Time Is Actually Lost: The Data

Across 10,000 incidents analyzed by ObservabilityOS, the time distribution breaks down as follows: on average, TTI accounts for 58% of total MTTR. This is the investigation phase — the time spent correlating evidence and identifying the root cause. It is where AI has the highest leverage, and it is consistently the most underinvested phase.

TTD accounts for 22% of average MTTR. Static threshold alerting with poor signal-to-noise ratio means many incidents are caught late — engineers investigate alerts in order of arrival, and a real incident may sit in the queue behind noise alerts. Dynamic thresholds and AI triage compress this phase significantly.

TTR (the actual fix) accounts for only 12% of MTTR for most incidents. The fix is usually fast once you know what it is — rollback a deploy, restart a service, scale up a resource. The 12% can be compressed with runbook automation, but the 58% (investigation) is where the real leverage is.

  • TTD — Time to Detect: 22% of MTTR. Fix: dynamic thresholds, AI anomaly detection, reduce alert queue noise.
  • TTI — Time to Investigate: 58% of MTTR. Fix: AI root cause analysis, structured observability data, incident runbook access.
  • TTR — Time to Resolve: 12% of MTTR. Fix: runbook automation, one-click rollbacks, pre-approved remediation scripts.
  • TTC — Time to Communicate: 8% of MTTR. Fix: AI status page drafts, templated communication, stakeholder alert routing.

Compressing TTD: Catch Anomalies 10x Faster

The median TTD for incidents using static threshold alerting is 11 minutes. The median TTD for incidents using dynamic Z-score thresholds is 2 minutes. This 5x improvement comes from eliminating the 'alert queue delay' problem: with fewer false positives, engineers process every alert immediately rather than triaging a backlog.

A second contributor to long TTD: monitoring gaps. Services that are new, recently refactored, or recently migrated often have incomplete monitoring coverage. An incident in a service with no custom alerts relies on generic infrastructure signals — CPU, memory — which are lagging indicators that don't fire until a service is severely degraded. Regular monitoring coverage audits (ObservabilityOS flags services with no custom alerts) prevent these gaps.

Compressing TTI: From 45 Minutes to 90 Seconds

The single most effective intervention for reducing TTI is AI root cause analysis. In ObservabilityOS's dataset, incidents with AI RCA available had a median TTI of 3.2 minutes versus 47 minutes for incidents investigated manually. The AI does not replace the engineer's investigation — it compresses the data gathering and initial hypothesis phases from 20–40 minutes to under 60 seconds.

The second most effective TTI intervention is trace-level observability. Incidents in services with distributed tracing enabled had TTIs 34% shorter than untraced services, because the engineer could immediately see the request path, identify the slow or failing span, and target their investigation. Without traces, investigation requires reconstructing the request path from logs — a much slower process.

Training and documentation have a measurable impact as well. Teams with documented incident runbooks available in their observability platform had 28% lower TTI than teams without. Runbooks don't replace investigation, but they prevent engineers from starting from zero on failure modes the team has seen before.

Compressing TTR: Runbooks and Automation

Once you know what's wrong, the fix is usually fast — if you have the right tools and permissions in place. The most common TTR delays: waiting for the right person to have permissions to execute a fix, not knowing the exact command to run, and caution about executing changes during an active incident without a second pair of eyes.

Pre-approved remediation scripts — stored in your runbook library and executable with a single click from the incident dashboard — eliminate the first two problems. An engineer does not need to remember the exact Kubernetes command to scale a deployment if there's a 'Scale Up Checkout Service' button in the runbook that executes it with audit logging. These scripts should be tested in staging and pre-approved by a senior engineer, removing the caution problem as well.

Measuring MTTR Improvement: A Baseline First

You cannot improve what you do not measure. Before implementing any of the interventions in this guide, establish a baseline MTTR for each severity tier of incidents — measured separately, not as an average across all severities. A P1 checkout outage and a P4 internal tooling degradation have very different TTDs, TTIs, and TTRs, and improving one type does not necessarily improve the other.

Set a 90-day improvement target for each phase. A realistic target for teams implementing AI RCA and dynamic thresholds: TTD from 11 minutes to 3 minutes, TTI from 47 minutes to 12 minutes, overall MTTR from 65 minutes to 25 minutes. This 60% reduction in MTTR is achievable within 90 days without major architectural changes — only tooling and process changes.

Stop debugging production in the dark

ObservabilityOS gives every engineer AI-powered incident intelligence. Zero config. Connects in 5 minutes.

About the Author

OO

ObservabilityOS Team

Core Engineering & DevRel

The core engineering, site reliability, and developer relations team behind ObservabilityOS. We build AI-native observability infrastructure to eliminate 3 AM firefighting.