Let's start with a definition. MTTR — Mean Time to Repair, Mean Time to Recovery, or Mean Time to Remediation, depending on who you ask and how good their week has been — technically measures the average time from incident detection through remediation and resolution. (In practice, it's whatever makes your metrics look good.) It's one of the four DORA metrics. It appears in reliability dashboards. People put it in slides. Leadership asks about it.

MTTB — Mean Time to Blame — measures something different: the average time from incident detection to the identification of the team, person, or service that caused it. It doesn't appear in any official framework. It has no dashboard. Nobody puts it in slides. And yet, in a substantial number of incident postmortems, it is the first metric to be satisfied.

The gap between these two metrics is the gap between the SRE organization you have and the one you think you have.

A Complete Map of SRE Incident Metrics

To understand why MTTB matters, you have to understand what the full landscape of incident metrics actually looks like — and which ones organizations use vs. which ones they aspire to use.

Metric What it measures Who cares Reality
MTTR Detection → Resolution Everyone (officially) Heavily gamed; often excludes detection time
MTTD Incident start → Detection Monitoring teams Underreported; hard to measure accurately
MTTI Detection → Investigation start Nobody (they should) The gap where MTTB lives
MTTF Restore → Next failure Reliability-focused teams Rarely tracked; deeply revealing
MTTB™ Detection → Blame assigned Everyone (unofficially) Optimized in every org, measured in none

How MTTR Gets Gamed (A Field Guide)

MTTR reduction is the stated goal of most SRE programs. It's also one of the most reliably distorted metrics in engineering. The distortions are rarely intentional — they're structural. Here's how they happen:

The Clock Start Problem

MTTR is calculated from when the incident was detected. But incident detection is itself a fuzzy concept. If a monitor fires at 2:14am but nobody acknowledges it until 2:22am, when did the incident start? Different tools, different teams, and different on-call practices will answer this differently. Organizations that report great MTTR often have generous clock-start definitions.

The Rollback Cheat

Rollback is the fastest path to MTTR. Revert the deploy, metrics recover, incident closed. MTTR: 12 minutes. Excellent. The underlying issue — why the deploy broke production, why the tests didn't catch it, why the deploy was made at 5pm on a Friday — goes into the postmortem, gets three action items assigned to the team that shipped the change, and is quietly never addressed.

The system optimized for MTTR. The system got faster rollbacks. The system got the same incident again next month.

Goodhart's Law, SRE edition: When MTTR becomes a target, it ceases to be a good measure. Teams optimize for closing incidents fast, not for preventing the next one. MTTR goes down. Incident frequency goes up. The numbers look great until they don't.

The MTTB Phenomenon: Why It Dominates

Here's what's actually happening in the first 30 minutes of most P0 incidents, in roughly this order:

  1. Alert fires. On-call acknowledges.
  2. War room convened. People join.
  3. Someone asks "what changed recently?"
  4. Git history and deploy log examined.
  5. Someone is identified.
  6. That person is asked to explain themselves.
  7. Incident resolution continues, now with an audience and a narrative.

Steps 3–6 happen faster than steps 1–2, 7, and every step after. The identification of a responsible party is often the fastest-moving part of incident response. It's prioritized, implicitly, because it answers the organizational question everyone is actually asking: who did this?

18 min
Average time to identify "responsible party" in a P0 incident
Average time to resolve the same incident: 47 min  ·  These are not unrelated numbers

What You Should Be Measuring Instead

The most useful SRE metrics are the ones that create feedback loops toward the behaviors you want. MTTR, as typically measured, creates a feedback loop toward faster rollbacks and more conservative clock-start definitions. That's not nothing, but it's not the loop you want.

Here's what the metrics framework looks like in organizations that are actually improving:

What Most Teams Measure

  • MTTR (often gamed)
  • Incident count (often filtered)
  • SLA compliance (often a lagging indicator)
  • Deploy frequency (activity, not outcome)
  • MTTB (not measured, fully optimized)

What Improves Reliability

  • MTTD — are you catching things early?
  • MTTI — how fast does investigation start?
  • Incident recurrence rate — same cause twice?
  • Action item completion rate from postmortems
  • % of incidents caught before user impact

The right side of that table has a common characteristic: these metrics measure the quality of your incident response and prevention work, not just its speed. They create pressure toward building better detection, better runbooks, and better systems — not just faster fingers on the rollback button.

The Uncomfortable Reason MTTB Persists

MTTB persists not because engineers are bad people, but because organizations have legitimate needs that MTTB satisfies: accountability, closure, and the ability to tell stakeholders "we know what happened and we've addressed it." These are real organizational needs, and dismissing them as dysfunctional misses the point.

The question isn't how to eliminate the pressure toward blame — it's how to satisfy that pressure through systemic accountability rather than individual blame. The answer to "who's responsible?" in a high-performing SRE organization is "the team that owns this service, and here are the three things they're changing about the system." That's a different answer than "Jordan deployed at 4pm" — and it's actually more satisfying, if leadership has been educated to want it.

-- incident_metrics_audit.sql
SELECT
  AVG(blame_assigned_at - detected_at) AS mttb_minutes,
  AVG(resolved_at - detected_at) AS mttr_minutes,
  (mttb_minutes / mttr_minutes * 100) AS pct_recovery_spent_on_blame
FROM incidents
WHERE severity = 'P0'
AND quarter = 'Q4-2025';

-- Industry median: ~38% of MTTR spent on blame assignment
-- This is the number nobody tracks and everyone should

The Actual Path to MTTR Reduction

Real MTTR reduction — the kind that compounds over time and actually reduces incident frequency, not just incident duration — comes from a different place than faster rollbacks and better on-call rotations. It comes from understanding the relationship between your incidents.

The incidents you're having today are related to each other. There are patterns in what breaks, when it breaks, and why. Those patterns are in your data. They're in your postmortems. They're in your change history and your monitoring dashboards. Organizations that find and address those patterns see incident rates drop. Organizations that treat each incident as a discrete event, assign blame, and move on see incident rates stay flat or increase with scale.

The metric that matters most isn't how fast you recover. It's how fast you learn. Mean Time to Learning, maybe — though that one won't fit on a dashboard quite as cleanly.

What you measure shapes what you optimize for. If you measure MTTR in isolation, you'll get faster rollbacks. If you measure incident recurrence, you'll get better systems. If you measure MTTB — even informally, even just by being honest about what your postmortem process actually produces — you'll at least know what game you're playing.

Most organizations are playing the blame game and calling it SRE.


If this resonates

Ciroos Actually Solves This

AI SRE teammates that find the patterns in your incidents before they repeat — so you're optimizing Mean Time to Learning, not Mean Time to Blame. This is what MTTR reduction actually looks like.

See How Ciroos Works → Calculate Your MTTB Score Laughed, then cried.