Incident Response Engineering Done Right

Severity definitions, runbooks that get read at 3 a.m., on-call rotation health, and blameless post-mortems that actually change behavior.

Datadog fired at 10:14 a.m. Pacific. By the time I joined the Slack thread, the Aurora reader replicas at the creator economy platform I worked at were 14 minutes behind the writer and still drifting. Community feeds were degraded for millions of customers. I wasn’t on-call that week. I was tagged because I owned the Aurora layer for the Community product and the on-call engineer wanted a second pair of eyes.

That whole incident took about 22 minutes from page to recovery. Most of what made it 22 minutes and not 2 hours was process, not cleverness. Severity defined ahead of time. A runbook that started with the right command. A war-room channel that auto-opened with templated roles. The actual fix, killing a runaway ANALYZE on a hot table, took roughly 30 seconds once we’d diagnosed it. Everything else was the system around the fix.

That’s the thing I’d put up front. Incident response is mostly writing and process. The team that practices severity, runbooks, and post-mortems out-performs the team with smarter engineers, almost every time.

Severity definitions that survive contact

Most severity definitions I’ve inherited at past gigs read like marketing copy. “Critical issue impacting business operations” doesn’t help the engineer at 3 a.m. who has 90 seconds to decide whether to wake up half the staff.

Severity should be defined by user impact and reversibility, not by which service is on fire. We had four levels at the creator platform:

SEV1: writes are failing, or reads are degraded for the majority of customers, or money is moving wrong.
SEV2: a major surface is degraded but most users are fine. Or writes are slow but completing.
SEV3: one customer or one segment affected. No money implications.
SEV4: known issue, tracked, not paging.

The Aurora replica lag thing was a SEV1 inside of 90 seconds. Writes still worked, but read p99 on /communities/:id/posts had gone from around 120 ms to over 8 seconds, and that’s the entire experience for the Community product. Money wasn’t moving wrong, but the customer impact was the majority of the surface area.

I like having the severity decision encoded somewhere both the on-call and the SRE bot agree on. A small helper that you can call from a paging integration:

# app/incidents/severity_classifier.rb
module Incidents
  class SeverityClassifier
    SEV1 = "SEV1"
    SEV2 = "SEV2"
    SEV3 = "SEV3"

    def self.classify(signal)
      return SEV1 if signal.writes_failing?
      return SEV1 if signal.money_path_affected?
      return SEV1 if signal.read_error_rate >= 0.10 && signal.surface_is_major?

      return SEV2 if signal.read_error_rate >= 0.02
      return SEV2 if signal.p99_latency_ms >= 2_000 && signal.surface_is_major?

      SEV3
    end
  end
end

It’s not a replacement for judgment, it’s a tiebreaker. When the on-call is sleep-deprived and the page is ambiguous, the helper picks. PagerDuty’s routing then picks who gets woken up.

# pagerduty-routes.yaml
routing_rules:
  - if: "severity == 'SEV1'"
    escalation_policy: sev1-everyone-up
    notify:
      - on_call_engineer
      - incident_commander_pool
      - eng_manager_on_call
  - if: "severity == 'SEV2'"
    escalation_policy: sev2-primary
    notify:
      - on_call_engineer

The runbook that gets read at 3 a.m.

The best runbook I ever wrote starts with the literal sentence: “Before touching reader scaling, check pg_stat_activity on the writer.” The reason it starts that way is the Aurora incident. The on-call’s first instinct that morning was reasonable: bump reader instance class up two tiers. The readers were fine. They were starved of WAL because a long-running ANALYZE on community_posts was holding write-side locks. I’m the reason that sentence is in the runbook.

Most runbooks I’ve seen open with paragraphs of architecture context. That’s a docs page, not a runbook. A runbook is for the engineer who’s already paged, already in the war room, and needs the first command to type, right now. Lead with the action. Background can live three scrolls down.

A real excerpt, lightly anonymized:

# Runbook: Aurora reader replica lag > 60s

## First command (do this before anything else)

psql $WRITER_URL -c ” SELECT pid, now() - query_start AS duration, state, query FROM pg_stat_activity WHERE state != ‘idle’ AND now() - query_start > interval ‘30 seconds’ ORDER BY duration DESC; ”


If you see a long-running ANALYZE, VACUUM, or anything maintenance-flavored:
  - kill it: `SELECT pg_terminate_backend(<pid>);`
  - drain time is usually 5-10 minutes. Do not bump reader instance class.

## Only if pg_stat_activity is clean
- Check WAL emission rate on the writer (link to Datadog).
- Check replica IO throughput per reader (link to Datadog).
- Last resort: bump reader instance class one tier. This is a 20+ minute operation.

## Rollback
- If you bumped the instance class and lag did not move, revert. Do not stack changes.

Imperatives. Links to dashboards inline. Always a rollback section, because the first thing you try will sometimes be wrong and you need a path back.

On-call rotation health

If you only measure incident response by MTTR, you’ll miss the slow-motion failure: rotation burnout. The signal I trust is page volume per engineer per week. Anything north of three real pages a week and you’ve got a noisy system, not a heroic team.

At the federation platform I CTO’d in London, we ran a small script every Monday that pulled PagerDuty page counts for each rotation, broken down by actionable, noise, and unactionable noise. The script doesn’t fix anything, but it puts the bad rotation in front of the whole eng leadership every week. You can’t fix what you don’t surface.

// scripts/oncall-health.ts
import { PagerDutyClient } from "./pagerduty-client";
import { startOfWeek, endOfWeek } from "date-fns";

const THRESHOLD_PAGES_PER_WEEK = 3;

type RotationHealth = {
  rotation: string;
  engineer: string;
  pages: number;
  actionable: number;
  noise: number;
};

async function weeklyHealth(): Promise<RotationHealth[]> {
  const pd = new PagerDutyClient(process.env.PD_API_TOKEN!);
  const since = startOfWeek(new Date()).toISOString();
  const until = endOfWeek(new Date()).toISOString();

  const incidents = await pd.listIncidents({ since, until });

  const buckets = new Map<string, RotationHealth>();
  for (const inc of incidents) {
    const key = `${inc.rotation}::${inc.assigneeEmail}`;
    const row = buckets.get(key) ?? {
      rotation: inc.rotation,
      engineer: inc.assigneeEmail,
      pages: 0,
      actionable: 0,
      noise: 0,
    };
    row.pages += 1;
    if (inc.labels.includes("noise")) row.noise += 1;
    else row.actionable += 1;
    buckets.set(key, row);
  }

  return [...buckets.values()].filter((r) => r.pages > THRESHOLD_PAGES_PER_WEEK);
}

The pages-with-label workflow only works if the on-call labels every incident after the fact. We made that part of the handoff template. If you didn’t label your pages from last week, you don’t get to hand off the pager.

The incident commander role

Single biggest mistake I made in a war room, ever, was running comms and debugging at the same time. Real-time trading platform, market-open WebSocket reconnect storm, I was on-call and I made myself incident commander out of habit. About 8 minutes in I was simultaneously typing into nginx config, watching a Datadog graph, and trying to write a status update for the support channel. The status update was 4 minutes late and read like a confused diary entry.

One person runs comms. A different person debugs. The IC is not a fancy title, it’s a hat. The IC takes timestamped notes in the war room channel, decides when to escalate, decides when to declare resolved, talks to leadership. They do not type commands at the database.

A small Slack workflow we eventually built auto-opened a war-room channel when anyone reacted to a page with the siren emoji:

# slack-workflows/sev1-warroom.yaml
trigger:
  type: reaction_added
  reaction: rotating_light
  channel: pagerduty-feeds
action:
  create_channel:
    name: "inc-{{date.YYYYMMDD}}-{{incident.id}}"
    invite:
      - "{{incident.assignee}}"
      - "@incident-commanders"
      - "@eng-managers"
    post_message: |
      *Incident*: {{incident.title}}
      *Severity*: {{incident.severity}}
      *IC*: react with :crown: to claim
      *Debugger*: react with :wrench: to claim
      *Comms*: react with :speech_balloon: to claim

      Reminder: the IC does not debug. Debugger does not write comms.

The reminder line at the bottom matters more than the channel automation. Half the people in the channel forget the rule under stress.

Blameless post-mortems that change behavior

Blameless does not mean “no one made a mistake”. It means the post-mortem looks for the systemic reason a mistake was possible. People will always make mistakes. The system either caught it or it didn’t.

Late one evening I’d approved a Rails migration on a hot Postgres table. Non-null column with a default, table with hundreds of millions of rows. Used the supposedly-safe helper from the strong_migrations gem. Ran the deploy past midnight UTC because that was “off-peak”. It still acquired ACCESS EXCLUSIVE on the table for about 87 seconds during the backfill. Login error rate was 100% for 85 seconds at Pacific peak hours.

The blameless part wasn’t “I approved a bad migration, oh well”. The blameless part was: why didn’t CI catch a non-null-default on a giant table? Why did “strong_migrations” feel safe enough to skip a manual review of the migration file? Why did the deploy window logic call 6 p.m. Pacific “off-peak”? Three systemic gaps, three concrete action items, each with an owner and a date. The action I owned shipped a strong_migrations rule that blocks any add_column with a non-null default against tables with more than 10 million rows. It went into CI within a week. That rule is still there.

A post-mortem without owned, dated action items is theater. We had a rule that anyone could close out a post-mortem doc only after all action items had a name and a date next to them.

MTTR is a vanity metric most of the time

MTTR gets quoted as if it’s the headline. It isn’t. Time-to-detect and time-to-mitigate matter more.

The Aurora replica lag incident had a TTD of about 2 minutes (Datadog fired fast) and a TTM of about 6 minutes once we’d diagnosed it. Full TTR was several days if you count the runbook update and the db_safe_maintenance.rb helper that now refuses to run heavy maintenance commands during peak hours. TTR is dominated by paperwork. TTD and TTM are the numbers I track.

I prefer SLO-based monitors over raw threshold alerts for the same reason. A 30-minute error budget burn at 5x the SLO budget tells you something real about user pain. A error_rate > 5% static threshold does not.

# datadog/monitors/community-reads-slo.yaml
name: "community reads SLO burn (5x for 30m)"
type: slo alert
query: |
  burn_rate("slo:community-reads-availability").last("30m") > 5
message: |
  Community reads SLO is burning at 5x budget over 30 minutes.
  Suggested severity: SEV2. Escalate to SEV1 if surface is degraded for the majority.
  Runbook: https://internal/runbooks/community-reads
tags:
  - product:community
  - team:platform
options:
  notify_audit: true
  include_tags: true

Burn-rate alerts are quieter and more meaningful. The on-call rotation thanks you.

Takeaways

One incident commander. Separate comms from debugging. Always.
Runbooks lead with the first command. Background goes three scrolls down.
Page volume per engineer per week is the rotation-health metric. Three pages a week is a yellow card.
Severity is defined by user impact and reversibility, not by which service is on fire.
Blameless means systemic root cause. Action items have owners and dates, or it didn’t happen.
TTD and TTM matter more than MTTR. SLO burn rate beats raw threshold.

Thanks for reading. If you’ve got thoughts, send them my way.