A new search feature shipped clean in staging, then sequential-scanned a half-billion-row table on the busiest day of the year. The lessons that became hard CI gates.
It was the Friday after Thanksgiving at the creator economy platform I spent the last few years at. Around 11:40 a.m. Pacific, the war room lit up. A search feature we’d shipped two days earlier was sequential-scanning a roughly half-billion-row table on the writer. p99 on the relevant API route had gone from ~180 ms to over 12 seconds, Aurora CPU pinned at 94%. I was floating across a couple of squads that quarter, including the one that owned the feature, so I got pulled in within minutes.
Short version: a LIKE 'prefix%' query without the right index, against a table the size we don’t sequential-scan, on the highest-traffic day of the year.
The feature was a search box on a creator-facing dashboard. The query the new endpoint generated boiled down to this:
SELECT id, title, status, created_at
FROM community_posts
WHERE creator_id = $1
AND status = 'published'
AND lower(title) LIKE lower($2) || '%'
ORDER BY created_at DESC
LIMIT 50;
In staging, that ran in 38 ms. CI hit a seeded fixture of a few hundred rows and was happy. In code review, two of us waved it through, because community_posts had an index on (creator_id, status, created_at DESC) and we figured the planner would narrow it down to one creator’s posts and filter the LIKE in memory. For most creators, that would have been fine.
For the long tail of large creators, that index didn’t narrow enough. Some had hundreds of thousands of published posts. For a handful of huge ones the planner picked a different plan and the lower(title) LIKE 'something%' ran against a much wider slice. Some plans went sequential. On a half-billion-row table. On Black Friday.
First instinct: add a Redis cache in front of the endpoint. Twenty minutes in, someone had a PR up. Would have helped long-term, but not right then, because any cold path still hit Postgres. I pushed back on shipping the cache mid-incident and asked someone to pull EXPLAIN ANALYZE against the writer for one of the affected creator IDs.
The plan came back with Seq Scan on community_posts and an estimated row count off by three orders of magnitude. The planner was doing its job. We hadn’t given it the right tool.
We didn’t fix it with a new index. Not in the moment. You don’t add an index concurrently to a half-billion-row table when the writer is already at 94% CPU. That migration would have made things worse before it made them better.
What we did, in order:
(creator_id, status, created_at DESC) index and used an in-Ruby filter for the title prefix, capped at 500 candidate rows.pg_stat_activity open and killed two long-running queries that had piled up behind the slow ones.Writer CPU dropped from 94% to roughly 40% within four minutes. p99 recovered to under 400 ms within ten.
The real fix went in the following Wednesday, after the holiday weekend. A trigram GIN index on lower(title), scoped with a partial predicate on status = 'published':
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX CONCURRENTLY idx_community_posts_title_trgm
ON community_posts
USING gin (lower(title) gin_trgm_ops)
WHERE status = 'published';
That index took just over six hours to build concurrently. We watched it with pg_stat_progress_create_index and kept reads on the bounded prefix path until it was done. Once it was in, the original query came back to single-digit milliseconds on the same creator IDs that had pinned CPU four days earlier.
The single change I cared most about wasn’t the index. It was the gate that meant we couldn’t do this to ourselves again. We added a CI step that runs EXPLAIN ANALYZE against every new query touching a flagged hot table, on a staging DB loaded with production-shaped data. Sequential scan on a hot table fails the build.
The script that drives it lives in bin/check_query_plans.rb and looks roughly like this:
require "active_record"
require "json"
HOT_TABLES = {
"community_posts" => 100_000_000,
"users" => 10_000_000,
"subscriptions" => 10_000_000,
}.freeze
def explain(sql, params)
conn = ActiveRecord::Base.connection
result = conn.exec_query("EXPLAIN (ANALYZE, FORMAT JSON, BUFFERS) #{sql}", "EXPLAIN", params)
JSON.parse(result.rows.first.first).first.fetch("Plan")
end
def scan_for_bad_plans(plan, offenders)
if plan["Node Type"] == "Seq Scan" && HOT_TABLES.key?(plan["Relation Name"])
offenders << "Seq Scan on #{plan['Relation Name']} (rows: #{plan['Actual Rows']})"
end
(plan["Plans"] || []).each { |child| scan_for_bad_plans(child, offenders) }
offenders
end
queries = YAML.load_file("config/explain_targets.yml")
failures = queries.flat_map do |q|
scan_for_bad_plans(explain(q["sql"], q["params"]), [])
end
if failures.any?
warn "Bad query plans on hot tables:\n #{failures.join("\n ")}"
exit 1
end
It’s not magic. It catches the dumb case, which is the case we keep shipping. Engineers can override with a PR comment (# explain-allow: seq_scan_on_subscriptions), but the override gets reviewed by someone on the data platform team. We see maybe one override a month. Most of the time the PR author just adds the right index.
The staging DB the step runs against is the other half of the lesson. Tiny fixtures lie. The job boots a Postgres restored from a sanitized snapshot of the production writer, then applies the branch’s migrations before running explain checks. Slow, a few minutes per PR, and it’s caught a handful of would-have-been-incidents since.
The Black Friday one isn’t the only time a single Aurora hot table has bit me. About a year before, on the same platform, we shipped a schema change to add a non-null column to users late one evening Pacific. The migration acquired an ACCESS EXCLUSIVE lock and ran the default backfill in one shot. About 87 seconds of blocked writes, which translated to roughly 85 seconds of total login outage at peak hours.
First instinct in the war room: roll back. But add_column_with_default doesn’t have a clean mid-flight rollback at that row count, and killing it risked an inconsistent metadata state. The real fix was to let it finish (those 87 seconds felt like an hour), then split the migration the next morning into the three correct steps: add the column nullable, backfill in batches in a separate job, flip the not-null once everything was caught up. The CI rule that came out of it blocks any add_column with a non-null default on tables we’ve flagged as large. Same shape of lesson.
Two incidents, one root cause: we’d been treating “looks safe in staging” as the same thing as “safe on a half-billion-row table in production.” It isn’t. The only thing that makes them the same is having production-shaped data in the loop.
Last piece, which I should have led with: we don’t ship anything non-trivial to the writer in the week before a peak event. No new query paths on hot tables, no schema changes, no new indexes (even concurrent ones, because the IO is real). Bug fixes only.
A boring policy. Engineers grumble about it. The reason it exists is the Friday above. Nobody pages out of bed for a feature that didn’t ship this week.
EXPLAIN ANALYZE against production-shaped data before the change merges, not after the incident.LIKE 'prefix%' on a hot table needs pg_trgm + GIN, or a dedicated search column, or it’s a sequential scan waiting for the wrong day.Thanks for reading. If you’ve got thoughts, send them my way.