The Day Our Database Hit 100% CPU

A PostgreSQL post-mortem on autovacuum contention, a missing index, and an ORM-generated query that hid in plain sight. Plus why pg_stat_statements is not optional.

It was a Tuesday morning at the creator economy platform I worked at, and the writer CPU on our Aurora cluster was pinned at 100%. Datadog had been screaming for about 90 seconds before I joined the war room. The Community feed was the loudest victim, but really every read path was bleeding.

The on-call’s first guess was traffic. Reasonable guess. Wrong guess. Traffic was up maybe 8% over the same hour last week, nothing the cluster hadn’t shrugged off a hundred times before.

The three things that went wrong that morning are the three I’ve seen go wrong at every job since. Autovacuum starving for I/O. A missing index. And an ORM-generated query nobody on the team had ever read. None of these are rare. All of them compound.

Start by turning on pg_stat_statements

Before anything else: if your production PostgreSQL doesn’t have pg_stat_statements enabled, stop reading and go enable it. I mean it. This is the single non-negotiable I bring to every team I join, and I’ve never regretted it.

-- in postgresql.conf or parameter group
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = 'top'
pg_stat_statements.track_utility = off

-- then, once per database
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Without it, when CPU goes red you’re guessing. With it, you can do this:

SELECT
  substring(query, 1, 120) AS q,
  calls,
  round(total_exec_time::numeric, 0) AS total_ms,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 1) AS pct
FROM pg_stat_statements
WHERE query NOT ILIKE '%pg_stat_statements%'
ORDER BY total_exec_time DESC
LIMIT 20;

That’s the query that ends the guessing. Twenty rows sorted by total time burned. The one at the top is almost always your fire. The overhead is real but measured in low single-digit percent on a healthy cluster. The cost of not having it is finding out at 10 a.m. on a Tuesday that you can’t see your own database.

The autovacuum was starving

The first thing the top of pg_stat_statements told us was that our hot query, the Community feed read, had a mean execution time roughly 14x what it had been the day before. Same query. Same plan, mostly. Just slow.

Then we looked at pg_stat_activity:

SELECT pid, now() - xact_start AS xact_age, state, wait_event_type, wait_event,
       substring(query, 1, 100) AS query
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY xact_age DESC NULLS LAST
LIMIT 30;

Three autovacuum workers, all waiting on IO:DataFileRead, all on the same hot partition of community_posts. The table had crossed some threshold overnight and autovacuum had finally come for it, hard. The I/O subsystem was already saturated, so it queued. Dead tuples piled up. Every query against that table got slower from scanning bloat.

Default autovacuum settings are tuned for the table you had three years ago, not the one you have now.

# postgresql.conf (or parameter group), per-table tuning is even better
autovacuum_vacuum_cost_limit = 4000        # default 200, way too low at scale
autovacuum_vacuum_cost_delay = 2ms         # default 20ms, lets vacuum breathe
autovacuum_max_workers = 6                 # default 3
autovacuum_naptime = 15s                   # default 60s

For hot tables we set per-table overrides:

ALTER TABLE community_posts SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_analyze_scale_factor = 0.01,
  autovacuum_vacuum_cost_limit = 8000
);

These numbers aren’t universally right, but the defaults are almost certainly wrong for your hot tables. Start with pg_stat_user_tables.n_dead_tup and last_autovacuum.

The Aurora replica lag story

Same cluster, slightly different incident a few months earlier. Started around 10:14 a.m. Pacific. Datadog fired the AuroraReplicaLagMaximum > 60s for 2m alert. Feed read p99 went from about 120ms to over 8 seconds in roughly 4 minutes. When I got pulled into the channel, lag was sitting at 14 minutes and climbing.

The on-call’s first move was the obvious one. Bump reader instance class up two tiers, on the theory that the readers were CPU-bound. Lag didn’t budge. The readers weren’t bottlenecked at all. They were starved of WAL.

The real cause was on the writer. A scheduled ANALYZE on community_posts, one of the hottest tables on the platform, was running and holding write-side locks. WAL emission slowed to a trickle. Readers had nothing to apply.

SELECT pid, now() - query_start AS run_time, query
FROM pg_stat_activity
WHERE query ILIKE 'ANALYZE%' OR query ILIKE 'VACUUM%'
ORDER BY run_time DESC;

There it was. A maintenance cron job a teammate had scheduled the previous quarter, no awareness of peak hours. We killed the analyze. Replica lag bled off in about 6 minutes.

The lesson cost us roughly 22 minutes of degraded reads for the millions of people on the platform. Nothing lost on disk. Our runbook now opens with one sentence in bold: check pg_stat_activity on the writer before you touch reader scaling. I’m the reason that sentence is there.

That week we shipped a thin wrapper, db_safe_maintenance, that every maintenance command on hot tables now routes through. It refuses to run between 06:00 and 22:00 UTC unless you pass an override flag and a ticket ID.

The missing index hiding behind ActiveRecord

Back to the Tuesday morning fire. With autovacuum unstuck, CPU dropped from 100% to about 70%. Still too high. The top query in pg_stat_statements had moved, but the new number-one was a feed query I genuinely did not recognize.

It was an ActiveRecord-generated monster with five joins, a DISTINCT, and an ORDER BY on a column with no index. It had been fine when the joined table was small. The joined table was not small anymore.

# what someone wrote
Post.joins(:author, :community, :reactions)
    .where(communities: { archived: false })
    .where("posts.created_at > ?", 7.days.ago)
    .distinct
    .order(:popularity_score)
    .limit(50)

EXPLAIN ANALYZE showed a sequential scan on a table with tens of millions of rows, sorting in memory, then deduping. Cost was real. Plan was clear once we looked, but nobody had ever looked, because the ORM hid it.

The fix was three lines.

CREATE INDEX CONCURRENTLY idx_posts_active_recent_pop
  ON posts (community_id, created_at DESC, popularity_score)
  WHERE deleted_at IS NULL;

CONCURRENTLY is non-negotiable on a hot table. The partial predicate on deleted_at matters because most rows aren’t deleted and excluding the soft-deleted ones keeps the index small.

Mean execution time dropped from about 380ms to 4ms. CPU on the writer came back to something normal within ten minutes of the index finishing its build.

The schema migration that locked everything

The other war story is from a late-evening deploy a couple of quarters earlier. We were adding a non-null column to the users table. Hundreds of millions of rows. I’d reviewed the migration that morning and signed off. It used the “safer” add_column_with_default helper from the strong_migrations gem.

It still grabbed an ACCESS EXCLUSIVE lock on the table while backfilling the default. On Aurora, given the table size, that was about 90 seconds of blocked writes. Every endpoint touching user creation went red, login and signup included. Error rate on login was at 100% for roughly 85 seconds. PagerDuty woke up half the senior engineers in California.

First instinct was rollback. Problem: Rails has no clean rollback path for a half-applied add_column_with_default. Killing it mid-flight could have left table metadata inconsistent. So we let it finish. 87 seconds. Login came back within 15 seconds of the lock releasing, because retries on the dependent service were tight.

The postmortem fix was the obvious three-step dance on a hot table:

# step 1, add the column, nullable
class AddOnboardedFlagToUsers < ActiveRecord::Migration[7.1]
  def change
    add_column :users, :onboarded, :boolean
  end
end

# step 2, backfill in batches in a separate job, off the deploy path
User.in_batches(of: 5_000) { |b| b.update_all(onboarded: false) }

# step 3, set NOT NULL once 100% backfilled
class SetOnboardedNotNullOnUsers < ActiveRecord::Migration[7.1]
  def change
    change_column_null :users, :onboarded, false
  end
end

We added a strong_migrations CI rule that blocks any add_column with a non-null default against hot tables. The gem is safer than raw ActiveRecord. It is not safe. The defaults are guard rails, not a substitute for reading the migration.

Takeaways

Turn on pg_stat_statements before you need it. The day you need it, it’s too late to enable.
The default autovacuum settings are wrong for your hot tables. Set per-table overrides on the busy ones.
Before scaling reader replicas, check pg_stat_activity on the writer. Long-running maintenance starves WAL.
ORMs hide bad plans. Read the generated SQL on your top ten queries by total time, at least once a quarter.
Schema changes against hot tables are a three-step dance: nullable add, batched backfill, then NOT NULL.
strong_migrations is a guard rail, not a free pass.

Thanks for reading. If you’ve got thoughts, send them my way.