How I actually use pg_stat_statements to find and fix slow queries on a large Aurora cluster, with snapshots for trend tracking and Prometheus/Grafana wiring.
It was a Tuesday morning at the creator economy platform I worked at. Aurora reader lag had just crossed 14 minutes, the Community feed was timing out for millions of customers, and the on-call was 90 seconds away from bumping reader instance class for the second time that quarter. I joined the Slack thread. The first thing I asked was not “what does Datadog say”. It was “what does pg_stat_statements look like on the writer right now”. That’s where the truth lives.
I’ll just say it. If you run PostgreSQL at any serious scale and you don’t have pg_stat_statements turned on, you’re flying blind. APMs are fine. Slow query logs are fine. Neither one tells you which query, normalized across all its parameter values, is actually eating your database. That’s the gap this extension fills.
On Aurora and vanilla RDS the extension is preinstalled, but it’s not loaded by default. You need it in shared_preload_libraries, which means a parameter group change and a restart. Do this once, before you need it. Trying to set up observability mid-incident is a thoughts-and-prayers strategy.
-- run as superuser, once per database that you actually query
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- parameter group settings (Aurora cluster parameter group)
-- shared_preload_libraries = 'pg_stat_statements'
-- pg_stat_statements.max = 10000
-- pg_stat_statements.track = top
-- pg_stat_statements.track_utility = off
-- pg_stat_statements.save = on
-- track_io_timing = on
Two things I’ve learned the hard way. track = top is what you want, not all, otherwise nested function calls double-count and the top of the list turns into noise. track_io_timing = on is the one most people skip, and then they wonder why blk_read_time is always zero and they can’t tell if a query is CPU-bound or cache-miss-bound. Turn it on. Overhead is small. Without it the data is half a picture.
The workflow I run, on every PostgreSQL cluster I own, is four steps. Identify, EXPLAIN ANALYZE, fix, verify. Nothing clever. The trick is doing it every week, not once a quarter when something is on fire.
-- step 1: identify. the top queries by total time spent.
-- total_exec_time is what matters, not mean. a 50ms query that runs
-- 200 times per second is your problem, not a 4s analytics query
-- that runs at 3am.
SELECT
substring(query for 200) AS query,
calls,
round(total_exec_time::numeric, 1) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
round((100 * total_exec_time /
sum(total_exec_time) OVER ())::numeric, 1) AS pct_of_total,
round(blk_read_time::numeric, 1) AS read_ms,
round(blk_write_time::numeric, 1) AS write_ms,
rows
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = current_database())
AND query NOT LIKE '%pg_stat_statements%'
ORDER BY total_exec_time DESC
LIMIT 25;
That pct_of_total column is the one I read first. If a single normalized query is 18% of all time spent in your database, you know exactly what to look at next. Mean latency is a trap on its own. A query that’s slow on average might be fine in aggregate. A fast query that’s hammered into the dirt is the one that quietly costs you Aurora compute.
Step two is grabbing one specific parameterization and running EXPLAIN (ANALYZE, BUFFERS). Not EXPLAIN alone. BUFFERS tells you whether you’re reading from shared buffers or from disk, and that distinction is usually the difference between adding an index and accepting that the query is doing real work. Step three is the fix, which is almost always a missing index, a bad join order from stale planner stats, or pagination done wrong. Step four is the part teams skip. You verify. After the fix has been in production ~30 minutes, you check the same query in pg_stat_statements and confirm the totals dropped. If they didn’t, you guessed wrong and you start over.
Here’s the thing about pg_stat_statements. The numbers are cumulative since the last pg_stat_statements_reset(). So the top-by-total-time view drifts. A query that did terribly six weeks ago and has been fine ever since will still sit at the top of the list. Live ranking is good for incident response. Trend tracking is what tells you whether last week’s deploy made things better or worse.
The pattern I use is dumb and it works. Snapshot the table every 10 minutes into a history table, then diff between snapshots to get deltas.
CREATE TABLE pgss_snapshots (
captured_at timestamptz NOT NULL DEFAULT now(),
queryid bigint NOT NULL,
query_text text,
calls bigint,
total_exec_time double precision,
mean_exec_time double precision,
rows bigint,
shared_blks_hit bigint,
shared_blks_read bigint,
blk_read_time double precision,
PRIMARY KEY (captured_at, queryid)
);
CREATE INDEX pgss_snapshots_qid_time
ON pgss_snapshots (queryid, captured_at DESC);
-- run from cron every 10 minutes
INSERT INTO pgss_snapshots
(queryid, query_text, calls, total_exec_time, mean_exec_time,
rows, shared_blks_hit, shared_blks_read, blk_read_time)
SELECT queryid, substring(query for 500), calls, total_exec_time,
mean_exec_time, rows, shared_blks_hit, shared_blks_read,
blk_read_time
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = current_database());
Now you can ask the question that matters: what changed between yesterday and today. A LAG window over total_exec_time per queryid gives you the delta. If something doubled overnight, it’s almost certainly tied to a deploy or a data-shape change. I’ve caught regressions this way that the team’s APM dashboard missed entirely, because the absolute latency didn’t move. Only the call volume did.
Keep snapshots for 30 days. Beyond that you’re paying storage for data no one will read. For longer trends, roll up to daily aggregates and drop the raw rows.
Postgres metrics belong on the same Grafana board as everything else. Cleanest way is postgres_exporter with a custom queries file that pulls from pg_stat_statements and exposes the top-N as Prometheus metrics. You don’t want every query as a metric, that explodes cardinality. Top 20 by total time, refreshed every 30 seconds, is plenty.
# queries.yaml, mounted into postgres_exporter
pg_stat_statements_top:
query: |
SELECT
queryid::text AS queryid,
substring(query for 80) AS query,
calls,
total_exec_time AS total_exec_time_ms,
mean_exec_time AS mean_exec_time_ms,
rows
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = current_database())
ORDER BY total_exec_time DESC
LIMIT 20
metrics:
- queryid:
usage: "LABEL"
- query:
usage: "LABEL"
- calls:
usage: "COUNTER"
- total_exec_time_ms:
usage: "COUNTER"
- mean_exec_time_ms:
usage: "GAUGE"
- rows:
usage: "COUNTER"
And the alert I actually run, in Prometheus rule form:
groups:
- name: postgres-slow-queries
rules:
- alert: PgSlowQueryHotspot
expr: |
topk(3,
rate(pg_stat_statements_top_total_exec_time_ms_total[5m])
) > 0.4
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "A single query is consuming >40% of one core on the writer"
runbook: "https://runbooks.internal/postgres/slow-query-hotspot"
That 0.4 means one normalized query is burning more than 40% of a single CPU core, sustained for 10 minutes. It catches regressions worth waking someone up for, and the time window keeps it from false-firing on healthy hot paths.
Two production scars worth flagging, because they shaped how I use this tool.
The first one I already opened with. Reader replica lag, Community feed degraded, writer pinned. The on-call’s first move was to bump reader instance class up two tiers. That did nothing, because the readers weren’t CPU-bound, they were starved of WAL. pg_stat_activity on the writer showed a long-running ANALYZE on one of the hottest tables, holding write-side locks and choking WAL emission. Killed the analyze, lag drained in roughly 6 minutes, about 22 minutes of degraded read latency on the Community surface. The fix I shipped that same week was a maintenance gate that refuses to run any analyze, vacuum, or reindex between 06:00 and 22:00 UTC. Runbook now leads with “before touching reader scaling, check the writer”. I’m the reason that sentence is in there.
The second one. Late-evening deploy on the Rails monolith, adding a non-null column with a default to a table with hundreds of millions of rows. We used the strong_migrations helper that’s supposed to be safe. It wasn’t. The migration grabbed an ACCESS EXCLUSIVE lock and held it for 87 seconds. Login error rate hit 100% for about 85 seconds, half the senior engineers in California woke up, and the war room channel filled with the FIRE emoji. The reason I’m telling this story in a pg_stat_statements post is what we did the next morning. Pulled snapshots from the last 30 days, found three other migrations that had spent over 10 seconds on hot tables and that nobody had noticed because they finished before anyone looked. Added a CI rule that fails any add_column with a non-null default against a table over 10M rows. The data was sitting in pg_stat_statements the whole time. We just hadn’t been reading it.
pg_stat_statements on with track = top and track_io_timing = on. Do it before you need it.total_exec_time and pct_of_total, not by mean. The cheap query running 200/s is the problem.EXPLAIN (ANALYZE, BUFFERS). Always BUFFERS. CPU-bound and cache-miss-bound look identical without it.postgres_exporter. Alert on sustained per-query CPU, not on single slow runs.Thanks for reading. If you’ve got thoughts, send them my way.