How I planned and executed a major-version PostgreSQL upgrade on a multi-terabyte Aurora cluster using logical replication, with the sequence sync, DDL freeze, cutover, and rollback details that actually matter.
It was a Thursday afternoon, two days before our scheduled cutover, when I realized the “zero-downtime” PostgreSQL upgrade plan I’d reviewed earlier that week had a sequence-sync hole big enough to charge a customer twice. The cluster was a multi-terabyte Aurora writer behind the creator-economy platform I worked at. Two reader replicas, a Sidekiq fleet hammering inserts on creator_subscriptions, billing webhooks from Apple and Google. Engine going from 15 to 16. The plan was logical replication, and the plan was sound. The detail I’d missed was that logical replication doesn’t replicate sequence values, and our creator_subscriptions.id sequence was the spine of half our idempotency logic.
Every “we did zero-downtime upgrades” post I’ve read skips the parts that almost broke us. This one is about the gaps. Sequences, DDL, cutover discipline, the rollback you rehearse and pray you never use.
The boring answer first. At our row count and our IOPS profile, in-place pg_upgrade meant a real outage window. Not “a few seconds while we flip a connection string”, but minutes of writes refusing on the primary while the cluster restarted under a new binary. Aurora makes some of this easier than vanilla Postgres, but not enough to call it free. We had a Sidekiq fleet that, on a normal Tuesday, was already producing more queued retry pressure than I felt good about.
So logical replication. Set up a target on 16, replicate from the 15 source in the background, validate, cut over. The headline is simple. The middle parts are where it lives or dies.
First thing on the source: parameter group changes. wal_level to logical, max_replication_slots and max_wal_senders bumped to fit the slot count we needed. On Aurora these are cluster parameter group settings, which means a writer reboot to apply. We did it two weeks ahead so the reboot wouldn’t surprise anyone on the day.
Then a dedicated replication role and a publication. I prefer an explicit table list over FOR ALL TABLES because it forces us to think about which tables we’re carrying forward.
-- Source (older major version)
CREATE ROLE upgrade_repl WITH REPLICATION LOGIN PASSWORD :'pw';
GRANT USAGE ON SCHEMA public TO upgrade_repl;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO upgrade_repl;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO upgrade_repl;
CREATE PUBLICATION upgrade_pub FOR TABLE
users,
creators,
creator_subscriptions,
community_posts,
community_memberships,
-- ... the explicit list. About 140 tables for us.
billing_events
WITH (publish = 'insert, update, delete, truncate');
Then on the target (16), the subscription. The trick not flagged loudly enough in the docs: copy_data = true locks and COPYs every row in the initial sync, which on a hot table is exactly what you don’t want. We did an out-of-band initial COPY via aws_s3.export_query + aws_s3.table_import_from_s3, then created the subscription with copy_data = false and a captured LSN.
-- Target (newer major version)
CREATE SUBSCRIPTION upgrade_sub
CONNECTION 'host=src.cluster.aws port=5432 dbname=app user=upgrade_repl password=...'
PUBLICATION upgrade_pub
WITH (
copy_data = false,
create_slot = true,
slot_name = 'upgrade_slot_v1',
enabled = true,
synchronous_commit = 'off'
);
synchronous_commit = off because we don’t need replica-grade durability while catching up. Burn it off at cutover.
OK here’s the part nobody flags loudly enough. Logical replication carries row changes. It does not carry sequence advancement. If creator_subscriptions_id_seq is at 814,332,109 on the source and you create a fresh sequence on the target, the target starts at 1. The day after cutover, your first insert collides with an id that already exists. Or worse, idempotency keys derived from id produce duplicates against external systems that already saw the old keys.
The fix is a script that runs as one of the last cutover steps. It walks pg_sequences on the source, reads last_value, and sets the target to last_value + buffer. Buffer matters because there’s a window between when you read the source and when you flip writes.
-- Run on the target during cutover, after writes are paused on the source.
DO $$
DECLARE
r RECORD;
src_last_value BIGINT;
buffer BIGINT := 10000;
BEGIN
FOR r IN
SELECT schemaname, sequencename
FROM pg_sequences
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
LOOP
EXECUTE format(
$f$ SELECT last_value FROM dblink(
'host=src.cluster.aws dbname=app user=upgrade_repl',
'SELECT last_value FROM %I.%I'
) AS t(last_value bigint) $f$,
r.schemaname, r.sequencename
) INTO src_last_value;
IF src_last_value IS NOT NULL THEN
EXECUTE format(
'SELECT setval(%L, %s, true)',
r.schemaname || '.' || r.sequencename,
src_last_value + buffer
);
END IF;
END LOOP;
END $$;
Near-miss on the rehearsal run. An engineer on my squad had built the dry-run cluster from a manual pg_dump --schema-only, and the dump reset a couple of sequences to 1. Caught it in validation, not in production. That rehearsal is what made me trust the script.
Logical replication doesn’t replicate DDL either. Any ALTER TABLE on the source during the replication window drifts the target’s schema, and your next write against that column errors in a way that’s hard to diagnose at 2 a.m.
So a hard DDL freeze for the five days bracketing cutover. Engineering loved that. I added a CI guard so we couldn’t accidentally cheat on it.
# .github/workflows/db-freeze.yml
name: db-freeze
on:
pull_request:
branches: [main]
jobs:
block-migrations:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Check DDL freeze window
run: |
FREEZE_START="${DDL_FREEZE_START}"
FREEZE_END="${DDL_FREEZE_END}"
TODAY=$(date -u +%Y-%m-%d)
if [[ "$TODAY" < "$FREEZE_START" || "$TODAY" > "$FREEZE_END" ]]; then
echo "Outside freeze window, skipping."
exit 0
fi
CHANGED=$(git diff --name-only origin/main...HEAD -- 'db/migrate/' || true)
if [[ -n "$CHANGED" ]]; then
echo "DDL freeze active. New migration files are blocked:"
echo "$CHANGED"
exit 1
fi
Yes, it’s blunt. Yes, it caught a migration on day three of the freeze that a teammate had ack’d on autopilot. That’s the point.
Cutover ran on a Saturday, late-evening Pacific, low-traffic window for our customer base. The shape:
pg_replication_slots.confirmed_flush_lsn on the source to match pg_current_wal_lsn.The validation step was a shell script I’d written and rehearsed. It compared a sampled checksum, not a full one, because a full one on hundreds of millions of rows would take longer than the cutover.
#!/usr/bin/env bash
set -euo pipefail
SRC_URL="${SRC_DATABASE_URL}"
DST_URL="${DST_DATABASE_URL}"
TABLES=(users creators creator_subscriptions community_posts billing_events)
for t in "${TABLES[@]}"; do
src_count=$(psql "$SRC_URL" -tAc "SELECT count(*) FROM $t")
dst_count=$(psql "$DST_URL" -tAc "SELECT count(*) FROM $t")
if [[ "$src_count" != "$dst_count" ]]; then
echo "ROW COUNT MISMATCH on $t: src=$src_count dst=$dst_count"
exit 1
fi
# Sampled MD5 of the first 100k rows ordered by id. Not perfect,
# but enough to catch corruption that row-count misses.
src_hash=$(psql "$SRC_URL" -tAc \
"SELECT md5(string_agg(t::text, '')) FROM (SELECT * FROM $t ORDER BY id LIMIT 100000) t")
dst_hash=$(psql "$DST_URL" -tAc \
"SELECT md5(string_agg(t::text, '')) FROM (SELECT * FROM $t ORDER BY id LIMIT 100000) t")
if [[ "$src_hash" != "$dst_hash" ]]; then
echo "CHECKSUM MISMATCH on $t"
exit 1
fi
echo "OK $t count=$src_count"
done
Total write-pause window from step 1 to step 7 was 4 minutes 18 seconds. End users didn’t see a hard outage. A small slice saw a delayed billing webhook retry by a minute or two.
A thing I want to plant here from a different scar. A colleague and I once shipped a Rails migration that added a non-null column with a default to our users table. Hundreds of millions of rows. The migration grabbed an ACCESS EXCLUSIVE lock and held it for 87 seconds. Login error rate hit 100% for about 85 of those seconds. First instinct in the war room was to kill the migration. We didn’t, because aborting mid-flight would have left the table metadata worse than letting it finish. Login recovered within 15 seconds of the lock release.
That experience changed how I run cutovers. Once you’ve started the flip, you do not panic-revert in the first 90 seconds. Watch the metrics, let the system settle, trust the rehearsal. The rollback exists, but the moment of maximum temptation to use it is also the moment when using it is most expensive.
We set up a reverse subscription before cutover. Target as publisher, source as subscriber. Turned it on the moment we flipped writes, so any post-cutover write on 16 was shipped back to 15. That gave us a 6-hour window where we could flip back by reversing the connection string. After 6 hours the divergence in sequences made rollback too painful to reconcile cleanly.
We didn’t use it. Knowing it was there changed the temperature in the war room.
pg_upgrade is not zero-downtime at this scale.Thanks for reading. If you’ve got thoughts, send them my way.