PostgreSQL Logical Replication for Zero-Downtime Upgrades

How I planned and executed a major-version PostgreSQL upgrade on a multi-terabyte Aurora cluster using logical replication, with the sequence sync, DDL freeze, cutover, and rollback details that actually matter.

It was a Thursday afternoon, two days before our scheduled cutover, when I realized the “zero-downtime” PostgreSQL upgrade plan I’d reviewed earlier that week had a sequence-sync hole big enough to charge a customer twice. The cluster was a multi-terabyte Aurora writer behind the creator-economy platform I worked at. Two reader replicas, a Sidekiq fleet hammering inserts on creator_subscriptions, billing webhooks from Apple and Google. Engine going from 15 to 16. The plan was logical replication, and the plan was sound. The detail I’d missed was that logical replication doesn’t replicate sequence values, and our creator_subscriptions.id sequence was the spine of half our idempotency logic.

Every “we did zero-downtime upgrades” post I’ve read skips the parts that almost broke us. This one is about the gaps. Sequences, DDL, cutover discipline, the rollback you rehearse and pray you never use.

Why pg_upgrade was off the table

The boring answer first. At our row count and our IOPS profile, in-place pg_upgrade meant a real outage window. Not “a few seconds while we flip a connection string”, but minutes of writes refusing on the primary while the cluster restarted under a new binary. Aurora makes some of this easier than vanilla Postgres, but not enough to call it free. We had a Sidekiq fleet that, on a normal Tuesday, was already producing more queued retry pressure than I felt good about.

So logical replication. Set up a target on 16, replicate from the 15 source in the background, validate, cut over. The headline is simple. The middle parts are where it lives or dies.

The publication and subscription setup

First thing on the source: parameter group changes. wal_level to logical, max_replication_slots and max_wal_senders bumped to fit the slot count we needed. On Aurora these are cluster parameter group settings, which means a writer reboot to apply. We did it two weeks ahead so the reboot wouldn’t surprise anyone on the day.

Then a dedicated replication role and a publication. I prefer an explicit table list over FOR ALL TABLES because it forces us to think about which tables we’re carrying forward.

-- Source (older major version)
CREATE ROLE upgrade_repl WITH REPLICATION LOGIN PASSWORD :'pw';
GRANT USAGE ON SCHEMA public TO upgrade_repl;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO upgrade_repl;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
  GRANT SELECT ON TABLES TO upgrade_repl;

CREATE PUBLICATION upgrade_pub FOR TABLE
  users,
  creators,
  creator_subscriptions,
  community_posts,
  community_memberships,
  -- ... the explicit list. About 140 tables for us.
  billing_events
  WITH (publish = 'insert, update, delete, truncate');

Then on the target (16), the subscription. The trick not flagged loudly enough in the docs: copy_data = true locks and COPYs every row in the initial sync, which on a hot table is exactly what you don’t want. We did an out-of-band initial COPY via aws_s3.export_query + aws_s3.table_import_from_s3, then created the subscription with copy_data = false and a captured LSN.

-- Target (newer major version)
CREATE SUBSCRIPTION upgrade_sub
  CONNECTION 'host=src.cluster.aws port=5432 dbname=app user=upgrade_repl password=...'
  PUBLICATION upgrade_pub
  WITH (
    copy_data = false,
    create_slot = true,
    slot_name = 'upgrade_slot_v1',
    enabled = true,
    synchronous_commit = 'off'
  );

synchronous_commit = off because we don’t need replica-grade durability while catching up. Burn it off at cutover.

Sequences are where it falls apart

OK here’s the part nobody flags loudly enough. Logical replication carries row changes. It does not carry sequence advancement. If creator_subscriptions_id_seq is at 814,332,109 on the source and you create a fresh sequence on the target, the target starts at 1. The day after cutover, your first insert collides with an id that already exists. Or worse, idempotency keys derived from id produce duplicates against external systems that already saw the old keys.

The fix is a script that runs as one of the last cutover steps. It walks pg_sequences on the source, reads last_value, and sets the target to last_value + buffer. Buffer matters because there’s a window between when you read the source and when you flip writes.

-- Run on the target during cutover, after writes are paused on the source.
DO $$
DECLARE
  r RECORD;
  src_last_value BIGINT;
  buffer BIGINT := 10000;
BEGIN
  FOR r IN
    SELECT schemaname, sequencename
    FROM pg_sequences
    WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
  LOOP
    EXECUTE format(
      $f$ SELECT last_value FROM dblink(
            'host=src.cluster.aws dbname=app user=upgrade_repl',
            'SELECT last_value FROM %I.%I'
          ) AS t(last_value bigint) $f$,
      r.schemaname, r.sequencename
    ) INTO src_last_value;

    IF src_last_value IS NOT NULL THEN
      EXECUTE format(
        'SELECT setval(%L, %s, true)',
        r.schemaname || '.' || r.sequencename,
        src_last_value + buffer
      );
    END IF;
  END LOOP;
END $$;

Near-miss on the rehearsal run. An engineer on my squad had built the dry-run cluster from a manual pg_dump --schema-only, and the dump reset a couple of sequences to 1. Caught it in validation, not in production. That rehearsal is what made me trust the script.

The DDL freeze nobody likes

Logical replication doesn’t replicate DDL either. Any ALTER TABLE on the source during the replication window drifts the target’s schema, and your next write against that column errors in a way that’s hard to diagnose at 2 a.m.

So a hard DDL freeze for the five days bracketing cutover. Engineering loved that. I added a CI guard so we couldn’t accidentally cheat on it.

# .github/workflows/db-freeze.yml
name: db-freeze
on:
  pull_request:
    branches: [main]

jobs:
  block-migrations:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Check DDL freeze window
        run: |
          FREEZE_START="${DDL_FREEZE_START}"
          FREEZE_END="${DDL_FREEZE_END}"
          TODAY=$(date -u +%Y-%m-%d)
          if [[ "$TODAY" < "$FREEZE_START" || "$TODAY" > "$FREEZE_END" ]]; then
            echo "Outside freeze window, skipping."
            exit 0
          fi
          CHANGED=$(git diff --name-only origin/main...HEAD -- 'db/migrate/' || true)
          if [[ -n "$CHANGED" ]]; then
            echo "DDL freeze active. New migration files are blocked:"
            echo "$CHANGED"
            exit 1
          fi

Yes, it’s blunt. Yes, it caught a migration on day three of the freeze that a teammate had ack’d on autopilot. That’s the point.

The cutover, slowly

Cutover ran on a Saturday, late-evening Pacific, low-traffic window for our customer base. The shape:

Pause Sidekiq enqueueing for billing-adjacent queues.
Set the source to read-only at the application layer (a feature flag we’d added two weeks earlier, just for this).
Wait for pg_replication_slots.confirmed_flush_lsn on the source to match pg_current_wal_lsn.
Run the sequence-sync script on the target.
Run validation: row counts on the top 30 tables, sampled checksums on 10 of them.
Flip the writer connection string in our connection-pool config.
Unpause Sidekiq.

The validation step was a shell script I’d written and rehearsed. It compared a sampled checksum, not a full one, because a full one on hundreds of millions of rows would take longer than the cutover.

#!/usr/bin/env bash
set -euo pipefail

SRC_URL="${SRC_DATABASE_URL}"
DST_URL="${DST_DATABASE_URL}"
TABLES=(users creators creator_subscriptions community_posts billing_events)

for t in "${TABLES[@]}"; do
  src_count=$(psql "$SRC_URL" -tAc "SELECT count(*) FROM $t")
  dst_count=$(psql "$DST_URL" -tAc "SELECT count(*) FROM $t")

  if [[ "$src_count" != "$dst_count" ]]; then
    echo "ROW COUNT MISMATCH on $t: src=$src_count dst=$dst_count"
    exit 1
  fi

  # Sampled MD5 of the first 100k rows ordered by id. Not perfect,
  # but enough to catch corruption that row-count misses.
  src_hash=$(psql "$SRC_URL" -tAc \
    "SELECT md5(string_agg(t::text, '')) FROM (SELECT * FROM $t ORDER BY id LIMIT 100000) t")
  dst_hash=$(psql "$DST_URL" -tAc \
    "SELECT md5(string_agg(t::text, '')) FROM (SELECT * FROM $t ORDER BY id LIMIT 100000) t")

  if [[ "$src_hash" != "$dst_hash" ]]; then
    echo "CHECKSUM MISMATCH on $t"
    exit 1
  fi

  echo "OK $t count=$src_count"
done

Total write-pause window from step 1 to step 7 was 4 minutes 18 seconds. End users didn’t see a hard outage. A small slice saw a delayed billing webhook retry by a minute or two.

The patience rule, learned the hard way

A thing I want to plant here from a different scar. A colleague and I once shipped a Rails migration that added a non-null column with a default to our users table. Hundreds of millions of rows. The migration grabbed an ACCESS EXCLUSIVE lock and held it for 87 seconds. Login error rate hit 100% for about 85 of those seconds. First instinct in the war room was to kill the migration. We didn’t, because aborting mid-flight would have left the table metadata worse than letting it finish. Login recovered within 15 seconds of the lock release.

That experience changed how I run cutovers. Once you’ve started the flip, you do not panic-revert in the first 90 seconds. Watch the metrics, let the system settle, trust the rehearsal. The rollback exists, but the moment of maximum temptation to use it is also the moment when using it is most expensive.

The rollback we rehearsed

We set up a reverse subscription before cutover. Target as publisher, source as subscriber. Turned it on the moment we flipped writes, so any post-cutover write on 16 was shipped back to 15. That gave us a 6-hour window where we could flip back by reversing the connection string. After 6 hours the divergence in sequences made rollback too painful to reconcile cleanly.

We didn’t use it. Knowing it was there changed the temperature in the war room.

Takeaways

Use logical replication for major-version upgrades on a multi-terabyte writer. pg_upgrade is not zero-downtime at this scale.
Sequence advancement is not replicated. Script the sync, run it at cutover, and rehearse it.
DDL is not replicated either. Freeze schema for the duration, with a CI guard so the freeze is real.
Validate with row counts and sampled checksums. Full checksums on big tables are too slow for a cutover.
Set up the reverse subscription before you cut over. Time-box the rollback window.
Once you’ve started the cutover, do not panic-revert in the first 90 seconds. Trust the rehearsal.

Thanks for reading. If you’ve got thoughts, send them my way.