PostgreSQL Replication in Practice

Streaming, logical, and failover replication on Aurora and self-hosted PostgreSQL. What actually breaks, how to monitor lag honestly, and where Patroni earns its keep.

A Tuesday morning at the creator economy platform I worked at. The Community product ran on a large Aurora writer with three reader replicas behind a custom routing layer. At 10:14 a.m. Pacific, Datadog fired AuroraReplicaLagMaximum > 60s for 2m. Within four minutes, p99 read latency on /communities/:id/posts went from 120 ms to over 8 seconds. By the time I joined the thread, replica lag was at 14 minutes and climbing.

Most teams talk about replication wrong. They argue topology, streaming vs. logical, sync vs. async. The thing that bites you in production is almost never topology. It is something blocking WAL on the writer, a slot quietly filling disk, or a failover script no one tested in 18 months.

Opinion up front: streaming is the default for HA, logical is the right tool for migrations and selective sharing, and failover should go through Patroni or pg_auto_failover at any real scale. Manual failover is a runbook nobody runs correctly at 3 a.m.

Streaming, the default that works

Physical streaming ships WAL byte-for-byte to a standby. Same Postgres version, same binary layout, exact replica. It is what Aurora hides behind its reader endpoints, what RDS Multi-AZ uses under the hood, and what most self-hosted setups should reach for first.

The basics on a self-hosted primary are not complicated. The trap is the defaults.

# postgresql.conf on the primary
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 2GB
hot_standby = on
hot_standby_feedback = on
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (standby_a, standby_b)'

# archive WAL somewhere durable so you can rebuild a standby without re-cloning
archive_mode = on
archive_command = 'aws s3 cp %p s3://pg-wal-archive/%f --only-show-errors'

The synchronous_standby_names = 'ANY 1 (...)' quorum is the boring middle ground. Sync to at least one of two standbys, async beyond. If you set it to a single standby and that standby goes down, your primary will block writes. That is how Friday-night payments paths die.

On the standby, the connection lives in a single config file these days (no more recovery.conf):

# postgresql.auto.conf on the standby
primary_conninfo = 'host=pg-primary.internal port=5432 user=replicator application_name=standby_a sslmode=require'
primary_slot_name = 'standby_a_slot'
hot_standby_feedback = on

Note the slot. Without one, the primary will recycle WAL on its own schedule and you will rebuild the standby from a base backup the moment it falls a few hours behind. With a slot, the primary keeps WAL around until the standby has consumed it. Which is fine until it isn’t, see below.

Logical replication for migrations

Streaming gives you a binary copy. Logical gives you row-level changes, per table, per database, across major versions. That is what you want when migrating a bounded context to its own cluster, splitting a schema, or feeding a search index without a custom CDC stack.

-- on the source
CREATE PUBLICATION mobile_release_pub
FOR TABLE
  mobile_release.branded_app_builds,
  mobile_release.releases,
  mobile_release.submissions;

ALTER TABLE mobile_release.branded_app_builds REPLICA IDENTITY FULL;
ALTER TABLE mobile_release.releases REPLICA IDENTITY FULL;
ALTER TABLE mobile_release.submissions REPLICA IDENTITY FULL;

-- on the target (a fresh cluster)
CREATE SUBSCRIPTION mobile_release_sub
CONNECTION 'host=source.internal port=5432 dbname=app user=logical_repl password=...'
PUBLICATION mobile_release_pub
WITH (
  copy_data = true,
  create_slot = true,
  slot_name = 'mobile_release_sub_slot',
  streaming = on,
  synchronous_commit = 'off'
);

REPLICA IDENTITY FULL is heavy, every update logs the old row, but if your table lacks a stable primary key it is the price of correctness. For well-modeled tables, the default identity (primary key) is fine.

streaming = on is the one a lot of teams miss. It applies in-progress transactions on the subscriber instead of waiting for COMMIT, which on a hot table is the difference between catching up in 20 minutes and never catching up at all.

The slot that ate the disk

Replication slots are a foot-gun. The primary holds WAL until every active slot has consumed it. If a subscriber dies and nobody notices, WAL piles up, disk fills, writes stop.

A boring query worth alerting on:

SELECT
  slot_name,
  slot_type,
  active,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS retained_wal,
  EXTRACT(EPOCH FROM (now() - COALESCE(
    pg_last_committed_xact(),
    now()
  ))) AS staleness_seconds
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

Page on retained_wal > 50GB or active = false for > 5m, whichever you hit first. Have a runbook that explicitly authorizes the on-call to drop an abandoned slot. Slots nobody owns are how clusters die.

Monitoring lag without lying to yourself

pg_stat_replication is the canonical view on the primary. The one that matters most for app behavior is replay_lag, not write_lag or flush_lag. Replay is what your readers actually see.

SELECT
  application_name,
  client_addr,
  state,
  sync_state,
  pg_size_pretty(pg_wal_lsn_diff(sent_lsn,   replay_lsn)) AS replay_bytes_behind,
  EXTRACT(EPOCH FROM replay_lag) AS replay_lag_seconds
FROM pg_stat_replication
ORDER BY replay_lag_seconds DESC NULLS LAST;

On Aurora you do not run this, but AuroraReplicaLagMaximum in CloudWatch is the equivalent. Back to the morning I opened with. The on-call’s first move was reasonable on its face: bump reader instance class up two tiers, r6g.4xlarge to r6g.16xlarge. Reasoning was “we’re CPU-bound on the readers”. Wrong root cause. Lag did not move because the readers were not bottlenecked, they were starved of WAL.

Pulled pg_stat_activity on the writer. A long-running ANALYZE on community_posts (one of the hottest tables on the platform) was holding write-side locks and starving WAL emission. A partition-stats refresh someone had scheduled in a maintenance cron that did not respect peak hours. Killed the analyze. Replica lag drained in about six minutes. Roughly 22 minutes of degraded reads for millions of users. No data loss.

That same week I shipped a small Ruby gem wired into the app, db_safe_maintenance. Every maintenance command (analyze, vacuum, reindex, partition rotation) routes through it and refuses to run between 06:00 and 22:00 UTC. The replica-lag runbook now leads with one bold sentence: “Before touching reader scaling, check pg_stat_activity on the writer.” I’m the reason that sentence is in there.

Failover with Patroni or pg_auto_failover

Do not write your own failover. I have watched teams try. They write a shell script, test it once in staging, and never run it again until the primary dies on a Saturday and the script has rotted past the point of trust.

Patroni gives you a real leader election (etcd, Consul, or ZooKeeper underneath), automatic standby promotion, and a stable virtual endpoint via HAProxy or a Kubernetes service. The DCS config is what actually matters:

# patroni.yml on each node
scope: pg-app-primary
namespace: /service/
name: pg-node-a

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.20.1.11:8008

etcd:
  hosts: etcd-a:2379,etcd-b:2379,etcd-c:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    synchronous_mode: true
    synchronous_mode_strict: false
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        max_wal_senders: 10
        max_replication_slots: 10
        hot_standby: on

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.20.1.11:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    replication:
      username: replicator
      password: ${PG_REPLICATOR_PASSWORD}

tags:
  nofailover: false
  noloadbalance: false

Two knobs earn their keep. maximum_lag_on_failover disqualifies any standby further behind than that byte count from promotion. You do not want to fail over to a 10-minute-stale replica. synchronous_mode: true with synchronous_mode_strict: false lets the primary keep accepting writes when no sync standby is available, the right tradeoff for most app workloads. Strict is for systems where data loss is worse than downtime, know which one you are.

pg_auto_failover is the lighter option if you do not want a separate DCS. Easier for two or three node clusters. Past four nodes, Patroni wins.

Takeaways

Physical streaming replication is the default. Logical is for migrations and selective sharing. Do not mix the two for HA.
Replication slots will save your standby and kill your primary’s disk. Alert on retained WAL and inactive slots, both.
replay_lag is the lag that matters. Write and flush lag are diagnostic, not user-facing.
When replica lag spikes, check the writer first. A blocked or starved WAL is almost always the cause, not the reader hardware.
Do not write your own failover. Patroni or pg_auto_failover. Test failover quarterly, not annually.
synchronous_standby_names quorum (ANY 1 (a, b)) beats a single sync target. A single sync target is a single point of write failure.

Thanks for reading. If you’ve got thoughts, send them my way.