Migrating 2 Billion Rows Without Downtime

How I ran a time-partitioned table migration on Aurora at 50K writes per second, with dual-write, throttled backfill, sequence sync, and a rollback plan that actually held.

It was a Thursday at the creator economy platform I worked at, and the events table on the Community product had crossed two billion rows. Around 50K writes per second on a normal afternoon, more during a launch. The table was a flat heap, single primary key, three btree indexes, sitting on a multi-terabyte Aurora writer. Inserts were fine. Vacuums had started taking longer than the window between deploys. Reads were getting slower in a way that no index was going to save.

The decision was to move it to a time-partitioned layout, monthly partitions, same column shape, new table, new sequence. The constraint was the obvious one. No maintenance window. No paused writes. Not even a slow one.

The shape of the migration

I’m going to skip the part where you debate pg_partman versus hand-rolled DDL. We went hand-rolled because the rollback story was clearer and the operator already knew how to read it. The high-level dance was the boring one. Create the new partitioned table next to the old one. Dual-write at the app layer. Backfill the historical rows in batches, throttled. Sync the sequence. Flip reads. Stop writing the old table. Drop it later.

CREATE TABLE events_partitioned (
  id          BIGINT NOT NULL,
  account_id  BIGINT NOT NULL,
  kind        TEXT   NOT NULL,
  payload     JSONB  NOT NULL,
  occurred_at TIMESTAMPTZ NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (id, occurred_at)
) PARTITION BY RANGE (occurred_at);

CREATE TABLE events_current PARTITION OF events_partitioned
  FOR VALUES FROM (date_trunc('month', now()))
  TO (date_trunc('month', now()) + interval '1 month');

CREATE INDEX CONCURRENTLY events_current_account_idx
  ON events_current (account_id, occurred_at DESC);

Notice the primary key includes occurred_at. PostgreSQL forces this for range-partitioned tables, the partition key has to be in the PK or in a unique index. That was the first thing that broke an assumption in our ORM layer, because the old PK was just id.

Dual-write at the app, not via triggers

Triggers were tempting. They are also a great way to find out at 3 a.m. that your DBA’s WAL emission has stalled because every insert is now firing a stored procedure under a BEFORE INSERT trigger. We did the writes in the app, behind a feature flag, with a write-wrapper that fanned out to both tables inside a single transaction.

class EventWriter
  def self.create!(attrs)
    ActiveRecord::Base.transaction do
      legacy = LegacyEvent.create!(attrs)

      if Flipper.enabled?(:dual_write_events)
        PartitionedEvent.create!(attrs.merge(id: legacy.id))
      end

      legacy
    end
  rescue ActiveRecord::RecordNotUnique => e
    raise unless e.message.include?("events_partitioned")
    Rails.logger.warn(msg: "partitioned dup, skipping", id: attrs[:id])
    LegacyEvent.find(attrs[:id])
  end
end

The flag let us turn off the partitioned write inside a few seconds if Aurora started complaining. The RecordNotUnique rescue is there because the backfill and the dual-write race on rows near the cutover. We chose to accept the dup error rather than serialize the two paths, the dup is cheap, the serialization is not.

Backfill, throttled by replica lag

The backfill was a maintenance task copying rows in batches of 5,000, ordered by id, with a sleep that adapted to replica lag. The throttle is the part most teams get wrong. A fixed sleep is fine until traffic shifts. A lag-aware sleep is fine until your lag metric goes stale and you cook the writer.

class BackfillEvents < MaintenanceTasks::Task
  BATCH_SIZE = 5_000
  LAG_CEILING_SECONDS = 10

  def collection
    LegacyEvent.where("id > ?", cursor).order(:id).limit(BATCH_SIZE)
  end

  def process(batch)
    raise Paused if Flipper.enabled?(:backfill_events_paused)
    wait_until_replicas_caught_up

    rows = batch.pluck(:id, :account_id, :kind, :payload, :occurred_at, :created_at)
    PartitionedEvent.insert_all!(
      rows.map { |r| Hash[%i[id account_id kind payload occurred_at created_at].zip(r)] },
      returning: false
    )
  end

  def wait_until_replicas_caught_up
    loop do
      lag = AuroraMetrics.replica_lag_seconds
      break if lag < LAG_CEILING_SECONDS
      sleep [lag, 30].min
    end
  end
end

The job ran for about six days. Most of that time it was pacing itself, not pushing. We deliberately let it be slow.

A war story on what I assumed was safe

A while back, on the same Aurora writer, I shipped what looked like a tiny schema change to a hot table. Late evening deploy, past midnight UTC, around 6 p.m. Pacific. Not the same table, but the same writer. The migration added a non-null column with a default to a table with hundreds of millions of rows, behind the “safer than raw ActiveRecord” helper. I’d reviewed it that morning and acked it as safe.

It took an ACCESS EXCLUSIVE lock while it backfilled the default. Login error rate hit 100 percent for about 85 seconds. PagerDuty woke half the senior engineers in California. The first instinct in the war room was to roll back the migration, but Rails does not roll back a partially-applied default cleanly, and killing it mid-flight risked leaving the table metadata in a bad state. We let it finish. 87 seconds in, locks released, login recovered.

The thing that stuck from that night is the rule we baked into CI: no add_column with a non-null default on tables over 10M rows, ever. Every schema change against a hot table is a three-step dance. Add nullable. Backfill in batches outside the migration. Set not-null once the backfill is verified. The partition migration above is the same dance, just bigger.

A war story on replica lag

About a month before the partition cutover, a Tuesday morning, Datadog’s AuroraReplicaLagMaximum > 60s for 2m alert fired around 10:14 Pacific. Community feed p99 read latency went from about 120 ms to over 8 seconds inside four minutes. Replica lag was at 14 minutes and climbing.

The on-call’s first move was reasonable, bump reader instance class up two tiers. Lag did not move. The readers were not CPU bound, they were starved of WAL. Pulled pg_stat_activity on the writer and there it was, an ANALYZE on one of the hottest tables, scheduled inside a maintenance cron that did not respect peak hours. Killed it. Lag drained in about six minutes.

That same week we shipped db_safe_maintenance.rb. Every maintenance command, including the partition backfill above, now routes through it and refuses to run between 06:00 and 22:00 UTC. The lag-aware sleep in the backfill task is the second line of defense. The runbook for the Aurora layer now leads with a literal sentence: before touching reader scaling, check pg_stat_activity on the writer. I’m the reason that sentence is in there.

Sequence sync and the cutover

The last gotcha. The legacy table had its own bigint sequence on id. The partitioned table needed to reuse it. After the backfill caught up, we attached the same sequence to the new table and double-checked nextval would not collide.

ALTER TABLE events_partitioned
  ALTER COLUMN id SET DEFAULT nextval('legacy_events_id_seq');

SELECT setval(
  'legacy_events_id_seq',
  GREATEST(
    (SELECT COALESCE(MAX(id), 0) FROM legacy_events),
    (SELECT COALESCE(MAX(id), 0) FROM events_partitioned)
  )
);

Cutover itself was three flag flips over a quiet 90 minutes on a Saturday. Flip reads to the partitioned table. Watch p99 on both tables for 20 minutes. Stop writing the legacy table. We kept the legacy table around for two weeks, cold, in case we needed to read from it. We didn’t. Then we dropped it.

Takeaways

Dual-write in the app, not in triggers. Triggers hide cost.
Backfill throttles must read replica lag, not a fixed sleep.
Every schema change on a hot table is a three-step dance. The “safer” helper is not a substitute for reading the migration.
Keep the rollback path alive longer than feels necessary. Cold legacy tables are cheap insurance.
Sequence sync is the boring step that breaks loudest if you skip it.

Thanks for reading. If you’ve got thoughts, send them my way.