Zero-Downtime Database Migrations

How I run expand-contract migrations on Aurora and MySQL at scale, with gh-ost, pg-osc, CREATE INDEX CONCURRENTLY, batched backfills, and CI validation against production-like data.

It was a late evening deploy at the creator economy platform I worked at. Past midnight UTC, around 6 p.m. Pacific, which is the worst possible time for a US-heavy product. We were adding a non-null column to users, a table with hundreds of millions of rows on a multi-terabyte Aurora writer. The migration was written using the “safer than raw ActiveRecord” helper. I’d reviewed it that morning and acked it as safe. 87 seconds later, login error rate was at 100 percent and PagerDuty was waking half the senior engineers in California.

I’m the reason that runbook now has a line that says “no add_column with a non-null default on tables over 10 million rows. Ever.”

My position on this is pretty plain. There is no such thing as a small schema change against a hot table at scale. Every change is a three-step dance: expand, backfill, contract. If you try to do it in one shot, you either get lucky for a long time and then you don’t, or you get unlucky on day one. The tools that promise to abstract this for you, strong_migrations, Rails add_column_with_default, the pt-osc and gh-ost schema changers, are great. They are not a substitute for reading the migration file.

Expand and contract, in order

The pattern is old and boring and it works. You never modify a column in place. You add a new column or table that is backward compatible, you teach the app to write to both (or to the new one) while still reading from the old, you backfill the historical data in the background, you flip reads to the new column, you stop writing the old column, and only then do you drop it. Every step is its own deploy. Every step is reversible. No step is allowed to block writes.

Here’s the first step for adding a locale column to users. Notice what’s missing.

class AddLocaleToUsers < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def change
    add_column :users, :locale, :string, null: true, default: nil
    add_index :users, :locale, algorithm: :concurrently, if_not_exists: true
  end
end

No default value. No null: false. No backfill. algorithm: :concurrently so PostgreSQL doesn’t take an ACCESS EXCLUSIVE lock to build the index. disable_ddl_transaction! because CREATE INDEX CONCURRENTLY cannot run inside a transaction. That’s the whole schema migration. If it fails halfway, you re-run it. If you ship it twice, if_not_exists saves you. It runs in well under a second on Aurora.

The constraint flip, the NOT NULL and the default, lives in a separate migration that ships only after the backfill is verified at 100 percent. The drop of the old column lives in a third migration that ships after reads have been flipped and the app has been running clean for a week. Three deploys. No exceptions.

Backfill in batches, with a kill switch

The backfill is not a schema migration. It runs as a maintenance task or a Sidekiq job, with a batch size, a sleep between batches, and a kill switch you can toggle without a deploy.

module Maintenance
  class BackfillUserLocaleTask < MaintenanceTasks::Task
    BATCH_SIZE = 1_000
    THROTTLE_SECONDS = 0.2

    def collection
      User.where(locale: nil).in_batches(of: BATCH_SIZE)
    end

    def process(batch)
      raise Paused if Rails.cache.read("backfill:user_locale:paused")

      User.where(id: batch.pluck(:id)).update_all(
        locale: "en",
        updated_at: Time.current
      )

      sleep THROTTLE_SECONDS
    rescue ActiveRecord::StatementInvalid => e
      raise unless e.message.include?("deadlock detected")
      sleep 5
      retry
    end
  end
end

Two things matter here. One, the batch is resumable. If the task crashes at row four million out of twelve million, the next run continues from row four million and one because the where(locale: nil) filter naturally skips already-processed rows. Two, there is a cache-key kill switch. When Aurora replica lag spikes, on-call flips the key and the job pauses without anyone having to run kubectl rollout restart or open a PR.

A war story on why the kill switch exists. A Tuesday morning at the same platform. Around 10:14 Pacific, Datadog’s AuroraReplicaLagMaximum > 60s for 2m alert fired. Community feed p99 read latency went from about 120 ms to over 8 seconds inside four minutes. Replica lag was at 14 minutes and climbing. The on-call’s first instinct was to scale reader instance class up two tiers. Sensible, wrong. The readers weren’t CPU bound, they were starved of WAL. Someone had scheduled an ANALYZE on community_posts, one of the hottest tables on the platform, inside a maintenance cron that didn’t check business hours. Killing the analyze drained replica lag in about six minutes. That same week we shipped db_safe_maintenance.rb, and every maintenance command, including backfills, now routes through it and refuses to run between 06:00 and 22:00 UTC. The kill switch in the task above is the second line of defense.

gh-ost and pg-osc for the hard cases

CREATE INDEX CONCURRENTLY is great. It does not help you when you need to change a column type, shrink a text to a varchar(255), or split one column into two. For those, Aurora MySQL gets gh-ost, Aurora PostgreSQL gets pg-osc (the Shopify port of pt-osc for Postgres). Both work the same way underneath. They create a shadow copy of the table, trickle-copy rows over in batches, capture writes via triggers (or in gh-ost’s case, binlog), and atomically swap the tables when the copy catches up. The original table stays online the whole time.

A pg-osc invocation looks like this. You’d wire it into a deploy hook or a runbook step, not into rails db:migrate.

pg-online-schema-change perform \
  --alter "ALTER TABLE users ALTER COLUMN bio TYPE text USING bio::text" \
  --dbname production \
  --host writer.cluster.aurora.local \
  --copy-statement "INSERT INTO %{shadow_table} SELECT * FROM users" \
  --batch-size 2000 \
  --pull-batch-count 1000 \
  --max-replication-lag 30 \
  --no-drop-old-table-after-copy

--max-replication-lag 30 tells pg-osc to pause the copy when reader lag exceeds 30 seconds. --no-drop-old-table-after-copy is the one flag I always set. After the swap, the old table is renamed but kept. If anything looks weird in the first 24 hours, you rename it back and you’re whole. Dropping the old table is its own deploy, ideally a week later, ideally on a Tuesday.

gh-ost on Aurora MySQL has equivalent flags: --max-lag-millis, --throttle-control-replicas, --cut-over=atomic. The shape of the runbook is identical.

CI validation against production-like data

The migration that took down login error rate at the creator platform passed CI. It passed local. It passed staging. None of those environments had a users table with hundreds of millions of rows. The lock that took 87 seconds against production took 200 ms against the staging copy. You cannot catch this kind of bug without testing against production-like data shape.

What I’ve shipped, more than once, is a nightly job that restores a sanitized snapshot of the production writer into a “migration sandbox” cluster, runs the pending migrations against it, and posts the timings to a Slack channel. The Aurora snapshot restore is the slow part, but it runs unattended overnight, and the actual migration timing is the answer to the question “what will this do in prod tomorrow.”

name: migration-sandbox

on:
  schedule:
    - cron: "0 6 * * 1-5"
  workflow_dispatch:

jobs:
  validate:
    runs-on: ubuntu-latest
    timeout-minutes: 120
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.MIGRATION_SANDBOX_ROLE }}
          aws-region: us-east-1
      - name: restore sanitized snapshot to sandbox cluster
        run: ./bin/restore-sandbox-cluster.sh
      - name: run pending migrations with timing
        run: |
          bundle exec rake db:migrate:status
          /usr/bin/time -v bundle exec rake db:migrate 2>&1 | tee migrate.log
      - name: post timings to slack
        if: always()
        run: ./bin/post-migration-timings.sh migrate.log

The threshold I use is dumb on purpose. Any migration that takes more than five seconds on the sandbox is rejected at the PR level until it’s been rewritten as expand-contract. Anything that does an add_column with a non-null default is rejected at the linter level by strong_migrations. The two checks together catch every category of mistake I’ve personally shipped, which is most of them.

Takeaways

No schema change on a hot table is one step. Expand, backfill, contract.
add_column with a non-null default is a lock. algorithm: :concurrently on indexes. Always.
Backfills live outside migrations, batch-sized, throttled, with a cache-key kill switch.
For column type changes, use gh-ost (MySQL) or pg-osc (Postgres) and keep the old table for a week.
Validate migrations against a snapshot of production. CI on an empty schema tells you nothing.
The “safer” helpers are safer than raw, not safe. Read the migration file.

Thanks for reading. If you’ve got thoughts, send them my way.