How I run expand-contract migrations on Aurora and MySQL at scale, with gh-ost, pg-osc, CREATE INDEX CONCURRENTLY, batched backfills, and CI validation against production-like data.
It was a late evening deploy at the creator economy platform I worked at. Past midnight UTC, around 6 p.m. Pacific, which is the worst possible time for a US-heavy product. We were adding a non-null column to users, a table with hundreds of millions of rows on a multi-terabyte Aurora writer. The migration was written using the “safer than raw ActiveRecord” helper. I’d reviewed it that morning and acked it as safe. 87 seconds later, login error rate was at 100 percent and PagerDuty was waking half the senior engineers in California.
I’m the reason that runbook now has a line that says “no add_column with a non-null default on tables over 10 million rows. Ever.”
My position on this is pretty plain. There is no such thing as a small schema change against a hot table at scale. Every change is a three-step dance: expand, backfill, contract. If you try to do it in one shot, you either get lucky for a long time and then you don’t, or you get unlucky on day one. The tools that promise to abstract this for you, strong_migrations, Rails add_column_with_default, the pt-osc and gh-ost schema changers, are great. They are not a substitute for reading the migration file.
The pattern is old and boring and it works. You never modify a column in place. You add a new column or table that is backward compatible, you teach the app to write to both (or to the new one) while still reading from the old, you backfill the historical data in the background, you flip reads to the new column, you stop writing the old column, and only then do you drop it. Every step is its own deploy. Every step is reversible. No step is allowed to block writes.
Here’s the first step for adding a locale column to users. Notice what’s missing.
class AddLocaleToUsers < ActiveRecord::Migration[7.1]
disable_ddl_transaction!
def change
add_column :users, :locale, :string, null: true, default: nil
add_index :users, :locale, algorithm: :concurrently, if_not_exists: true
end
end
No default value. No null: false. No backfill. algorithm: :concurrently so PostgreSQL doesn’t take an ACCESS EXCLUSIVE lock to build the index. disable_ddl_transaction! because CREATE INDEX CONCURRENTLY cannot run inside a transaction. That’s the whole schema migration. If it fails halfway, you re-run it. If you ship it twice, if_not_exists saves you. It runs in well under a second on Aurora.
The constraint flip, the NOT NULL and the default, lives in a separate migration that ships only after the backfill is verified at 100 percent. The drop of the old column lives in a third migration that ships after reads have been flipped and the app has been running clean for a week. Three deploys. No exceptions.
The backfill is not a schema migration. It runs as a maintenance task or a Sidekiq job, with a batch size, a sleep between batches, and a kill switch you can toggle without a deploy.
module Maintenance
class BackfillUserLocaleTask < MaintenanceTasks::Task
BATCH_SIZE = 1_000
THROTTLE_SECONDS = 0.2
def collection
User.where(locale: nil).in_batches(of: BATCH_SIZE)
end
def process(batch)
raise Paused if Rails.cache.read("backfill:user_locale:paused")
User.where(id: batch.pluck(:id)).update_all(
locale: "en",
updated_at: Time.current
)
sleep THROTTLE_SECONDS
rescue ActiveRecord::StatementInvalid => e
raise unless e.message.include?("deadlock detected")
sleep 5
retry
end
end
end
Two things matter here. One, the batch is resumable. If the task crashes at row four million out of twelve million, the next run continues from row four million and one because the where(locale: nil) filter naturally skips already-processed rows. Two, there is a cache-key kill switch. When Aurora replica lag spikes, on-call flips the key and the job pauses without anyone having to run kubectl rollout restart or open a PR.
A war story on why the kill switch exists. A Tuesday morning at the same platform. Around 10:14 Pacific, Datadog’s AuroraReplicaLagMaximum > 60s for 2m alert fired. Community feed p99 read latency went from about 120 ms to over 8 seconds inside four minutes. Replica lag was at 14 minutes and climbing. The on-call’s first instinct was to scale reader instance class up two tiers. Sensible, wrong. The readers weren’t CPU bound, they were starved of WAL. Someone had scheduled an ANALYZE on community_posts, one of the hottest tables on the platform, inside a maintenance cron that didn’t check business hours. Killing the analyze drained replica lag in about six minutes. That same week we shipped db_safe_maintenance.rb, and every maintenance command, including backfills, now routes through it and refuses to run between 06:00 and 22:00 UTC. The kill switch in the task above is the second line of defense.
CREATE INDEX CONCURRENTLY is great. It does not help you when you need to change a column type, shrink a text to a varchar(255), or split one column into two. For those, Aurora MySQL gets gh-ost, Aurora PostgreSQL gets pg-osc (the Shopify port of pt-osc for Postgres). Both work the same way underneath. They create a shadow copy of the table, trickle-copy rows over in batches, capture writes via triggers (or in gh-ost’s case, binlog), and atomically swap the tables when the copy catches up. The original table stays online the whole time.
A pg-osc invocation looks like this. You’d wire it into a deploy hook or a runbook step, not into rails db:migrate.
pg-online-schema-change perform \
--alter "ALTER TABLE users ALTER COLUMN bio TYPE text USING bio::text" \
--dbname production \
--host writer.cluster.aurora.local \
--copy-statement "INSERT INTO %{shadow_table} SELECT * FROM users" \
--batch-size 2000 \
--pull-batch-count 1000 \
--max-replication-lag 30 \
--no-drop-old-table-after-copy
--max-replication-lag 30 tells pg-osc to pause the copy when reader lag exceeds 30 seconds. --no-drop-old-table-after-copy is the one flag I always set. After the swap, the old table is renamed but kept. If anything looks weird in the first 24 hours, you rename it back and you’re whole. Dropping the old table is its own deploy, ideally a week later, ideally on a Tuesday.
gh-ost on Aurora MySQL has equivalent flags: --max-lag-millis, --throttle-control-replicas, --cut-over=atomic. The shape of the runbook is identical.
The migration that took down login error rate at the creator platform passed CI. It passed local. It passed staging. None of those environments had a users table with hundreds of millions of rows. The lock that took 87 seconds against production took 200 ms against the staging copy. You cannot catch this kind of bug without testing against production-like data shape.
What I’ve shipped, more than once, is a nightly job that restores a sanitized snapshot of the production writer into a “migration sandbox” cluster, runs the pending migrations against it, and posts the timings to a Slack channel. The Aurora snapshot restore is the slow part, but it runs unattended overnight, and the actual migration timing is the answer to the question “what will this do in prod tomorrow.”
name: migration-sandbox
on:
schedule:
- cron: "0 6 * * 1-5"
workflow_dispatch:
jobs:
validate:
runs-on: ubuntu-latest
timeout-minutes: 120
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.MIGRATION_SANDBOX_ROLE }}
aws-region: us-east-1
- name: restore sanitized snapshot to sandbox cluster
run: ./bin/restore-sandbox-cluster.sh
- name: run pending migrations with timing
run: |
bundle exec rake db:migrate:status
/usr/bin/time -v bundle exec rake db:migrate 2>&1 | tee migrate.log
- name: post timings to slack
if: always()
run: ./bin/post-migration-timings.sh migrate.log
The threshold I use is dumb on purpose. Any migration that takes more than five seconds on the sandbox is rejected at the PR level until it’s been rewritten as expand-contract. Anything that does an add_column with a non-null default is rejected at the linter level by strong_migrations. The two checks together catch every category of mistake I’ve personally shipped, which is most of them.
add_column with a non-null default is a lock. algorithm: :concurrently on indexes. Always.gh-ost (MySQL) or pg-osc (Postgres) and keep the old table for a week.Thanks for reading. If you’ve got thoughts, send them my way.