Data Migration Patterns in Rails

How I separate data migrations from schema migrations in Rails, with resumable batched backfills, dual-write windows, and zero-downtime cutovers on Aurora.

The backfill had been running for three days. Tuesday afternoon at the creator-economy platform I worked at, Aurora writer pinned at 60 percent CPU, and our users table was halfway through a column transformation that I’d kicked off Friday evening before logging off. Then a teammate deployed an unrelated migration. The Rails app booted, ran db:migrate, hit my in-progress migration’s change block, and the whole thing started over from row zero.

I’d put the backfill inside a schema migration. I know.

That mistake is the cleanest argument I can give for splitting data migrations from schema migrations in Rails. They run on different timelines, fail in different ways, need different recovery stories, and live in different parts of your repo. Treating them as the same thing is how you end up with a 90-second ACCESS EXCLUSIVE lock on a hot table, or a backfill that gets re-run on every deploy until somebody notices.

My position is plain. Schema migrations should be small, fast, reversible, and shipped with the deploy. Data migrations should be separate, idempotent, resumable, and run by an operator (or a scheduled job), not by db:migrate. The maintenance_tasks gem from Shopify gives you most of this out of the box, but the discipline matters more than the tool.

The three-step dance

On Aurora at scale, every schema change against a hot table is a three-step dance, not a one-liner. Add the column nullable. Backfill. Flip the constraint.

# db/migrate/0001_add_locale_to_users.rb
class AddLocaleToUsers < ActiveRecord::Migration[7.1]
  def change
    add_column :users, :locale, :string, null: true, default: nil
    add_index :users, :locale, algorithm: :concurrently
  end

  disable_ddl_transaction!
end

That’s it. No backfill in the migration. No default value that triggers a table rewrite. The data migration lives somewhere else. The constraint flip lives in a third migration that runs only after the backfill is 100 percent done.

I learned this the hard way. The Rails monolith at the creator platform, past midnight UTC but only around 6 p.m. Pacific. I’d ack’d a migration that morning. It used strong_migrations’ add_column_with_default helper, which I’d treated as the safer path. On Aurora at our row count, the helper acquired an ACCESS EXCLUSIVE lock on users while applying the default backfill. Roughly 87 seconds of blocked writes. Login error rate hit 100 percent for ~85 seconds. PagerDuty woke half the senior engineers in California. I was eight minutes into a Slack DM with a colleague when the war room channel hit the FIRE emoji wall.

First instinct was to roll back. Rails doesn’t have a clean rollback for a partially-applied add_column_with_default. By the time the lock would release, we’d already be 60 seconds into the cascade. We let the migration finish. It took 87 seconds. Locks released. Login recovered within 15 seconds because the dependent service had a tight retry loop. Postmortem the next day split the migration into the three steps above. strong_migrations got a new rule blocking any add_column with a non-null default against tables with more than 10 million rows in CI.

The gem gives you safer defaults, not safe ones. You still have to read the migration file. I had not.

Resumable backfills

The backfill itself belongs in a maintenance_tasks task or a Sidekiq job, not a migration. The non-negotiable property is resumability. If the task crashes at row 4,000,000 out of 12,000,000, the next run starts at 4,000,001, not zero.

# app/tasks/maintenance/backfill_user_locale_task.rb
module Maintenance
  class BackfillUserLocaleTask < MaintenanceTasks::Task
    BATCH_SIZE = 1_000
    DEFAULT_LOCALE = "en"

    def collection
      User.where(locale: nil).in_batches(of: BATCH_SIZE)
    end

    def process(batch)
      ids = batch.pluck(:id)
      User.where(id: ids).update_all(
        locale: DEFAULT_LOCALE,
        updated_at: Time.current
      )
    rescue ActiveRecord::StatementInvalid => e
      logger.error("batch failed: #{ids.first}..#{ids.last} -- #{e.message}")
      raise
    end

    def count
      User.where(locale: nil).count
    end
  end
end

A few things worth flagging here. in_batches keyset-paginates by primary key, which is the only safe way on a multi-terabyte Aurora writer. update_all skips ActiveRecord callbacks and validations, which is what you want for a backfill. The collection re-queries on each batch, so if the task is paused and resumed, it picks up only the rows that still need work. And count returns the remaining rows, not the original total, which makes the maintenance_tasks UI honest about progress.

The Gemfile entry is one line.

# Gemfile
gem "maintenance_tasks", "~> 2.10"

I’ve seen teams reach for find_each with a saved offset in a Redis key. It works, but it’s a worse version of the same idea. Use the gem.

Dual-write windows

The harder case is a column transformation, not just a backfill. Say you’re splitting users.full_name into first_name and last_name. You can’t flip in one shot. You need a window where both old and new columns exist, both are written, and reads can pick either source. Then you backfill. Then you cut reads over. Then you stop writing the old column. Then you drop it.

# app/services/users/name_writer.rb
module Users
  class NameWriter
    def initialize(user)
      @user = user
    end

    def write(full_name:)
      first, *rest = full_name.to_s.strip.split(" ")
      last = rest.join(" ").presence

      @user.assign_attributes(
        full_name: full_name,
        first_name: first,
        last_name: last
      )
      @user.save!
    end
  end
end

Every write path goes through this for the duration of the window. No direct update(full_name: ...) anywhere. CI greps for it. Once the backfill finishes and reads have been moved over (feature flag, no deploys required), you can simplify the writer to only set the new columns, then drop the old one in a follow-up schema migration.

The dual-write window is the part everyone wants to skip. Don’t. It’s the only reason the cutover is boring.

Cutover

The cutover itself is small. The migration drops the old column and tightens the constraint, but only after the data migration log says 100 percent.

# db/migrate/0002_finalize_user_locale.rb
class FinalizeUserLocale < ActiveRecord::Migration[7.1]
  def up
    remaining = User.where(locale: nil).count
    raise "backfill not complete: #{remaining} rows left" if remaining > 0

    change_column_null :users, :locale, false
  end

  def down
    change_column_null :users, :locale, true
  end
end

The raise is intentional. A schema migration that depends on a data migration should refuse to run if the data isn’t ready. I’ve watched too many cutovers run on the wrong environment to trust myself.

Takeaways

Schema migrations are fast and reversible. Data migrations are slow and resumable. Don’t mix them.
Backfill columns nullable first. Flip the constraint in a separate migration that checks completeness.
Use maintenance_tasks for backfills. Operator-driven, idempotent, paused and resumed without losing state.
Dual-write through a service object during column transformations. CI greps for direct writes.
On Aurora at scale, treat every schema change against a hot table as a three-step dance.

Thanks for reading. If you’ve got thoughts, send them my way.