Testing Rails Applications at Scale

How I keep a large Rails test suite under twelve minutes on CI: factory_bot tuning, parallel execution, fixture trade-offs, and how to quarantine flaky system tests before they block every deploy.

It was a Wednesday at the creator economy platform I worked at and the deploy train had been red for three days. Not because the app was broken. Because a handful of system tests kept failing randomly in CI, and the merge queue refused to let anything through until they went green. I was tagged in the war room channel around 10 a.m. Pacific because I’d been doing joker mode that quarter, and the Community squad’s tests were one of the worst offenders.

I ran the suite locally. All green. Pushed an empty commit. Two tests failed that weren’t even in the diff. Pushed again. A different two failed. The CI cost meter kept climbing and nobody had shipped to production since Monday.

This is the part of “testing at scale” nobody writes blog posts about. The suite isn’t slow. The suite is dishonest. I’ve seen Rails test suites with thousands of files, BDD culture baked in, mandatory coverage thresholds, and still the deploys stop because system tests pick fights with Capybara timing.

My position is opinionated. At scale, you stop treating “all tests pass” as a single boolean. You split your suite into tiers, you spend real engineering time on factories, you parallelize aggressively, and you build a quarantine pipeline for flakes the way you build a circuit breaker for an external API. The “ten green ticks on every PR” mindset breaks down past about 2,000 examples.

Fixtures versus factories

I’ve fought this argument more times than I can count. The Rails default is fixtures, and DHH still defends them. Factories won the community vote a decade ago via factory_bot. Both are right, in narrow ways.

Fixtures load once per test database, share state across the entire suite, and are essentially free at runtime. They’re brilliant for reference data. They’re terrible for anything with non-trivial relations because a fixture file ends up coupled to half your model graph and nobody dares delete a row.

Factories are explicit, isolated, and slow. The slowness is not because factory_bot is bad. It’s because a senior engineer wrote create(:user) and the User factory pulled in a workspace, a subscription, a Stripe customer, and three default permissions, none of which the test needed. Then it ran 14 callbacks. On Aurora.

What I do in practice at Rails scale is mix them. Reference data, lookup tables, plan tiers, country codes, anything that never changes, lives in fixtures and loads once. Anything specific to the scenario gets a factory. And the factories are aggressively minimal.

FactoryBot.define do
  factory :user do
    sequence(:email) { |n| "user#{n}@example.test" }
    password { "correct-horse-battery-staple" }

    trait :with_workspace do
      after(:create) do |user|
        create(:workspace, owner: user)
      end
    end

    trait :with_subscription do
      after(:create) do |user|
        create(:subscription, user: user, plan: Plan.find_by!(slug: "pro"))
      end
    end
  end
end

The base :user factory creates exactly one row. If the test needs more, it asks. The traits are opt-in. A new engineer reading the test sees create(:user, :with_workspace) and knows immediately what’s in scope.

The other thing I push everywhere: build_stubbed whenever you can. If a test doesn’t write to the database, don’t write to the database.

RSpec.describe BillingPresenter do
  it "formats the next charge date" do
    user = build_stubbed(:user, :with_subscription)
    presenter = described_class.new(user)
    expect(presenter.next_charge_on).to eq(user.subscription.current_period_end.to_date)
  end
end

That test used to take 180 ms with create. With build_stubbed it’s about 6 ms. Multiply by a few thousand examples and you’ve cut ten minutes off the suite.

Parallel execution and CI splitting

Rails has had parallel test runners built in since 6.0. Most teams I’ve joined either don’t use them or use them wrong. The default parallelize(workers: :number_of_processors) works fine on a laptop. On CI it fights the runner because GitHub Actions hosted runners have noisy neighbors and the auto-detected core count lies to you.

What works at scale is splitting the suite across multiple CI jobs, not just multiple processes inside one job. We do it by file, weighted by historical runtime.

jobs:
  rspec:
    strategy:
      fail-fast: false
      matrix:
        ci_node_total: [8]
        ci_node_index: [0, 1, 2, 3, 4, 5, 6, 7]
    runs-on: ubuntu-latest-8-cores
    steps:
      - uses: actions/checkout@v4
      - uses: ruby/setup-ruby@v1
        with:
          bundler-cache: true
      - name: Run RSpec
        env:
          CI_NODE_TOTAL: ${{ matrix.ci_node_total }}
          CI_NODE_INDEX: ${{ matrix.ci_node_index }}
        run: |
          bundle exec rspec \
            $(bin/split-tests --total $CI_NODE_TOTAL --index $CI_NODE_INDEX) \
            --format progress --format RspecJunitFormatter --out tmp/rspec.xml

The bin/split-tests script reads the previous run’s JUnit output and assigns files to nodes to balance total runtime. Round-robin or alphabetical splitting will leave one node doing the slow controller specs while seven others sit idle. Balanced splitting takes more setup. It also pays for itself within a week.

Inside each node we still parallelize across processes, but conservatively. Three workers per 8-core runner, not eight. The extra cores absorb fork overhead and database connection churn.

Quarantine, do not delete

This is the part most teams get wrong. A system test fails twice in a week. Somebody adds a retry annotation. Six months later that file has eleven retries and is testing nothing.

The right move is a quarantine list. A failing flaky test gets pulled out of the main suite, runs in a separate CI job that doesn’t block merges, and goes on an explicit fix-or-delete deadline.

RSpec.configure do |config|
  quarantined = YAML.load_file(Rails.root.join("spec/quarantine.yml")).fetch("examples", [])

  config.filter_run_excluding(quarantine: true)

  config.before(:each) do |example|
    if quarantined.include?(example.metadata[:full_description])
      example.metadata[:quarantine] = true
      skip("Quarantined. See spec/quarantine.yml")
    end
  end
end

A separate workflow runs the quarantine list nightly, posts results to a Slack channel, and pings the owning team. If a quarantined test goes green for seven nights in a row, it earns its way back. If it sits in quarantine for over 30 days, it gets deleted. The codebase forgets nothing.

Aurora schema migration, redux

The Rails monolith I worked on was hitting Aurora PostgreSQL. We shipped a migration adding a non-null column to users, a table with hundreds of millions of rows. I’d reviewed the migration that morning and ack’d it as safe. It used add_column_with_default from strong_migrations, which we all thought of as the safer path. On Aurora at our row count the migration acquired an ACCESS EXCLUSIVE lock on users and held it for 87 seconds. Login error rate hit 100 percent for about 85 seconds. PagerDuty woke half the senior engineers in California.

First instinct was rollback. We couldn’t. Rails can’t cleanly roll back a partially applied add_column_with_default, and by the time the lock would have released we’d have been deeper into the cascade.

The real fix landed the next day, split into three migrations: add the column nullable, backfill in batches in a separate Sidekiq job, then change_column_null once 100 percent backfilled. And then the part most postmortems skip. We wrote a test. A strong_migrations rule that blocks any add_column with a non-null default on tables with more than 10M rows, enforced in CI. The test runs against the migration files, not the app. It would have caught the original PR. It’s caught two more since.

Cost: 85 seconds of total login outage. Lesson: at scale, your test suite tests your migrations too. Not just your code.

The Kafka rebalance equivalent

The other story I always reach for is from the combat-sports tournament platform I CTO’d in London. Different stack, same lesson. We had hundreds of microservices on Kafka. A consumer group started rebalancing every 30 seconds during a live combat-sports tournament. The page froze at 14:32 local time on a Saturday broadcast.

First fix was a kubectl rollout restart. Made it worse. The real cause was one pod out of six running a stale image because someone had pushed a config fix without bumping the image tag, and the deployment had pulled :latest. The pod’s max.poll.interval.ms was 60 seconds where the others were 300. We pinned image SHAs after that and never tags.

The test we added wasn’t a unit test. It was a CI gate. A linter step that failed the deploy if any Kafka-touching deployment manifest referenced :latest. Same energy as the strong_migrations rule. Same idea. The test isn’t about your business logic. It’s about the class of mistake your team keeps making.

# spec/system/deploy_safety_spec.rb
require "rails_helper"
require "yaml"

RSpec.describe "deploy manifests" do
  Dir[Rails.root.join("ops/k8s/**/*.yaml")].each do |manifest|
    it "#{File.basename(manifest)} pins image SHAs, not tags" do
      doc = YAML.load_file(manifest, aliases: true)
      images = doc.dig("spec", "template", "spec", "containers")&.map { |c| c["image"] } || []
      images.each do |image|
        expect(image).to match(/@sha256:[0-9a-f]{64}\z/),
          "#{image} uses a tag, not a SHA. See runbook docs/runbooks/deploys.md"
      end
    end
  end
end

That’s a test that runs in 40 ms and would have saved 12 minutes of stale standings during a live broadcast.

Takeaways

Stop loading the world in your :user factory. Traits are opt-in, base factories are minimal, and build_stubbed beats create whenever the test doesn’t hit the database.
Mix fixtures and factories on purpose. Reference data goes in fixtures. Scenario data goes in factories. Don’t pick a side, pick the right tool per row.
Split CI by file with historical runtime weighting, not by node index. Three processes per 8-core runner, not eight. Forks cost more than people think.
Quarantine flakes, do not retry them. Separate CI job, nightly run, Slack ping to the owning team, 30-day fix-or-delete deadline.
Write tests against the class of mistake your team keeps making. Schema migrations, deploy manifests, Sidekiq retry budgets. Your spec/system directory is allowed to test more than your controllers.
The goal isn’t 100 percent green on every push. The goal is honest signal.

Thanks for reading. If you’ve got thoughts, send them my way.