How I keep a large Rails test suite under twelve minutes on CI: factory_bot tuning, parallel execution, fixture trade-offs, and how to quarantine flaky system tests before they block every deploy.
It was a Wednesday at the creator economy platform I worked at and the deploy train had been red for three days. Not because the app was broken. Because a handful of system tests kept failing randomly in CI, and the merge queue refused to let anything through until they went green. I was tagged in the war room channel around 10 a.m. Pacific because I’d been doing joker mode that quarter, and the Community squad’s tests were one of the worst offenders.
I ran the suite locally. All green. Pushed an empty commit. Two tests failed that weren’t even in the diff. Pushed again. A different two failed. The CI cost meter kept climbing and nobody had shipped to production since Monday.
This is the part of “testing at scale” nobody writes blog posts about. The suite isn’t slow. The suite is dishonest. I’ve seen Rails test suites with thousands of files, BDD culture baked in, mandatory coverage thresholds, and still the deploys stop because system tests pick fights with Capybara timing.
My position is opinionated. At scale, you stop treating “all tests pass” as a single boolean. You split your suite into tiers, you spend real engineering time on factories, you parallelize aggressively, and you build a quarantine pipeline for flakes the way you build a circuit breaker for an external API. The “ten green ticks on every PR” mindset breaks down past about 2,000 examples.
I’ve fought this argument more times than I can count. The Rails default is fixtures, and DHH still defends them. Factories won the community vote a decade ago via factory_bot. Both are right, in narrow ways.
Fixtures load once per test database, share state across the entire suite, and are essentially free at runtime. They’re brilliant for reference data. They’re terrible for anything with non-trivial relations because a fixture file ends up coupled to half your model graph and nobody dares delete a row.
Factories are explicit, isolated, and slow. The slowness is not because factory_bot is bad. It’s because a senior engineer wrote create(:user) and the User factory pulled in a workspace, a subscription, a Stripe customer, and three default permissions, none of which the test needed. Then it ran 14 callbacks. On Aurora.
What I do in practice at Rails scale is mix them. Reference data, lookup tables, plan tiers, country codes, anything that never changes, lives in fixtures and loads once. Anything specific to the scenario gets a factory. And the factories are aggressively minimal.
FactoryBot.define do
factory :user do
sequence(:email) { |n| "user#{n}@example.test" }
password { "correct-horse-battery-staple" }
trait :with_workspace do
after(:create) do |user|
create(:workspace, owner: user)
end
end
trait :with_subscription do
after(:create) do |user|
create(:subscription, user: user, plan: Plan.find_by!(slug: "pro"))
end
end
end
end
The base :user factory creates exactly one row. If the test needs more, it asks. The traits are opt-in. A new engineer reading the test sees create(:user, :with_workspace) and knows immediately what’s in scope.
The other thing I push everywhere: build_stubbed whenever you can. If a test doesn’t write to the database, don’t write to the database.
RSpec.describe BillingPresenter do
it "formats the next charge date" do
user = build_stubbed(:user, :with_subscription)
presenter = described_class.new(user)
expect(presenter.next_charge_on).to eq(user.subscription.current_period_end.to_date)
end
end
That test used to take 180 ms with create. With build_stubbed it’s about 6 ms. Multiply by a few thousand examples and you’ve cut ten minutes off the suite.
Rails has had parallel test runners built in since 6.0. Most teams I’ve joined either don’t use them or use them wrong. The default parallelize(workers: :number_of_processors) works fine on a laptop. On CI it fights the runner because GitHub Actions hosted runners have noisy neighbors and the auto-detected core count lies to you.
What works at scale is splitting the suite across multiple CI jobs, not just multiple processes inside one job. We do it by file, weighted by historical runtime.
jobs:
rspec:
strategy:
fail-fast: false
matrix:
ci_node_total: [8]
ci_node_index: [0, 1, 2, 3, 4, 5, 6, 7]
runs-on: ubuntu-latest-8-cores
steps:
- uses: actions/checkout@v4
- uses: ruby/setup-ruby@v1
with:
bundler-cache: true
- name: Run RSpec
env:
CI_NODE_TOTAL: ${{ matrix.ci_node_total }}
CI_NODE_INDEX: ${{ matrix.ci_node_index }}
run: |
bundle exec rspec \
$(bin/split-tests --total $CI_NODE_TOTAL --index $CI_NODE_INDEX) \
--format progress --format RspecJunitFormatter --out tmp/rspec.xml
The bin/split-tests script reads the previous run’s JUnit output and assigns files to nodes to balance total runtime. Round-robin or alphabetical splitting will leave one node doing the slow controller specs while seven others sit idle. Balanced splitting takes more setup. It also pays for itself within a week.
Inside each node we still parallelize across processes, but conservatively. Three workers per 8-core runner, not eight. The extra cores absorb fork overhead and database connection churn.
This is the part most teams get wrong. A system test fails twice in a week. Somebody adds a retry annotation. Six months later that file has eleven retries and is testing nothing.
The right move is a quarantine list. A failing flaky test gets pulled out of the main suite, runs in a separate CI job that doesn’t block merges, and goes on an explicit fix-or-delete deadline.
RSpec.configure do |config|
quarantined = YAML.load_file(Rails.root.join("spec/quarantine.yml")).fetch("examples", [])
config.filter_run_excluding(quarantine: true)
config.before(:each) do |example|
if quarantined.include?(example.metadata[:full_description])
example.metadata[:quarantine] = true
skip("Quarantined. See spec/quarantine.yml")
end
end
end
A separate workflow runs the quarantine list nightly, posts results to a Slack channel, and pings the owning team. If a quarantined test goes green for seven nights in a row, it earns its way back. If it sits in quarantine for over 30 days, it gets deleted. The codebase forgets nothing.
The Rails monolith I worked on was hitting Aurora PostgreSQL. We shipped a migration adding a non-null column to users, a table with hundreds of millions of rows. I’d reviewed the migration that morning and ack’d it as safe. It used add_column_with_default from strong_migrations, which we all thought of as the safer path. On Aurora at our row count the migration acquired an ACCESS EXCLUSIVE lock on users and held it for 87 seconds. Login error rate hit 100 percent for about 85 seconds. PagerDuty woke half the senior engineers in California.
First instinct was rollback. We couldn’t. Rails can’t cleanly roll back a partially applied add_column_with_default, and by the time the lock would have released we’d have been deeper into the cascade.
The real fix landed the next day, split into three migrations: add the column nullable, backfill in batches in a separate Sidekiq job, then change_column_null once 100 percent backfilled. And then the part most postmortems skip. We wrote a test. A strong_migrations rule that blocks any add_column with a non-null default on tables with more than 10M rows, enforced in CI. The test runs against the migration files, not the app. It would have caught the original PR. It’s caught two more since.
Cost: 85 seconds of total login outage. Lesson: at scale, your test suite tests your migrations too. Not just your code.
The other story I always reach for is from the combat-sports tournament platform I CTO’d in London. Different stack, same lesson. We had hundreds of microservices on Kafka. A consumer group started rebalancing every 30 seconds during a live combat-sports tournament. The page froze at 14:32 local time on a Saturday broadcast.
First fix was a kubectl rollout restart. Made it worse. The real cause was one pod out of six running a stale image because someone had pushed a config fix without bumping the image tag, and the deployment had pulled :latest. The pod’s max.poll.interval.ms was 60 seconds where the others were 300. We pinned image SHAs after that and never tags.
The test we added wasn’t a unit test. It was a CI gate. A linter step that failed the deploy if any Kafka-touching deployment manifest referenced :latest. Same energy as the strong_migrations rule. Same idea. The test isn’t about your business logic. It’s about the class of mistake your team keeps making.
# spec/system/deploy_safety_spec.rb
require "rails_helper"
require "yaml"
RSpec.describe "deploy manifests" do
Dir[Rails.root.join("ops/k8s/**/*.yaml")].each do |manifest|
it "#{File.basename(manifest)} pins image SHAs, not tags" do
doc = YAML.load_file(manifest, aliases: true)
images = doc.dig("spec", "template", "spec", "containers")&.map { |c| c["image"] } || []
images.each do |image|
expect(image).to match(/@sha256:[0-9a-f]{64}\z/),
"#{image} uses a tag, not a SHA. See runbook docs/runbooks/deploys.md"
end
end
end
end
That’s a test that runs in 40 ms and would have saved 12 minutes of stale standings during a live broadcast.
:user factory. Traits are opt-in, base factories are minimal, and build_stubbed beats create whenever the test doesn’t hit the database.spec/system directory is allowed to test more than your controllers.Thanks for reading. If you’ve got thoughts, send them my way.