How I run Rails 7 against Aurora writers, reader replicas, and sharded clusters in production. Role switching, replica-lag stickiness, and migrations that don't take down login.
OK so picture a Tuesday morning, 10:14 a.m. PT, at the creator-economy platform I spent the last few years at. Datadog’s AuroraReplicaLagMaximum > 60s for 2m alert went off. The Community feeds were the loudest victim. /communities/:id/posts p99 went from about 120 ms to over 8 seconds inside four minutes. By the time I joined the war room, the lag was already 14 minutes and still climbing. I wasn’t on-call that week, I was floating across a couple of squads, but the Community team pulled me in because I owned the Aurora layer.
Reads were served from three reader replicas behind a custom routing layer. Writes hit the Aurora writer, which is a multi-terabyte beast holding the Community product’s hottest tables. The on-call’s first move was reasonable. Bump the readers two instance classes. It did nothing, because the readers weren’t bottlenecked. They were starved of WAL. A long-running ANALYZE on community_posts was holding write-side locks and choking replication. We killed it. Lag drained in about six minutes.
That afternoon is the reason I have strong opinions about Rails 7 multi-database. Teams adopt it because the Rails guides make it look easy. Then replica lag burns a feature and they decide the pattern is “complex”. It isn’t. It’s a contract you honor at the request boundary, not the query boundary.
Three things, in order of how often they matter.
Read scaling. Replicas absorb the cheap, high-volume reads so the writer can keep up with writes plus the expensive reads it can’t avoid. This is the 90% case. Adopt multi-DB for any other reason first and you’re probably doing it wrong.
Failure isolation. A noisy neighbor (a slow analytics query, a backup, a long migration) stays in its lane. Reader degradation doesn’t take down writes.
Horizontal sharding. Different tenants, regions, or workloads on different physical clusters. This is the heavy stuff. You don’t reach for it until your hottest table is misbehaving on a single writer, and even then you want to be sure partitioning isn’t enough.
Notice what’s missing. Multi-DB doesn’t fix slow queries. It doesn’t replace caching. Not a substitute for an archival strategy either. Wire it up expecting a silver bullet and you’ll wonder why your p99 didn’t move.
The Rails 7 shape is simple. One writer, one reader, declared as roles under the same logical database.
# config/database.yml
default: &default
adapter: postgresql
encoding: unicode
pool: <%= ENV.fetch("RAILS_MAX_THREADS", 25) %>
prepared_statements: false
advisory_locks: false
production:
primary:
<<: *default
url: <%= ENV.fetch("DATABASE_URL") %>
primary_replica:
<<: *default
url: <%= ENV.fetch("DATABASE_REPLICA_URL") %>
replica: true
community:
<<: *default
url: <%= ENV.fetch("COMMUNITY_DATABASE_URL") %>
migrations_paths: db/community_migrate
community_replica:
<<: *default
url: <%= ENV.fetch("COMMUNITY_DATABASE_REPLICA_URL") %>
replica: true
replica: true tells Rails not to run migrations against that connection. Obvious in hindsight, easy to forget when copying a config block.
Two ApplicationRecords, one per logical DB. The Community models inherit from a different base so they pin themselves to the right cluster.
# app/models/application_record.rb
class ApplicationRecord < ActiveRecord::Base
primary_abstract_class
connects_to database: { writing: :primary, reading: :primary_replica }
end
# app/models/community_record.rb
class CommunityRecord < ActiveRecord::Base
self.abstract_class = true
connects_to database: { writing: :community, reading: :community_replica }
end
class CommunityPost < CommunityRecord
belongs_to :community
end
That’s the skeleton. Everything else is about routing reads to the right role at the right moment.
Rails ships an automatic role switcher. It works like this. On a GET, reads go to the replica. On any non-GET, reads (and writes) go to the primary for some window after the last write. The default resolver uses the request method and a session timestamp.
# config/application.rb
config.active_record.database_selector = { delay: 5.seconds }
config.active_record.database_resolver = ActiveRecord::Middleware::DatabaseSelector::Resolver
config.active_record.database_resolver_context = ActiveRecord::Middleware::DatabaseSelector::Resolver::Session
I run this in production. With one caveat that has bitten me twice. The default delay: 2.seconds is a lie for Aurora at our scale. Set it higher than your p99 replica lag, not your p50. I run 5 seconds on our Aurora cluster because under healthy load p99 sits around 1.5 to 2 seconds, and any background work that pushes it briefly past 2 turns into “I just created a post and it disappeared” tickets.
The other thing the default resolver does badly is handle async work. Background jobs don’t have a session. They don’t know there was a recent write. So they happily hit the replica and read stale data, then write back. I’ve seen exactly that bug ship to production at multiple companies. The fix is to be explicit in jobs.
# app/jobs/community/digest_email_job.rb
class Community::DigestEmailJob < ApplicationJob
queue_as :community_low
def perform(community_id, since:)
CommunityRecord.connected_to(role: :reading) do
community = Community.find(community_id)
posts = community.posts.where("created_at >= ?", since).limit(50)
DigestEmailMailer.weekly(community, posts).deliver_later
end
end
end
The block makes the intent obvious. Anyone reviewing this PR can see “this job is read-only on the replica”. When you need read-after-write inside a job, switch to :writing for the offending query and switch back.
The default delay window is too coarse for some flows. Onboarding is the classic case. User signs up, we write three rows across two databases, then the next page reads from all three. If any of those reads land on a replica that hasn’t caught up, the user sees a half-rendered profile.
The pattern I land on every time is a per-flow override that pins the next N requests to the writer. Not the whole session. Just the next handful, scoped to the user.
# app/middleware/sticky_writer.rb
class StickyWriter
COOKIE = "_sticky_writer_until"
def initialize(app)
@app = app
end
def call(env)
request = ActionDispatch::Request.new(env)
deadline = request.cookies[COOKIE]&.to_i
if deadline && Time.now.to_i < deadline
ActiveRecord::Base.connected_to(role: :writing) do
@app.call(env)
end
else
@app.call(env)
end
end
end
# In a signup controller:
def create
user = User.create!(signup_params)
CommunityProfile.create!(user_id: user.id)
cookies[StickyWriter::COOKIE] = {
value: (Time.now.to_i + 15).to_s,
httponly: true,
secure: true,
}
redirect_to onboarding_path(user)
end
15 seconds is enough for our worst-case replication catch-up plus a generous margin. The cookie is short-lived, the override is bounded, and the writer doesn’t get hammered by every request from every user “just in case”.
This is the pattern that fixed the Community feed problem after the lag incident. Posts had a sticky cookie set for 10 seconds on the writer immediately after creation. The reader path was untouched for everything else.
Sharding deserves its own article, but the Rails 7 piece is short. connects_to shards: plus a connected_to(shard: ...) block at the request boundary. The trick is the boundary. You set the shard once, from a tenant id or region, and you never touch it inside a query.
class CommunityRecord < ActiveRecord::Base
self.abstract_class = true
connects_to shards: {
default: { writing: :community_us, reading: :community_us_replica },
eu: { writing: :community_eu, reading: :community_eu_replica },
apac: { writing: :community_apac, reading: :community_apac_replica },
}
end
# app/middleware/shard_by_region.rb
class ShardByRegion
def initialize(app); @app = app; end
def call(env)
region = env["HTTP_X_REGION"].presence || "default"
CommunityRecord.connected_to(shard: region.to_sym) do
@app.call(env)
end
end
end
If you find yourself toggling shards mid-controller, you’ve already lost. The boundary is the contract. Inside the boundary, code should look like a normal Rails app.
Late-evening deploy, same platform. We shipped a non-null column to users. The migration used add_column ... null: false, default: false through the strong_migrations gem’s “safer” helper. On Aurora at our row count, it acquired an ACCESS EXCLUSIVE lock and held it for 87 seconds. Login error rate hit 100% for about 85 of those seconds. PagerDuty woke half of California. Rollback wasn’t safe by the time we noticed, so we let it finish. Postmortem rule went in the next day: any add_column with a non-null default against a table with more than ten million rows is blocked by CI.
Multi-DB makes this rule more important, not less. You now have multiple migration paths (db/migrate, db/community_migrate), multiple writers, and different latency budgets per cluster. The three-step dance for any hot-table change isn’t optional.
# db/community_migrate/0001_add_pinned_to_community_posts.rb
class AddPinnedToCommunityPosts < ActiveRecord::Migration[7.1]
disable_ddl_transaction!
def up
safety_assured do
add_column :community_posts, :pinned, :boolean, null: true
end
end
def down
remove_column :community_posts, :pinned
end
end
# db/community_migrate/0002_backfill_pinned.rb
class BackfillPinned < ActiveRecord::Migration[7.1]
disable_ddl_transaction!
def up
CommunityPost.in_batches(of: 5_000) do |batch|
batch.where(pinned: nil).update_all(pinned: false)
sleep 0.1
end
end
end
# db/community_migrate/0003_set_pinned_not_null.rb
class SetPinnedNotNull < ActiveRecord::Migration[7.1]
def up
change_column_null :community_posts, :pinned, false
end
end
Three migrations, three deploys. Yes, that’s slower. It’s also why login still works.
To run migrations against a specific database in Rails 7:
bin/rails db:migrate:community
bin/rails db:rollback:community STEP=1
That’s it. The same task runner with a per-DB suffix. The friction is small. The wrong friction is letting any migration touch a hot table without the dance.
database_selector delay against your p99 replica lag, not your p50.connected_to.strong_migrations defaults are safer than raw ActiveRecord, not safe.Thanks for reading. If you’ve got thoughts, send them my way.