Multi-Database Architecture in Rails 7

How I run Rails 7 against Aurora writers, reader replicas, and sharded clusters in production. Role switching, replica-lag stickiness, and migrations that don't take down login.

OK so picture a Tuesday morning, 10:14 a.m. PT, at the creator-economy platform I spent the last few years at. Datadog’s AuroraReplicaLagMaximum > 60s for 2m alert went off. The Community feeds were the loudest victim. /communities/:id/posts p99 went from about 120 ms to over 8 seconds inside four minutes. By the time I joined the war room, the lag was already 14 minutes and still climbing. I wasn’t on-call that week, I was floating across a couple of squads, but the Community team pulled me in because I owned the Aurora layer.

Reads were served from three reader replicas behind a custom routing layer. Writes hit the Aurora writer, which is a multi-terabyte beast holding the Community product’s hottest tables. The on-call’s first move was reasonable. Bump the readers two instance classes. It did nothing, because the readers weren’t bottlenecked. They were starved of WAL. A long-running ANALYZE on community_posts was holding write-side locks and choking replication. We killed it. Lag drained in about six minutes.

That afternoon is the reason I have strong opinions about Rails 7 multi-database. Teams adopt it because the Rails guides make it look easy. Then replica lag burns a feature and they decide the pattern is “complex”. It isn’t. It’s a contract you honor at the request boundary, not the query boundary.

What multi-DB actually buys you

Three things, in order of how often they matter.

Read scaling. Replicas absorb the cheap, high-volume reads so the writer can keep up with writes plus the expensive reads it can’t avoid. This is the 90% case. Adopt multi-DB for any other reason first and you’re probably doing it wrong.

Failure isolation. A noisy neighbor (a slow analytics query, a backup, a long migration) stays in its lane. Reader degradation doesn’t take down writes.

Horizontal sharding. Different tenants, regions, or workloads on different physical clusters. This is the heavy stuff. You don’t reach for it until your hottest table is misbehaving on a single writer, and even then you want to be sure partitioning isn’t enough.

Notice what’s missing. Multi-DB doesn’t fix slow queries. It doesn’t replace caching. Not a substitute for an archival strategy either. Wire it up expecting a silver bullet and you’ll wonder why your p99 didn’t move.

Two databases in database.yml

The Rails 7 shape is simple. One writer, one reader, declared as roles under the same logical database.

# config/database.yml
default: &default
  adapter: postgresql
  encoding: unicode
  pool: <%= ENV.fetch("RAILS_MAX_THREADS", 25) %>
  prepared_statements: false
  advisory_locks: false

production:
  primary:
    <<: *default
    url: <%= ENV.fetch("DATABASE_URL") %>
  primary_replica:
    <<: *default
    url: <%= ENV.fetch("DATABASE_REPLICA_URL") %>
    replica: true
  community:
    <<: *default
    url: <%= ENV.fetch("COMMUNITY_DATABASE_URL") %>
    migrations_paths: db/community_migrate
  community_replica:
    <<: *default
    url: <%= ENV.fetch("COMMUNITY_DATABASE_REPLICA_URL") %>
    replica: true

replica: true tells Rails not to run migrations against that connection. Obvious in hindsight, easy to forget when copying a config block.

Two ApplicationRecords, one per logical DB. The Community models inherit from a different base so they pin themselves to the right cluster.

# app/models/application_record.rb
class ApplicationRecord < ActiveRecord::Base
  primary_abstract_class

  connects_to database: { writing: :primary, reading: :primary_replica }
end

# app/models/community_record.rb
class CommunityRecord < ActiveRecord::Base
  self.abstract_class = true

  connects_to database: { writing: :community, reading: :community_replica }
end

class CommunityPost < CommunityRecord
  belongs_to :community
end

That’s the skeleton. Everything else is about routing reads to the right role at the right moment.

Automatic role switching, with caveats

Rails ships an automatic role switcher. It works like this. On a GET, reads go to the replica. On any non-GET, reads (and writes) go to the primary for some window after the last write. The default resolver uses the request method and a session timestamp.

# config/application.rb
config.active_record.database_selector = { delay: 5.seconds }
config.active_record.database_resolver = ActiveRecord::Middleware::DatabaseSelector::Resolver
config.active_record.database_resolver_context = ActiveRecord::Middleware::DatabaseSelector::Resolver::Session

I run this in production. With one caveat that has bitten me twice. The default delay: 2.seconds is a lie for Aurora at our scale. Set it higher than your p99 replica lag, not your p50. I run 5 seconds on our Aurora cluster because under healthy load p99 sits around 1.5 to 2 seconds, and any background work that pushes it briefly past 2 turns into “I just created a post and it disappeared” tickets.

The other thing the default resolver does badly is handle async work. Background jobs don’t have a session. They don’t know there was a recent write. So they happily hit the replica and read stale data, then write back. I’ve seen exactly that bug ship to production at multiple companies. The fix is to be explicit in jobs.

# app/jobs/community/digest_email_job.rb
class Community::DigestEmailJob < ApplicationJob
  queue_as :community_low

  def perform(community_id, since:)
    CommunityRecord.connected_to(role: :reading) do
      community = Community.find(community_id)
      posts = community.posts.where("created_at >= ?", since).limit(50)

      DigestEmailMailer.weekly(community, posts).deliver_later
    end
  end
end

The block makes the intent obvious. Anyone reviewing this PR can see “this job is read-only on the replica”. When you need read-after-write inside a job, switch to :writing for the offending query and switch back.

Sticky sessions for read-after-write

The default delay window is too coarse for some flows. Onboarding is the classic case. User signs up, we write three rows across two databases, then the next page reads from all three. If any of those reads land on a replica that hasn’t caught up, the user sees a half-rendered profile.

The pattern I land on every time is a per-flow override that pins the next N requests to the writer. Not the whole session. Just the next handful, scoped to the user.

# app/middleware/sticky_writer.rb
class StickyWriter
  COOKIE = "_sticky_writer_until"

  def initialize(app)
    @app = app
  end

  def call(env)
    request = ActionDispatch::Request.new(env)
    deadline = request.cookies[COOKIE]&.to_i

    if deadline && Time.now.to_i < deadline
      ActiveRecord::Base.connected_to(role: :writing) do
        @app.call(env)
      end
    else
      @app.call(env)
    end
  end
end

# In a signup controller:
def create
  user = User.create!(signup_params)
  CommunityProfile.create!(user_id: user.id)

  cookies[StickyWriter::COOKIE] = {
    value: (Time.now.to_i + 15).to_s,
    httponly: true,
    secure: true,
  }

  redirect_to onboarding_path(user)
end

15 seconds is enough for our worst-case replication catch-up plus a generous margin. The cookie is short-lived, the override is bounded, and the writer doesn’t get hammered by every request from every user “just in case”.

This is the pattern that fixed the Community feed problem after the lag incident. Posts had a sticky cookie set for 10 seconds on the writer immediately after creation. The reader path was untouched for everything else.

Horizontal sharding without the regret

Sharding deserves its own article, but the Rails 7 piece is short. connects_to shards: plus a connected_to(shard: ...) block at the request boundary. The trick is the boundary. You set the shard once, from a tenant id or region, and you never touch it inside a query.

class CommunityRecord < ActiveRecord::Base
  self.abstract_class = true

  connects_to shards: {
    default:   { writing: :community_us, reading: :community_us_replica },
    eu:        { writing: :community_eu, reading: :community_eu_replica },
    apac:      { writing: :community_apac, reading: :community_apac_replica },
  }
end

# app/middleware/shard_by_region.rb
class ShardByRegion
  def initialize(app); @app = app; end

  def call(env)
    region = env["HTTP_X_REGION"].presence || "default"
    CommunityRecord.connected_to(shard: region.to_sym) do
      @app.call(env)
    end
  end
end

If you find yourself toggling shards mid-controller, you’ve already lost. The boundary is the contract. Inside the boundary, code should look like a normal Rails app.

Late-evening deploy, same platform. We shipped a non-null column to users. The migration used add_column ... null: false, default: false through the strong_migrations gem’s “safer” helper. On Aurora at our row count, it acquired an ACCESS EXCLUSIVE lock and held it for 87 seconds. Login error rate hit 100% for about 85 of those seconds. PagerDuty woke half of California. Rollback wasn’t safe by the time we noticed, so we let it finish. Postmortem rule went in the next day: any add_column with a non-null default against a table with more than ten million rows is blocked by CI.

Multi-DB makes this rule more important, not less. You now have multiple migration paths (db/migrate, db/community_migrate), multiple writers, and different latency budgets per cluster. The three-step dance for any hot-table change isn’t optional.

# db/community_migrate/0001_add_pinned_to_community_posts.rb
class AddPinnedToCommunityPosts < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def up
    safety_assured do
      add_column :community_posts, :pinned, :boolean, null: true
    end
  end

  def down
    remove_column :community_posts, :pinned
  end
end

# db/community_migrate/0002_backfill_pinned.rb
class BackfillPinned < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def up
    CommunityPost.in_batches(of: 5_000) do |batch|
      batch.where(pinned: nil).update_all(pinned: false)
      sleep 0.1
    end
  end
end

# db/community_migrate/0003_set_pinned_not_null.rb
class SetPinnedNotNull < ActiveRecord::Migration[7.1]
  def up
    change_column_null :community_posts, :pinned, false
  end
end

Three migrations, three deploys. Yes, that’s slower. It’s also why login still works.

To run migrations against a specific database in Rails 7:

bin/rails db:migrate:community
bin/rails db:rollback:community STEP=1

That’s it. The same task runner with a per-DB suffix. The friction is small. The wrong friction is letting any migration touch a hot table without the dance.

Takeaways

Multi-DB is read scaling first, isolation second, sharding distant third. Don’t reach for it for the wrong reason.
The role switch contract lives at the request boundary. Set it once. Don’t toggle inside queries.
Set the database_selector delay against your p99 replica lag, not your p50.
Background jobs don’t get the automatic switcher. Be explicit with connected_to.
Use a sticky-writer cookie for read-after-write flows. Bound it in seconds, not minutes.
Every hot-table migration is a three-step dance. strong_migrations defaults are safer than raw ActiveRecord, not safe.