Multi-Tenancy in a Rails Monolith

Schema-per-tenant versus row-based isolation in Rails, default_scope traps that leak data, and the tenant-aware cache keys that survived a real SaaS.

The day I lost faith in schema-per-tenant was a Thursday at the London agency I was a founding engineer at. I was building the agency’s flagship SaaS end to end. We had a few hundred client tenants on Postgres, one Postgres schema each, and a migration to add a single boolean column to a hot table. The script set search_path per tenant, ran the migration, moved on. Halfway through, one tenant’s schema came back with a different column type than the rest. We’d shipped a patch the night before that nudged the migration text, redeployed, and the orchestrator picked up the new file mid-run. Half the tenants got the old schema. Half got the new one. Same table name. Same column name. Different types.

That’s the part of multi-tenancy nobody warns you about. A schema-per-tenant model isn’t N copies of one database. It’s N databases that drift.

I’ve since defaulted to row-based isolation in every Rails monolith I’ve shipped. There are good reasons to pick schema-per-tenant. Most apps don’t have them.

The two real choices

Schema-per-tenant gives you bulletproof isolation at the cost of operational pain. Every migration runs N times. Every connection pool has to pin search_path. Backups are per-tenant. Reporting across tenants needs a UNION ALL across schemas. The bigger you get, the more this hurts.

Row-based tenancy stores tenant_id on every tenant-scoped table. One schema, one connection pool, one migration. The trade is that isolation lives in your code, not in the database. Forget a scope, leak data. This is the part everyone gets wrong.

There’s a third option, database-per-tenant, but unless you have regulatory pressure to physically separate customer data, it’s the same tradeoff as schema-per-tenant with worse ergonomics. Skip it.

My default: row-based with Postgres Row-Level Security as a backstop. Code does the filtering. RLS catches the bugs.

Resolve the tenant once

The mistake I see most is resolving the tenant in five different places. Subdomain in the controller. JWT claim in the Sidekiq job. ENV var in the rake task. Three of them disagree on the edge case.

Resolve it once, at the request edge, and stash it in ActiveSupport::CurrentAttributes. Then every model, every job, every cache key reads from that one place.

# app/models/current.rb
class Current < ActiveSupport::CurrentAttributes
  attribute :tenant, :user, :request_id

  def tenant_id
    tenant&.id
  end
end

# app/controllers/concerns/tenant_resolution.rb
module TenantResolution
  extend ActiveSupport::Concern

  included do
    before_action :resolve_tenant
  end

  private

  def resolve_tenant
    subdomain = request.subdomain.presence
    raise TenantNotFound, "missing subdomain" if subdomain.nil?

    Current.tenant = Tenant.find_by!(slug: subdomain)
    Current.request_id = request.request_id
  rescue ActiveRecord::RecordNotFound
    raise TenantNotFound, "unknown tenant: #{subdomain}"
  end
end

For Sidekiq, the same pattern with a middleware that reads the tenant id from the job payload and hydrates Current before perform runs. Don’t trust the job to remember.

# config/initializers/sidekiq.rb
class TenantSidekiqMiddleware
  def call(_worker, job, _queue)
    Current.tenant = Tenant.find(job["tenant_id"]) if job["tenant_id"]
    Current.request_id = job["request_id"]
    yield
  ensure
    Current.reset_all
  end
end

Sidekiq.configure_server do |config|
  config.server_middleware { |chain| chain.add(TenantSidekiqMiddleware) }
  config.client_middleware do |chain|
    chain.add(Class.new do
      def call(_worker, job, _queue, _redis_pool)
        job["tenant_id"] ||= Current.tenant_id
        job["request_id"] ||= Current.request_id
        yield
      end
    end)
  end
end

The ensure is the part that bites you if you skip it. Forget to reset and the next job on that thread runs as the previous tenant. Found that one the slow way.

Default_scope is a footgun

The first time I row-tenanted a Rails app, I reached for default_scope. It seemed obvious. Every read filtered by tenant_id, automatically, forever. What could go wrong.

What went wrong: Post.unscoped.update_all(status: "archived"). Run in a rake task, by someone who didn’t know what default scopes were doing under the hood. Archived every post for every tenant. Found it in production. Took a backup restore on the affected rows to claw it back.

default_scope also leaks into associations. tenant.posts.where(status: "draft") will silently double-apply the tenant filter, which is fine until someone joins through tenant and the optimizer gets confused. Inserts get the default values from the scope, which is genuinely weird behavior to debug at 2 a.m.

Use an explicit scope and call it from a base class. Make it noisy. Make it impossible to forget.

# app/models/application_record.rb
class ApplicationRecord < ActiveRecord::Base
  self.abstract_class = true

  def self.tenanted!
    before_validation :assign_tenant!
    validates :tenant_id, presence: true
    scope :for_current_tenant, -> {
      raise "Current.tenant missing" if Current.tenant_id.nil?
      where(tenant_id: Current.tenant_id)
    }
  end

  private

  def assign_tenant!
    self.tenant_id ||= Current.tenant_id
  end
end

# app/models/post.rb
class Post < ApplicationRecord
  tenanted!
  belongs_to :tenant
end

for_current_tenant is verbose on purpose. If you read a model in a controller and don’t see .for_current_tenant in the chain, that’s a code review flag. We had a Rubocop cop on it. Caught two bugs in the first month.

The backstop is RLS. Postgres enforces it at the row level no matter what ActiveRecord does. Even unscoped can’t escape it, because the session role doesn’t have permission to see other tenants’ rows.

ALTER TABLE posts ENABLE ROW LEVEL SECURITY;
ALTER TABLE posts FORCE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON posts
  USING (tenant_id = current_setting('app.tenant_id')::bigint);

Set app.tenant_id in the same place you set Current.tenant. If you forget, queries return zero rows instead of every row. Failing closed is the right default.

Cache keys are tenant scope too

Cache keys are part of your public API. Treat any change to them like a schema migration. In Rails, this means every Russian-doll cache key includes the tenant.

# app/models/concerns/tenanted_cacheable.rb
module TenantedCacheable
  extend ActiveSupport::Concern

  def cache_key_with_version
    "tenants/#{tenant_id}/#{super}"
  end
end

class Post < ApplicationRecord
  tenanted!
  include TenantedCacheable
end

Fragment caches in views get the same prefix. Page caches too. If you skip this, a request for tenant A can serve a fragment generated for tenant B. The leak is silent because the data looks right, just for the wrong tenant.

Migrations at tenant scale

I keep saying row-based migrations are easier than schema-per-tenant, and they are, but easier is not safe. Late one evening at the creator economy platform I worked at, I was on a deploy that added a non-null column to a hot table with hundreds of millions of rows. I’d reviewed the migration that morning and called it safe. We used strong_migrations and the add_column_with_default helper, which is the “safer than raw ActiveRecord” path.

The migration grabbed an ACCESS EXCLUSIVE lock on the table while backfilling. On Aurora at our row count, that was 87 seconds of blocked writes. Login error rate hit 100% for 85 seconds. PagerDuty woke half the senior engineers in California. First instinct was rollback, but Rails doesn’t have a clean rollback for a partially-applied add_column_with_default. We let it finish. Login recovered fifteen seconds after the lock released.

The fix in the postmortem was the three-step dance everyone who’s been bitten by Aurora knows: add the column nullable, backfill in batches, then flip nullability. We also added a strong_migrations rule blocking any add_column with a non-null default against tables with more than ten million rows. CI fails the migration before it ever reaches a database.

In a row-tenant app this dance gets worse because every backfill is implicitly cross-tenant. You can batch by tenant_id to make the lock blast radius smaller, and you should, but the migration still has to be safe at the table level.

class AddArchivedAtToPosts < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def up
    add_column :posts, :archived_at, :timestamptz, null: true

    Tenant.find_each do |tenant|
      Post.where(tenant_id: tenant.id, archived_at: nil)
          .in_batches(of: 1_000) do |batch|
        batch.update_all(archived_at: Time.current)
        sleep 0.1
      end
    end

    change_column_null :posts, :archived_at, false
  end

  def down
    remove_column :posts, :archived_at
  end
end

The sleep 0.1 looks dumb. It’s not. On Aurora with reader replicas, hammering the writer in a tight loop generates WAL faster than the readers can apply it. Replica lag climbs. The hottest read path on the platform starts serving stale data. I’ve watched that exact pattern open a Datadog alert on a Tuesday morning, with Community feeds going from 120 ms p99 to over 8 seconds in four minutes. The on-call’s first move was to bump reader instance class up two tiers. Lag didn’t move. The readers weren’t bottlenecked, they were starved of WAL. Real fix was finding a long-running ANALYZE on a hot table holding write-side locks, killing it, and shipping a maintenance-window helper that refuses to run between peak hours. Now I sleep in my backfills. The readers thank me.

Takeaways

Default to row-based tenancy with RLS as a backstop. Schema-per-tenant only when regulatory isolation demands it.
Resolve the tenant once at the edge, store it in CurrentAttributes, hydrate it from Sidekiq middleware.
Never default_scope for tenancy. Use an explicit named scope. Make forgetting it a code review failure.
Tenant id is part of every cache key. Russian-doll, fragment, page, all of them.
Migrations on row-tenant tables are still schema migrations. Three-step dances, batched backfills, sleeps in the loop.
The database is the last line of defense. RLS catches the bug your code review missed.

Thanks for reading. If you’ve got thoughts, send them my way.