Schema-per-tenant versus row-based isolation in Rails, default_scope traps that leak data, and the tenant-aware cache keys that survived a real SaaS.
The day I lost faith in schema-per-tenant was a Thursday at the London agency I was a founding engineer at. I was building the agency’s flagship SaaS end to end. We had a few hundred client tenants on Postgres, one Postgres schema each, and a migration to add a single boolean column to a hot table. The script set search_path per tenant, ran the migration, moved on. Halfway through, one tenant’s schema came back with a different column type than the rest. We’d shipped a patch the night before that nudged the migration text, redeployed, and the orchestrator picked up the new file mid-run. Half the tenants got the old schema. Half got the new one. Same table name. Same column name. Different types.
That’s the part of multi-tenancy nobody warns you about. A schema-per-tenant model isn’t N copies of one database. It’s N databases that drift.
I’ve since defaulted to row-based isolation in every Rails monolith I’ve shipped. There are good reasons to pick schema-per-tenant. Most apps don’t have them.
Schema-per-tenant gives you bulletproof isolation at the cost of operational pain. Every migration runs N times. Every connection pool has to pin search_path. Backups are per-tenant. Reporting across tenants needs a UNION ALL across schemas. The bigger you get, the more this hurts.
Row-based tenancy stores tenant_id on every tenant-scoped table. One schema, one connection pool, one migration. The trade is that isolation lives in your code, not in the database. Forget a scope, leak data. This is the part everyone gets wrong.
There’s a third option, database-per-tenant, but unless you have regulatory pressure to physically separate customer data, it’s the same tradeoff as schema-per-tenant with worse ergonomics. Skip it.
My default: row-based with Postgres Row-Level Security as a backstop. Code does the filtering. RLS catches the bugs.
The mistake I see most is resolving the tenant in five different places. Subdomain in the controller. JWT claim in the Sidekiq job. ENV var in the rake task. Three of them disagree on the edge case.
Resolve it once, at the request edge, and stash it in ActiveSupport::CurrentAttributes. Then every model, every job, every cache key reads from that one place.
# app/models/current.rb
class Current < ActiveSupport::CurrentAttributes
attribute :tenant, :user, :request_id
def tenant_id
tenant&.id
end
end
# app/controllers/concerns/tenant_resolution.rb
module TenantResolution
extend ActiveSupport::Concern
included do
before_action :resolve_tenant
end
private
def resolve_tenant
subdomain = request.subdomain.presence
raise TenantNotFound, "missing subdomain" if subdomain.nil?
Current.tenant = Tenant.find_by!(slug: subdomain)
Current.request_id = request.request_id
rescue ActiveRecord::RecordNotFound
raise TenantNotFound, "unknown tenant: #{subdomain}"
end
end
For Sidekiq, the same pattern with a middleware that reads the tenant id from the job payload and hydrates Current before perform runs. Don’t trust the job to remember.
# config/initializers/sidekiq.rb
class TenantSidekiqMiddleware
def call(_worker, job, _queue)
Current.tenant = Tenant.find(job["tenant_id"]) if job["tenant_id"]
Current.request_id = job["request_id"]
yield
ensure
Current.reset_all
end
end
Sidekiq.configure_server do |config|
config.server_middleware { |chain| chain.add(TenantSidekiqMiddleware) }
config.client_middleware do |chain|
chain.add(Class.new do
def call(_worker, job, _queue, _redis_pool)
job["tenant_id"] ||= Current.tenant_id
job["request_id"] ||= Current.request_id
yield
end
end)
end
end
The ensure is the part that bites you if you skip it. Forget to reset and the next job on that thread runs as the previous tenant. Found that one the slow way.
The first time I row-tenanted a Rails app, I reached for default_scope. It seemed obvious. Every read filtered by tenant_id, automatically, forever. What could go wrong.
What went wrong: Post.unscoped.update_all(status: "archived"). Run in a rake task, by someone who didn’t know what default scopes were doing under the hood. Archived every post for every tenant. Found it in production. Took a backup restore on the affected rows to claw it back.
default_scope also leaks into associations. tenant.posts.where(status: "draft") will silently double-apply the tenant filter, which is fine until someone joins through tenant and the optimizer gets confused. Inserts get the default values from the scope, which is genuinely weird behavior to debug at 2 a.m.
Use an explicit scope and call it from a base class. Make it noisy. Make it impossible to forget.
# app/models/application_record.rb
class ApplicationRecord < ActiveRecord::Base
self.abstract_class = true
def self.tenanted!
before_validation :assign_tenant!
validates :tenant_id, presence: true
scope :for_current_tenant, -> {
raise "Current.tenant missing" if Current.tenant_id.nil?
where(tenant_id: Current.tenant_id)
}
end
private
def assign_tenant!
self.tenant_id ||= Current.tenant_id
end
end
# app/models/post.rb
class Post < ApplicationRecord
tenanted!
belongs_to :tenant
end
for_current_tenant is verbose on purpose. If you read a model in a controller and don’t see .for_current_tenant in the chain, that’s a code review flag. We had a Rubocop cop on it. Caught two bugs in the first month.
The backstop is RLS. Postgres enforces it at the row level no matter what ActiveRecord does. Even unscoped can’t escape it, because the session role doesn’t have permission to see other tenants’ rows.
ALTER TABLE posts ENABLE ROW LEVEL SECURITY;
ALTER TABLE posts FORCE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON posts
USING (tenant_id = current_setting('app.tenant_id')::bigint);
Set app.tenant_id in the same place you set Current.tenant. If you forget, queries return zero rows instead of every row. Failing closed is the right default.
Cache keys are part of your public API. Treat any change to them like a schema migration. In Rails, this means every Russian-doll cache key includes the tenant.
# app/models/concerns/tenanted_cacheable.rb
module TenantedCacheable
extend ActiveSupport::Concern
def cache_key_with_version
"tenants/#{tenant_id}/#{super}"
end
end
class Post < ApplicationRecord
tenanted!
include TenantedCacheable
end
Fragment caches in views get the same prefix. Page caches too. If you skip this, a request for tenant A can serve a fragment generated for tenant B. The leak is silent because the data looks right, just for the wrong tenant.
I keep saying row-based migrations are easier than schema-per-tenant, and they are, but easier is not safe. Late one evening at the creator economy platform I worked at, I was on a deploy that added a non-null column to a hot table with hundreds of millions of rows. I’d reviewed the migration that morning and called it safe. We used strong_migrations and the add_column_with_default helper, which is the “safer than raw ActiveRecord” path.
The migration grabbed an ACCESS EXCLUSIVE lock on the table while backfilling. On Aurora at our row count, that was 87 seconds of blocked writes. Login error rate hit 100% for 85 seconds. PagerDuty woke half the senior engineers in California. First instinct was rollback, but Rails doesn’t have a clean rollback for a partially-applied add_column_with_default. We let it finish. Login recovered fifteen seconds after the lock released.
The fix in the postmortem was the three-step dance everyone who’s been bitten by Aurora knows: add the column nullable, backfill in batches, then flip nullability. We also added a strong_migrations rule blocking any add_column with a non-null default against tables with more than ten million rows. CI fails the migration before it ever reaches a database.
In a row-tenant app this dance gets worse because every backfill is implicitly cross-tenant. You can batch by tenant_id to make the lock blast radius smaller, and you should, but the migration still has to be safe at the table level.
class AddArchivedAtToPosts < ActiveRecord::Migration[7.1]
disable_ddl_transaction!
def up
add_column :posts, :archived_at, :timestamptz, null: true
Tenant.find_each do |tenant|
Post.where(tenant_id: tenant.id, archived_at: nil)
.in_batches(of: 1_000) do |batch|
batch.update_all(archived_at: Time.current)
sleep 0.1
end
end
change_column_null :posts, :archived_at, false
end
def down
remove_column :posts, :archived_at
end
end
The sleep 0.1 looks dumb. It’s not. On Aurora with reader replicas, hammering the writer in a tight loop generates WAL faster than the readers can apply it. Replica lag climbs. The hottest read path on the platform starts serving stale data. I’ve watched that exact pattern open a Datadog alert on a Tuesday morning, with Community feeds going from 120 ms p99 to over 8 seconds in four minutes. The on-call’s first move was to bump reader instance class up two tiers. Lag didn’t move. The readers weren’t bottlenecked, they were starved of WAL. Real fix was finding a long-running ANALYZE on a hot table holding write-side locks, killing it, and shipping a maintenance-window helper that refuses to run between peak hours. Now I sleep in my backfills. The readers thank me.
CurrentAttributes, hydrate it from Sidekiq middleware.default_scope for tenancy. Use an explicit named scope. Make forgetting it a code review failure.Thanks for reading. If you’ve got thoughts, send them my way.