PostgreSQL Row-Level Security for Multi-Tenancy

How I treat Postgres RLS as a backstop behind tenant-aware queries: session-var context, policy types, performance, and isolation tests in CI.

It was a Wednesday hackathon at the creator economy platform I worked at. I had a tenant-id missing from exactly one WHERE clause in a prototype community feed, and the query happily returned posts from a different tenant. The prototype never went to production. But that’s all it takes, one missing predicate, one junior dev refactor, one hot-fix that forgets to thread tenant_id through. And the database doesn’t care, it gives you what you ask for.

RLS is the answer to that exact failure mode. It’s not the answer to “how do I do multi-tenancy”. The answer to that is your ORM, your service boundary, your code review process. RLS sits behind all of that and catches the bug none of those things caught.

Why RLS belongs in defense in depth

I own the Aurora layer for the community product at the creator economy platform I spent the last few years at. Hundreds of millions of rows on a multi-terabyte Aurora writer. Every read path is tenant-scoped. Every write path is tenant-scoped. And I still turn RLS on, because I’ve seen what happens when you trust application code to remember things forever.

Here’s the smallest version of turning it on:

ALTER TABLE community_posts ENABLE ROW LEVEL SECURITY;
ALTER TABLE community_posts FORCE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON community_posts
  FOR ALL
  TO application_role
  USING (tenant_id = current_setting('app.tenant_id', true)::uuid)
  WITH CHECK (tenant_id = current_setting('app.tenant_id', true)::uuid);

That FORCE matters. Without it the table owner bypasses RLS, and the table owner is usually whoever runs your migrations. You want the rule to apply even to your own service role. Otherwise you’ve got a backstop that only catches mistakes made by other people.

Setting tenant context with session vars

The policy reads current_setting('app.tenant_id'). Something has to set that. The pattern that’s held up for me is: bind it once at the start of every request, on the same connection that’s going to do the work, inside a transaction.

import { Pool, PoolClient } from "pg";
import { Request, Response, NextFunction } from "express";

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

export async function withTenantContext(
  req: Request,
  res: Response,
  next: NextFunction
) {
  const tenantId = req.auth?.tenantId;
  if (!tenantId) {
    return res.status(401).json({ error: "missing tenant" });
  }

  const client: PoolClient = await pool.connect();
  try {
    await client.query("BEGIN");
    await client.query("SET LOCAL app.tenant_id = $1", [tenantId]);
    (req as any).db = client;
    res.on("finish", async () => {
      try {
        await client.query("COMMIT");
      } catch {
        await client.query("ROLLBACK").catch(() => {});
      } finally {
        client.release();
      }
    });
    next();
  } catch (err) {
    await client.query("ROLLBACK").catch(() => {});
    client.release();
    next(err);
  }
}

A few things I learned the hard way here. SET LOCAL is scoped to the transaction, so if your query runs outside that transaction the tenant context is gone and the policy will return nothing, which looks like a bug but is actually the policy doing its job. Don’t use a pooler in transaction-pooling mode in front of this without thinking about it. PgBouncer with pool_mode = transaction works fine because the session vars live with the transaction. pool_mode = session is also fine. pool_mode = statement will quietly break you.

On Rails I do the same thing with an around_action that calls connection.execute("SET LOCAL app.tenant_id = #{quoted}") inside transaction do. Same shape, different syntax.

Policy types in practice

The two knobs that bit me most are PERMISSIVE vs RESTRICTIVE, and USING vs WITH CHECK.

Default policies are PERMISSIVE, which means multiple policies on the same table get OR’d together. If you have a permissive policy for tenants and another permissive policy for admins, a row is visible if either passes. That’s usually what you want for reads. Sometimes it isn’t.

RESTRICTIVE policies get AND’d in. If you need an extra gate that must always hold, even when other policies would allow access, write it as restrictive:

CREATE POLICY tenant_isolation ON community_posts
  AS PERMISSIVE
  FOR ALL
  TO application_role
  USING (tenant_id = current_setting('app.tenant_id', true)::uuid)
  WITH CHECK (tenant_id = current_setting('app.tenant_id', true)::uuid);

CREATE POLICY no_deleted_tenants ON community_posts
  AS RESTRICTIVE
  FOR ALL
  TO application_role
  USING (
    NOT EXISTS (
      SELECT 1 FROM tenants t
      WHERE t.id = community_posts.tenant_id
        AND t.deleted_at IS NOT NULL
    )
  );

USING filters what rows the policy returns on reads, and what rows existing-row operations (UPDATE, DELETE) can touch. WITH CHECK validates what rows you’re inserting or updating into the table. If you only write USING, an UPDATE that flips tenant_id to a different tenant will succeed quietly, because the post-update row is never checked. Always write both. Always. I’ve seen people get this wrong in code review and I’ve been the person getting it wrong.

Performance impact and how I measure it

RLS adds a predicate to every plan, and the planner has to fold it into the rest of the query. Most of the time it’s cheap. The cases where it bites are joins, partial indexes that don’t match the tenant predicate, and partitioned tables where partition pruning depends on the planner seeing the literal early enough.

The first thing I do on any new policy is run EXPLAIN ANALYZE against the real query, with and without the policy applied, on a real-sized table:

SET app.tenant_id = '01J3...';
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, body, created_at
FROM community_posts
WHERE created_at > now() - interval '7 days'
ORDER BY created_at DESC
LIMIT 50;

If the plan changed from an index scan to a sequential scan, the tenant predicate is fighting your index. The fix is almost always a composite index that leads with tenant_id:

CREATE INDEX CONCURRENTLY idx_posts_tenant_created
  ON community_posts (tenant_id, created_at DESC);

Production lesson. One Tuesday morning, Aurora reader replica lag fired on the community read path, with p99 climbing from around 120 ms to over 8 seconds. First instinct on the on-call thread was to bump reader instance class. Didn’t help. The actual cause was a long-running ANALYZE on the hottest table holding write-side locks, starving WAL emission to the readers. Killed the analyze, replica lag drained inside six minutes. The lesson translates straight across to RLS work: when a policy change looks like it broke performance, check the writer side first, not the readers. pg_stat_activity on the writer is the first thing you open.

Isolation testing in CI

If you don’t have a test that says “tenant A cannot read tenant B”, you don’t have multi-tenancy. You have hopes.

I run a Postgres in CI on every PR and exercise the policy directly. The docker-compose excerpt I keep around:

services:
  db:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: ci
      POSTGRES_DB: app_test
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "postgres"]
      interval: 2s
      timeout: 2s
      retries: 30
    tmpfs:
      - /var/lib/postgresql/data

And the test that actually catches leaks:

import { Pool } from "pg";

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

async function asTenant<T>(tenantId: string, fn: (c: any) => Promise<T>) {
  const client = await pool.connect();
  try {
    await client.query("BEGIN");
    await client.query("SET LOCAL ROLE application_role");
    await client.query("SET LOCAL app.tenant_id = $1", [tenantId]);
    return await fn(client);
  } finally {
    await client.query("ROLLBACK").catch(() => {});
    client.release();
  }
}

test("tenant cannot see another tenant's posts", async () => {
  const a = "11111111-1111-1111-1111-111111111111";
  const b = "22222222-2222-2222-2222-222222222222";

  await asTenant(a, (c) =>
    c.query("INSERT INTO community_posts (id, tenant_id, body) VALUES (gen_random_uuid(), $1, 'hello A')", [a])
  );

  const rows = await asTenant(b, async (c) => {
    const r = await c.query("SELECT id FROM community_posts WHERE body = 'hello A'");
    return r.rows;
  });

  expect(rows).toHaveLength(0);
});

I’d run a few of these. Read leak, write leak (try INSERTing with a wrong tenant_id and assert it fails), update leak (try to UPDATE tenant_id and assert it fails or returns zero rows). The whole suite runs in seconds and it’s the single highest-value test file in the repo.

War story, the policy migration that locked the table

Late-evening deploy at the creator economy platform, past midnight UTC but only around 6 p.m. Pacific. The change was supposed to be safe. Enable RLS on a hot user-adjacent table, add the tenant policy. I’d reviewed it that morning and ack’d it. Strong migrations gem in the loop. We’d done it ten times on quieter tables.

Around 18:04 Pacific the migration kicked off. Login error rate hit 100% for about 85 seconds. PagerDuty woke half the senior engineers in California. The migration was taking an ACCESS EXCLUSIVE lock on a table with hundreds of millions of rows while it set up the policy and validated it. Aurora at our size means that lock blocks writes for as long as it holds.

First instinct, of course, was to roll back. But the operation was past the point where a clean rollback was an option. We let it finish. It took 87 seconds. Locks released. Login recovered within fifteen seconds because the dependent service had a tight retry loop.

The postmortem fix wasn’t fancy. Split the change into pieces that take cheaper locks. Enable the row level security flag in one tiny migration. Create the policy as NOT VALID-style in a second migration (actually for RLS the pattern is: create the policy in one txn, validate against existing rows in a separate batched job, then enable FORCE last). Run the heavy validation as a background task, not in the deploy window. Add a strong_migrations check that flags any RLS migration against tables over a configured row count threshold and forces a comment from the author explaining the lock plan.

The lesson is the same lesson Aurora teaches you every time. On a hot table at scale, every DDL is a three-step dance. RLS migrations are DDL.

When RLS is the wrong tool

Cross-tenant analytics. Admin dashboards. Anything that genuinely needs to read across tenants. Don’t fight it. Make a separate role:

CREATE ROLE analytics_reader WITH LOGIN;
ALTER ROLE analytics_reader BYPASSRLS;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analytics_reader;

That role goes to the BI tool, the ETL job, the support console. Production application traffic never uses it. Auditing on this role is critical, so log every connection.

The other place RLS gets in your way is migrations themselves. Most migration runners connect as a privileged role that owns the schema. Make sure that role either has BYPASSRLS or that your policies don’t apply to it. Otherwise your migration tries to backfill a column and gets zero rows back, and you spend an hour wondering why.

Takeaways

RLS is your second line. Keep tenant-aware queries in your ORM as the first.
Bind tenant context in middleware inside a transaction. Don’t trust app code to remember.
Always write both USING and WITH CHECK. UPDATE without WITH CHECK is how tenants migrate sideways.
Run EXPLAIN ANALYZE on policy-affected queries before you ship; lead your indexes with tenant_id.
Write cross-tenant leak tests in CI. If you don’t, you’ll learn about a leak from a customer.
Treat policy enablement on a hot table as a multi-step migration, not a one-liner.

Thanks for reading. If you’ve got thoughts, send them my way.