Multi-Tenant SaaS with NestJS

Tenant detection, AsyncLocalStorage context, and schema vs row isolation in NestJS. The cross-tenant leak that taught me to test isolation as a first-class feature.

The first time I saw a cross-tenant data leak in production, it wasn’t even our fault. Not really. The code looked correct. The query had a where. The user was authenticated. The role was right. But the cache key in front of it didn’t know about tenants, and for about fifteen minutes one customer’s dashboard was rendering snippets from another customer’s data. Different company. Different industry. Both real.

This was on a logistics-sector hiring platform I was CTO on the side for. NestJS on the backend, Postgres, AWS. Multi-tenant by design. I had assumed isolation. That’s the part that still bothers me.

This is how I build multi-tenant SaaS on NestJS now. Where the tenant comes from, how it travels through a request, and the choice between schema-per-tenant and row-per-tenant that I keep getting asked about.

Where the tenant comes from

Two sources, in this order. Subdomain first. JWT second. Never the request body, never a query param, never a header the client controls without verification.

A subdomain is a strong signal because DNS routed the request before your app saw it. A JWT is a strong signal because you signed it. Anything else is a guess dressed up as a constraint.

import { Injectable, NestMiddleware, UnauthorizedException } from '@nestjs/common';
import { Request, Response, NextFunction } from 'express';
import { JwtService } from '@nestjs/jwt';
import { TenantRegistry } from './tenant.registry';

@Injectable()
export class TenantResolverMiddleware implements NestMiddleware {
  constructor(
    private readonly jwt: JwtService,
    private readonly tenants: TenantRegistry,
  ) {}

  async use(req: Request, _res: Response, next: NextFunction): Promise<void> {
    const host = req.hostname;
    const sub = host.split('.')[0];
    const fromSubdomain = await this.tenants.findBySlug(sub);

    const auth = req.headers.authorization?.replace(/^Bearer\s+/i, '');
    const fromToken = auth ? await this.tryVerify(auth) : null;

    if (fromSubdomain && fromToken && fromSubdomain.id !== fromToken.tenantId) {
      // someone is on acme.app.com with a token for globex. refuse.
      throw new UnauthorizedException('tenant_mismatch');
    }

    const tenant = fromSubdomain ?? (fromToken ? await this.tenants.findById(fromToken.tenantId) : null);
    if (!tenant) throw new UnauthorizedException('tenant_required');

    (req as any).tenant = tenant;
    next();
  }

  private async tryVerify(token: string) {
    try { return await this.jwt.verifyAsync<{ tenantId: string }>(token); }
    catch { return null; }
  }
}

The mismatch check earns its keep. I’ve seen real bugs where a session cookie outlived a subdomain change after a tenant rename. The middleware refuses the request instead of guessing. Refusing is cheap. Guessing is the bug.

Threading tenant through the request

The middleware sets req.tenant and that’s fine for controllers. It’s not fine for everything else. Repositories, listeners, queue workers, scheduled jobs, NestJS lifecycle hooks. The moment your code path leaves the controller, req is gone.

AsyncLocalStorage is the right primitive. NestJS doesn’t ship with a wrapper but it’s a few lines.

import { AsyncLocalStorage } from 'node:async_hooks';
import { Injectable } from '@nestjs/common';

type TenantStore = { tenantId: string; tenantSlug: string };

@Injectable()
export class TenantContext {
  private readonly als = new AsyncLocalStorage<TenantStore>();

  run<T>(store: TenantStore, fn: () => Promise<T>): Promise<T> {
    return this.als.run(store, fn);
  }

  current(): TenantStore {
    const s = this.als.getStore();
    if (!s) throw new Error('tenant_context_missing');
    return s;
  }
}

I deliberately throw when the store is missing. Returning null and trusting downstream code to handle it is how you end up with cross-tenant queries. Hard failure in development, hard failure in tests, hard failure on a worker. You want it loud.

Request-scoped providers do this too, technically. I avoid them. They tear down the DI graph per request and the overhead is measurable once you’re past a few thousand RPS. ALS is free.

The interceptor that wraps every request:

import { CallHandler, ExecutionContext, Injectable, NestInterceptor } from '@nestjs/common';
import { Observable, from, mergeMap } from 'rxjs';
import { TenantContext } from './tenant.context';

@Injectable()
export class TenantContextInterceptor implements NestInterceptor {
  constructor(private readonly ctx: TenantContext) {}

  intercept(ec: ExecutionContext, next: CallHandler): Observable<unknown> {
    const req = ec.switchToHttp().getRequest();
    const tenant = req.tenant;
    return from(
      this.ctx.run({ tenantId: tenant.id, tenantSlug: tenant.slug }, () =>
        next.handle().toPromise(),
      ),
    ).pipe(mergeMap((v) => (v as any) ?? []));
  }
}

For BullMQ workers and @Cron jobs, the producer attaches tenantId to the job payload, the consumer wraps the handler in ctx.run(...) before doing anything else. Same primitive, different entry point.

Schema or row

The honest answer is “row, until you can’t.” Schema-per-tenant sounds clean. It is not free.

Row-based means one schema, every tenant table has a tenant_id column, every query filters by it. Pros: one migration, one connection pool, one set of indexes to tune. Cons: nothing stops a buggy query from forgetting the filter.

Schema-based means CREATE SCHEMA tenant_acme and a new copy of every table inside it. Pros: blast radius on a bad query stops at one tenant. Cons: migrations are now N parallel jobs, connection pooling fragments, pg_class bloats, and onboarding latency goes from “INSERT” to “wait for DDL plus seed.”

I run row-based by default. I add a layer that makes forgetting the tenant filter impossible. With TypeORM that’s a global subscriber or a query builder wrapper. With Prisma it’s the client extension API.

import { Prisma } from '@prisma/client';
import { TenantContext } from './tenant.context';

export function tenantScopedClient(base: PrismaClient, ctx: TenantContext) {
  return base.$extends({
    query: {
      $allModels: {
        async $allOperations({ args, query, operation, model }) {
          const scoped = ['findMany', 'findFirst', 'findUnique', 'count', 'aggregate', 'updateMany', 'deleteMany'];
          if (!scoped.includes(operation)) return query(args);

          const { tenantId } = ctx.current();
          args.where = { AND: [{ tenantId }, args.where ?? {}] };
          return query(args);
        },
      },
    },
  });
}

That extension is the only way the rest of the codebase talks to Postgres. Application code calls prisma.user.findMany({ where: { active: true } }), the extension silently ANDs in tenantId. There is no escape hatch in normal code paths. The one place you can break out is in a clearly named admin client that takes a tenant ID as a function argument. That client lives in one folder. Code review treats every change in that folder like a security review, because it is.

Testing isolation as a feature

This is the part most teams skip. They write tests for “user can see their own data.” They don’t write tests for “user cannot see anyone else’s data.”

import { Test } from '@nestjs/testing';
import { AppModule } from '../src/app.module';
import * as request from 'supertest';

describe('tenant isolation', () => {
  let app;
  beforeAll(async () => {
    const mod = await Test.createTestingModule({ imports: [AppModule] }).compile();
    app = mod.createNestApplication();
    await app.init();
  });

  it('does not return another tenants users', async () => {
    const acmeToken = await signFor('acme', '[email protected]');
    const globexToken = await signFor('globex', '[email protected]');

    await request(app.getHttpServer())
      .post('/users').set('Authorization', `Bearer ${globexToken}`)
      .send({ email: '[email protected]' }).expect(201);

    const res = await request(app.getHttpServer())
      .get('/users').set('Authorization', `Bearer ${acmeToken}`).expect(200);

    expect(res.body.find((u) => u.email === '[email protected]')).toBeUndefined();
  });
});

We run this kind of test on every endpoint that returns or mutates tenant-scoped data. It catches the regression that nothing else catches. The kind where someone refactors a repository method and removes the implicit filter because they’re using the raw client by mistake.

The schema migration that almost cost us

At the creator-economy platform, same family of pain. We were adding a non-null column to users on a multi-terabyte Aurora writer. The migration was reviewed and used the strong_migrations helper. Felt safe.

It wasn’t. The default backfill took an ACCESS EXCLUSIVE lock and held it for eighty-seven seconds. Login error rate hit 100% during peak Pacific hours. First instinct was rollback. Rails doesn’t have a clean rollback for a partially-applied add_column_with_default. Letting it finish was less bad than fighting it.

Lesson I carried into NestJS multi-tenant work. On a hot tenant table, schema changes are a three-step dance regardless of framework. Add the column nullable. Backfill in batches. Add the not-null constraint. This is the strongest reason I still prefer row-based isolation. One schema means one of these dances per change, not a thousand.

Takeaways

Subdomain first, JWT second. Anything else is a guess.
AsyncLocalStorage over request-scoped providers once you’re past a few thousand RPS.
Throw loud when the tenant context is missing. Never return null.
Default to row-based isolation. Schema-per-tenant is not free.
The Prisma extension or TypeORM subscriber is your only safe path to the DB. No raw client in app code.
Write tests that prove cross-tenant reads return nothing. Not just that own-tenant reads return something.

Thanks for reading. If you’ve got thoughts, send them my way.