Bulkhead Pattern and Fault Isolation

Why I partition thread pools, semaphores, and connection pools per dependency, and how that pairs with circuit breakers, timeout budgets, and load shedding in production.

09:31:14 on a Tuesday morning at the real-time trading platform I architected. 74 seconds after market open. Connections started dropping en masse, clients reconnected immediately, got dropped again. Within 90 seconds every Socket.io gateway pod was pinned at 100 percent CPU and our p99 tick fan-out latency climbed from around 80 ms to roughly 3 seconds. The thing that turned a bad client bug into a full outage wasn’t the reconnect storm itself. It was that one shared worker pool inside the gateway tier was handling auth lookups, tick fan-out, and the slow path that talked to the user-profile service. When the slow path got slow, everything backed up behind it.

Yeah. That’s a bulkhead failure, by absence.

Here’s the position. If a service depends on more than one downstream, the default should be one isolated resource pool per downstream, not one shared pool with optimistic sizing. I’d rather pay the small cost of separate thread pools, semaphores, and connection pools than pay the tax of a cascading failure during peak traffic. Bulkheads alone don’t save you. They have to live next to circuit breakers, timeout budgets, and load shedding. Treat any one of those in isolation as a partial fix.

What a bulkhead actually buys you

A bulkhead is a hard partition. One slow dependency can consume its own slice of the resource pool and nothing else. The rest of the service keeps serving. In Node.js land we usually reach for semaphores, separate HTTP agents, and separate database connection pools. Caps you cannot exceed, queues that drop early, and observability per partition so you can see which slice is starving.

The trap is sizing them by gut feel. I’ve watched teams set every pool to 100 because “that’s what the old monolith did”. Two downstreams account for most traffic, two more are slow on purpose (S3, Stripe), the rest hover near idle. Sized flat, the slow ones starve the fast ones the moment a downstream blip happens.

Semaphore bulkheads in NestJS

The simplest bulkhead in a Node service is a semaphore around outbound calls. Cap concurrent in-flight requests per downstream, reject (or queue with a tight ceiling) past the cap.

import { Injectable, ServiceUnavailableException } from '@nestjs/common';

type Permit = () => void;

export class Semaphore {
  private inUse = 0;
  private readonly waiters: Array<(p: Permit) => void> = [];

  constructor(
    private readonly max: number,
    private readonly maxQueue: number,
    private readonly name: string,
  ) {}

  async acquire(): Promise<Permit> {
    if (this.inUse < this.max) {
      this.inUse++;
      return () => this.release();
    }
    if (this.waiters.length >= this.maxQueue) {
      // shed load early, do not let callers queue forever
      throw new ServiceUnavailableException(`bulkhead_full:${this.name}`);
    }
    return new Promise<Permit>((resolve) => {
      this.waiters.push((permit) => resolve(permit));
    });
  }

  private release() {
    const next = this.waiters.shift();
    if (next) {
      next(() => this.release());
      return;
    }
    this.inUse--;
  }
}

@Injectable()
export class Bulkheads {
  readonly stripe = new Semaphore(20, 40, 'stripe');
  readonly profile = new Semaphore(80, 200, 'profile');
  readonly search = new Semaphore(40, 100, 'search');
}

Sizing comes from real numbers. Look at p95 downstream latency and target RPS and pick a cap that lets the queue drain inside your timeout budget. If your timeout for Stripe is 8 seconds and p95 is 600 ms, twenty in-flight calls per pod is plenty. If you set it to 200 you’re not bulkheading, you’re delaying the inevitable.

Separate connection pools, not shared ones

PostgreSQL connection pools are the bulkhead people forget. Long-running reports, transactional writes, and read-mostly hot paths should not share a pool. They have different latency profiles and different failure modes. At the creator economy platform I worked at, the Aurora writer powering the Community product carried a multi-terabyte working set, and we’d seen what happens when an analytics job starves the OLTP path. After that, every service got at least two pools.

import { Pool } from 'pg';

export const writePool = new Pool({
  host: process.env.PG_WRITER_HOST,
  max: 30,                       // tx writes
  idleTimeoutMillis: 10_000,
  connectionTimeoutMillis: 2_000,
  application_name: 'orders-writer',
});

export const readPool = new Pool({
  host: process.env.PG_READER_HOST,
  max: 60,                       // hot read path
  idleTimeoutMillis: 30_000,
  connectionTimeoutMillis: 1_500,
  application_name: 'orders-reader',
});

export const reportPool = new Pool({
  host: process.env.PG_READER_HOST,
  max: 8,                        // long, intentionally capped low
  idleTimeoutMillis: 60_000,
  connectionTimeoutMillis: 5_000,
  application_name: 'orders-reports',
  statement_timeout: 60_000,
});

The statement_timeout on the report pool is half the bulkhead. A pool cap without a per-statement timeout is a leaky bulkhead. The other half is rejecting at the edge when the queue is full, not letting requests pile up forever.

Bulkheads alone do not save you

This is where I see the most teams trip. They add bulkheads and call it done. Then a downstream goes slow but not dead, every permit gets held for the full timeout, the pool stays full, and new requests get rejected. From the client’s view that’s still an outage, just a quieter one. You want the bulkhead to open a circuit when failures accumulate, shed load fast, and recover automatically.

import CircuitBreaker from 'opossum';
import axios from 'axios';

const profileClient = axios.create({
  baseURL: process.env.PROFILE_BASE_URL,
  timeout: 800, // tighter than the bulkhead's outer budget
});

const breaker = new CircuitBreaker(
  async (userId: string) => {
    const { data } = await profileClient.get(`/users/${userId}`);
    return data;
  },
  {
    timeout: 800,
    errorThresholdPercentage: 50,
    rollingCountTimeout: 10_000,
    rollingCountBuckets: 10,
    resetTimeout: 5_000,
    volumeThreshold: 20,
  },
);

breaker.fallback(() => null);

export async function fetchProfile(userId: string, bulkheads: Bulkheads) {
  const release = await bulkheads.profile.acquire();
  try {
    return await breaker.fire(userId);
  } finally {
    release();
  }
}

The bulkhead caps how much of your service the profile dependency can occupy. The breaker shortcuts further calls when the dependency is unhealthy. The fallback gives you a graceful degraded mode (null profile, render an anonymous avatar) instead of a 5xx. The combination is what survives. Any one of them alone is a half-measure.

Timeout budgets and load shedding

A request budget says: this whole request gets 1500 ms end to end. The downstream calls don’t get to ask for more than their share. If auth gets 200 ms, profile gets 400 ms, search gets 600 ms, the remaining 300 ms is your local processing. When a call eats its share, the next call sees a smaller deadline. Pair the budget with a load-shedding header at the edge so when CPU goes over a threshold, the gateway returns 503 to non-critical requests and lets critical ones through.

Takeaways

One isolated pool per external dependency. Not one big pool optimistic about everything.
Size bulkheads from p95 latency and timeout budget, not “round number that looks safe”.
Pair every bulkhead with a circuit breaker and a tight per-call timeout. Alone they leak.
Reject early when the queue is full. Slow rejection is a worse failure than fast rejection.
Put separate connection pools on read, write, and long-running paths. Always statement_timeout.
Shed load early when a pool is saturated. Autoscale amplifies a self-reinforcing failure.

Thanks for reading. If you’ve got thoughts, send them my way.