Circuit Breaker Pattern in Microservices

A working circuit breaker in TypeScript, the state machine that actually matters, and the two production incidents that made me stop trusting retries.

A creator on a SaaS platform I worked at opened a ticket on a Wednesday afternoon saying every one of her customers had been charged twice. We pulled the logs and traced it to Apple’s renewal notification endpoint quietly retrying us because our handler had crept past its 30 second deadline. We returned a 200 OK after Apple had already given up and queued another attempt. No idempotency check on our side. New row inserted on every retry. The thing that should have prevented the cascade, the boring piece of infrastructure that almost nobody on the team owned, was a circuit breaker. We didn’t have one in front of that handler. We do now.

OK so let me back up. Circuit breakers are one of those patterns that sound trivial when you read the wiki page and then you put one in production and immediately discover the state machine has teeth. I’ve shipped them in Node.js gateways for a real-time trading platform I architected, in a Rails monolith hitting Aurora, and in the Kafka consumers behind a federation rankings page. Every time, the half-open transition is what bit me. Not the closed state. Not the open state. The thing in between.

What the breaker actually models

Three states. Closed means traffic flows. Open means traffic is rejected immediately, no network call, fail fast. Half-open is the recovery probe, where you let a trickle through to see if the downstream came back. That’s it. The interesting question is when each transition happens and what the breaker is counting.

The naive version counts errors. Five in a row, trip. That’s wrong almost every time I’ve seen it deployed. You want a rolling window with a percentage threshold, because steady-state error rates are noisy and bursty traffic will trip a counter-based breaker on a Tuesday morning when nothing’s actually broken. So roll a window of recent calls, say 30 seconds, require some minimum sample size so a single early failure doesn’t trip the thing, and trip on failure ratio above a threshold like 50 percent. The minimum sample size matters more than people think.

type State = 'closed' | 'open' | 'half_open';

interface BreakerOptions {
  failureRateThreshold: number;     // e.g. 0.5
  minimumSamples: number;            // e.g. 20
  rollingWindowMs: number;           // e.g. 30_000
  openStateDurationMs: number;       // e.g. 15_000
  halfOpenMaxConcurrent: number;     // e.g. 3
  isFailure?: (err: unknown) => boolean;
}

interface Outcome {
  ok: boolean;
  ts: number;
}

The isFailure predicate is the part most off-the-shelf libraries get wrong. A 404 from a downstream is not a circuit-breaker-tripping event. A 500 is. A ETIMEDOUT definitely is. A 429 is debatable and depends on whether you respect Retry-After. You want to be explicit. Don’t let the default “any thrown error counts” rule send your breaker into open just because somebody queried a missing record.

A working breaker in TypeScript

Here’s the shape I keep landing on, simplified but production-ish. Used something close to this in a NestJS gateway calling out to Apple’s App Store Connect API after the renewal incident. Resilience4j and Polly both implement this state machine well, the Node ecosystem has opossum, but rolling one for a specific call path is sometimes the cleanest path when your failure semantics are weird.

import { performance } from 'node:perf_hooks';

export class CircuitBreaker<TArgs extends unknown[], TReturn> {
  private state: State = 'closed';
  private outcomes: Outcome[] = [];
  private openedAt = 0;
  private inflightProbes = 0;

  constructor(
    private readonly fn: (...args: TArgs) => Promise<TReturn>,
    private readonly opts: BreakerOptions,
  ) {}

  async exec(...args: TArgs): Promise<TReturn> {
    this.transitionIfNeeded();

    if (this.state === 'open') {
      throw new CircuitOpenError(this.fn.name);
    }

    if (this.state === 'half_open') {
      if (this.inflightProbes >= this.opts.halfOpenMaxConcurrent) {
        throw new CircuitOpenError(this.fn.name);
      }
      this.inflightProbes++;
    }

    const start = performance.now();
    try {
      const result = await this.fn(...args);
      this.record({ ok: true, ts: Date.now() });
      if (this.state === 'half_open') this.closeFromHalfOpen();
      return result;
    } catch (err) {
      const isFail = this.opts.isFailure?.(err) ?? true;
      this.record({ ok: !isFail, ts: Date.now() });
      if (isFail && this.state === 'half_open') this.tripOpen();
      throw err;
    } finally {
      if (this.state === 'half_open') this.inflightProbes--;
      metrics.observeLatency(this.fn.name, performance.now() - start);
    }
  }

  private transitionIfNeeded() {
    if (this.state === 'open' && Date.now() - this.openedAt >= this.opts.openStateDurationMs) {
      this.state = 'half_open';
      this.inflightProbes = 0;
    }
    if (this.state === 'closed' && this.shouldTrip()) this.tripOpen();
  }

  private shouldTrip(): boolean {
    const cutoff = Date.now() - this.opts.rollingWindowMs;
    this.outcomes = this.outcomes.filter(o => o.ts >= cutoff);
    if (this.outcomes.length < this.opts.minimumSamples) return false;
    const failures = this.outcomes.filter(o => !o.ok).length;
    return failures / this.outcomes.length >= this.opts.failureRateThreshold;
  }

  private tripOpen() {
    this.state = 'open';
    this.openedAt = Date.now();
    metrics.incr('circuit.opened', { name: this.fn.name });
  }

  private closeFromHalfOpen() {
    this.state = 'closed';
    this.outcomes = [];
    metrics.incr('circuit.closed', { name: this.fn.name });
  }

  private record(o: Outcome) { this.outcomes.push(o); }
}

A few things to call out. The half-open path lets a small number of concurrent probes through, not just one. If you only allow one probe you serialize recovery and waste minutes on a service that came back five seconds ago. But you also can’t let the floodgates open. Three works in most of my deployments. The other thing is that the breaker emits metrics on every state transition. Without those metrics, you have no dashboard, and without a dashboard you find out the breaker is wedged open from a customer support ticket. Which leads me to a story.

The rankings page that wedged open

At the combat-sports tournament platform I CTO’d in London, the rankings page went stale for eight hours on a Saturday night. A federation tournament had just finished, the new champion needed to show up at the top, and instead the page kept showing the old number one. The athlete tweeted a screenshot of our broken page at the federation. Tagged them. Great.

We had an Elasticsearch indexer reading off Kafka and projecting into the rankings index. Backed by PostgreSQL as the system of record. The indexer used a circuit breaker around the ES bulk-write client. The breaker had tripped open the night before during a transient ES blip and stayed open. The half-open transition was never attempted because somebody had configured the breaker to require a manual reset. I’d reviewed that config months ago. I’d approved it.

First fix was the operational reflex. SSH into the pod, restart the indexer. The container came up clean. The breaker reset to closed. It started projecting new events. The problem was we never backfilled. The old wrong rankings were still in the index. So we triggered a full reindex from PostgreSQL into a new ES index, atomic-aliased the read pointer, the page caught up in about 25 minutes.

Real fix was structural. Made the breaker attempt half-open every 60 seconds automatically, with three concurrent probes, instead of staying open forever waiting on a human. Added a freshness metric to the indexer separate from “is the consumer consuming” because the consumer was happily consuming Kafka the whole time. It just wasn’t writing anywhere. Twelve hours of perfectly green throughput dashboards while the index drifted into nonsense.

Cost was eight hours of stale rankings during a publicly visible window, one angry athlete on Twitter, a Monday call from the federation. The lesson, scribbled into the runbook that week, was that a circuit breaker without an automatic half-open path is a foot-gun. If your downstream is going to recover, the breaker has to know how to find out.

Fallbacks and dashboards

When the breaker is open, you have to return something. The wrong answer is to surface the error. The reader of your API doesn’t care that you have a circuit breaker, they care that your endpoint works. Common fallbacks I’ve shipped: a cached last-known-good value from Redis with a freshness TTL, an empty-but-valid response shape so the client renders a graceful state, a degraded mode that disables the slow feature but keeps the rest of the surface up.

async function getCreatorProfile(id: string) {
  try {
    return await profileBreaker.exec(id);
  } catch (err) {
    if (err instanceof CircuitOpenError) {
      const stale = await redis.get(`profile:${id}`);
      if (stale) {
        return { ...JSON.parse(stale), stale: true };
      }
    }
    throw err;
  }
}

The stale: true flag is doing real work there. Downstream caches and frontends can decide to render the cached version with a small “live data unavailable” badge instead of treating it as fresh. Honesty is a feature.

On dashboards, the things I want visible on every breaker: current state, time in state, failure ratio in the rolling window, count of open transitions per hour, half-open probe success rate, p50/p99 latency of the wrapped call. I want an alert on “breaker has been open for more than 5 minutes” because that’s the line between “downstream blip” and “we have a problem that needs a human.” A Datadog monitor expression we used looked roughly like:

avg(last_5m):sum:circuit.state{state:open} by {service,target} > 0

Pair that with a runbook entry. The runbook should not say “investigate.” It should say “check the wrapped target’s health, then run the probe command, then decide between manual reset and rollback.” I’ve been on-call for breakers without runbooks. It’s not fun at 2 a.m.

Takeaways

Circuit breakers count failures over a rolling window with a minimum sample size, never raw consecutive failures.
The half-open state must transition automatically with bounded concurrency, never wait for a human.
Be explicit about which errors count as failures. 404 usually doesn’t, 500 and timeouts always do.
Pair every breaker with a fallback. Stale-with-flag beats error-page almost always.
Dashboard the breaker like you dashboard the wrapped call. State, time in state, failure ratio, half-open success rate.
A breaker can shed load on your own service, not just protect against downstream failures.

Thanks for reading. If you’ve got thoughts, send them my way.