The Cascading Failure That Took Down 47 Services

Ninety seconds. That's how long it took for one slow downstream call to drag a hundreds-of-services topology to its knees. What I learned about circuit breakers, bulkheads, timeout budgets, and the difference between a health check that's honest and one that lies.

Ninety seconds. That’s the entire window from the first PagerDuty page to the moment a federation tech lead pinged me directly because the public standings page had frozen mid-broadcast. By the time I had a terminal open, the dependency graph looked like a Christmas tree on Datadog. Forty seven services degraded. One root cause. And, honestly, the most embarrassing part is that I’d reviewed the change that caused it.

This was at the combat-sports tournament platform I CTO’d in London. Hundreds of microservices, Kafka as the async backbone, a Saturday afternoon live broadcast. The story is in the persona well already, but I want to use it here for the thing it actually taught me. Which is that cascading failure is rarely about the failing service. It’s about everything around it that wasn’t ready to refuse work.

How one slow consumer poisons the pool

The trigger was small. Someone had shipped a config-touching change without bumping the image tag. The deployment pulled :latest. One pod, out of six, came up with a different max.poll.interval.ms. The handler did a slow call out to a federation-rules service that occasionally took roughly 70 seconds. On five pods the poll interval was 300s and they shrugged it off. On the sixth, the interval was 60s. So that pod kept getting kicked out of the consumer group, which triggered a rebalance, which paused the other five, which made the lag grow, which extended the slow downstream call further. Every 30 seconds the dance started over.

First wrong fix, which I’m not proud of: kubectl rollout restart deployment/standings-projector. I told myself the group needed a clean re-join. What it actually got was the same six pods doing the same self-amplifying dance with a slightly different starting offset. I was feeding the fire.

The real fix took maybe five minutes once I pulled pod logs side by side. Cordoned the bad pod. Storm drained in about ninety seconds. Over the weekend we pinned image SHAs on every Kafka-touching deployment, committed offsets more frequently with smaller batches, and split the slow downstream call out of the hot consumer loop. CI now refuses to deploy any consumer manifest referencing :latest. I’m the reason that check exists.

But the bigger lesson wasn’t really about Kafka. It was the count: 47 services touched. None of them had any business breaking because the standings projector was sick. They broke because the surrounding patterns weren’t there.

Timeout budgets, not timeouts

Here’s the thing about timeouts. Most teams set them per-call. The HTTP client gets 30s, the database client gets 10s, the message handler gets some other number. Each one feels reasonable in isolation. Then you add them up across a request path and the budget is two minutes.

What I want is the opposite. Set a budget at the edge, and propagate it down. Every hop gets a slice. If the slice runs out, you fail fast and let the caller decide.

import { context, propagation, trace } from '@opentelemetry/api';

interface BudgetCarrier {
  deadlineMs: number;
}

const BUDGET_KEY = 'x-deadline-ms';

export function withBudget<T>(totalBudgetMs: number, fn: () => Promise<T>) {
  const deadline = Date.now() + totalBudgetMs;
  const ctx = propagation.setBaggage(
    context.active(),
    propagation.createBaggage({ [BUDGET_KEY]: { value: String(deadline) } }),
  );
  return context.with(ctx, fn);
}

export function remainingBudgetMs(): number {
  const baggage = propagation.getBaggage(context.active());
  const raw = baggage?.getEntry(BUDGET_KEY)?.value;
  if (!raw) return 30_000;
  return Math.max(0, Number(raw) - Date.now());
}

export async function callDownstream(url: string): Promise<Response> {
  const budget = remainingBudgetMs();
  if (budget < 100) {
    throw new Error('budget_exhausted');
  }
  const controller = new AbortController();
  const timer = setTimeout(() => controller.abort(), budget - 50);
  try {
    return await fetch(url, { signal: controller.signal });
  } finally {
    clearTimeout(timer);
  }
}

That budget - 50 is deliberate. You always want a little headroom so the caller can serialize an error response, not just hang on the wire. Small detail. Saves you when retries fan in.

Bulkheads keep the bad pool small

Bulkheads are the unsexy pattern that pays back every time. The idea is dumb: don’t share a single connection pool, thread pool, or queue across unrelated callers. If the slow downstream poisons one pool, the rest of the system still has work to do.

In Node.js I usually do it with separate undici agents per critical dependency. Each one gets its own connection ceiling and queue depth. When the federation-rules service goes sour, only the agent talking to it backs up. The rest of the world stays cold and fast.

import { Agent, setGlobalDispatcher } from 'undici';

export const federationRulesAgent = new Agent({
  connections: 32,
  pipelining: 1,
  keepAliveTimeout: 5_000,
  bodyTimeout: 2_000,
  headersTimeout: 1_500,
});

export const paymentsAgent = new Agent({
  connections: 64,
  pipelining: 1,
  bodyTimeout: 4_000,
  headersTimeout: 2_000,
});

setGlobalDispatcher(
  new Agent({ connections: 16, bodyTimeout: 3_000, headersTimeout: 1_500 }),
);

Pair that with a circuit breaker that trips on rolling error rate, not on a single failure. The breaker should also fail closed when the budget is low. There’s no point opening a connection you can’t possibly afford to wait on.

Shallow versus deep health checks

This was the second pattern that took 47 services down. Almost all of them had a /health endpoint that returned 200 if the process was up. Useless. The kubelet was happily declaring “ready” on pods that couldn’t talk to Kafka, couldn’t reach Postgres, and had their thread pool fully saturated.

Shallow checks belong on the liveness probe. Deep checks belong on readiness. They’re not the same thing and they shouldn’t return the same answer.

livenessProbe:
  httpGet:
    path: /livez
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
  failureThreshold: 2
  timeoutSeconds: 2
startupProbe:
  httpGet:
    path: /livez
    port: 8080
  failureThreshold: 30
  periodSeconds: 2

Liveness asks “is the process alive enough to be worth restarting.” Readiness asks “do you have working dependencies and spare capacity right now.” A pod that’s overloaded but healthy should fail readiness, not liveness. You don’t want it killed. You want it temporarily off the load balancer. Big difference under cascade.

When clients amplify the failure

I’ll close with a different story that lives in my head right next to this one. A real-time trading and charting platform I architected a couple of years earlier. Market open on a post-holiday Tuesday. At about 74 seconds after the open, gateway pods started dropping connections en masse. Clients reconnected immediately. Got dropped again. Reconnected again. Within about 90 seconds every gateway pod was pinned at 100% CPU. p99 tick fan-out climbed from roughly 80 ms to about 3 seconds.

I scaled pods three times via the autoscaler’s manual override. Hit the reconnect storm head-on. The new pods went CPU-bound in about twenty seconds. I was feeding a different fire.

The real fix was a client-side config push through a remote-config channel we’d built for exactly this kind of moment: jittered exponential backoff on reconnects, plus a per-IP rate limit at the nginx layer. About eight minutes later the pool stabilized.

The lesson there matches the one from the consumer rebalance. Autoscale is not a fix for a self-amplifying client-side bug. Backoff lives on the client. Server-side scaling is a multiplier. If you multiply zero by anything, you still get zero.

Chaos engineering, but the cheap version

I’m not going to sell you on a full chaos platform. At the scale most teams operate, you don’t need one. What you do need is a weekly fault-injection ritual you actually do. Kill a pod during business hours. Add 500ms of latency to a downstream dependency. Drop 5% of packets between two services. See what breaks. Then write the runbook.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: federation-rules-latency
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - tournaments
    labelSelectors:
      app: federation-rules
  delay:
    latency: '500ms'
    correlation: '50'
    jitter: '100ms'
  duration: '5m'

We ran a version of this once a fortnight. The first three runs found three things that would have paged us in production. After about six months the runs got boring. That’s when you know it’s working.

Takeaways

Cascading failure is almost always about the receivers, not the sender.
Set a deadline at the edge and propagate the remaining budget down. Per-call timeouts add up to a system without a clock.
Bulkhead by critical dependency. A poisoned pool should be one pool, not all of them.
Shallow checks on liveness, deep checks on readiness. They are not the same probe.
Client-side backoff with jitter, every time. The server cannot fix a thundering herd it didn’t create.
Run cheap chaos drills until they get boring. That’s the goal.

Thanks for reading. If you’ve got thoughts, send them my way.