The Retry Storm That DDoS'd Us

We took down our own platform with naive retry logic. Here's the 50x amplification math, the false starts, and the retry budgets and circuit breakers that actually fixed it.

09:31:14 local time. Market open plus 74 seconds. I was on call for a real-time trading and charting platform I architected a few years back, watching the Grafana board, and the connection count on our Socket.io gateway just… folded. Clients dropped, reconnected in the same breath, dropped again. Within ninety seconds every gateway pod was pinned at 100% CPU. p99 tick fan-out went from around 80ms to three seconds. Charts froze mid-candle. Which, on a trading product, is roughly the worst failure mode there is.

Nobody was attacking us. We were attacking us. Our own clients, retrying with no backoff, hammering the same gateway pool that was already on fire. First proper retry storm I lived through. Not the last.

The amplification math nobody draws on the whiteboard

OK so here’s the thing about retries. People reason about them per call. “I’ll retry three times with one-second delays, what’s the worst that could happen.” The worst that could happen is the call graph.

Picture a user action that fans out across twenty-three internal services. Each service retries failed downstream calls three times. Reasonable, per-service. Now the deepest one starts failing. The service one hop up does its three retries. The one above that does three of its own. Three more layers and you’re at 3^5 = 243 calls landing on the dying service for every one real user request. We measured roughly 50x amplification in practice, not 243, because some retries timed out before completing and a few circuit breakers actually did their job. 50x is still enough to take you down.

Same pattern, three different companies, three different stacks. Same shape every time.

The wrong fix I keep seeing

The instinct, when retries take you down, is to retry less. Or worse, retry on the client only. Both are bad. Less retry means more user-visible failures on the real transient blips. Client-only retry just moves the problem one layer out.

What you actually need: a retry budget, idempotency classification, jitter, and a circuit breaker that opens early and recovers gradually. Roughly in that order.

A retry budget you can carry across hops

The thing that finally clicked was treating retry as a budget that travels with the request, not a per-call knob. The simplest version, and the one I’ve shipped twice now, is a header. X-Retry-Budget: 2. Every retry decrements it before propagating. When it hits zero, you stop, no matter how many polite three-retry rules each service has configured locally.

import { context, propagation, trace } from "@opentelemetry/api";
import type { AxiosRequestConfig, AxiosResponse } from "axios";
import axios from "axios";

const RETRY_BUDGET_HEADER = "x-retry-budget";
const DEFAULT_BUDGET = 2;

type RetryableRequest = AxiosRequestConfig & {
  retryBudget?: number;
  idempotencyKey?: string;
};

export async function callDownstream<T>(
  url: string,
  config: RetryableRequest = {},
): Promise<AxiosResponse<T>> {
  const incomingBudget = readBudgetFromContext();
  const budget = Math.min(
    config.retryBudget ?? DEFAULT_BUDGET,
    incomingBudget ?? DEFAULT_BUDGET,
  );

  let lastErr: unknown;
  for (let attempt = 0; attempt <= budget; attempt++) {
    try {
      return await axios.request<T>({
        ...config,
        url,
        headers: {
          ...config.headers,
          [RETRY_BUDGET_HEADER]: String(budget - attempt),
          ...(config.idempotencyKey
            ? { "idempotency-key": config.idempotencyKey }
            : {}),
        },
        timeout: config.timeout ?? 2000,
      });
    } catch (err) {
      lastErr = err;
      if (!isRetriable(err) || attempt === budget) break;
      await sleep(backoffMs(attempt));
    }
  }
  trace.getActiveSpan()?.recordException(lastErr as Error);
  throw lastErr;
}

function backoffMs(attempt: number): number {
  const base = 200 * Math.pow(2, attempt);
  const jitter = base * (Math.random() - 0.5);
  return Math.min(3000, base + jitter);
}

Two things to notice. The budget shrinks as the request crosses service boundaries, so deep services can’t kick off their own retry universe. And the backoff is jittered, not fixed. Fixed backoff is a thundering herd ninety seconds after every brownout.

Jitter on the client, not just the server

Back to that trading-platform morning. My first wrong fix was operational: scale the gateway pool 3x via kubectl scale straight to nine pods. New pods came online, hit the reconnect storm head-on, went CPU-bound within twenty seconds. I was feeding the fire. Worse, more pods meant more partial-success reconnects. Clients would briefly land on a healthy pod, get a connection established event, then drop again when that pod saturated, which counted as a “success” in their retry logic and reset their backoff.

The real fix happened in two places in parallel. First, an emergency client-side push through a remote-config channel we’d built for exactly this kind of moment. Jittered exponential backoff on reconnects: min 200ms, max 30s, factor 2, plus or minus 50% jitter. Second, a per-IP connection rate limit at nginx, set tight at three new connections per second per IP. Within roughly eight minutes the connection pool stabilized and tick fan-out came back under 200ms.

Around fourteen minutes of degraded tick delivery during one of the most-watched windows of the trading week. Hardened the remote-config channel afterwards and put a kill-switch on aggressive reconnection into the client release pipeline. One-sentence lesson: autoscale is not a fix for a self-amplifying client bug.

Classify your calls before you retry them

You can’t retry safely if you don’t know what’s idempotent. Most teams have not actually classified their RPCs that way.

Three buckets, when I land on a new codebase:

Safe to retry: GETs, anything functionally pure.
Safe to retry with an idempotency key: POSTs that take a client-generated key and dedupe server-side.
Not safe to retry at all: writes against systems you don’t control (Apple, Google, Stripe in some flows). Retry only after read-after-write against the upstream’s source of truth.

That last bucket is where I got bitten again. Branded-mobile-app pipeline at the creator platform I worked at. Auto-retry on 5xx for Apple’s submission endpoint, reasonable. Then we extended it to retry on “stuck” state, where the build still showed Waiting for Review on App Store Connect after a threshold. That extension treated 200 OK as truth. Apple was silently throttling, returning 200s with normal-looking bodies but dropping the submission. Our retry then submitted again. A bunch of customer apps ended up with two competing review records and conflicting metadata.

Real fix: pulled the auto-retry on stuck state. Added a circuit breaker around the submit step that verifies state via a separate GET against the App Store Connect resource, not via the response of the POST. Wrote a reconciliation job with an idempotency key derived from app_id + version + git_sha.

When the upstream is human-moderated, never trust the response of a write. Read-after-write against their source of truth. Sticky note. Still there.

Circuit breaker config that actually opens early

Most circuit-breaker tutorials show you the API and skip the config. The config is the entire game. Roughly the shape I run now:

breakers:
  - name: payments-create-charge
    failure_threshold: 0.5
    minimum_requests: 20
    rolling_window_seconds: 10
    open_duration_seconds: 5
    half_open_max_requests: 3
    half_open_required_successes: 2
    timeouts:
      total_ms: 1500
      per_attempt_ms: 800
    classify_as_failure:
      - http_5xx
      - timeout
      - connection_reset
    do_not_count_as_failure:
      - http_409_conflict
      - http_422_unprocessable

The half-open phase is the one people get wrong. Flip straight from open back to closed on the first success and you’ll oscillate. Require multiple successes in a half-open window before closing. And do not classify 4xx as failure unless you mean it. A 422 is the upstream telling you your payload is bad. Retrying that is just self-DDoSing politely.

What we put in place after that morning

Standing rules I now bring to every new codebase:

Retry budget is a header that travels with the request. Per-service caps still exist, but the budget caps the whole call graph.
Every retryable POST takes an idempotency key. If a service can’t accept one, it doesn’t get retried.
Backoff is jittered, exponential, capped. Plus or minus 50% is a good default.
Circuit breakers open on failure rate over a rolling window, not a fixed count. Half-open requires multiple successes.
Client reconnection has a kill-switch behind remote config. You’ll need it eventually.

Takeaways

Retries compose multiplicatively across hops. Reason about the graph, not the call.
A retry budget header is the cheapest way to cap blast radius without rewriting anyone’s service.
Jitter belongs on every backoff, especially client reconnection. Fixed backoff is a thundering herd waiting for a trigger.
Idempotency keys are a precondition for safe retry, not a nice-to-have.
Open circuit breakers early, recover gradually. Half-open needs multiple successes.
When upstream is human-moderated, read-after-write against their source of truth. Never trust the response of a write.

Thanks for reading. If you’ve got thoughts, send them my way.