We took down our own platform with naive retry logic. Here's the 50x amplification math, the false starts, and the retry budgets and circuit breakers that actually fixed it.
09:31:14 local time. Market open plus 74 seconds. I was on call for a real-time trading and charting platform I architected a few years back, watching the Grafana board, and the connection count on our Socket.io gateway just… folded. Clients dropped, reconnected in the same breath, dropped again. Within ninety seconds every gateway pod was pinned at 100% CPU. p99 tick fan-out went from around 80ms to three seconds. Charts froze mid-candle. Which, on a trading product, is roughly the worst failure mode there is.
Nobody was attacking us. We were attacking us. Our own clients, retrying with no backoff, hammering the same gateway pool that was already on fire. First proper retry storm I lived through. Not the last.
OK so here’s the thing about retries. People reason about them per call. “I’ll retry three times with one-second delays, what’s the worst that could happen.” The worst that could happen is the call graph.
Picture a user action that fans out across twenty-three internal services. Each service retries failed downstream calls three times. Reasonable, per-service. Now the deepest one starts failing. The service one hop up does its three retries. The one above that does three of its own. Three more layers and you’re at 3^5 = 243 calls landing on the dying service for every one real user request. We measured roughly 50x amplification in practice, not 243, because some retries timed out before completing and a few circuit breakers actually did their job. 50x is still enough to take you down.
Same pattern, three different companies, three different stacks. Same shape every time.
The instinct, when retries take you down, is to retry less. Or worse, retry on the client only. Both are bad. Less retry means more user-visible failures on the real transient blips. Client-only retry just moves the problem one layer out.
What you actually need: a retry budget, idempotency classification, jitter, and a circuit breaker that opens early and recovers gradually. Roughly in that order.
The thing that finally clicked was treating retry as a budget that travels with the request, not a per-call knob. The simplest version, and the one I’ve shipped twice now, is a header. X-Retry-Budget: 2. Every retry decrements it before propagating. When it hits zero, you stop, no matter how many polite three-retry rules each service has configured locally.
import { context, propagation, trace } from "@opentelemetry/api";
import type { AxiosRequestConfig, AxiosResponse } from "axios";
import axios from "axios";
const RETRY_BUDGET_HEADER = "x-retry-budget";
const DEFAULT_BUDGET = 2;
type RetryableRequest = AxiosRequestConfig & {
retryBudget?: number;
idempotencyKey?: string;
};
export async function callDownstream<T>(
url: string,
config: RetryableRequest = {},
): Promise<AxiosResponse<T>> {
const incomingBudget = readBudgetFromContext();
const budget = Math.min(
config.retryBudget ?? DEFAULT_BUDGET,
incomingBudget ?? DEFAULT_BUDGET,
);
let lastErr: unknown;
for (let attempt = 0; attempt <= budget; attempt++) {
try {
return await axios.request<T>({
...config,
url,
headers: {
...config.headers,
[RETRY_BUDGET_HEADER]: String(budget - attempt),
...(config.idempotencyKey
? { "idempotency-key": config.idempotencyKey }
: {}),
},
timeout: config.timeout ?? 2000,
});
} catch (err) {
lastErr = err;
if (!isRetriable(err) || attempt === budget) break;
await sleep(backoffMs(attempt));
}
}
trace.getActiveSpan()?.recordException(lastErr as Error);
throw lastErr;
}
function backoffMs(attempt: number): number {
const base = 200 * Math.pow(2, attempt);
const jitter = base * (Math.random() - 0.5);
return Math.min(3000, base + jitter);
}
Two things to notice. The budget shrinks as the request crosses service boundaries, so deep services can’t kick off their own retry universe. And the backoff is jittered, not fixed. Fixed backoff is a thundering herd ninety seconds after every brownout.
Back to that trading-platform morning. My first wrong fix was operational: scale the gateway pool 3x via kubectl scale straight to nine pods. New pods came online, hit the reconnect storm head-on, went CPU-bound within twenty seconds. I was feeding the fire. Worse, more pods meant more partial-success reconnects. Clients would briefly land on a healthy pod, get a connection established event, then drop again when that pod saturated, which counted as a “success” in their retry logic and reset their backoff.
The real fix happened in two places in parallel. First, an emergency client-side push through a remote-config channel we’d built for exactly this kind of moment. Jittered exponential backoff on reconnects: min 200ms, max 30s, factor 2, plus or minus 50% jitter. Second, a per-IP connection rate limit at nginx, set tight at three new connections per second per IP. Within roughly eight minutes the connection pool stabilized and tick fan-out came back under 200ms.
Around fourteen minutes of degraded tick delivery during one of the most-watched windows of the trading week. Hardened the remote-config channel afterwards and put a kill-switch on aggressive reconnection into the client release pipeline. One-sentence lesson: autoscale is not a fix for a self-amplifying client bug.
You can’t retry safely if you don’t know what’s idempotent. Most teams have not actually classified their RPCs that way.
Three buckets, when I land on a new codebase:
That last bucket is where I got bitten again. Branded-mobile-app pipeline at the creator platform I worked at. Auto-retry on 5xx for Apple’s submission endpoint, reasonable. Then we extended it to retry on “stuck” state, where the build still showed Waiting for Review on App Store Connect after a threshold. That extension treated 200 OK as truth. Apple was silently throttling, returning 200s with normal-looking bodies but dropping the submission. Our retry then submitted again. A bunch of customer apps ended up with two competing review records and conflicting metadata.
Real fix: pulled the auto-retry on stuck state. Added a circuit breaker around the submit step that verifies state via a separate GET against the App Store Connect resource, not via the response of the POST. Wrote a reconciliation job with an idempotency key derived from app_id + version + git_sha.
When the upstream is human-moderated, never trust the response of a write. Read-after-write against their source of truth. Sticky note. Still there.
Most circuit-breaker tutorials show you the API and skip the config. The config is the entire game. Roughly the shape I run now:
breakers:
- name: payments-create-charge
failure_threshold: 0.5
minimum_requests: 20
rolling_window_seconds: 10
open_duration_seconds: 5
half_open_max_requests: 3
half_open_required_successes: 2
timeouts:
total_ms: 1500
per_attempt_ms: 800
classify_as_failure:
- http_5xx
- timeout
- connection_reset
do_not_count_as_failure:
- http_409_conflict
- http_422_unprocessable
The half-open phase is the one people get wrong. Flip straight from open back to closed on the first success and you’ll oscillate. Require multiple successes in a half-open window before closing. And do not classify 4xx as failure unless you mean it. A 422 is the upstream telling you your payload is bad. Retrying that is just self-DDoSing politely.
Standing rules I now bring to every new codebase:
Thanks for reading. If you’ve got thoughts, send them my way.