How I think about capacity planning for production web apps: load modeling, bottleneck hunting, headroom buffers, cost curves, and review cadence.
09:31:14 local. 74 seconds after the London market opened. Every gateway pod on the real-time trading platform I’d architected was pinned at 100% CPU, p99 tick fan-out was over 3 seconds, and clients were watching stale prices on charts they’d paid for. We’d sized the cluster for 10x peak. The slide said we were fine. We were not fine.
That morning changed how I do capacity planning. The headroom number was right. The model behind it wasn’t. We’d forecast RPS and ignored shape, and “10x peak” against the wrong shape is a number, not a plan.
So here’s how I think about it now. Capacity planning is a load-modeling problem, not a percentage-of-peak heuristic. Autoscaling is a useful tactic. It isn’t a capacity plan. If your plan is “set HPA to 70% CPU and trust the cloud,” what you’ve actually said is “I don’t understand my load.”
A request count is the thinnest possible model of your traffic. Two systems can do the same RPS and have entirely different capacity profiles.
What I care about, in order:
Here’s a small TypeScript helper I use to sanity-check pod sizing against a request mix, instead of just rps_per_pod:
type RequestKind = "read_light" | "read_heavy" | "write" | "fanout";
interface MixModel {
rps: number;
mix: Record<RequestKind, number>; // shares, sum to 1
costMs: Record<RequestKind, number>; // p99 CPU-ms per request
targetUtil: number; // 0.6 for stateless services
}
export function podsNeeded(m: MixModel, cpuMsPerPodPerSec = 1000): number {
const weightedCost = (Object.keys(m.mix) as RequestKind[]).reduce(
(acc, k) => acc + m.mix[k] * m.costMs[k],
0,
);
const totalCpuMs = m.rps * weightedCost;
const usable = cpuMsPerPodPerSec * m.targetUtil;
return Math.ceil(totalCpuMs / usable);
}
That’s not the whole answer, but it forces you to write down costMs per request type. When you can’t fill in read_heavy, that’s a real finding. Nobody on the team knows what the expensive request actually costs, which means you can’t size for it.
CPU is almost always a symptom. The real bottleneck on a web app is one of: DB connection exhaustion, IOPS ceiling, Kafka consumer lag, a downstream rate limit, GC pauses, socket file descriptors, or one ridiculous serializer that nobody profiled.
That is the whole reason I distrust “scale up to fix it.” You can scale a layer that wasn’t the problem and feel good for ten minutes while the actual cause keeps burning.
The discipline is: instrument every layer before you scale anything. Connections, IOPS, replica lag, queue depth, downstream timeouts, GC pauses. When you run a load test, watch all of them. The bottleneck is whichever one saturates first. Fix that, then re-run.
Here’s a k6 script I use as a starting point. It ramps, holds, and tags downstream calls with custom metrics so you can see which layer breaks first, not just that something broke:
import http from "k6/http";
import { Trend, Counter } from "k6/metrics";
import { check, sleep } from "k6";
const dbLatency = new Trend("downstream_db_ms", true);
const upstreamErrors = new Counter("upstream_errors");
export const options = {
scenarios: {
ramp_and_hold: {
executor: "ramping-arrival-rate",
startRate: 200,
timeUnit: "1s",
preAllocatedVUs: 200,
maxVUs: 4000,
stages: [
{ duration: "2m", target: 2000 },
{ duration: "10m", target: 2000 },
{ duration: "2m", target: 5000 },
{ duration: "10m", target: 5000 },
],
},
},
thresholds: {
http_req_failed: ["rate<0.01"],
"http_req_duration{kind:read}": ["p(99)<400"],
"http_req_duration{kind:write}": ["p(99)<800"],
},
};
export default function () {
const res = http.get("https://api.internal/feed?cursor=latest", {
tags: { kind: "read" },
});
const dbMs = Number(res.headers["X-DB-Ms"] ?? 0);
if (dbMs > 0) dbLatency.add(dbMs);
if (res.status >= 500) upstreamErrors.add(1);
check(res, { "status 2xx": (r) => r.status < 400 });
sleep(0.2);
}
The custom downstream_db_ms trend is the bit that matters. You don’t just see http p99 climb. You see which layer drove it.
Stateless and stateful services need different headroom. Stateless web pods, I want 2x headroom over forecast peak. Stateful tiers (DB writers, brokers, real-time gateways) get 3 to 5x, sized against the worst plausible failure mode rather than the steady state.
Three failure modes I size for explicitly:
Capacity isn’t linear in cost. Different layers have different curves.
Reserved Instances change the slope under you entirely. At the creator economy platform I worked at, a hackathon I joined kicked off an audit of AWS RDS Aurora usage. Several instances were on-demand that had been steady-state for over a year. Planning the Reserved Instance commitments and rolling them out moved the cost curve significantly downward on that line. The technical decision wasn’t clever. The discipline was that someone actually sat down and modeled commitment against forecast.
Here’s the relevant Terraform piece I tend to ship, with autoscaling targets that are honest about headroom:
resource "aws_rds_cluster" "primary" {
cluster_identifier = "app-primary"
engine = "aurora-postgresql"
engine_version = "15.4"
database_name = "app"
backup_retention_period = 14
storage_encrypted = true
}
resource "aws_rds_cluster_instance" "reader" {
count = var.reader_baseline
identifier = "app-reader-${count.index}"
cluster_identifier = aws_rds_cluster.primary.id
instance_class = var.reader_class
engine = aws_rds_cluster.primary.engine
}
resource "aws_appautoscaling_target" "reader_count" {
max_capacity = var.reader_baseline * 3 # 3x burst headroom
min_capacity = var.reader_baseline
resource_id = "cluster:${aws_rds_cluster.primary.id}"
scalable_dimension = "rds:cluster:ReadReplicaCount"
service_namespace = "rds"
}
resource "aws_appautoscaling_policy" "reader_cpu" {
name = "reader-cpu-target"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.reader_count.resource_id
scalable_dimension = aws_appautoscaling_target.reader_count.scalable_dimension
service_namespace = aws_appautoscaling_target.reader_count.service_namespace
target_tracking_scaling_policy_configuration {
target_value = 50
predefined_metric_specification {
predefined_metric_type = "RDSReaderAverageCPUUtilization"
}
scale_in_cooldown = 600
scale_out_cooldown = 60
}
}
50% reader CPU target, not 70. Scale-out cooldown 60s, scale-in 600s. Slow to shrink, fast to grow. That asymmetry is the whole point.
A flat-rate load test will pass and your real traffic will still tip you over. Production traffic has hot keys, cache hit rates, retry storms, geographic skew. A synthetic test that pretends traffic is uniform is testing a system you don’t run.
What I do instead: pull a slice of prod logs (anonymized), bucket by endpoint and request shape, and replay that distribution. Honestly, this is the part most teams skip. They run k6 with a uniform script against a happy path, call it capacity verification, and then prod surprises them.
Capacity isn’t a one-shot exercise. Quarterly review minimum, monthly during high-growth phases. Ad-hoc reviews trigger on three things: a new feature with unknown fan-out, a vendor change (DB major version, broker upgrade, region change), or a pricing model change from a cloud or SaaS provider.
The capacity plan for a service belongs to the service owner, not “ops” or “the platform team.” Platform provides tooling, headroom defaults, and a quarterly forum. The squad owning the service makes the call. If platform owns the plan, nobody who actually understands the load is in the room.
A simple Datadog monitor I attach to every service owner’s dashboard:
name: "Capacity headroom below 30% - {{service.name}}"
type: query alert
query: |
avg(last_15m):
avg:kubernetes.cpu.usage.total{service:$service} /
avg:kubernetes.cpu.limits{service:$service} > 0.7
message: |
Service {{service.name}} has been above 70% CPU utilization for 15m,
which means headroom is below 30%. This is the cue to review the
capacity plan for this service.
Runbook: https://runbooks.internal/capacity/$service
Owner: {{service.owner}}
options:
thresholds:
critical: 0.7
warning: 0.6
notify_audit: true
no_data_timeframe: 30
notify_no_data: true
You want the review to start before you’re in the incident, not during.
Thanks for reading. If you’ve got thoughts, send them my way.