Capacity Planning for Web Applications

How I think about capacity planning for production web apps: load modeling, bottleneck hunting, headroom buffers, cost curves, and review cadence.

09:31:14 local. 74 seconds after the London market opened. Every gateway pod on the real-time trading platform I’d architected was pinned at 100% CPU, p99 tick fan-out was over 3 seconds, and clients were watching stale prices on charts they’d paid for. We’d sized the cluster for 10x peak. The slide said we were fine. We were not fine.

That morning changed how I do capacity planning. The headroom number was right. The model behind it wasn’t. We’d forecast RPS and ignored shape, and “10x peak” against the wrong shape is a number, not a plan.

So here’s how I think about it now. Capacity planning is a load-modeling problem, not a percentage-of-peak heuristic. Autoscaling is a useful tactic. It isn’t a capacity plan. If your plan is “set HPA to 70% CPU and trust the cloud,” what you’ve actually said is “I don’t understand my load.”

Forecast request shape, not just RPS

A request count is the thinnest possible model of your traffic. Two systems can do the same RPS and have entirely different capacity profiles.

What I care about, in order:

Read vs write ratio (different DB cost, different cache hit profile).
Burst vs sustained pattern (market open, Black Friday, viral push).
Fan-out factor per request (one read that hits seven downstreams isn’t one request).
Hot key concentration (90% of writes against 5% of rows is a different planet from uniform writes).

Here’s a small TypeScript helper I use to sanity-check pod sizing against a request mix, instead of just rps_per_pod:

type RequestKind = "read_light" | "read_heavy" | "write" | "fanout";

interface MixModel {
  rps: number;
  mix: Record<RequestKind, number>; // shares, sum to 1
  costMs: Record<RequestKind, number>; // p99 CPU-ms per request
  targetUtil: number; // 0.6 for stateless services
}

export function podsNeeded(m: MixModel, cpuMsPerPodPerSec = 1000): number {
  const weightedCost = (Object.keys(m.mix) as RequestKind[]).reduce(
    (acc, k) => acc + m.mix[k] * m.costMs[k],
    0,
  );
  const totalCpuMs = m.rps * weightedCost;
  const usable = cpuMsPerPodPerSec * m.targetUtil;
  return Math.ceil(totalCpuMs / usable);
}

That’s not the whole answer, but it forces you to write down costMs per request type. When you can’t fill in read_heavy, that’s a real finding. Nobody on the team knows what the expensive request actually costs, which means you can’t size for it.

Find the real bottleneck first

CPU is almost always a symptom. The real bottleneck on a web app is one of: DB connection exhaustion, IOPS ceiling, Kafka consumer lag, a downstream rate limit, GC pauses, socket file descriptors, or one ridiculous serializer that nobody profiled.

That is the whole reason I distrust “scale up to fix it.” You can scale a layer that wasn’t the problem and feel good for ten minutes while the actual cause keeps burning.

The discipline is: instrument every layer before you scale anything. Connections, IOPS, replica lag, queue depth, downstream timeouts, GC pauses. When you run a load test, watch all of them. The bottleneck is whichever one saturates first. Fix that, then re-run.

Here’s a k6 script I use as a starting point. It ramps, holds, and tags downstream calls with custom metrics so you can see which layer breaks first, not just that something broke:

import http from "k6/http";
import { Trend, Counter } from "k6/metrics";
import { check, sleep } from "k6";

const dbLatency = new Trend("downstream_db_ms", true);
const upstreamErrors = new Counter("upstream_errors");

export const options = {
  scenarios: {
    ramp_and_hold: {
      executor: "ramping-arrival-rate",
      startRate: 200,
      timeUnit: "1s",
      preAllocatedVUs: 200,
      maxVUs: 4000,
      stages: [
        { duration: "2m", target: 2000 },
        { duration: "10m", target: 2000 },
        { duration: "2m", target: 5000 },
        { duration: "10m", target: 5000 },
      ],
    },
  },
  thresholds: {
    http_req_failed: ["rate<0.01"],
    "http_req_duration{kind:read}": ["p(99)<400"],
    "http_req_duration{kind:write}": ["p(99)<800"],
  },
};

export default function () {
  const res = http.get("https://api.internal/feed?cursor=latest", {
    tags: { kind: "read" },
  });
  const dbMs = Number(res.headers["X-DB-Ms"] ?? 0);
  if (dbMs > 0) dbLatency.add(dbMs);
  if (res.status >= 500) upstreamErrors.add(1);
  check(res, { "status 2xx": (r) => r.status < 400 });
  sleep(0.2);
}

The custom downstream_db_ms trend is the bit that matters. You don’t just see http p99 climb. You see which layer drove it.

Headroom buffers that survive incidents

Stateless and stateful services need different headroom. Stateless web pods, I want 2x headroom over forecast peak. Stateful tiers (DB writers, brokers, real-time gateways) get 3 to 5x, sized against the worst plausible failure mode rather than the steady state.

Three failure modes I size for explicitly:

AZ loss (one third of capacity gone).
Deploy in progress (rolling, so transient capacity dip).
Self-amplifying client behavior (reconnect storm, retry storm, cache stampede).

Cost curves and the boring math

Capacity isn’t linear in cost. Different layers have different curves.

Stateless workers: roughly linear. Twice the pods, roughly twice the bill.
Aurora writers: step function. You scale by instance class, not by core.
Kafka brokers: move in threes (replication factor).
Reader replicas: linear in count but with diminishing returns past 3-4.

Reserved Instances change the slope under you entirely. At the creator economy platform I worked at, a hackathon I joined kicked off an audit of AWS RDS Aurora usage. Several instances were on-demand that had been steady-state for over a year. Planning the Reserved Instance commitments and rolling them out moved the cost curve significantly downward on that line. The technical decision wasn’t clever. The discipline was that someone actually sat down and modeled commitment against forecast.

Here’s the relevant Terraform piece I tend to ship, with autoscaling targets that are honest about headroom:

resource "aws_rds_cluster" "primary" {
  cluster_identifier = "app-primary"
  engine             = "aurora-postgresql"
  engine_version     = "15.4"
  database_name      = "app"
  backup_retention_period = 14
  storage_encrypted  = true
}

resource "aws_rds_cluster_instance" "reader" {
  count              = var.reader_baseline
  identifier         = "app-reader-${count.index}"
  cluster_identifier = aws_rds_cluster.primary.id
  instance_class     = var.reader_class
  engine             = aws_rds_cluster.primary.engine
}

resource "aws_appautoscaling_target" "reader_count" {
  max_capacity       = var.reader_baseline * 3 # 3x burst headroom
  min_capacity       = var.reader_baseline
  resource_id        = "cluster:${aws_rds_cluster.primary.id}"
  scalable_dimension = "rds:cluster:ReadReplicaCount"
  service_namespace  = "rds"
}

resource "aws_appautoscaling_policy" "reader_cpu" {
  name               = "reader-cpu-target"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.reader_count.resource_id
  scalable_dimension = aws_appautoscaling_target.reader_count.scalable_dimension
  service_namespace  = aws_appautoscaling_target.reader_count.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value = 50
    predefined_metric_specification {
      predefined_metric_type = "RDSReaderAverageCPUUtilization"
    }
    scale_in_cooldown  = 600
    scale_out_cooldown = 60
  }
}

50% reader CPU target, not 70. Scale-out cooldown 60s, scale-in 600s. Slow to shrink, fast to grow. That asymmetry is the whole point.

Load tests that mirror production

A flat-rate load test will pass and your real traffic will still tip you over. Production traffic has hot keys, cache hit rates, retry storms, geographic skew. A synthetic test that pretends traffic is uniform is testing a system you don’t run.

What I do instead: pull a slice of prod logs (anonymized), bucket by endpoint and request shape, and replay that distribution. Honestly, this is the part most teams skip. They run k6 with a uniform script against a happy path, call it capacity verification, and then prod surprises them.

Review cadence and ownership

Capacity isn’t a one-shot exercise. Quarterly review minimum, monthly during high-growth phases. Ad-hoc reviews trigger on three things: a new feature with unknown fan-out, a vendor change (DB major version, broker upgrade, region change), or a pricing model change from a cloud or SaaS provider.

The capacity plan for a service belongs to the service owner, not “ops” or “the platform team.” Platform provides tooling, headroom defaults, and a quarterly forum. The squad owning the service makes the call. If platform owns the plan, nobody who actually understands the load is in the room.

A simple Datadog monitor I attach to every service owner’s dashboard:

name: "Capacity headroom below 30% - {{service.name}}"
type: query alert
query: |
  avg(last_15m):
    avg:kubernetes.cpu.usage.total{service:$service} /
    avg:kubernetes.cpu.limits{service:$service} > 0.7
message: |
  Service {{service.name}} has been above 70% CPU utilization for 15m,
  which means headroom is below 30%. This is the cue to review the
  capacity plan for this service.

  Runbook: https://runbooks.internal/capacity/$service
  Owner: {{service.owner}}
options:
  thresholds:
    critical: 0.7
    warning: 0.6
  notify_audit: true
  no_data_timeframe: 30
  notify_no_data: true

You want the review to start before you’re in the incident, not during.

Takeaways

Forecast request shape (read/write, burst, fan-out, hot keys), not just RPS.
Bottlenecks are almost never CPU. Instrument every layer before you scale.
Stateless services want 2x headroom, stateful tiers want 3-5x, sized against the worst failure mode.
Cost curves are step functions for stateful tiers. Plan compute and commitments together.
Load tests must replay traffic shape, not a flat rate.
Capacity plans live with service owners. Quarterly review, monthly during growth.

Thanks for reading. If you’ve got thoughts, send them my way.