Load Testing Production Web Systems

How I think about k6, Gatling, and Locust for HTTP load testing, why percentile design beats average-based SLOs, and how to load-test production without taking it down.

09:31 on a Tuesday. Market open at the real-time trading platform I architected a few years back. Within 74 seconds every gateway pod was pinned at 100% CPU, p99 tick fan-out climbed from ~80 ms to ~3 s, and clients were seeing stale prices on their charts. The load test we’d run two weeks before said we could handle roughly three times that traffic.

The load test wasn’t wrong about the volume. It was wrong about the shape. We’d modeled fresh connections climbing into market open. What we got was a reconnect storm, and the test had no idea reconnects even existed.

That’s the whole article in a paragraph. Load testing isn’t “send a lot of requests and watch the numbers”. It’s modeling the traffic shapes you’ll actually see at the worst moment, against percentiles you can defend, with tools that fit your team and your CI.

Why most load tests lie

The default load test I see people write: ramp 0 to N virtual users over 5 minutes, hold for 10, tear down, look at the average response time, ship the dashboard screenshot. Done.

This catches almost none of what actually breaks production. It misses reconnect storms, retry storms, cache stampedes, slow file-descriptor leaks, Aurora reader replication drift. It reports averages, so p99 can be on fire while p50 looks calm. Honestly I’d rather have no load test than a fresh-traffic-ramp-to-peak test that gives the team false confidence.

k6 vs Gatling vs Locust

I’ve run all three. k6 is my default for HTTP and API: scripts live in the app repo, the threshold model is the cleanest I’ve used, the CI story is good. Locust is the pick when I need actual Python logic in the test, like custom auth flows or weird state machines where you’re really writing a fake client. Gatling is fine when a JVM team already owns the load rig; picking it because you read a benchmark blog when nobody on the team writes Scala is how you end up with a load suite nobody touches.

Here’s a k6 script of the shape I’d put in a repo.

import http from 'k6/http'
import { check, sleep } from 'k6'
import { Trend, Rate } from 'k6/metrics'

const apiLatency = new Trend('api_latency_ms', true)
const apiErrors = new Rate('api_errors')

export const options = {
  scenarios: {
    smoke: {
      executor: 'constant-arrival-rate',
      rate: 200,
      timeUnit: '1s',
      duration: '2m',
      preAllocatedVUs: 100,
      maxVUs: 400,
    },
    soak: {
      executor: 'ramping-arrival-rate',
      startRate: 200,
      timeUnit: '1s',
      preAllocatedVUs: 200,
      maxVUs: 1500,
      stages: [
        { target: 800, duration: '5m' },
        { target: 800, duration: '30m' },
        { target: 0, duration: '2m' },
      ],
      startTime: '3m',
    },
  },
  thresholds: {
    'api_latency_ms{endpoint:feed}': ['p(95)<400', 'p(99)<1200'],
    'api_errors': ['rate<0.005'],
    'http_req_failed': ['rate<0.01'],
  },
}

const BASE = __ENV.BASE_URL
const TOKEN = __ENV.LOAD_TEST_TOKEN

export default function () {
  const res = http.get(`${BASE}/communities/abc/posts?limit=20`, {
    headers: {
      Authorization: `Bearer ${TOKEN}`,
      'x-synthetic-tenant': 'true',
    },
    tags: { endpoint: 'feed' },
  })

  apiLatency.add(res.timings.duration, { endpoint: 'feed' })
  apiErrors.add(res.status >= 500)

  check(res, {
    'status 2xx': (r) => r.status >= 200 && r.status < 300,
  })

  sleep(0.2)
}

Two things. The thresholds are p95 and p99, not averages. And the request carries an x-synthetic-tenant header. We’ll come back to that one.

Design tests against percentiles

Report averages and the slow tail hides behind the fast majority. A p50 of 90 ms looks great, a p99 of 4.2 s on the same endpoint two weeks later gets you paged on a Saturday. Design SLOs on percentiles and let your load test enforce them as CI gates. If p99 of the feed endpoint goes above 1200 ms, the test fails, the PR fails with it, and the test stops being a thing engineers run manually and ignore.

Also: track http_req_failed separately, and tag by endpoint. A regression that makes 5% of requests return 500 in 5 ms will absolutely improve your average latency, and p99 of “everything” is meaningless when /health and /feed share the same dataset.

Model realistic traffic shapes

Pick the shape that matches the failure mode you’re worried about. Smoke runs on every PR and fails the PR on regression. Spike steps from baseline to 10x in 30 seconds, finds the autoscaler-too-slow and connection-pool-too-cold class of bugs. Soak holds moderate traffic for hours, finds the leaks - I run soaks nightly. Stress ramps until something breaks, useful to find the cliff, please don’t run it against production.

A spike-shaped Locust runner looks something like this.

from locust import HttpUser, task, between, LoadTestShape
import os

class FeedUser(HttpUser):
    wait_time = between(0.1, 0.4)

    def on_start(self):
        token = os.environ["LOAD_TEST_TOKEN"]
        self.client.headers.update({
            "Authorization": f"Bearer {token}",
            "x-synthetic-tenant": "true",
        })

    @task(8)
    def feed(self):
        with self.client.get(
            "/communities/abc/posts?limit=20",
            name="/communities/:id/posts",
            catch_response=True,
        ) as r:
            if r.status_code >= 500:
                r.failure(f"server error {r.status_code}")

    @task(1)
    def post(self):
        self.client.post(
            "/communities/abc/posts",
            name="/communities/:id/posts:create",
            json={"body": "synthetic"},
        )


class MarketOpenSpike(LoadTestShape):
    stages = [
        {"duration": 60,  "users": 200,  "spawn_rate": 50},
        {"duration": 30,  "users": 2000, "spawn_rate": 800},
        {"duration": 600, "users": 2000, "spawn_rate": 200},
        {"duration": 60,  "users": 100,  "spawn_rate": 50},
    ]

    def tick(self):
        run_time = self.get_run_time()
        elapsed = 0
        for stage in self.stages:
            elapsed += stage["duration"]
            if run_time < elapsed:
                return (stage["users"], stage["spawn_rate"])
        return None

And from the trading incident: if your product has reconnect behavior, your load test must model it. Force ~30% of your virtual users to drop and reconnect mid-test with the same backoff config your production client ships. Otherwise you’re not testing your production traffic.

Safe production load testing

You cannot fully simulate production in staging. Data shape, cache cardinality, cold code paths, upstream rate limits, noisy neighbors on your Aurora cluster, none of it replicates cleanly. If you only load test staging, you’re testing the wrong system.

You can load test production safely with a few disciplines. Synthetic tenants gated by a header, so your write paths no-op or route to a sandbox table and your billing and email systems ignore them. A global per-second rate cap well below what you think production can take. A kill switch on a feature flag. A runbook: announce the window, dashboards open, on-call number in front of you. The test is a deploy. Treat it like one.

Wiring it into CI

Load testing isn’t a quarterly event. It’s a CI gate. A short smoke runs on every PR against a staging environment, with tight thresholds. A nightly soak runs against staging too, longer, broader scope.

name: load-test

on:
  pull_request:
    paths:
      - 'app/**'
      - 'config/**'
      - 'loadtests/**'
  schedule:
    - cron: '0 4 * * *'

jobs:
  k6-smoke:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    timeout-minutes: 10
    env:
      BASE_URL: ${{ secrets.STAGING_BASE_URL }}
      LOAD_TEST_TOKEN: ${{ secrets.LOAD_TEST_TOKEN }}
    steps:
      - uses: actions/checkout@v4
      - name: install k6
        run: |
          sudo gpg -k
          sudo gpg --no-default-keyring \
            --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
            --keyserver hkp://keyserver.ubuntu.com:80 \
            --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
          echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
            | sudo tee /etc/apt/sources.list.d/k6.list
          sudo apt-get update
          sudo apt-get install -y k6
      - name: run smoke
        run: |
          k6 run \
            --tag git_sha=${GITHUB_SHA::8} \
            --summary-export=summary.json \
            loadtests/feed-smoke.js
      - name: upload summary
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: k6-summary
          path: summary.json

  k6-soak:
    if: github.event_name == 'schedule'
    runs-on: ubuntu-latest
    timeout-minutes: 60
    env:
      BASE_URL: ${{ secrets.STAGING_BASE_URL }}
      LOAD_TEST_TOKEN: ${{ secrets.LOAD_TEST_TOKEN }}
    steps:
      - uses: actions/checkout@v4
      - name: install k6
        run: |
          curl -L https://github.com/grafana/k6/releases/download/v0.51.0/k6-v0.51.0-linux-amd64.tar.gz \
            | tar xz
          sudo mv k6-v0.51.0-linux-amd64/k6 /usr/local/bin/
      - name: run soak
        run: |
          k6 run \
            --tag run_type=nightly \
            loadtests/feed-soak.js
      - name: notify slack on regression
        if: failure()
        run: |
          curl -X POST -H 'Content-type: application/json' \
            --data '{"text":"Nightly soak regressed on '${GITHUB_SHA::8}'"}' \
            ${{ secrets.SLACK_LOAD_WEBHOOK }}

Fail PRs on threshold breaches that compare against a baseline, not against absolute numbers. A 30% p99 regression matters even if the absolute number is still under the SLO. And keep the smoke under a couple of minutes - engineers will quietly disable a CI step that costs them 12 minutes per push.

Takeaways

Design SLOs and thresholds on p95 / p99, never on averages. Track error rate separately.
The traffic shape matters more than the tool. Model reconnects, retries, and spikes, not just fresh ramps.
k6 is my default for HTTP load tests in CI. Locust when you need real Python logic. Gatling only if a JVM team already owns the rig.
Load testing production is fine, with synthetic tenants, rate caps, a kill switch, and a runbook everyone knows.
Wire smoke into PR CI with tight thresholds. Run soak nightly. Compare against a baseline, not an absolute number.

Thanks for reading. If you’ve got thoughts, send them my way.