How I think about k6, Gatling, and Locust for HTTP load testing, why percentile design beats average-based SLOs, and how to load-test production without taking it down.
09:31 on a Tuesday. Market open at the real-time trading platform I architected a few years back. Within 74 seconds every gateway pod was pinned at 100% CPU, p99 tick fan-out climbed from ~80 ms to ~3 s, and clients were seeing stale prices on their charts. The load test we’d run two weeks before said we could handle roughly three times that traffic.
The load test wasn’t wrong about the volume. It was wrong about the shape. We’d modeled fresh connections climbing into market open. What we got was a reconnect storm, and the test had no idea reconnects even existed.
That’s the whole article in a paragraph. Load testing isn’t “send a lot of requests and watch the numbers”. It’s modeling the traffic shapes you’ll actually see at the worst moment, against percentiles you can defend, with tools that fit your team and your CI.
The default load test I see people write: ramp 0 to N virtual users over 5 minutes, hold for 10, tear down, look at the average response time, ship the dashboard screenshot. Done.
This catches almost none of what actually breaks production. It misses reconnect storms, retry storms, cache stampedes, slow file-descriptor leaks, Aurora reader replication drift. It reports averages, so p99 can be on fire while p50 looks calm. Honestly I’d rather have no load test than a fresh-traffic-ramp-to-peak test that gives the team false confidence.
I’ve run all three. k6 is my default for HTTP and API: scripts live in the app repo, the threshold model is the cleanest I’ve used, the CI story is good. Locust is the pick when I need actual Python logic in the test, like custom auth flows or weird state machines where you’re really writing a fake client. Gatling is fine when a JVM team already owns the load rig; picking it because you read a benchmark blog when nobody on the team writes Scala is how you end up with a load suite nobody touches.
Here’s a k6 script of the shape I’d put in a repo.
import http from 'k6/http'
import { check, sleep } from 'k6'
import { Trend, Rate } from 'k6/metrics'
const apiLatency = new Trend('api_latency_ms', true)
const apiErrors = new Rate('api_errors')
export const options = {
scenarios: {
smoke: {
executor: 'constant-arrival-rate',
rate: 200,
timeUnit: '1s',
duration: '2m',
preAllocatedVUs: 100,
maxVUs: 400,
},
soak: {
executor: 'ramping-arrival-rate',
startRate: 200,
timeUnit: '1s',
preAllocatedVUs: 200,
maxVUs: 1500,
stages: [
{ target: 800, duration: '5m' },
{ target: 800, duration: '30m' },
{ target: 0, duration: '2m' },
],
startTime: '3m',
},
},
thresholds: {
'api_latency_ms{endpoint:feed}': ['p(95)<400', 'p(99)<1200'],
'api_errors': ['rate<0.005'],
'http_req_failed': ['rate<0.01'],
},
}
const BASE = __ENV.BASE_URL
const TOKEN = __ENV.LOAD_TEST_TOKEN
export default function () {
const res = http.get(`${BASE}/communities/abc/posts?limit=20`, {
headers: {
Authorization: `Bearer ${TOKEN}`,
'x-synthetic-tenant': 'true',
},
tags: { endpoint: 'feed' },
})
apiLatency.add(res.timings.duration, { endpoint: 'feed' })
apiErrors.add(res.status >= 500)
check(res, {
'status 2xx': (r) => r.status >= 200 && r.status < 300,
})
sleep(0.2)
}
Two things. The thresholds are p95 and p99, not averages. And the request carries an x-synthetic-tenant header. We’ll come back to that one.
Report averages and the slow tail hides behind the fast majority. A p50 of 90 ms looks great, a p99 of 4.2 s on the same endpoint two weeks later gets you paged on a Saturday. Design SLOs on percentiles and let your load test enforce them as CI gates. If p99 of the feed endpoint goes above 1200 ms, the test fails, the PR fails with it, and the test stops being a thing engineers run manually and ignore.
Also: track http_req_failed separately, and tag by endpoint. A regression that makes 5% of requests return 500 in 5 ms will absolutely improve your average latency, and p99 of “everything” is meaningless when /health and /feed share the same dataset.
Pick the shape that matches the failure mode you’re worried about. Smoke runs on every PR and fails the PR on regression. Spike steps from baseline to 10x in 30 seconds, finds the autoscaler-too-slow and connection-pool-too-cold class of bugs. Soak holds moderate traffic for hours, finds the leaks - I run soaks nightly. Stress ramps until something breaks, useful to find the cliff, please don’t run it against production.
A spike-shaped Locust runner looks something like this.
from locust import HttpUser, task, between, LoadTestShape
import os
class FeedUser(HttpUser):
wait_time = between(0.1, 0.4)
def on_start(self):
token = os.environ["LOAD_TEST_TOKEN"]
self.client.headers.update({
"Authorization": f"Bearer {token}",
"x-synthetic-tenant": "true",
})
@task(8)
def feed(self):
with self.client.get(
"/communities/abc/posts?limit=20",
name="/communities/:id/posts",
catch_response=True,
) as r:
if r.status_code >= 500:
r.failure(f"server error {r.status_code}")
@task(1)
def post(self):
self.client.post(
"/communities/abc/posts",
name="/communities/:id/posts:create",
json={"body": "synthetic"},
)
class MarketOpenSpike(LoadTestShape):
stages = [
{"duration": 60, "users": 200, "spawn_rate": 50},
{"duration": 30, "users": 2000, "spawn_rate": 800},
{"duration": 600, "users": 2000, "spawn_rate": 200},
{"duration": 60, "users": 100, "spawn_rate": 50},
]
def tick(self):
run_time = self.get_run_time()
elapsed = 0
for stage in self.stages:
elapsed += stage["duration"]
if run_time < elapsed:
return (stage["users"], stage["spawn_rate"])
return None
And from the trading incident: if your product has reconnect behavior, your load test must model it. Force ~30% of your virtual users to drop and reconnect mid-test with the same backoff config your production client ships. Otherwise you’re not testing your production traffic.
You cannot fully simulate production in staging. Data shape, cache cardinality, cold code paths, upstream rate limits, noisy neighbors on your Aurora cluster, none of it replicates cleanly. If you only load test staging, you’re testing the wrong system.
You can load test production safely with a few disciplines. Synthetic tenants gated by a header, so your write paths no-op or route to a sandbox table and your billing and email systems ignore them. A global per-second rate cap well below what you think production can take. A kill switch on a feature flag. A runbook: announce the window, dashboards open, on-call number in front of you. The test is a deploy. Treat it like one.
Load testing isn’t a quarterly event. It’s a CI gate. A short smoke runs on every PR against a staging environment, with tight thresholds. A nightly soak runs against staging too, longer, broader scope.
name: load-test
on:
pull_request:
paths:
- 'app/**'
- 'config/**'
- 'loadtests/**'
schedule:
- cron: '0 4 * * *'
jobs:
k6-smoke:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
timeout-minutes: 10
env:
BASE_URL: ${{ secrets.STAGING_BASE_URL }}
LOAD_TEST_TOKEN: ${{ secrets.LOAD_TEST_TOKEN }}
steps:
- uses: actions/checkout@v4
- name: install k6
run: |
sudo gpg -k
sudo gpg --no-default-keyring \
--keyring /usr/share/keyrings/k6-archive-keyring.gpg \
--keyserver hkp://keyserver.ubuntu.com:80 \
--recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
| sudo tee /etc/apt/sources.list.d/k6.list
sudo apt-get update
sudo apt-get install -y k6
- name: run smoke
run: |
k6 run \
--tag git_sha=${GITHUB_SHA::8} \
--summary-export=summary.json \
loadtests/feed-smoke.js
- name: upload summary
if: always()
uses: actions/upload-artifact@v4
with:
name: k6-summary
path: summary.json
k6-soak:
if: github.event_name == 'schedule'
runs-on: ubuntu-latest
timeout-minutes: 60
env:
BASE_URL: ${{ secrets.STAGING_BASE_URL }}
LOAD_TEST_TOKEN: ${{ secrets.LOAD_TEST_TOKEN }}
steps:
- uses: actions/checkout@v4
- name: install k6
run: |
curl -L https://github.com/grafana/k6/releases/download/v0.51.0/k6-v0.51.0-linux-amd64.tar.gz \
| tar xz
sudo mv k6-v0.51.0-linux-amd64/k6 /usr/local/bin/
- name: run soak
run: |
k6 run \
--tag run_type=nightly \
loadtests/feed-soak.js
- name: notify slack on regression
if: failure()
run: |
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Nightly soak regressed on '${GITHUB_SHA::8}'"}' \
${{ secrets.SLACK_LOAD_WEBHOOK }}
Fail PRs on threshold breaches that compare against a baseline, not against absolute numbers. A 30% p99 regression matters even if the absolute number is still under the SLO. And keep the smoke under a couple of minutes - engineers will quietly disable a CI step that costs them 12 minutes per push.
Thanks for reading. If you’ve got thoughts, send them my way.