Hypothesis-driven production debugging with flame graphs, continuous profiling, bpftrace, and core dumps, drawn from real incidents on Aurora and WebSocket gateways.
It was 09:31 on a Tuesday, the first market open after a bank holiday weekend, and every Socket.io gateway pod on the real-time trading platform I’d architected was pinned at 100% CPU. I’d been staring at Datadog for six minutes and learned nothing. The dashboards said “CPU high, latency high, error rate climbing”. Yeah. I could see that. What I needed was the call stack that was actually eating the cores, not another red square.
I ended up reaching for perf and a flame graph. That’s the moment I stopped trusting dashboards as a debugging tool and started treating them as an alerting tool. Dashboards tell you something is wrong. Profilers tell you what.
A flame graph isn’t art. Width is time spent on a stack, height is call depth, and the only thing you really care about is the widest plateau under a hot path. That’s it. Ignore the candle flicker on the left and right edges. Find the fat block.
When a Node.js gateway is CPU-bound, the flame graph for that process almost always lands on one of three plateaus: JSON serialization, regex matching, or some pure-JS code that should have been a buffer copy. On the trading gateway, it was tick payload re-serialization on every fan-out. We were stringifying the same object hundreds of times per second per client.
Here’s the wrapper I keep around for grabbing a flame graph off a live pod without rebuilding the image:
#!/usr/bin/env bash
# flamegraph-pod.sh - grab a 30s CPU flame graph off a running pod
set -euo pipefail
POD=$1
NAMESPACE=${2:-default}
OUT=${3:-flame-$(date +%s).svg}
kubectl exec -n "$NAMESPACE" "$POD" -- bash -c '
PID=$(pgrep -f "node " | head -1)
perf record -F 99 -p "$PID" -g -- sleep 30
perf script > /tmp/perf.out
'
kubectl cp "$NAMESPACE/$POD:/tmp/perf.out" /tmp/perf.out
# FlameGraph repo: brendangregg/FlameGraph
/opt/FlameGraph/stackcollapse-perf.pl /tmp/perf.out \
| /opt/FlameGraph/flamegraph.pl --title "$POD CPU" \
> "$OUT"
echo "wrote $OUT"
That script has saved me hours. Run it on the pod that’s misbehaving, get an SVG, open it in a browser, look for the widest plateau. Done in under a minute.
The problem with on-demand profiling is that by the time you SSH into the pod, the incident is half over and the bad stack is already cold. At the creator-economy platform I worked at the last few years, our EKS fleet runs thousands of pods. Half the time the misbehaving pod has been restarted by the time I’m in there.
That’s why I’ve moved every serious production system I own to continuous profiling. Pyroscope or Parca, sampled CPU plus heap, scraped continuously and kept for 14 days. Sample rate is low enough that overhead is negligible. When something goes sideways, I pick the time window, filter by service and pod, and the right flame graph is sitting there waiting.
A minimal DaemonSet for a Parca-style agent on EKS:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: parca-agent
namespace: observability
spec:
selector:
matchLabels: { app: parca-agent }
template:
metadata:
labels: { app: parca-agent }
spec:
hostPID: true
containers:
- name: parca-agent
image: ghcr.io/parca-dev/parca-agent:v0.30.0
args:
- /bin/parca-agent
- --node=$(NODE_NAME)
- --remote-store-address=parca.observability.svc.cluster.local:7070
- --profiling-cpu-sampling-frequency=19
- --metadata-external-labels=cluster=prod-us-east-1
env:
- name: NODE_NAME
valueFrom: { fieldRef: { fieldPath: spec.nodeName } }
securityContext:
privileged: true
volumeMounts:
- { name: sys, mountPath: /sys, readOnly: true }
- { name: cgroup, mountPath: /sys/fs/cgroup, readOnly: true }
volumes:
- { name: sys, hostPath: { path: /sys } }
- { name: cgroup, hostPath: { path: /sys/fs/cgroup } }
19 Hz sampling, eBPF-based stack walking, no app instrumentation required. Once this is in place, “give me a CPU flame graph for the community service between 14:30 and 14:45 yesterday” is a query, not an archaeological dig.
bpftrace is the one tool I wish I’d learned five years earlier. The one-liners cover 80% of the questions you have during an incident. A few I genuinely keep in shell history:
# 1. Per-syscall latency histogram for a PID, helps spot slow read/write
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_* /pid == $1/ { @start[tid, probe] = nsecs; }
tracepoint:syscalls:sys_exit_* /pid == $1/ {
$s = @start[tid, probe];
if ($s) { @lat[probe] = hist(nsecs - $s); delete(@start[tid, probe]); }
}' 12345
# 2. TCP state transitions, catches reconnection storms
sudo bpftrace -e '
tracepoint:sock:inet_sock_set_state {
printf("%-6d %-16s %d -> %d\n", pid, comm, args->oldstate, args->newstate);
}'
# 3. Off-CPU time per stack, finds threads blocked on locks or IO
sudo bpftrace -e '
kprobe:finish_task_switch { @start[arg0] = nsecs; }
kretprobe:schedule /@start[curtask]/ {
@off[kstack] = sum(nsecs - @start[curtask]);
delete(@start[curtask]);
}'
# 4. Slow PostgreSQL queries from userland uprobe (libpq)
sudo bpftrace -e '
uprobe:/usr/lib/x86_64-linux-gnu/libpq.so.5:PQexec { @s[tid] = nsecs; }
uretprobe:/usr/lib/x86_64-linux-gnu/libpq.so.5:PQexec /@s[tid]/ {
$d = (nsecs - @s[tid]) / 1000000;
if ($d > 100) { printf("slow query %d ms pid=%d\n", $d, pid); }
delete(@s[tid]);
}'
The second one is the one I ran during a war story I’ll get to in a second. The fourth one is the one that saved my Tuesday morning at the creator economy platform a year and a half ago, when a long-running ANALYZE on a hot community table was starving WAL on the writer and pushing Aurora reader replica lag past 14 minutes. I’d written code that looked clean. The slow path was downstream of a maintenance cron nobody wanted to claim ownership of. bpftrace caught it in 30 seconds where Datadog query traces didn’t.
When a pod is stuck, not crashed, the worst move is to kill it. You lose the only copy of the state that would tell you why it’s stuck. gcore lets you grab a core dump from a running process without killing it, so you can poke at it offline.
# stuck-pod-core.sh - dump a hung process to a core file and copy it out
POD=$1
NAMESPACE=${2:-default}
kubectl exec -n "$NAMESPACE" "$POD" -- bash -c '
apt-get update >/dev/null && apt-get install -y gdb >/dev/null
PID=$(pgrep -f "node " | head -1)
gcore -o /tmp/stuck "$PID"
ls -lh /tmp/stuck.*
'
kubectl cp "$NAMESPACE/$POD:/tmp/stuck.${PID}" ./stuck.core
# offline:
# lldb -c stuck.core -- $(which node)
# (lldb) thread backtrace all
For Node, you can do better with process._debugProcess(pid) on the live pod and attach an inspector, but gcore is the bigger hammer when the event loop is fully stuck. I’ve used this pattern to diagnose hung Sidekiq workers and a Node consumer that was wedged in a tight JSON.parse loop on a malformed payload. In both cases the live pod kept serving traffic at reduced capacity while I worked the core offline.
Now the war story I promised. The reconnection storm.
09:31 on a Tuesday, 74 seconds after market open. The trading platform’s Socket.io gateways were designed to handle around ten million concurrent connections at peak. Clients started dropping en masse, reconnecting immediately, dropping again. Within 90 seconds, every gateway pod was pinned at 100% CPU and p99 tick fan-out latency went from 80 ms to 3 seconds. Charts on the client showed stale prices. For a trading product, that’s the worst failure mode there is.
First thing I did was wrong. I scaled the deployment from 3 to 9 pods with a manual kubectl scale. New pods came online, hit the reconnect storm, and went CPU-bound within 20 seconds of joining the pool. I was feeding the fire. Worse, the bigger pool meant more partial-success reconnects, clients got “connection established” then dropped again the moment the pod saturated.
The real fix took two things in parallel. First, an emergency client-side config push through a remote-config channel we’d built for exactly this kind of moment, jittered exponential backoff with min: 200ms, max: 30s, factor: 2, jitter: +/-50%. Second, a tight per-IP connection-rate limit at the nginx layer:
# /etc/nginx/conf.d/ws-gateway.conf
limit_conn_zone $binary_remote_addr zone=ws_per_ip:10m;
limit_req_zone $binary_remote_addr zone=ws_new:10m rate=3r/s;
upstream ws_gateway {
least_conn;
server gw-1:8080 max_fails=2 fail_timeout=5s;
server gw-2:8080 max_fails=2 fail_timeout=5s;
server gw-3:8080 max_fails=2 fail_timeout=5s;
keepalive 256;
}
server {
listen 443 ssl http2;
server_name ws.example.com;
location /socket.io/ {
limit_conn ws_per_ip 50;
limit_req zone=ws_new burst=10 nodelay;
proxy_pass http://ws_gateway;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 3600s;
}
}
Within 8 minutes the pool stabilized and tick latency was back under 200 ms. Around 14 minutes of degraded tick delivery during market open, which is the worst possible 14 minutes of the trading week. Plenty of angry tickets, no actual losses, but you remember those numbers.
The lesson stuck. Autoscale is not a fix for a self-amplifying client-side bug. Backoff lives on the client, not on the server. And if I’d had bpftrace’s inet_sock_set_state running on a sample node from the first minute, I’d have seen the TIME_WAIT explosion before I touched the autoscaler.
You want a sentence in your head before you run the command. “I think a hot path is doing too much JSON.parse” gives you a flame graph filter. “I think we’re blocked on syscalls” gives you a bpftrace target. “I think the consumer is stuck in a tight loop” gives you a core dump and a backtrace. Without that sentence you’re just collecting data, and at 100% CPU your tooling becomes part of the problem. The reflex to avoid is the one I had at 09:31 that morning. More pods, more memory, more retries. Those make a profile question harder, never easier.
inet_sock_set_state, off-CPU stacks, syscall latency.gcore lets you debug a stuck pod without killing it. Take the core, restart later.Thanks for reading. If you’ve got thoughts, send them my way.