From 200ms to 15ms

How a gateway tier ate 185ms of p99 latency without anyone noticing, and the boring fixes that gave it back. JWKS caching, DNS, keep-alive, and async logs.

It was a Thursday afternoon at the creator-economy platform I worked at. I’d pulled a Datadog trace for a routine API call. p99 latency on /me sat at 212ms. The actual handler? 9ms. Everything else, the missing 200-ish ms, was the gateway tier. Network. Auth. Logging. Stuff nobody on the squad was looking at. I had a free afternoon. So I started pulling threads.

A week later we were at 15ms p99 on the same endpoint. Same handler. Same Aurora. No new boxes. The gateway had been quietly eating most of the budget for months because everyone was looking at handler timings, not edge timings. Here’s how that went.

The shape of where time was going

When you flamegraph a single request through a typical Node gateway, the slow stuff is rarely the code you wrote. It’s the stuff between the code you wrote. I broke the 200ms down into rough buckets by adding span markers around each hop.

import { trace } from '@opentelemetry/api'

const tracer = trace.getTracer('api-gateway')

export async function handle(req: GatewayRequest): Promise<Response> {
  return tracer.startActiveSpan('gateway.handle', async (span) => {
    const t0 = process.hrtime.bigint()

    const decoded = await tracer.startActiveSpan('jwt.verify', (s) => {
      try { return verifyToken(req.token) } finally { s.end() }
    })

    const upstream = await tracer.startActiveSpan('http.upstream', async (s) => {
      try { return await callUpstream(decoded.sub, req) }
      finally { s.end() }
    })

    span.setAttribute('gateway.total_ms', Number(process.hrtime.bigint() - t0) / 1e6)
    span.end()
    return upstream
  })
}

What the spans told me: roughly 70ms went to JWT verification, 40ms to DNS, 35ms to TCP/TLS handshakes against upstreams, 30ms to a synchronous access log write, and the rest scattered across small things. Each one was, on its own, defensible. Together they were the whole budget.

JWT verification was doing a network call

This was the one that made me wince. The gateway used JWKS for token verification, which is fine. But the JWKS fetch was happening on every request. No cache. The auth team had shipped this six months ago with a TODO that said “add cache later.” Later had not arrived.

import { createRemoteJWKSet, jwtVerify } from 'jose'

const JWKS = createRemoteJWKSet(
  new URL(process.env.OIDC_JWKS_URI as string),
  {
    cacheMaxAge: 10 * 60 * 1000,
    cooldownDuration: 30 * 1000,
    timeoutDuration: 1500,
  },
)

export async function verifyToken(token: string) {
  const { payload } = await jwtVerify(token, JWKS, {
    issuer: process.env.OIDC_ISSUER,
    audience: process.env.OIDC_AUDIENCE,
    clockTolerance: 5,
  })
  return payload
}

70ms gone. Cold start still pays for one fetch, every subsequent request hits memory. If the upstream rotates keys aggressively you need a short cooldown, but for an OIDC provider that rotates monthly this is the right shape.

DNS was being resolved per request

I missed this for two days. I’d assumed Node was caching DNS. It is not. Node asks the OS resolver on every connection, and what the OS does depends on your container’s nsswitch, your VPC resolver, and whether you’ve got CoreDNS in front. We were going through CoreDNS in EKS with a 5s cache, which sounds fine until you realize that on a hot endpoint at high RPS, even 5s means a lot of DNS work.

I dropped in cacheable-lookup and pointed the HTTP agent at it.

import http from 'node:http'
import https from 'node:https'
import CacheableLookup from 'cacheable-lookup'

const lookup = new CacheableLookup({
  maxTtl: 60,
  errorTtl: 5,
  fallbackDuration: 3600,
})

lookup.install(http.globalAgent)
lookup.install(https.globalAgent)

40ms gone. Not glamorous. The kind of fix that doesn’t make it into a postmortem because nothing broke, it was just slow.

Keep-alive pools, properly sized

Even with DNS cached, we were still paying TCP and TLS handshake cost per upstream call, because the default Node HTTP agent’s keepAlive is off. I’ve shipped Node gateways at three different employers and every single time the default agent was wrong for the workload.

import { Agent as HttpAgent } from 'node:http'
import { Agent as HttpsAgent } from 'node:https'

export const upstreamAgent = new HttpsAgent({
  keepAlive: true,
  keepAliveMsecs: 30_000,
  maxSockets: 256,
  maxFreeSockets: 64,
  timeout: 5_000,
  scheduling: 'lifo',
})

export const internalAgent = new HttpAgent({
  keepAlive: true,
  keepAliveMsecs: 30_000,
  maxSockets: 512,
  maxFreeSockets: 128,
})

scheduling: 'lifo' matters more than you’d think. With LIFO the hottest sockets stay hot and idle sockets age out cleanly when traffic drops. FIFO keeps every socket marginally alive, which leads to TLS resumption thrashing upstream. 35ms gone.

Access logs were blocking the response

The gateway was writing an access log line synchronously, before the response flushed. JSON, stdout, fluent-bit as the driver. Normal so far. The problem: the line included a JSON-stringified body diff for audit, which on bigger responses was 20-30ms of CPU per request, inline with the response.

Fix is to move logging off the hot path. Flush the response first, enqueue the log on a microtask. If the worker dies between response and log you lose the line, and I’m okay with that trade for a gateway tier. Need stronger guarantees? Batch into a ring buffer.

import { setImmediate } from 'node:timers'

export function logAccess(entry: AccessLog) {
  setImmediate(() => {
    try {
      process.stdout.write(JSON.stringify(entry) + '\n')
    } catch {
      // log driver is the SRE team's problem, not the request path's
    }
  })
}

30ms gone. Tail latency improved more than the average because the stringify cost scaled with payload size.

Takeaways

Flamegraph the gateway before you flamegraph the handler. The waste is usually in the middle.
JWKS without a cache is a network call per request. Cache it with cacheMaxAge and a short cooldown.
Node does not cache DNS by default. Install cacheable-lookup on both HTTP and HTTPS agents.
Keep-alive with scheduling: 'lifo' and a real maxSockets is one of the highest-leverage tuning knobs in Node.
Move access logging off the response path. setImmediate is fine for most gateways.
Scale is the wrong reflex when latency is the symptom. Cheaper-per-request first, then more boxes.

Thanks for reading. If you’ve got thoughts, send them my way.