Microservice Security and Zero Trust

How I run mTLS, JWT propagation, machine-to-machine OAuth2, API key rotation, and Kubernetes NetworkPolicies across a hundreds-of-services topology, with two war stories where trust was the bug.

Saturday afternoon. A live combat-sports tournament was being broadcast publicly. I was acting CTO at the federation platform I’d built in London, hundreds of microservices behind a Kafka backbone, the standings page open on commentators’ screens. Around the third bout the standings-projector consumer group started rebalancing every 30 seconds, the leaderboard froze at 14:32, and PagerDuty lit up. The fix was unglamorous. One pod out of six was running a stale image with a different max.poll.interval.ms, an old downstream call past its deadline, no SHA pin on the consumer deployment. That outage wasn’t a security incident, but it taught me the lesson zero trust is really about. The cluster trusted that pod. It shouldn’t have.

OK so when people say “zero trust microservices” they usually mean five separate things stacked on top of each other. mTLS between services. Signed user context flowing through requests. Machine-to-machine auth for the stuff that doesn’t have a user. API keys that actually rotate. And a network layer that says no by default. I’ve run all five together at the creator economy platform I worked at across thousands of pods, and I’ll tell you which knobs I actually care about.

mTLS between services, with rotation

Most teams stop at “we have a service mesh, we’re good.” Then I ask when their CA root last rotated and the room goes quiet. mTLS is only useful if certificates expire and rotate without humans touching anything.

I run cert-manager with an internal CA for service certs. Short lifetimes. Sidecar reload on rotation. The hard part isn’t issuing certificates, it’s getting the consumer side to keep working when a cert flips at 03:14 UTC.

import https from 'node:https';
import fs from 'node:fs';
import { Agent, fetch } from 'undici';

const certPath = '/var/run/secrets/spiffe/tls.crt';
const keyPath  = '/var/run/secrets/spiffe/tls.key';
const caPath   = '/var/run/secrets/spiffe/ca.crt';

let agent = buildAgent();

fs.watch('/var/run/secrets/spiffe', { persistent: false }, () => {
  // cert-manager atomically swaps the symlink. Rebuild the agent
  // so pooled keepalive sockets pick up the new keypair.
  const next = buildAgent();
  const prev = agent;
  agent = next;
  setTimeout(() => prev.close().catch(() => {}), 30_000);
});

function buildAgent(): Agent {
  return new Agent({
    connect: {
      cert: fs.readFileSync(certPath),
      key:  fs.readFileSync(keyPath),
      ca:   fs.readFileSync(caPath),
      rejectUnauthorized: true,
      minVersion: 'TLSv1.3',
    },
    keepAliveTimeout: 10_000,
    keepAliveMaxTimeout: 30_000,
  });
}

export async function callOrders(path: string) {
  const res = await fetch(`https://orders.internal${path}`, { dispatcher: agent });
  if (!res.ok) throw new Error(`orders ${res.status}`);
  return res.json();
}

Two non-obvious things. Keep-alive sockets will hold the old cert until they close, so I keep the previous agent alive for 30 seconds, then drain. And rejectUnauthorized: true is the line that turns mTLS from theater into actual auth. If you ever see it set to false in code review, that’s the bug.

User context flows as a signed JWT

Once mTLS is in place, the network knows who the calling service is. It does not know who the user is. That’s the job of a short-lived JWT minted by the edge and propagated as a header.

The rule I’ve kept across every microservice topology I’ve shipped: services never re-authenticate the user, they verify the token signature and use the claims. Re-authentication at every hop is how you get a login service doing a million reads per second for no reason.

import { Injectable, CanActivate, ExecutionContext, UnauthorizedException } from '@nestjs/common';
import { jwtVerify, createRemoteJWKSet } from 'jose';

const jwks = createRemoteJWKSet(new URL(process.env.JWKS_URL!));

@Injectable()
export class UserContextGuard implements CanActivate {
  async canActivate(ctx: ExecutionContext): Promise<boolean> {
    const req = ctx.switchToHttp().getRequest();
    const raw = req.headers['x-user-context'];
    if (!raw) throw new UnauthorizedException('missing user context');

    try {
      const { payload } = await jwtVerify(raw, jwks, {
        issuer: 'edge.internal',
        audience: 'orders.internal',
        // 30s clock skew is the most I'll tolerate. More than that and
        // someone's NTP is broken and I'd rather know.
        clockTolerance: 30,
      });
      req.user = { id: payload.sub, scopes: payload.scopes ?? [] };
      return true;
    } catch {
      throw new UnauthorizedException('bad user context');
    }
  }
}

The token lifetime is 5 minutes. Refresh happens at the edge, not in the downstream services. Audience claim is per-service, so a token minted for orders.internal cannot be replayed against billing.internal. That last part is the one I see teams skip the most.

Machine-to-machine is OAuth2 client credentials

For service-to-service calls that aren’t on behalf of a user, JWT-as-user-context is the wrong tool. The right tool is OAuth2 client credentials, with the service identity baked into the token. I run a small in-cluster authorization server for this. Tokens are 15 minutes, scoped per caller, audience-bound.

import { Agent, fetch } from 'undici';

interface CachedToken { token: string; exp: number }
const cache = new Map<string, CachedToken>();

export async function m2mToken(audience: string, agent: Agent): Promise<string> {
  const hit = cache.get(audience);
  if (hit && hit.exp - 60 > Date.now() / 1000) return hit.token;

  const body = new URLSearchParams({
    grant_type: 'client_credentials',
    audience,
    scope: 'read:invoices write:invoices',
  });

  const res = await fetch('https://auth.internal/oauth/token', {
    method: 'POST',
    headers: { 'content-type': 'application/x-www-form-urlencoded' },
    body,
    dispatcher: agent,
  });
  if (!res.ok) throw new Error(`m2m token ${res.status}`);
  const json = await res.json() as { access_token: string; expires_in: number };

  cache.set(audience, {
    token: json.access_token,
    exp: Math.floor(Date.now() / 1000) + json.expires_in,
  });
  return json.access_token;
}

Caching matters. Every team I’ve joined has at some point shipped a bug where each request to a downstream service also did a token-exchange round trip. Latency goes through the floor and you’ve added a single point of failure named “the auth service.” Cache the token, refresh 60 seconds before expiry, never store it on disk.

API keys with a real lifecycle

For third parties and webhooks, API keys still rule. The mistake is treating them as forever-strings. They aren’t. I use a key table with version, expires_at, last_used_at, and a parent client_id. Rotation is two keys live at once, old one expires when traffic drops.

async function verifyKey(presented: string): Promise<Client | null> {
  const prefix = presented.slice(0, 8);
  const candidates = await db.apiKeys.findByPrefix(prefix);

  for (const k of candidates) {
    if (k.expiresAt && k.expiresAt < new Date()) continue;
    if (await timingSafeCompare(hash(presented), k.hashedKey)) {
      await db.apiKeys.touchLastUsed(k.id);
      return k.client;
    }
  }
  return null;
}

Hash the keys at rest. Compare with a timing-safe function. Store a short prefix in cleartext only so the lookup doesn’t scan the whole table. And expire them. A key that’s never used in 90 days is a key someone’s laptop has lost. Revoke it.

NetworkPolicies say no by default

This is the part most clusters skip. The default Kubernetes pod-to-pod policy is “everything can talk to everything.” That’s the opposite of zero trust. I land a deny-all baseline first, then allow specific paths.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: orders
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: orders-allow-from-edge-and-billing
  namespace: orders
spec:
  podSelector:
    matchLabels: { app: orders }
  policyTypes: [Ingress]
  ingress:
    - from:
        - namespaceSelector: { matchLabels: { name: edge } }
          podSelector: { matchLabels: { app: gateway } }
        - namespaceSelector: { matchLabels: { name: billing } }
          podSelector: { matchLabels: { app: invoicer } }
      ports:
        - port: 8443
          protocol: TCP

You want this in place before you have hundreds of services, not after. Retrofitting deny-all on a live topology is a 6-month project I’ve watched twice.

When the upstream lies to you

One more war story, because it taught me a different shape of trust. At the creator economy platform I worked at, our branded mobile app pipeline was submitting native iOS builds for thousands of creator apps. The submission queue started backing up. By lunch, hundreds of customer apps were stuck in “Waiting for Review” on App Store Connect, but our pipeline thought they were submitted. Apple’s API was returning 200 OK with a normal-looking body, then silently dropping the submission. Our auto-retry on 5xx got extended to retry on “stuck”, which made Apple see duplicate submissions, which made some customers end up with two competing review records. I’m not proud of that one. The fix was a circuit breaker that verified state via a separate GET against the App Store Connect resource, never via the response of the POST. Same rule applies to OAuth and API keys. The body of a 200 OK is not proof that anything happened on the other side. Read-after-write against the source of truth, or you’re just hoping.

Takeaways

mTLS only counts if certificates rotate without you. Watch the secrets dir, rebuild the agent, drain old keep-alive sockets.
JWTs carry user context, OAuth2 client credentials carry service identity. Don’t confuse them.
Audience-bind every token. A token for one service must not be replayable against another.
Hash API keys at rest. Expire them. Two live versions during rotation, never three.
NetworkPolicies start with deny-all in every namespace. Allow specific paths from there.
A 200 OK is not auth. Verify state against the source of truth.

Thanks for reading. If you’ve got thoughts, send them my way.