Kubernetes for Backend Engineers

What a backend engineer actually has to get right about Kubernetes before shipping a service to production. Pods, probes, HPA, and resource sizing from a real production angle.

A Tuesday morning at the creator-economy platform I worked at the last few years. A backend service was crashlooping. Liveness returning 200, readiness flapping, pod torn down every 90 seconds. We spent an hour before figuring out the probe path was hitting a health endpoint that had silently started touching the database.

That’s the Kubernetes I want to talk about. Not the architecture diagram, the bits a backend engineer has to get right before shipping.

The platform I work at runs thousands of pods on EKS. I do not own the cluster, I own the manifests. The parts of Kubernetes that matter from that seat are smaller than the docs make it look. Pods, Deployments, Services, probes, HPA, resource sizing. Get those right and you outsource 80% of the pain to the platform team. Get them wrong and you’ll be the reason a feature freeze hits Slack.

Pods are not the unit you ship

A Pod is the smallest deployable thing in Kubernetes. You will never write one by hand in production. You write a Deployment, which writes a ReplicaSet, which writes Pods. Almost every “weird Pod behavior” question is actually a Deployment question.

A minimum-viable Deployment I’d ship and not be embarrassed about:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-api
  labels:
    app: orders-api
spec:
  replicas: 3
  revisionHistoryLimit: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: orders-api
  template:
    metadata:
      labels:
        app: orders-api
    spec:
      terminationGracePeriodSeconds: 45
      containers:
        - name: app
          image: ghcr.io/example/orders-api@sha256:9f1b...e2
          ports:
            - containerPort: 3000
          envFrom:
            - secretRef:
                name: orders-api-secrets
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              memory: "1Gi"
          readinessProbe:
            httpGet:
              path: /healthz/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /healthz/live
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 15
            failureThreshold: 4
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]

A few non-obvious choices. maxUnavailable: 0 means a bad rollout cannot drop capacity below current replica count. The new pod comes up before an old one goes away. Image pinned to a SHA, not a tag. I’ll come back to that one because it once cost us 12 minutes of stale data on a public leaderboard.

Probes lie if you let them

Most common backend mistake I see, wiring livenessProbe and readinessProbe to the same endpoint. Readiness controls traffic. Liveness restarts the pod. If your readiness probe hits the DB and the DB is slow, you yank the pod out of the Service’s endpoints. Fine. If liveness does the same, Kubernetes restarts the pod during a slow-DB blip, the pool reconnects, latency spikes for everyone. You just made the incident worse.

The pattern I use, every time:

import type { FastifyInstance } from 'fastify'
import { db } from './db'
import { redis } from './redis'

export async function registerHealthRoutes(app: FastifyInstance) {
  app.get('/healthz/live', async () => {
    return { status: 'ok' }
  })

  app.get('/healthz/ready', async (req, reply) => {
    const checks = await Promise.allSettled([
      db.query('select 1').then(() => 'db'),
      redis.ping().then(() => 'redis'),
    ])

    const failed = checks
      .filter((c) => c.status === 'rejected')
      .map((c, i) => ['db', 'redis'][i])

    if (failed.length > 0) {
      req.log.warn({ failed }, 'readiness check failed')
      return reply.code(503).send({ status: 'not_ready', failed })
    }

    return { status: 'ready' }
  })
}

Liveness is “the process is alive and the event loop is responsive.” Readiness is “I can take traffic right now.” Two different questions, two different endpoints.

Resource requests are a budget, not a wish

Trips up every backend engineer I’ve onboarded. Requests are what the scheduler reserves for your pod. Limits are the ceiling before the kernel kills you for memory or throttles you for CPU. Set requests: cpu: 2 and your pod cannot land on a node without 2 unallocated CPU shares, even if it uses 50m on a calm afternoon.

Teams set high requests “to be safe” and then wonder why HPA doesn’t scale. HPA’s default CPU target is a percentage of requests. Over-request, utilization stays low, HPA never scales, and you’ve reserved capacity nobody else can use.

What I do. Run the service in staging under realistic load. Look at p95 CPU and working-set memory over 24h. Requests at p95 CPU and 1.2x working-set memory. Memory limit at 1.5x to 2x working set. Leave CPU limit unset unless you have a noisy-neighbor reason. Throttling is worse than contention.

Services and what they are not

A Service is a stable virtual IP plus a label selector. Traffic gets load-balanced across matching pods. That’s it. Not a cloud load balancer, not a circuit breaker, not a retry policy. For those, a service mesh like Istio or Linkerd, or per-client logic.

Subtle gotcha. Kubernetes routes to a pod as soon as readiness is true. If your pod accepts the probe but hasn’t warmed its connection pool, the first ~30 requests will be slow. Delay readiness until pool warmup finishes, not just until the HTTP listener is up.

HPA, and the war story I owe you

HPA scales replicas based on a metric. CPU by default, custom metrics if you wire the adapter. The trap is that HPA is a slow control loop. Not a fix for a self-amplifying client-side bug. The shape I usually start from:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: orders-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: orders-api
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60

Aggressive scale-up, conservative scale-down. Short stabilization on the way up so you react to a real spike. Long on the way down so you don’t yo-yo on a 20-second CPU dip.

At a real-time trading platform I architected years ago, we had a Socket.io tier behind nginx, designed for around 10M concurrent connections at peak. The Tuesday after a bank-holiday weekend, market opened at 09:30 London time, and 74 seconds in, the connection pool started churning. Clients dropped, reconnected, dropped again. p99 tick fan-out went from ~80ms to ~3s.

My first move, manually scale the gateway pods 3x through the HPA override. New pods came online, hit the same reconnect storm, went CPU-bound in 20 seconds. I was feeding the fire. More pods meant more partial-success reconnects, clients getting a handshake then dropping again.

The fix landed in two places. A remote-config push for jittered exponential backoff on the client (min: 200ms, max: 30s, factor: 2, jitter: +/-50%). And a per-IP connection-rate limit at nginx set tight, 3 new connections per second per IP. About 8 minutes later the pool stabilized, tick fan-out back under 200ms. Around 14 minutes of degraded delivery during the most-watched 15-minute window of the trading week.

Autoscaling solves “the workload grew.” Not “the clients are screaming.” Backoff lives on the client. HPA cannot save you from a feedback loop.

A Kafka consumer SHA pin, also a war story

At the federation platform I CTO’d in London. A live combat-sports tournament being broadcast on a Saturday. The standings-projector consumer group started rebalancing every 30 seconds. match-events kept growing, standings stopped updating, the public leaderboard froze at 14:32 local.

First instinct, operational. kubectl rollout restart deployment/standings-projector. Consumers re-joined cleanly. Then triggered another rebalance 40 seconds later. Same dance the group was already doing on its own.

Real fix took longer. Pulled pod logs side by side. One pod out of six had a different max.poll.interval.ms. 300s on five, 60s on the sixth. That pod was running a stale image. Someone had pushed a config-touching fix without bumping the tag, the manifest pulled :latest. Its handler did a slow downstream call that sometimes took ~70s, past max.poll.interval.ms, so it kept getting kicked out, rebalancing the group. Cordoned the bad pod, storm drained in 90 seconds. SHA pins on every Kafka-touching deployment, smaller poll batches, slow call moved out of the hot loop.

Standing rule from that day. Never reference :latest in a production manifest. CI fails the deploy if it does.

Takeaways

Pods are not the unit you ship. Write Deployments. Set maxUnavailable: 0.
Liveness and readiness are different endpoints. Liveness is “the process is alive,” readiness is “I can take traffic right now.”
Requests are a scheduler budget. Set them from p95 CPU and working-set memory, not from gut feeling.
Avoid CPU limits unless you have a real noisy-neighbor problem. Throttling is worse than contention.
HPA scales workloads, not feedback loops. Backoff and rate limits live closer to the client and the edge.
Pin images to SHAs, especially on Kafka consumers. Tags are not immutable.

Thanks for reading. If you’ve got thoughts, send them my way.