How I tune liveness, readiness, and startup probes across NestJS services so rolling updates don't cascade and a slow downstream doesn't kill the whole fleet.
It was a Tuesday morning at the combat-sports tournament platform I CTO’d in London. The federation had a public broadcast running, and our checkout service started getting killed every 30 seconds. Six pods, all green a minute ago, all CrashLooping now. Liveness probe was failing because the pod did one slow HTTP call to a federation-rules service on boot, and that call sometimes took 70 seconds. We had a 10-second liveness timeout. Kubernetes did exactly what we told it to do.
That’s the bit nobody warns you about. The defaults the Helm chart generator gives you, the snippets you copy from a blog, none of them are tuned. They look reasonable. They are not safe. The probe config is part of your reliability story.
Here’s how I think about it after running probes across hundreds of microservices and thousands of pods.
Liveness, readiness, startup. People conflate them constantly. They are not interchangeable, they don’t even share a failure mode.
Liveness asks: is this process so broken that the only fix is to restart it. If liveness fails, the kubelet kills the pod. That’s it. There’s no gentler recovery path. So liveness should be the cheapest, dumbest check you can write. “Is the event loop spinning. Are my locks not deadlocked.” Nothing that depends on a database, a cache, a broker, or any peer service.
Readiness asks: is this pod ok to receive traffic right now. If readiness fails, the pod gets pulled from the service endpoints. No restart. Traffic just stops landing on it. This is where you can be opinionated. Database connection pool exhausted, downstream circuit open, Kafka consumer lagging past a threshold, all of these are legit readiness fails.
Startup is the one most teams skip. It runs once during boot, and while it’s running, liveness is effectively suspended. If your service takes 40 seconds to load a model into memory or warm a connection pool, a startup probe with a generous failureThreshold is the right answer. Not a 60-second initialDelaySeconds on liveness, which a lot of people reach for and which I’ve fallen into more than once.
Here’s the shape I land on for a NestJS service. Three separate endpoints, three separate probes, deliberately different.
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 2
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
successThreshold: 1
startupProbe:
httpGet:
path: /startup
port: 3000
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 24
failureThreshold: 24 with periodSeconds: 5 gives the pod up to two minutes to boot before Kubernetes decides it’s gone. That’s deliberately wider than I want most pods to need. Cold-start outliers are real, especially when a node is busy or when the image cache misses and you’re pulling a fresh layer.
Timeouts matter too. Liveness timeoutSeconds: 2 is short on purpose. If /healthz can’t return in 2 seconds, the process is in trouble. Readiness gets 3 because the readiness check actually touches a couple of cheap dependencies and 3 seconds is a more realistic ceiling.
The handlers themselves do different things. Liveness is the cheapest:
import { Controller, Get } from '@nestjs/common';
@Controller()
export class LivenessController {
@Get('/healthz')
liveness() {
return { status: 'ok' };
}
}
That’s it. No DI of a service. No DB ping. Nothing that can hang. The only thing that can fail this is the event loop being blocked, and if that’s blocked, restart is what you want.
Readiness is allowed to be smarter, but you have to be careful what it depends on.
import { Controller, Get, HttpException, HttpStatus } from '@nestjs/common';
import { HealthCheckService, TypeOrmHealthIndicator } from '@nestjs/terminus';
@Controller()
export class ReadinessController {
constructor(
private readonly health: HealthCheckService,
private readonly db: TypeOrmHealthIndicator,
) {}
@Get('/ready')
async ready() {
try {
return await this.health.check([
() => this.db.pingCheck('postgres', { timeout: 1500 }),
]);
} catch (err) {
throw new HttpException(
{ status: 'down', reason: (err as Error).message },
HttpStatus.SERVICE_UNAVAILABLE,
);
}
}
}
Note what’s not in there. No check against a peer service. No “is the upstream auth service up.” That’s how you build a cascading failure trap. If service A’s readiness depends on service B, and B has a hiccup, A’s pods all go un-ready, traffic shifts, B gets hammered worse, A gets restarted, everything domino-falls. I’ve watched this happen during a market open at the real-time trading platform I architected. The reconnect storm was bad enough on its own, but the readiness probe on the gateway tier was also pinging the price service, which was pinned. So the gateways went un-ready in lockstep. The fix, in the moment, was to rip the cross-service check out of /ready. Liveness was already fine. Readiness should not have been asking that question.
So what does belong in readiness? In my services it’s usually three things.
const isReady =
dbPoolIsHealthy() &&
consumerLagBelowThreshold('orders', 5000) &&
!circuitBreaker.state('payments').isOpen();
DB pool because if the pool is exhausted, no request will succeed anyway. Consumer lag because for a Kafka-consuming pod, “ready to take traffic” includes “I’m keeping up.” And circuit breaker state because if I’ve already decided the downstream is dead, my pod returning 503 is more honest than my pod returning 500.
I keep peer-service status out. Always. The rule I write on whiteboards: a pod’s readiness is allowed to depend on resources it owns. It is not allowed to depend on the health of services it calls.
The trap shows up in two flavors. First flavor is the one I just described, readiness checking a peer. Second flavor is more subtle: liveness doing something expensive that occasionally times out.
I lived through this one at the schema-migration incident on the creator economy platform’s Rails monolith. Not Kubernetes, but the same shape of mistake. Login was failing for about 85 seconds during an Aurora lock. If our liveness probe had been touching the DB, every pod in the fleet would have been killed and restarted during the lock, instead of just queueing requests and recovering on its own. We got lucky because liveness was DB-free. Readiness went red, traffic stopped, the migration finished, traffic resumed. No pods restarted. The blast radius was the lock duration, not a fleet-wide cold start on top of the lock duration.
That distinction is the whole point of separating the probes. Liveness restarts. Readiness sheds. You want as much as possible in the “shed” bucket and almost nothing in the “restart” bucket.
Once probes are separated, you get a useful new tool: turning readiness off on purpose. We did this during a maintenance window on the creator economy platform, draining one zone at a time. The deploy pipeline flipped a feature flag that made /ready return 503 on every pod in that zone. Kubernetes pulled them out of rotation. We did the maintenance. We flipped the flag back. The pods went green and started accepting traffic.
Same trick works for in-flight shutdown. On SIGTERM, set a flag that flips /ready to 503 immediately, finish the current requests, then exit. Kubernetes stops routing new traffic the moment readiness goes red, but liveness keeps passing, so the pod isn’t killed mid-request. The terminationGracePeriodSeconds on the deployment should be longer than your worst-case in-flight request.
let shuttingDown = false;
process.on('SIGTERM', () => {
shuttingDown = true;
setTimeout(() => process.exit(0), 25_000);
});
// inside the readiness handler
if (shuttingDown) {
throw new HttpException({ status: 'draining' }, HttpStatus.SERVICE_UNAVAILABLE);
}
Sounds simple. The trick is terminationGracePeriodSeconds: 30 in the deployment, with the 25-second setTimeout above. Five-second buffer so the kubelet’s preStop hook and the route propagation both finish before the process actually exits.
timeoutSeconds and failureThreshold per probe, never copy them between services without thinking.SIGTERM, flip readiness before you start tearing things down.Thanks for reading. If you’ve got thoughts, send them my way.