NestJS Health Checks and Graceful Shutdown

How I wire Terminus indicators, readiness vs liveness probes, and shutdown hook ordering in NestJS so Kubernetes rolling updates actually stay zero-downtime.

It was a Saturday afternoon at the combat-sports tournament platform I CTO’d in London. A live federation broadcast was on, and our standings page froze mid-bout. The standings-projector consumer was stuck in a rebalance loop. The pods were green in Kubernetes. Readiness was returning 200. Liveness was returning 200. The deployment dashboard looked fine. The product was not fine.

That’s the bit nobody tells you about health checks. The defaults lie. They say “the process is up” when what you actually want them to say is “this pod is ready to take traffic and won’t drop a Kafka partition mid-flight if you kill it.” Those are very different statements, and getting them right in NestJS is mostly about being deliberate with Terminus, readiness vs liveness, and the order shutdown hooks fire in.

Here’s how I wire it now.

Liveness is not readiness

The single most common mistake I see, including from me back then, is one endpoint at /health that returns 200 if the process is breathing. That endpoint gets pointed at both the liveness probe and the readiness probe in Deployment.yaml. It’s wrong for both jobs at the same time.

Liveness asks: is this process so broken that Kubernetes should kill it. The answer is almost always yes, the process is up. If it’s deadlocked or out of memory, the kernel and the runtime will usually take care of it. So liveness should be cheap, in-process, and basically never fail. If liveness fails, the pod gets killed mid-request.

Readiness is the load-bearing one. Readiness asks: should this pod receive traffic right now. That answer depends on the DB being reachable, Redis being reachable, the broker connection being alive, in-flight migrations being done, and the warmup being complete. If readiness fails, Kubernetes pulls the pod out of the Service endpoints. No traffic. No restart. That’s exactly what you want during a hiccup.

I split them like this:

import { Controller, Get } from '@nestjs/common';
import {
  HealthCheck,
  HealthCheckService,
  TypeOrmHealthIndicator,
  HttpHealthIndicator,
} from '@nestjs/terminus';
import { RedisHealthIndicator } from './health/redis.indicator';
import { KafkaHealthIndicator } from './health/kafka.indicator';

@Controller()
export class HealthController {
  constructor(
    private readonly health: HealthCheckService,
    private readonly db: TypeOrmHealthIndicator,
    private readonly http: HttpHealthIndicator,
    private readonly redis: RedisHealthIndicator,
    private readonly kafka: KafkaHealthIndicator,
  ) {}

  @Get('/healthz')
  @HealthCheck()
  liveness() {
    return this.health.check([]);
  }

  @Get('/readyz')
  @HealthCheck()
  readiness() {
    return this.health.check([
      () => this.db.pingCheck('postgres', { timeout: 1500 }),
      () => this.redis.isHealthy('redis'),
      () => this.kafka.isHealthy('kafka'),
    ]);
  }
}

/healthz returns 200 unconditionally. /readyz actually checks the dependencies. Terminus’ built-in TypeOrmHealthIndicator runs a SELECT 1 with a timeout. Redis and Kafka I do as custom indicators because I want them to share the same client instances the app already uses, not open a parallel connection per probe. A separate connection per probe is one of those quietly-expensive habits I’ve seen tank a Redis cluster at peak.

A custom Redis indicator looks like this:

import { Injectable } from '@nestjs/common';
import {
  HealthIndicator,
  HealthIndicatorResult,
  HealthCheckError,
} from '@nestjs/terminus';
import { InjectRedis } from '@nestjs-modules/ioredis';
import Redis from 'ioredis';

@Injectable()
export class RedisHealthIndicator extends HealthIndicator {
  constructor(@InjectRedis() private readonly client: Redis) {
    super();
  }

  async isHealthy(key: string): Promise<HealthIndicatorResult> {
    try {
      const reply = await this.client
        .ping()
        .then((r) => r)
        .catch(() => null);

      if (reply !== 'PONG') {
        throw new HealthCheckError(
          `${key} not responding`,
          this.getStatus(key, false),
        );
      }
      return this.getStatus(key, true, { latency: 'ok' });
    } catch (err) {
      throw new HealthCheckError(`${key} check failed`, this.getStatus(key, false));
    }
  }
}

Same shape for Kafka, but I check the producer’s metadata cache, not a ping. Producers don’t ping. If producer.connect() has resolved and metadata is fresh, you’re good. If you’re a consumer, the indicator checks that the group is in Stable state, not Rebalancing. That last bit is what I should have had years ago.

The shutdown order is the whole game

NestJS exposes shutdown hooks. You have to opt in.

async function bootstrap() {
  const app = await NestFactory.create(AppModule, {
    bufferLogs: true,
  });

  app.enableShutdownHooks();

  await app.listen(3000);
}
bootstrap();

Without enableShutdownHooks(), your OnModuleDestroy and OnApplicationShutdown handlers do nothing on SIGTERM. The process just dies. That’s fine for hello-world. Not fine for a Kafka consumer holding 12 partitions.

The order matters, and it’s the opposite of what most people guess. When Kubernetes wants to roll a pod, it does roughly this:

Sends SIGTERM.
Removes the pod from the Service endpoints (asynchronously, takes a few hundred ms).
Waits terminationGracePeriodSeconds (default 30).
Sends SIGKILL.

The gap between step 1 and step 2 is the one that hurts. SIGTERM arrives. Your app starts shutting down. But the Service still routes new requests to this pod for another second or two. If you start tearing down the DB pool the instant SIGTERM hits, those last few requests get Connection terminated errors and bubble back to the client as 5xxs.

The fix is a deliberate ordering with a small pre-stop delay. I do it like this:

import {
  Injectable,
  Logger,
  OnApplicationShutdown,
  OnModuleDestroy,
} from '@nestjs/common';
import { Kafka, Consumer } from 'kafkajs';

@Injectable()
export class StandingsProjector implements OnApplicationShutdown, OnModuleDestroy {
  private readonly logger = new Logger(StandingsProjector.name);
  private consumer: Consumer;
  private draining = false;

  async onApplicationShutdown(signal?: string) {
    this.logger.log(`shutdown signal received: ${signal}`);
    this.draining = true;

    // 1. stop pulling new work from Kafka, finish in-flight handlers
    await this.consumer.stop();

    // 2. wait for any in-flight DB writes the handlers kicked off
    await this.waitForInflight({ timeoutMs: 8000 });

    // 3. commit the last offsets, then disconnect cleanly
    await this.consumer.disconnect();
  }

  async onModuleDestroy() {
    // pools, redis, etc., torn down after the consumer is out of the group
  }

  isReady(): boolean {
    return !this.draining;
  }

  private async waitForInflight({ timeoutMs }: { timeoutMs: number }) {
    // implementation specific - track a counter of pending handlers
  }
}

The draining flag also flips readiness to false. Once readiness flips, Kubernetes pulls the pod out of the Service. The pre-stop hook in the manifest adds a 5-second sleep to cover the endpoint propagation gap:

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 5"]
terminationGracePeriodSeconds: 45

5 seconds for endpoints to converge, then NestJS shutdown hooks fire, then up to 40 seconds for in-flight work to drain. If your handlers are slow, bump the grace period. Don’t lower it to feel fast.

The Aurora lag morning

Different job, different scar. A Tuesday morning at the creator economy platform I worked at. Aurora reader replica lag alarm fired around 10:14 a.m. PT. Reader replicas were behind by 14 minutes and climbing. The Community feed reads were timing out. p99 on /communities/:id/posts went from 120 ms to over 8 seconds.

What was wrong with our health setup that day: the readiness probe for the read-path service was checking SELECT 1 against the local pool, which routed to a reader. The reader was up. The reader was just stale. The probe passed. Traffic kept landing on a pod that was serving 14-minute-old data.

What we tried first: bumped the reader instance class up two tiers, on the theory that the readers were CPU-bound. Lag didn’t move. The readers weren’t bottlenecked, they were starved of WAL because a long-running ANALYZE on a hot table was holding write-side locks on the writer. Killed the analyze. Lag drained in about 6 minutes.

What actually fixed the broader thing: I added a freshness indicator to readiness for any service that reads from a replica.

@Injectable()
export class AuroraReplicaFreshnessIndicator extends HealthIndicator {
  constructor(
    @InjectDataSource('reader') private readonly reader: DataSource,
  ) {
    super();
  }

  async check(key: string, maxLagMs = 5000): Promise<HealthIndicatorResult> {
    const rows = await this.reader.query<{ lag_ms: number }[]>(
      `SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) * 1000 AS lag_ms`,
    );
    const lag = Number(rows?.[0]?.lag_ms ?? 0);
    const ok = lag < maxLagMs;
    if (!ok) {
      throw new HealthCheckError(
        `${key} replica lag ${lag}ms exceeds ${maxLagMs}ms`,
        this.getStatus(key, false, { lag }),
      );
    }
    return this.getStatus(key, true, { lag });
  }
}

Cost of that morning: about 22 minutes of degraded Community read latency. No data loss. The runbook now leads with a literal sentence: “Before touching reader scaling, check pg_stat_activity on the writer.” I’m the reason that sentence is in there. The freshness indicator means readiness flips off when a pod’s reader has fallen behind, and Kubernetes routes around it instead of cooking the user experience.

Takeaways

Liveness and readiness are not the same probe. Don’t reuse one endpoint for both.
Probes should reuse the app’s existing clients, not open new ones.
Call app.enableShutdownHooks() or your handlers do nothing.
Order: flip readiness off, sleep for endpoint propagation, drain in-flight work, then disconnect.
For replica reads, check replica freshness in readiness, not just SELECT 1.
A green probe that means nothing is worse than no probe at all.

Thanks for reading. If you’ve got thoughts, send them my way.