NestJS Logging and Observability

How I wire Pino, AsyncLocalStorage correlation IDs, OpenTelemetry, and PII redaction into NestJS, plus what shipping to CloudWatch and Datadog actually looks like under load.

Tuesday morning, 10:14 a.m. PT. Datadog fired the reader replica lag alert and the community feed p99 was already climbing past 8 seconds. I wasn’t on-call that week. I got tagged on the thread anyway because I owned the Aurora layer at the creator economy platform I worked at. The first thing I did, before scaling readers or pulling pg_stat_activity, was grep our log aggregator for the trace ID on a single slow request. That’s the whole reason I’m writing this. If your logs can’t tell you which downstream call is on fire from one request ID, you’re not going to debug your way out of a real incident.

The stack I actually run in production: Pino as the logger, AsyncLocalStorage for correlation IDs, OpenTelemetry for traces, Pino redaction for PII, CloudWatch or Datadog as the sink. Opinion up front: structured JSON, one trace ID per request, redact at the logger not at the controller, ship from a sidecar not from your app process.

Why the default logger isn’t enough

NestJS ships with Logger and it’s fine for hello world. In production it’s not. Text-formatted by default, no request context, no structured fields. The minute you put a thousand pods behind an ALB you cannot grep your way to anything useful.

Pino is the right call. Fastest Node logger I’ve benchmarked, structured JSON by default, and pino-http hooks into the NestJS middleware chain cleanly. I’ve used nestjs-pino on every NestJS service I’ve shipped in the last two years and don’t see a reason to look elsewhere.

import { Module } from '@nestjs/common';
import { LoggerModule } from 'nestjs-pino';
import { randomUUID } from 'crypto';

@Module({
  imports: [
    LoggerModule.forRoot({
      pinoHttp: {
        level: process.env.LOG_LEVEL ?? 'info',
        genReqId: (req) =>
          (req.headers['x-request-id'] as string) ?? randomUUID(),
        redact: {
          paths: [
            'req.headers.authorization',
            'req.headers.cookie',
            'req.body.password',
            'req.body.token',
            'res.headers["set-cookie"]',
            '*.email',
            '*.creditCard',
          ],
          censor: '[REDACTED]',
        },
        serializers: {
          req: (req) => ({
            id: req.id,
            method: req.method,
            url: req.url,
            traceId: req.headers['x-trace-id'],
          }),
        },
        customLogLevel: (req, res, err) => {
          if (err || res.statusCode >= 500) return 'error';
          if (res.statusCode >= 400) return 'warn';
          return 'info';
        },
      },
    }),
  ],
})
export class AppModule {}

genReqId lets a request ID flow through if your gateway already set one, and falls back to a UUID if not. The redact paths are non-negotiable. A junior on my squad once logged the full request body during a payment debug session and authorization headers ended up in our hot path for a few hours. Retention policy saved us. Now I redact at the logger before anything ships.

Correlation IDs across async boundaries

Request IDs are easy when you’re in a controller. They get hard the moment you cross into a Bull queue, a Kafka consumer, or a setTimeout. Node’s AsyncLocalStorage is the only sane way to carry context across async hops, and NestJS doesn’t ship a great built-in for it, so I roll my own interceptor.

import { AsyncLocalStorage } from 'async_hooks';
import {
  CallHandler,
  ExecutionContext,
  Injectable,
  NestInterceptor,
} from '@nestjs/common';
import { Observable } from 'rxjs';

export interface RequestContext {
  requestId: string;
  traceId?: string;
  userId?: string;
  tenantId?: string;
}

export const requestContext = new AsyncLocalStorage<RequestContext>();

@Injectable()
export class CorrelationInterceptor implements NestInterceptor {
  intercept(ctx: ExecutionContext, next: CallHandler): Observable<unknown> {
    const req = ctx.switchToHttp().getRequest();
    const store: RequestContext = {
      requestId: req.id,
      traceId: req.headers['x-trace-id'],
      userId: req.user?.id,
      tenantId: req.headers['x-tenant-id'],
    };

    return new Observable((subscriber) => {
      requestContext.run(store, () => {
        next.handle().subscribe({
          next: (v) => subscriber.next(v),
          error: (e) => subscriber.error(e),
          complete: () => subscriber.complete(),
        });
      });
    });
  }
}

Anywhere downstream, including queue consumers and background workers, you read requestContext.getStore() and attach the fields to whatever line you’re about to emit. The reason I lean on AsyncLocalStorage over CLS-hooked context is performance. ALS is in core and on Node 20 it’s roughly free.

OpenTelemetry without the auto-instrumentation tax

OpenTelemetry is the right primitive. The auto-instrumentation packages are mostly fine but they will pin you to specific versions of pg, ioredis, and kafkajs and they’ll add latency you can’t easily measure. I run OTel with a curated set of instrumentations and manual spans for the hot paths. The setup file runs before NestJS bootstraps.

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { NestInstrumentation } from '@opentelemetry/instrumentation-nestjs-core';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';

export const otel = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME,
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.GIT_SHA,
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_COLLECTOR_URL,
  }),
  instrumentations: [
    new HttpInstrumentation({
      ignoreIncomingRequestHook: (req) => req.url === '/health',
    }),
    new NestInstrumentation(),
    new PgInstrumentation({ enhancedDatabaseReporting: false }),
  ],
});

otel.start();

process.on('SIGTERM', async () => {
  await otel.shutdown();
});

Two details that matter. SERVICE_VERSION is the git SHA, not a semver, because every deploy gets a new SHA and you’ll want to slice traces by it during a regression. enhancedDatabaseReporting is off because it logs SQL parameters and that’s a redaction problem waiting to happen on a tenant table.

The Pino bridge is what stitches traces to logs. Every Pino line should include traceId and spanId so when you click a slow trace in Datadog, the linked log tab shows the actual stack of structured logs.

import pino from 'pino';
import { trace, context } from '@opentelemetry/api';

export const logger = pino({
  mixin: () => {
    const span = trace.getSpan(context.active());
    if (!span) return {};
    const { traceId, spanId } = span.spanContext();
    return { traceId, spanId };
  },
});

Shipping logs without melting your app

Do not ship logs from inside your app process. Pino writes to stdout. A Fluent Bit sidecar reads stdout from the pod, batches, and ships to CloudWatch Logs or Datadog. If your app is responsible for the network call to your aggregator, your app’s tail latency is now bound to your aggregator’s availability, and you will learn this the hard way during an aggregator outage.

# fluent-bit configmap excerpt
[INPUT]
    Name              tail
    Path              /var/log/containers/*nestjs*.log
    Parser            cri
    Refresh_Interval  5
    Mem_Buf_Limit     50MB

[FILTER]
    Name        kubernetes
    Match       kube.*
    Merge_Log   On
    K8S-Logging.Parser  On

[OUTPUT]
    Name              datadog
    Match             *
    Host              http-intake.logs.datadoghq.com
    TLS               on
    apikey            ${DD_API_KEY}
    dd_service        ${SERVICE_NAME}
    dd_source         nestjs
    dd_message_key    msg

This setup means your NestJS process never blocks on logging. If the aggregator goes down, Fluent Bit buffers up to its limit, then drops. Your app keeps serving. That’s the whole trade-off.

Two production stories

The reader replica incident I opened with. The on-call’s first move was to bump the reader instance class up two tiers. Reasoning was “we’re CPU-bound on the readers.” Wrong root cause. Lag didn’t move. The readers weren’t bottlenecked, they were starved of WAL. The real fix came from pulling pg_stat_activity on the writer. A long-running ANALYZE on the hottest community table was holding write-side locks and starving WAL emission. The job was a partition-stats refresh scheduled in a maintenance cron that didn’t respect peak hours. Killed the analyze. Replica lag drained in about 6 minutes. About 22 minutes of degraded reads on a community surface used by millions of customers. The reason we caught the root cause in under 10 minutes was that every NestJS service emitting reads was attaching its trace ID, the upstream Aurora endpoint, and the route to every log line. We could correlate from one slow request to the whole writer-side stall in three Datadog clicks. The runbook now leads with one literal sentence: before touching reader scaling, check pg_stat_activity on the writer. I’m the reason that sentence is there.

Different story, different scale. Years earlier at the combat sports tournament platform I CTO’d, a live federation broadcast on a Saturday afternoon. The standings-projector Kafka consumer group started rebalancing every 30 seconds or so. Standings page froze at 14:32 local time. First move was operational, kubectl rollout restart deployment/standings-projector. Pods re-joined cleanly. Then triggered another rebalance about 40 seconds later. We were doing the same dance the group was already doing on its own. Real fix came from pulling pod logs side by side. One pod out of six had a stale image with max.poll.interval.ms at 60s instead of 300s. Someone had pushed a config-touching fix without bumping the image tag and the deployment had pulled :latest. Side-by-side log diff is only possible if every line carries the pod name, image tag, and consumer config. About 12 minutes of stale standings on air. We pinned SHAs everywhere after that and the CI check fails the deploy if a manifest references :latest on any consumer pod.

Takeaways

Pino with nestjs-pino. Default to structured JSON. The built-in Logger is fine for hello world, not for production.
One trace ID per request, propagated through AsyncLocalStorage. Read it in every queue consumer, every background job, every error handler.
OpenTelemetry with a curated instrumentation list. Auto-instrumentation everywhere is a recipe for surprise latency.
Redact at the logger, not at the controller. PII in logs is a one-line fix that takes a year to clean up if you skip it.
Ship from a Fluent Bit sidecar. Your app should never block on the aggregator being healthy.
Stitch traces to logs with a Pino mixin. The minute you can click a slow span and see the structured log lines under it, your incident MTTR drops.

Thanks for reading. If you’ve got thoughts, send them my way.