Building a Microservice Chassis

An opinionated take on shared chassis libraries for logging, metrics, tracing, health checks, and config across a fleet of microservices.

The first time I ran grep -r 'pino' . across the creator economy platform I worked at, I got back about seventeen variants of bootstrapping the same logger. Different log shapes, different correlation-id header names, three different ways of redacting PII. Thousands of pods, every service born in a different month with a different opinion about what “structured logging” meant. That’s the day I started caring about a chassis.

If you’ve worked in a microservices org for more than a year, you’ve felt this. The worst part isn’t writing boilerplate. It’s debugging at 2 a.m. and realizing one service emits trace_id, another emits traceId, and a third emits nothing because someone disabled tracing “to reduce noise” six months ago.

What a chassis actually is

A microservice chassis is a shared library, or a tight family of them, that every new service pulls in on day one. It standardizes the cross-cutting stuff: logging, metrics, tracing, health checks, config loading, error shaping, broker glue. Product code stays in the service. Plumbing lives in the chassis.

Strong opinion here. Most teams shouldn’t write a chassis. Most teams should adopt OpenTelemetry plus a NestJS module they thinly wrap. The teams that should write one are the ones with hundreds of services who already feel the drift. If you’re at fifteen services and arguing about a chassis, you’re arguing about the wrong thing.

The shape I keep landing on

Here’s a sketch of the bootstrap that ends up at the top of every service I’ve shipped in the last few years. NestJS, TypeScript, AWS, Datadog as the observability backend.

import { NestFactory } from '@nestjs/core';
import { Chassis } from '@org/chassis';
import { AppModule } from './app.module';

async function bootstrap() {
  const app = await NestFactory.create(AppModule, {
    bufferLogs: true,
  });

  await Chassis.install(app, {
    service: 'orders-api',
    version: process.env.GIT_SHA ?? 'dev',
    env: process.env.NODE_ENV ?? 'development',
    redact: ['req.headers.authorization', 'req.body.password'],
  });

  await app.listen(parseInt(process.env.PORT ?? '3000', 10));
}

void bootstrap();

That’s it. The service author doesn’t touch the logger config, the OTel SDK, the health-check route, or the Prometheus exporter. They get the same ones every other service has.

Under the hood Chassis.install is unglamorous. It wires a pino logger with the team’s redaction list, attaches the OTel Node SDK pointed at the collector, registers /healthz and /readyz through Nest’s TerminusModule, and binds a global exception filter that maps thrown errors to a stable JSON shape. Nothing novel. The value is that it’s the same in every service.

Config loading without surprises

Config is the part most chassis libraries get wrong, in my experience. They either ship a thin wrapper over process.env (useless) or a giant DSL with secret-manager integration (overkill for a service that needs three keys). What works for me is a schema with sensible defaults.

import { z } from 'zod';
import { loadConfig } from '@org/chassis/config';

const schema = z.object({
  PORT: z.coerce.number().int().min(1).default(3000),
  DATABASE_URL: z.string().url(),
  KAFKA_BROKERS: z.string().transform(s => s.split(',')),
  FEATURE_FLAGS_TTL_MS: z.coerce.number().int().default(60_000),
});

export const config = loadConfig(schema, {
  source: 'env',
  onError: 'crash',
});

loadConfig fails boot if a required key is missing, dumps a redacted summary at startup, and refuses to log anything it suspects is a secret. onError: 'crash' is the point. A service with bad config should crash immediately, not emit a confusing 500 thirty minutes later when something finally references the missing key.

Health checks aren’t free

/healthz returning 200 OK because the process is alive is worse than no health check at all. It teaches your platform team to trust a lie. The chassis I keep building has two routes:

import { Controller, Get } from '@nestjs/common';
import {
  HealthCheck,
  HealthCheckService,
  TypeOrmHealthIndicator,
  MemoryHealthIndicator,
} from '@nestjs/terminus';

@Controller()
export class HealthController {
  constructor(
    private readonly health: HealthCheckService,
    private readonly db: TypeOrmHealthIndicator,
    private readonly mem: MemoryHealthIndicator,
  ) {}

  @Get('healthz')
  @HealthCheck()
  liveness() {
    return this.health.check([
      () => this.mem.checkHeap('heap', 512 * 1024 * 1024),
    ]);
  }

  @Get('readyz')
  @HealthCheck()
  readiness() {
    return this.health.check([
      () => this.db.pingCheck('postgres', { timeout: 1500 }),
    ]);
  }
}

healthz answers “is this process able to serve traffic.” readyz answers “should the load balancer send me traffic right now.” Different question, different consequence. The chassis registers both and the platform team configures Kubernetes probes against them by convention.

Scaffolding generators and governance

This is where teams get nervous. If you hand every team a scaffold generator, are you taking away their autonomy? No, you’re returning their time. The chassis is the floor, not the ceiling.

pnpm dlx @org/create-service orders-api \
  --transport http \
  --db postgres \
  --broker kafka

The generator emits a service with the chassis installed, a sample controller, a Dockerfile, a Helm values file, and a CI workflow pinned to the org’s reusable workflows. Teams can delete any of it. What they cannot do is ship to production without the chassis, because the deploy workflow checks for it.

The governance call I keep making: chassis owns cross-cutting plumbing. Teams own everything else. If a team wants a different logging shape, they propose it to the chassis, not vendor a fork. Forks die in two months. A proposal that lands in the chassis improves every service.

Versioning and deprecation windows

Honestly the boring problem is versioning. Semver the chassis, and treat a major bump as a coordinated migration across the fleet, with a deprecation window of at least one minor release.

Major versions don’t get to assume every consumer has migrated. If the chassis renames a config key or changes the log shape, the old shape stays parsable for at least one minor release behind a flag. “We’ll migrate everyone next quarter” turns into eighteen months.

Takeaways

A chassis is for orgs feeling drift, not orgs that don’t have it yet.
Standardize logging, metrics, tracing, health checks, config in one library. Anything more is scope creep.
Liveness and readiness are different questions. Two routes, not one.
Pin chassis-touched deployments at the SHA, never at a tag.
Generators give the floor. Governance says you can’t ship without the chassis.
Plan the deprecation when you ship a major bump, not after.

Thanks for reading. If you’ve got thoughts, send them my way.