Microservice Configuration Management

Centralizing config across hundreds of services with Consul KV, AWS Parameter Store, and Vault, with hot reloads, secret rotation, and dev/prod parity that actually holds.

The first time I really cared about config management was a Tuesday afternoon at the combat-sports tournament platform I CTO’d in London. We had hundreds of microservices and a .env file convention that everyone was quietly tired of. A new engineer rotated the staging Postgres password, pushed the change to one service’s .env, deployed, and the standings projector lost its connection. Cool. The fix took ten minutes. The audit took two days. That was the week I admitted the env-file thing had to go.

A lot of teams skip this work for years because the pain is gradual. You add one more service, paste the same twelve env vars, change one of them under pressure, forget which pod still had the old one. That is what configuration drift looks like in real life. It is not dramatic, it just slowly makes incidents harder to debug.

So here is the shape I have landed on after running this pattern at two different companies. Consul KV for service config you want to reload without a redeploy. AWS Parameter Store for the boring stuff. HashiCorp Vault for anything that should never live on disk. Three layers, each with one job, and a thin client library every service uses to read from them.

Why three stores, not one

Honestly I went back and forth on this for a long time. You can absolutely do everything in Vault, or everything in Parameter Store. I have seen both work. The reason I keep coming back to a layered setup is that the access patterns are different. Plain config gets read on boot and occasionally hot reloaded. Secrets get rotated on a schedule and need short-lived leases. Feature flags want to flip in seconds across thousands of pods. One store is rarely good at all three.

The other reason is blast radius. If your single config store goes down at 3 a.m. and your services cannot boot, you have a very bad night. Splitting by access pattern lets you cache aggressively in the layer that matters most.

The client library is the whole game

Before I show config, the most important piece is the tiny client every service uses. If each team rolls their own loader, you will end up with twelve subtly different timeout, retry, and fallback behaviors, and your incident reviews will all start with “well, my service does it slightly differently.”

// packages/config/src/loader.ts
import { z } from 'zod';
import Consul from 'consul';
import { SSMClient, GetParametersByPathCommand } from '@aws-sdk/client-ssm';
import vault from 'node-vault';

type Layer = 'consul' | 'ssm' | 'vault';

export interface ConfigSource {
  layer: Layer;
  prefix: string;
  required?: boolean;
}

export async function loadConfig<T>(schema: z.ZodSchema<T>, sources: ConfigSource[]): Promise<T> {
  const raw: Record<string, string> = {};

  for (const src of sources) {
    try {
      const values = await readLayer(src);
      Object.assign(raw, values);
    } catch (err) {
      if (src.required) throw err;
      console.warn(`config: layer ${src.layer} unavailable, continuing with cache`, { err });
    }
  }

  return schema.parse(raw);
}

Two things matter here. First, the schema. Every service validates its config against a Zod schema at boot. If a required field is missing, the pod refuses to start. That sounds harsh, but it is way better than booting a service that silently reads from a stale cache and pretends everything is fine. Second, the required flag per layer. Vault being down should hard-fail a payments service. Consul being down should fall back to the last good cache for a notifications service. The library encodes that, not the calling team.

Hot reload that does not stampede

Hot reload is the feature people get excited about and then get burned by. The naive version is: subscribe to Consul, fire a callback when a key changes, services pick up the new value. The thing nobody mentions is that when you have thousands of pods watching the same key, a change can trigger a thundering herd of reconnections to whatever the new value points at. We did this at the creator economy platform I worked at and learned the hard way.

The pattern that actually holds:

// packages/config/src/watcher.ts
import { jitter } from './jitter';

export function watchKey(
  consul: Consul.Consul,
  key: string,
  onChange: (value: string) => Promise<void>,
) {
  const watcher = consul.watch({
    method: consul.kv.get,
    options: { key },
    backoffFactor: 200,
    backoffMax: 30_000,
  });

  watcher.on('change', async (data) => {
    if (!data) return;
    await jitter(0, 5_000);
    try {
      await onChange(data.Value);
    } catch (err) {
      console.error('config reload failed, keeping previous value', { key, err });
    }
  });

  return () => watcher.end();
}

The two important things are the jitter and the try/catch. Jitter spreads the reload across the fleet so the downstream that just got a new endpoint does not eat a connection spike. The try/catch means a reload that throws never poisons the running config. The service keeps the value it had at boot. Cautious, but I have learned to be cautious.

Secret rotation, the part that is actually hard

Rotation is where most teams quietly skip and then end up with a five-year-old database password in production. The trick is to make the application transparent to the rotation, not the other way around.

// packages/config/src/secrets.ts
import vault from 'node-vault';

const client = vault({ endpoint: process.env.VAULT_ADDR });

export async function getDbCreds(role: string) {
  const lease = await client.read(`database/creds/${role}`);
  const { username, password } = lease.data;
  const ttlMs = lease.lease_duration * 1000;
  const refreshAt = Date.now() + ttlMs * 0.7;
  return { username, password, refreshAt, leaseId: lease.lease_id };
}

Vault issues short-lived credentials. The pool refreshes them at 70 percent of TTL, opens a new connection with the new creds, drains the old one. The application code never sees the rotation. We ran this for the read replicas on a multi-terabyte Aurora cluster and the only place rotation showed up was in Datadog as a tiny pool-cycle blip every hour.

A second war story, briefly

Late one weekend at the creator economy platform I worked at, a teammate pushed a cache-key change to a worker that read its feature flags from Consul. They had it right in staging. In production, the prefix was different by one path segment. The worker booted, read zero flags, defaulted everything to off, and quietly stopped doing background image processing for about forty minutes before alerts caught it. Real fix took an hour. The lesson was the missing schema. Once we made the loader hard-fail on flags.image_processing_enabled === undefined, that whole class of bug disappeared.

Dev and prod parity

The shortest version of this rule is: dev reads from the same client library as prod. We run a local Consul and a fake SSM in docker-compose, seeded from a checked-in fixtures file. Vault is mocked with a dev token that issues 24-hour leases. Engineers do not learn one mental model for local and a different one for prod, which is the whole reason parity matters.

# docker-compose.yml
services:
  consul:
    image: hashicorp/consul:1.18
    command: agent -dev -client=0.0.0.0
    ports: ['8500:8500']
  vault:
    image: hashicorp/vault:1.16
    environment:
      VAULT_DEV_ROOT_TOKEN_ID: dev-only-token
    cap_add: ['IPC_LOCK']
    ports: ['8200:8200']

Takeaways

Split config by access pattern, not by service. Plain config, secrets, feature flags want different stores.
The client library is the contract. Centralize loading, validation, retries, and fallbacks once.
Validate config with a schema at boot. A pod that cannot read its config should refuse to start, not guess.
Hot reload needs jitter and a try/catch. Otherwise it is just a faster way to take down a downstream.
Rotate secrets through short-lived leases. Make the application transparent to rotation, not the other way around.
Run the same client library locally as in prod. Parity is not glamorous, it is just how you avoid two-hour debugging sessions.

Thanks for reading. If you’ve got thoughts, send them my way.