Secrets Management in Production

Real-world trade-offs between AWS Secrets Manager, Parameter Store, and Vault, plus rotation, Kubernetes wiring, and a compromise response playbook.

It was a Wednesday at the creator economy platform I worked at, and the Apple shared secret had been sitting in a Kubernetes ConfigMap for thirteen months. Nobody flagged it during onboarding. We found out the boring way. An SRE pasted a kubectl describe configmap into the war room channel while we were debugging a different incident, and there it was, plain text, third line from the top.

Nothing leaked. Nothing was rotated either, which is its own kind of leak waiting to happen. That morning ended with a runbook, a calendar reminder, and a small piece of automation nobody wanted to write.

I’ve shipped secrets on three different stacks. A creator-economy SaaS on AWS with EKS. A live-video creator platform on AWS plus Cloudflare Workers. A combat-sports federation platform I CTO’d in London running hundreds of microservices on Kubernetes. Same three tools come up every time. AWS Secrets Manager, AWS Parameter Store (SSM), HashiCorp Vault.

The three options I have shipped with

My default on AWS is Secrets Manager for anything that should rotate, Parameter Store for the boring stuff, and Vault only if you’ve already paid the operational tax for a reason. That ordering is not popular in some Hacker News threads. I don’t care. It’s the one that survives 2 a.m.

Secrets Manager gets you native rotation hooks (the Lambda integration for RDS, RedShift, DocumentDB is genuinely good), cross-region replication, fine-grained IAM, and audit trails wired into CloudTrail. It costs a bit per secret per month plus per-API-call, which is fine for credentials. Don’t dump 200 feature flags in there, that’s not what it’s for.

Parameter Store is free up to a quota, and SecureString parameters are KMS-encrypted. I use it for non-secret config that I still want versioned and centrally managed. Feature flag defaults, third-party API endpoints, log levels, sometimes API keys for read-only services. It is not a secrets store. It is a config store with optional encryption. Mixing the two roles is one of the most common mistakes I see in early-stage teams.

Vault is the one I push back on the most. I love what it can do. Dynamic database credentials, PKI as a service, transit encryption, fine-grained policy. It is the right answer if you’re multi-cloud, if you need short-lived dynamic creds across many systems, or if you have a dedicated SRE team that can keep it highly available. On a single-cloud AWS shop with a small platform team, Vault is a load-bearing dependency you don’t want, and the failure mode of “Vault is down so nothing can boot” is the kind of thing that costs you sleep.

Wiring secrets into the app

I prefer External Secrets Operator on EKS, pulling from Secrets Manager, projecting into a native Kubernetes Secret, mounted as env vars or projected files. The app itself stays dumb. It reads env vars. It does not need to know where the secrets came from.

The fail-fast bit matters. If a required secret is missing, the pod should die at startup, loudly. Not retry quietly. Not fall back to a default. Die.

import { Injectable, Logger, OnModuleInit } from '@nestjs/common';
import {
  SecretsManagerClient,
  GetSecretValueCommand,
} from '@aws-sdk/client-secrets-manager';

type LoadedSecrets = {
  DATABASE_URL: string;
  STRIPE_SECRET_KEY: string;
  APPLE_SHARED_SECRET: string;
  JWT_SIGNING_KEY: string;
};

const REQUIRED_KEYS: (keyof LoadedSecrets)[] = [
  'DATABASE_URL',
  'STRIPE_SECRET_KEY',
  'APPLE_SHARED_SECRET',
  'JWT_SIGNING_KEY',
];

@Injectable()
export class SecretsService implements OnModuleInit {
  private readonly logger = new Logger(SecretsService.name);
  private readonly client = new SecretsManagerClient({});
  private cache: LoadedSecrets | null = null;
  private loadedAt = 0;

  async onModuleInit() {
    await this.load();
  }

  async load(): Promise<LoadedSecrets> {
    const secretId = process.env.APP_SECRETS_ARN;
    if (!secretId) {
      throw new Error('APP_SECRETS_ARN is not set, refusing to start');
    }

    const res = await this.client.send(
      new GetSecretValueCommand({ SecretId: secretId }),
    );

    if (!res.SecretString) {
      throw new Error(`Secret ${secretId} has no SecretString payload`);
    }

    const parsed = JSON.parse(res.SecretString) as Partial<LoadedSecrets>;

    // refuse to boot on any missing required key, no silent defaults
    for (const k of REQUIRED_KEYS) {
      if (!parsed[k]) {
        throw new Error(`Missing required secret: ${k}`);
      }
    }

    this.cache = parsed as LoadedSecrets;
    this.loadedAt = Date.now();
    this.logger.log(`Loaded ${REQUIRED_KEYS.length} secrets from ${secretId}`);
    return this.cache;
  }

  get<K extends keyof LoadedSecrets>(key: K): LoadedSecrets[K] {
    if (!this.cache) {
      throw new Error('SecretsService not initialized');
    }
    return this.cache[key];
  }

  // call this from a SIGHUP handler or a /admin/reload endpoint after rotation
  async reload(): Promise<void> {
    await this.load();
  }
}

Two things to point out. There’s no fallback to process.env. If the secret is missing, the pod dies. That is the feature. And there’s a deliberate reload() path because rotation is real, and you don’t want to bounce every pod just because a password changed at 03:00 UTC.

Wiring it through Kubernetes

On EKS, External Secrets Operator handles the sync. You declare an ExternalSecret, it watches Secrets Manager, and it materializes a native Secret you can mount or env-inject. IRSA gives the operator a workload identity, so no static IAM keys live anywhere.

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: api-app-secrets
  namespace: api
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: api-app-secrets
    creationPolicy: Owner
    template:
      type: Opaque
  data:
    - secretKey: DATABASE_URL
      remoteRef:
        key: prod/api/app
        property: DATABASE_URL
    - secretKey: STRIPE_SECRET_KEY
      remoteRef:
        key: prod/api/app
        property: STRIPE_SECRET_KEY
    - secretKey: APPLE_SHARED_SECRET
      remoteRef:
        key: prod/api/app
        property: APPLE_SHARED_SECRET
    - secretKey: JWT_SIGNING_KEY
      remoteRef:
        key: prod/api/app
        property: JWT_SIGNING_KEY

refreshInterval: 1h is a sane default. Pair it with a deployment trigger so pods actually pick up the new value. I’ve seen teams rotate a secret in Secrets Manager, then wonder six hours later why nothing changed in the pods. Operator updated the K8s Secret. The pods just never re-read it.

Bonus: stop using long-lived AWS access keys in GitHub Actions. OIDC has been GA for years.

name: deploy
on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/gh-actions-deploy
          aws-region: us-east-1
      - run: ./scripts/deploy.sh

Rotation is not optional

Every secret in my systems has a rotation policy. Database creds rotate on a 30-day schedule via Secrets Manager plus a rotation Lambda. Third-party API keys that don’t support programmatic rotation get a calendar entry on the platform team’s shared calendar and a runbook entry. The Apple shared secret from the opener now lives in Secrets Manager with a quarterly reminder, because Apple’s developer portal does not expose a rotation API.

I know the duplicate-IAP story is on me. Different incident, same theme. Native billing on the branded mobile apps platform I worked on had a window where Apple’s server-to-server renewal notification got retried after our endpoint returned a slow 200 OK. No idempotency check, so retries created duplicate subscription rows. Structural fix was a Sidekiq job plus a unique constraint on (apple_original_transaction_id, notification_uuid), with the endpoint returning 200 OK within 5 seconds so Apple’s retries became idempotent at the queue level. The secrets-management lesson rode shotgun. We audited every Apple-related credential the same week. The shared secret was in a ConfigMap. Moved it. Added the rotation reminder.

The compromise response playbook

When a secret leaks, the first 30 minutes are the only thing that matters. Here’s what runs, in order, every time.

Revoke. Whatever the credential is, kill it at the source. RDS user disabled, Stripe key rolled, Apple secret regenerated (for what it’s worth).
Rotate. Issue a new credential, push it through Secrets Manager, let External Secrets Operator propagate, restart the pods that need it.
Audit. Pull CloudTrail / Stripe audit log / Apple Connect audit log for any usage of the old credential after the suspected leak time.
Communicate. Internal first (incident channel, leadership), then customer-facing if the audit shows external impact.

Step 1 is where teams freeze. Don’t. The half-life of a leaked AWS access key on GitHub is in minutes, not hours. Treat revocation as reflex.

#!/usr/bin/env bash
# emergency-rotate-rds.sh: emergency rotation of an RDS master password.
# usage: ./emergency-rotate-rds.sh prod/api/db-master
set -euo pipefail

SECRET_ID="${1:?usage: $0 <secrets-manager-secret-id>}"
REGION="${AWS_REGION:-us-east-1}"

NEW_PASSWORD="$(aws secretsmanager get-random-password \
  --exclude-characters '"@/\' \
  --password-length 40 \
  --require-each-included-type \
  --query 'RandomPassword' --output text)"

DB_ID="$(aws secretsmanager describe-secret \
  --secret-id "${SECRET_ID}" \
  --region "${REGION}" \
  --query 'Tags[?Key==`DBInstanceIdentifier`].Value | [0]' --output text)"

if [[ "${DB_ID}" == "None" || -z "${DB_ID}" ]]; then
  echo "could not resolve DBInstanceIdentifier tag on ${SECRET_ID}" >&2
  exit 1
fi

echo "rotating master password for ${DB_ID}"
aws rds modify-db-instance \
  --db-instance-identifier "${DB_ID}" \
  --master-user-password "${NEW_PASSWORD}" \
  --apply-immediately \
  --region "${REGION}" >/dev/null

aws secretsmanager put-secret-value \
  --secret-id "${SECRET_ID}" \
  --secret-string "$(jq -nc --arg pw "${NEW_PASSWORD}" '{password:$pw}')" \
  --region "${REGION}" >/dev/null

kubectl rollout restart deployment -n api -l app.kubernetes.io/component=api
echo "done. confirm pod health and check CloudTrail for the old credential."

It’s not pretty. It’s not meant to be. It is meant to be runnable at 03:00 UTC by whoever is on-call, without thinking.

With secrets, “outside” tends to be GitHub’s secret scanner emailing your security alias, or someone pasting a URL from a screenshot in the wrong channel. By the time you hear about it from outside, you are already behind. Revoke first, post-mortem later.

Where I draw the line on Vault

If you’re on AWS and you don’t need dynamic database credentials, you don’t need Vault. There. I said it.

Vault is wonderful if you can run it well. Dynamic DB credentials alone are a real upgrade over long-lived passwords. PKI as a service is nice. But running Vault HA, with Raft or Consul as a backend, proper backup and unseal procedures, audit logs going somewhere durable, is a job. On a small platform team, that job competes with everything else, and the failure mode is your entire fleet failing to start because Vault is unreachable.

If you’re multi-cloud, or if you have one of the genuine use cases (PKI for an internal mesh, dynamic creds across many DBs, transit encryption), run Vault. Otherwise, Secrets Manager and Parameter Store plus External Secrets Operator will get you most of the value with a fraction of the operational footprint.

Takeaways

Default to AWS Secrets Manager for credentials, Parameter Store for config. Don’t mix the two roles.
Wire secrets in via External Secrets Operator. The app should just read env vars and fail fast.
Every secret has a rotation policy. If the upstream doesn’t support programmatic rotation, the policy is a calendar reminder plus a runbook.
Stop using long-lived IAM access keys anywhere CI/CD touches. Use OIDC.
Compromise response is reflex: revoke, rotate, audit, communicate. Don’t freeze on step 1.
Use Vault only when multi-cloud or dynamic secrets earn the operational tax.

Thanks for reading. If you’ve got thoughts, send them my way.