NestJS Caching Strategies

How I layer in-memory cache, Redis, custom keys, event-driven invalidation, and HTTP ETag headers in NestJS to take real load off Postgres.

A Tuesday morning at the creator-economy platform I worked at. The Community feed was on Postgres on Aurora, a multi-terabyte writer with three reader replicas behind a custom routing layer. Around 10:14 a.m. PT, Datadog lit up. AuroraReplicaLagMaximum > 60s for 2m. The /communities/:id/posts p99 walked from ~120 ms to over 8 s in four minutes. I wasn’t on-call that week, but I was tagged in the Slack thread within minutes because I owned the Aurora layer.

We eventually traced the spike to a maintenance ANALYZE on a hot table starving WAL emission. The replicas weren’t bottlenecked, they were starved. But the deeper lesson, the one that stuck and changed how I build NestJS services, is that the cache layer was the only reason the page was even partially loading during those 22 minutes. The reads that hit the in-memory tier did not care about the writer. The reads that fell through to Redis only cared about Redis. The reads that fell through to Aurora carried the outage.

If your cache strategy is “stick CacheInterceptor on a controller and forget”, that hot Tuesday is going to find you.

Two tiers, not one

I run NestJS caches in two tiers by default. In-process LRU for the per-pod hot path. Redis for cross-pod consistency. The interceptor on the controller is the surface, not the strategy.

// src/cache/cache.module.ts
import { Module } from '@nestjs/common'
import { CacheModule } from '@nestjs/cache-manager'
import { redisStore } from 'cache-manager-redis-yet'
import { ConfigService } from '@nestjs/config'

@Module({
  imports: [
    CacheModule.registerAsync({
      isGlobal: true,
      useFactory: async (config: ConfigService) => ({
        stores: [
          // L1: in-process, tiny TTLs, kills the thundering herd on a single pod
          { ttl: 5_000, max: 5_000 },
          // L2: shared Redis, the source of truth across pods
          await redisStore({
            url: config.getOrThrow<string>('REDIS_URL'),
            ttl: 60_000,
            // hard-fail on the SET, soft-fail on the GET
            socket: { reconnectStrategy: (n) => Math.min(n * 200, 5_000) },
          }),
        ],
      }),
      inject: [ConfigService],
    }),
  ],
})
export class AppCacheModule {}

A note worth saying out loud: the L1 TTL has to be small. Five seconds is a sane default in a busy service. The L1 exists to absorb the herd, not to be authoritative.

Custom cache keys

CacheInterceptor builds keys from the URL by default. Works for anonymous GETs and breaks the moment auth, locale, or feature flags enter the picture. On the creator platform’s Community feed the response varied per user, per locale, per A/B bucket. The default key would have served the wrong feed to the wrong user inside a minute.

// src/cache/feed-cache.interceptor.ts
import {
  CACHE_MANAGER,
  CacheInterceptor,
  CacheKey,
  CacheTTL,
} from '@nestjs/cache-manager'
import { ExecutionContext, Inject, Injectable } from '@nestjs/common'

@Injectable()
export class FeedCacheInterceptor extends CacheInterceptor {
  protected trackBy(ctx: ExecutionContext): string | undefined {
    const req = ctx.switchToHttp().getRequest()
    if (req.method !== 'GET') return undefined
    const userId = req.user?.id ?? 'anon'
    const locale = req.headers['accept-language']?.slice(0, 5) ?? 'en'
    const flag = req.headers['x-ff-feed-v2'] === '1' ? 'v2' : 'v1'
    return `feed:${req.params.communityId}:${userId}:${locale}:${flag}`
  }
}

@CacheTTL(30_000)
export class CommunityFeedController {
  /* ... */
}

Two things I’d flag here. One, the interceptor returns undefined for any non-GET request, which means the cache does not touch writes. Two, the key includes every dimension that changes the response body. I learned that the hard way on a live-video creator platform I led engineering at, where I shipped a worker refactor that supposedly “tightened cache key composition” and instead dropped the locale segment. EU users started seeing US users’ Open Graph previews on shared links. We learned about it from a German creator who tweeted a screenshot of someone else’s profile photo appearing on his own profile preview. Hit the global purge button. The cache repopulated in ~3 minutes with the same bad key. Rolled the worker back. Redeployed with locale explicitly in the key. Wrote a deploy-time check that diffs cache-key composition against the previous version and refuses to deploy when the key changes without a --migrate-cache-key flag. Forty minutes of mis-shared previews. Several public screenshots. Cache keys are part of your public API. Treat any change like a schema migration.

Event-driven invalidation

TTL-only invalidation is fine until your hit ratio matters more than your freshness budget. On the Community feed we needed both. The fix was event-driven invalidation on top of TTL, not instead of it.

// src/feed/feed-invalidator.service.ts
import { CACHE_MANAGER } from '@nestjs/cache-manager'
import { Inject, Injectable, Logger } from '@nestjs/common'
import { OnEvent } from '@nestjs/event-emitter'
import { Cache } from 'cache-manager'

@Injectable()
export class FeedInvalidator {
  private readonly log = new Logger(FeedInvalidator.name)

  constructor(@Inject(CACHE_MANAGER) private readonly cache: Cache) {}

  @OnEvent('community.post.created', { async: true })
  async onPostCreated(e: { communityId: string }) {
    // we don't know the user/locale/flag fanout, so we tag-invalidate
    await this.cache.del(`feed-tag:${e.communityId}`)
    this.log.debug(`invalidated feed-tag:${e.communityId}`)
  }
}

The interceptor writes a tag alongside the entry. The invalidator deletes the tag. Reads that find a stale tag fall through to the origin and rebuild. The pattern is borrowed straight from Cloudflare’s tag-based purge, just running on Redis. With this in place on the Community feed I saw the read-path hit ratio settle above the level where Aurora cared about replica lag.

HTTP layer is also cache

The cache that costs you nothing is the one the client respects. NestJS makes this easy with an interceptor.

// src/cache/etag.interceptor.ts
import { CallHandler, ExecutionContext, Injectable, NestInterceptor } from '@nestjs/common'
import { createHash } from 'crypto'
import { Observable } from 'rxjs'
import { map } from 'rxjs/operators'

@Injectable()
export class EtagInterceptor implements NestInterceptor {
  intercept(ctx: ExecutionContext, next: CallHandler): Observable<unknown> {
    const res = ctx.switchToHttp().getResponse()
    const req = ctx.switchToHttp().getRequest()
    return next.handle().pipe(
      map((body) => {
        const etag = `"${createHash('sha1').update(JSON.stringify(body)).digest('base64')}"`
        res.setHeader('ETag', etag)
        res.setHeader('Cache-Control', 'private, max-age=15, stale-while-revalidate=60')
        if (req.headers['if-none-match'] === etag) {
          res.status(304)
          return undefined
        }
        return body
      }),
    )
  }
}

stale-while-revalidate is the underrated header. The client serves stale for 60s while a background fetch refreshes. On a feed page that’s the difference between visible latency and invisible latency.

When to skip CacheInterceptor

CacheInterceptor is a fine starting point. It’s wrong for anything that varies by user. It’s wrong for anything where invalidation matters more than TTL. It’s wrong on writes. The interceptor I actually ship is a thin subclass that builds the right key, defers writes to a tag service, and falls back to the underlying cache-manager store. The decorator stays on the controller. The strategy lives in code.

Takeaways

Two tiers. L1 in-process with tiny TTLs. L2 Redis with longer TTLs. The L1 absorbs the herd.
Custom trackBy always. The default URL-based key is a per-user data leak waiting to happen.
Event-driven invalidation on top of TTL, not instead of it. Tag the entry, delete the tag on the write-side event.
ETag plus stale-while-revalidate is free latency. Ship it on every cacheable GET.
Cache keys are a schema. Diff them at deploy time. Treat any change like a migration.
Measure hit ratio per layer separately. A 95% L1 ratio and a 60% L2 ratio tell very different stories.

Thanks for reading. If you’ve got thoughts, send them my way.