Our Distributed Cache Invalidation Nightmare

Six hours of stale prices on millions of product pages at a creator-economy platform, and the event-driven invalidation, two-level coordination, and freshness monitoring that finally killed the drift.

It was a Monday morning at the creator economy platform I worked at. Pricing on a big creator’s course catalog had been wrong on the public pages for six hours and nobody had paged. Checkout was fine, the real price popped up the moment a customer hit the buy button. But every product card, every catalog row, every shared social preview was showing yesterday’s number. Millions of pageviews into the wrong figure. A creator noticed before we did and posted a screenshot.

OK so the setup. Catalog reads were served by a CDN (Cloudflare) in front of a Rails app, with Redis as a second-level cache for the price model. Writes went through a pricing service that emitted price.updated events on Kafka. The read path was supposed to listen for those events and bust both layers. The read path had been listening. It just hadn’t been busting both layers correctly. Pricing was verified at checkout against Postgres, so nobody lost money. The trust hit was the cost.

How the invalidation actually fell apart

A creator changed the price on roughly a few hundred courses through the bulk-edit tool. The pricing service wrote the new rows, emitted events, returned 200. The Redis layer cleared on those keys within seconds. The CDN did not. The Worker that handled cache key composition was reading a price_version header off the origin response, and the origin response was being served from a Rails fragment cache that had its own TTL on a different clock. The version header lagged. The CDN happily served the stale HTML with the stale embedded price for hours.

I’ve seen this exact shape before. A live-video creator platform I led engineering at had a Cloudflare Workers cache key that quietly dropped the locale segment after a refactor. EU users started seeing US users’ Open Graph previews on shared links. We learned about it from a creator tweeting a screenshot. First move there was the global purge button. The cache repopulated in three minutes with the same bad key and the same wrong results. Purging treats the symptom. The wrong key composition was still in production.

Same shape this time too. Wrong version source, not wrong cache.

The first fix we shipped that did nothing

First instinct, of course, was to purge. Cleared the price-related URLs from Cloudflare in a tight loop. Pages went blank for a couple of minutes while the edge re-warmed. The new HTML came back from origin, the origin was still serving a stale Rails fragment, and the new HTML had the same stale price baked in. We’d just made the bug refresh faster.

Second instinct, bump the Rails fragment TTL down so the origin would drift less. That helped the next pricing change. Did nothing for the one already wrong on disk. And shorter TTLs make every read slower for every customer, which is a cure worse than the disease.

What actually worked in the end

Three pieces. Cache versioning at the data layer, event-driven invalidation that didn’t trust TTLs, and freshness monitoring on the derived state.

The version moved from “a header off the origin response” to “a column on the row, stamped by the writer”. Every price row carries a monotonic version and a version_updated_at. The cache key is composed from those. A stale read can’t masquerade as fresh because the key itself is part of the version.

// apps/catalog/src/cache/keys.ts
export function priceCacheKey(productId: string, version: number) {
  return `price:v2:${productId}:${version}`;
}

export async function readPrice(productId: string) {
  const row = await db
    .selectFrom("prices")
    .select(["product_id", "amount_cents", "currency", "version"])
    .where("product_id", "=", productId)
    .executeTakeFirst();

  if (!row) return null;

  const key = priceCacheKey(row.product_id, row.version);
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  const payload = { amount: row.amount_cents, currency: row.currency };
  await redis.set(key, JSON.stringify(payload), "EX", 600);
  return payload;
}

The event consumer’s job got smaller too. It used to delete a list of keys. Now it just updates the index row that the read path uses to look up the current version, and lets old keys age out on their own. Less surface area for a missed delete to cause drift.

// apps/catalog/src/events/price-updated.ts
export class PriceUpdatedConsumer {
  constructor(private readonly redis: Redis, private readonly db: Kysely<DB>) {}

  async handle(event: PriceUpdatedEvent) {
    const { productId, version, updatedAt } = event;

    await this.db
      .insertInto("price_index")
      .values({ product_id: productId, version, version_updated_at: updatedAt })
      .onConflict((oc) =>
        oc.column("product_id").doUpdateSet({
          version,
          version_updated_at: updatedAt,
        }),
      )
      .execute();

    await this.redis.set(
      `price:idx:${productId}`,
      JSON.stringify({ version, updatedAt }),
      "EX",
      900,
    );
  }
}

The CDN layer got the same treatment. The Worker reads a small price:idx:${productId} index from KV before composing the cache key for the catalog HTML. Origin responses now carry an explicit Cache-Tag: price:${productId}:${version} header so we can purge by tag if we ever need to nuke a specific product without touching everything else.

// workers/catalog/src/handler.ts
export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    const url = new URL(req.url);
    const productId = url.pathname.split("/").pop();
    if (!productId) return env.ORIGIN.fetch(req);

    const idx = await env.KV_PRICE_IDX.get(`price:idx:${productId}`, "json") as
      | { version: number }
      | null;

    const cacheKey = `catalog:${productId}:v${idx?.version ?? "boot"}`;
    const cache = caches.default;

    let res = await cache.match(cacheKey);
    if (res) return res;

    res = await env.ORIGIN.fetch(req);
    if (res.ok) {
      await cache.put(cacheKey, res.clone());
    }
    return res;
  },
};

Last piece, freshness monitoring. The lesson from another incident I keep coming back to. A combat-sports tournament platform I was acting CTO at had a rankings page fed by a consumer that read from Kafka and projected into Elasticsearch. The consumer silently stopped writing one Saturday but kept consuming. Eight hours later, the new champion of a publicly visible tournament still showed up as not-the-champion. We learned about it from the athlete tweeting a screenshot at the federation. “The consumer is consuming” is not a health metric. The right metric is freshness, the difference between source-of-truth updated_at and the derived store’s version_updated_at.

We added a Datadog check that joins prices.updated_at against price_index.version_updated_at every minute and alerts if the p99 gap goes over 60 seconds. We added the same shape at the CDN edge with a small meta tag in the HTML carrying the version_updated_at, scraped by an external freshness probe. If either signal drifts, somebody is paged.

Takeaways from six bad hours

Cache keys are part of your public API. A version source in the key beats any TTL.
Bust the index, not the leaf. Smaller surface area, fewer missed deletes.
Cache-Tag headers at the CDN are how you get out of trouble without a global purge.
Verify pricing at checkout against Postgres no matter how good your cache looks.
Measure freshness, not throughput. A consumer that consumes is not a consumer that writes.
Two-level caches need one version source, not two clocks.

Thanks for reading. If you’ve got thoughts, send them my way.