Our Redis Cluster Split-Brain Incident

A short story about an ElastiCache partition that oversold inventory and corrupted sessions, and the rule it left behind: the cache is never the source of truth.

It was a Wednesday afternoon at the creator economy platform I worked at. I was deep in a different squad’s PR when the war room channel started moving. Two creators had hit “buy” on the same coaching slot at the same minute and both got a confirmation. A third creator’s session had silently flipped to a logged-out state for half her browser tabs. By the time I joined, our ElastiCache Redis cluster was showing two primaries on the dashboard. Not the kind of thing you want to see at 3:47 p.m. Pacific.

We use Redis for the usual stuff. Hot session blobs. A small inventory counter cache for coaching slots. Rate limiter counters. None of it the system of record. Postgres on Aurora was the source of truth. The cache was the fast lane. That afternoon the fast lane lied.

What actually happened in the cluster

A network partition between two AZs lasted longer than the failover timer. ElastiCache promoted a replica in the healthy AZ. The old primary, isolated but still up, kept accepting writes from application pods on its side. For about 90 seconds we had two primaries each accepting writes. When the partition healed, the cluster reconciled by picking one side and dropping the other. Whatever was written to the loser was gone.

Inventory counters were the loud failure. The “slot remaining” counter sat in Redis as a decrement-on-purchase value with a Postgres backstop. Two pods on opposite sides of the partition each saw remaining: 1, each ran a DECR, and each told checkout “you got it”. Both purchases went through. Refunds and a public apology followed.

Sessions were the quiet failure. We were storing the session blob in Redis with a TTL and treating Redis as the truth. After the partition collapsed, sessions on the losing primary just vanished. A chunk of users got logged out mid-flow. A smaller chunk got a stale session from a key that had been touched on both sides and reconciled to the older value, which presented as “I’m logged in as someone else’s account” for about 40 seconds before our auth middleware caught it. Not a data leak. Looked exactly like one, which is bad enough.

The first wrong fix we tried

First instinct was operational. Force a failover back to the original primary and call it a day. We did. It cleared the dual-primary state but did nothing about the data we’d already lost on the losing side. Inventory counters in Redis were now confidently wrong, and there was no way to know which keys had been touched on the wrong side without going back to Postgres.

Second instinct was to extend TTLs and dump the rate limiter counters so they’d rebuild. That made the inventory problem worse. Longer TTLs meant the wrong counts stuck around longer.

If the cache is wrong and you trust the cache, you’re going to be wrong faster.

The real fix that stuck

Two things in parallel.

First, drained the partitioned half. Failed over cleanly, then ran a one-shot job that walked every inventory key and rebuilt it from Postgres. The job is boring but it has to exist before you need it.

import { Redis } from "ioredis";
import { db } from "./db";

const redis = new Redis(process.env.REDIS_URL!);

async function rebuildInventoryFromTruth(productIds: string[]) {
  for (const id of productIds) {
    const row = await db
      .selectFrom("inventory")
      .select(["product_id", "remaining", "updated_at"])
      .where("product_id", "=", id)
      .executeTakeFirst();

    if (!row) continue;

    const key = `inv:${row.product_id}`;
    const pipe = redis.pipeline();
    pipe.set(key, row.remaining);
    pipe.set(`${key}:src_updated_at`, row.updated_at.toISOString());
    pipe.expire(key, 300);
    await pipe.exec();
  }
}

Second, the structural fix. The inventory path stopped treating Redis as authoritative. Every purchase now takes a Postgres advisory lock keyed on the product and does the decrement inside a transaction. Redis still holds the hot read for the listing page, but the write path doesn’t trust it.

export async function reserveSlot(productId: string, userId: string) {
  return db.transaction().execute(async (trx) => {
    await trx.raw("SELECT pg_advisory_xact_lock(hashtext(?))", [productId]);

    const row = await trx
      .selectFrom("inventory")
      .select(["remaining"])
      .where("product_id", "=", productId)
      .forUpdate()
      .executeTakeFirst();

    if (!row || row.remaining <= 0) {
      throw new SoldOutError(productId);
    }

    await trx
      .updateTable("inventory")
      .set({ remaining: row.remaining - 1 })
      .where("product_id", "=", productId)
      .execute();

    await trx
      .insertInto("reservations")
      .values({ product_id: productId, user_id: userId })
      .execute();

    await redis.set(`inv:${productId}`, row.remaining - 1, "EX", 300);
  });
}

Sessions moved off Redis-as-truth too. The session record lives in Postgres now, and Redis caches the lookup with a short TTL and a version field. When the cached version doesn’t match the database version, we evict and re-read. Slower than pure Redis. Not by enough to care.

Last piece, the cluster config. We changed failover behavior so a primary that loses quorum stops accepting writes. The default ElastiCache config is friendlier than that, and friendly is the wrong tradeoff for a write path that pays out money.

resource "aws_elasticache_replication_group" "sessions" {
  replication_group_id       = "sessions"
  description                = "session and hot read cache"
  engine                     = "redis"
  engine_version             = "7.1"
  node_type                  = "cache.r7g.large"
  num_node_groups            = 3
  replicas_per_node_group    = 2
  automatic_failover_enabled = true
  multi_az_enabled           = true
  parameter_group_name       = aws_elasticache_parameter_group.sessions.name
}

resource "aws_elasticache_parameter_group" "sessions" {
  name   = "sessions-params"
  family = "redis7"

  parameter {
    name  = "cluster-require-full-coverage"
    value = "yes"
  }

  parameter {
    name  = "min-replicas-to-write"
    value = "1"
  }
}

min-replicas-to-write 1 means a primary that can’t see at least one replica will refuse writes. We’d rather a checkout fail loudly than two checkouts succeed quietly.

What it cost and what stuck

About 22 minutes of mixed-state weirdness before we got the failover clean. A handful of double-purchased slots that needed manual refunds. A few hundred users logged out for a window. No data leak. A public note on the status page. The runbook now opens with a literal sentence. If two primaries show up on the cluster dashboard, drain the partitioned half before you touch anything else.

I think about another incident from a combat-sports tournament platform I was acting CTO at a few years before. The rankings page. A consumer projecting events from Kafka into Elasticsearch silently stopped writing, but kept consuming. Eight hours later the new champion of a tournament still showed up as not-the-champion on a publicly visible page. We learned about it from the athlete tweeting a screenshot at the federation. Different stack. Same lesson. A derived store, whether it’s Redis, Elasticsearch, or a materialized view, is not the truth. The truth is the row in Postgres. Everything else is a view on it, and views drift.

Takeaways from the incident

The cache is never the source of truth. If money or auth depends on it, that path goes to Postgres.
Failover defaults are tuned for availability, not for correctness. Set min-replicas-to-write if writes pay out anything.
Rebuild-from-truth jobs are not optional. Write them before you need them.
Trust upstreams by reading them back, not by trusting the write response.
Measure freshness on every derived store. “The consumer is consuming” is not a health metric.
When the cache looks wrong, do not purge. Find the wrong code first.

Thanks for reading. If you’ve got thoughts, send them my way.