Idempotency in Distributed Systems

Why idempotency keys, DB-level uniqueness, and consumer dedup tables are the actual contract for safe retries on payments, emails, and webhook handlers.

A creator opened a support ticket on a Tuesday. “All my customers got charged twice this month and the app shows them as having two active subscriptions.” That was the opening line. The product was native in-app billing on the creator economy platform I worked at, the renewal flow was Apple-driven, and the handler did not have an idempotency check on the server-to-server notification. Every retry from Apple created a brand new subscription row. A few thousand customers across dozens of branded apps got double-billed before anyone noticed.

That is the kind of bug that teaches you what idempotency actually is, and what it isn’t.

The one rule I keep coming back to

If a network can deliver your write more than once, you do not get to pretend it can’t. Idempotency is not a “nice to have for payments”. It’s the contract every async edge of your system signs whether you write it down or not. Stripe figured this out a decade ago, Apple and Google retry server-to-server notifications aggressively, every message broker I’ve shipped on (Kafka, RabbitMQ, SQS) is at-least-once by default. If you’re treating receipt of a message as proof the write happened once, you’re going to lose money.

The fix isn’t one thing. It’s three things layered: a key the caller controls, a database constraint that refuses duplicates, and a consumer dedup table for async work. Skip any of the three and you’ll find out which one you skipped at 2 a.m. Pacific.

The Stripe-style header on the edge

The simplest place to start is the HTTP boundary. Caller sends Idempotency-Key, you cache the first response under that key for some window, every retry with the same key returns the same response. No re-execution.

import { Controller, Post, Body, Headers, ConflictException } from '@nestjs/common';
import { Redis } from 'ioredis';
import { createHash, randomUUID } from 'crypto';

interface ChargeRequest {
  customerId: string;
  amountCents: number;
  currency: string;
}

@Controller('charges')
export class ChargesController {
  constructor(
    private readonly redis: Redis,
    private readonly payments: PaymentsService,
  ) {}

  @Post()
  async create(
    @Headers('idempotency-key') key: string | undefined,
    @Body() body: ChargeRequest,
  ) {
    if (!key) {
      throw new ConflictException('Idempotency-Key header is required');
    }

    const bodyHash = createHash('sha256').update(JSON.stringify(body)).digest('hex');
    const lockKey = `idem:charge:${key}`;
    const claim = await this.redis.set(lockKey, `pending:${bodyHash}`, 'EX', 86400, 'NX');

    if (claim !== 'OK') {
      const stored = await this.redis.get(lockKey);
      if (!stored) throw new ConflictException('idem state lost, retry later');
      if (!stored.endsWith(bodyHash)) {
        // same key, different body. caller bug. do not silently accept.
        throw new ConflictException('Idempotency-Key reuse with different payload');
      }
      if (stored.startsWith('pending:')) {
        throw new ConflictException('original request still in flight');
      }
      return JSON.parse(stored.slice('done:'.length + 64));
    }

    const result = await this.payments.charge({ ...body, requestId: randomUUID() });
    await this.redis.set(lockKey, `done:${bodyHash}${JSON.stringify(result)}`, 'EX', 86400);
    return result;
  }
}

Two details that get missed. One, you have to hash the request body and compare it against the stored hash. Same key with a different body is almost always a caller bug, and silently returning the old response is worse than failing loud. Two, the cache is not a substitute for the next layer, it’s the first layer.

The database is the only honest layer

Redis can drop your key. The pod handling the request can die between writing to Redis and writing to the DB. The actual source of truth is a unique constraint on something that maps one-to-one to the operation.

For payments and subscriptions, that something is usually a tuple. For Apple notifications it ended up being (apple_original_transaction_id, notification_uuid). For Stripe webhooks it’s event.id. For internal commands I tend to add a request_id uuid column and slap a unique index on it.

CREATE TABLE charge_attempts (
  id              bigserial PRIMARY KEY,
  request_id      uuid NOT NULL,
  customer_id     bigint NOT NULL,
  amount_cents    integer NOT NULL,
  currency        text NOT NULL,
  status          text NOT NULL,
  provider_ref    text,
  created_at      timestamptz NOT NULL DEFAULT now(),
  CONSTRAINT charge_attempts_request_id_uniq UNIQUE (request_id)
);

CREATE INDEX charge_attempts_customer_created_idx
  ON charge_attempts (customer_id, created_at DESC);

Then the actual insert uses ON CONFLICT DO NOTHING and reads back what’s there. The point is the DB tells you “I already have this”, not the application guessing.

async function recordChargeAttempt(
  db: DataSource,
  attempt: ChargeAttempt,
): Promise<{ created: boolean; row: ChargeAttempt }> {
  const result = await db.query(
    `INSERT INTO charge_attempts
       (request_id, customer_id, amount_cents, currency, status)
     VALUES ($1, $2, $3, $4, 'pending')
     ON CONFLICT (request_id) DO NOTHING
     RETURNING *`,
    [attempt.requestId, attempt.customerId, attempt.amountCents, attempt.currency],
  );

  if (result.length === 1) {
    return { created: true, row: result[0] };
  }

  const existing = await db.query(
    `SELECT * FROM charge_attempts WHERE request_id = $1`,
    [attempt.requestId],
  );
  return { created: false, row: existing[0] };
}

When created comes back false, you do not call Stripe again. You return whatever provider_ref is on the row, or you tell the caller it’s still pending. The DB ate the duplicate.

Consumer dedup for the async edges

HTTP is easy mode. The hard part is the consumer side. Kafka, SQS, SNS, every webhook source you don’t control. Apple retries. Google retries. Postmark retries when your endpoint takes too long to respond. The pattern that survived production for me is a processed_messages table per consumer group, with a unique index on the natural dedup key, written inside the same transaction as the side effect.

import { Injectable } from '@nestjs/common';
import { DataSource } from 'typeorm';

interface AppleRenewalNotification {
  notificationUUID: string;
  originalTransactionId: string;
  productId: string;
  expiresDateMs: number;
}

@Injectable()
export class AppleRenewalHandler {
  constructor(private readonly db: DataSource) {}

  async handle(n: AppleRenewalNotification): Promise<void> {
    await this.db.transaction(async (tx) => {
      const dedupKey = `apple:${n.originalTransactionId}:${n.notificationUUID}`;

      const claim = await tx.query(
        `INSERT INTO processed_messages (dedup_key, consumer, received_at)
         VALUES ($1, 'apple_renewals', now())
         ON CONFLICT (dedup_key) DO NOTHING
         RETURNING id`,
        [dedupKey],
      );

      if (claim.length === 0) {
        // Apple resent it. We already wrote the side effect on the first delivery.
        return;
      }

      await tx.query(
        `INSERT INTO creator_subscriptions
           (apple_original_transaction_id, product_id, expires_at, status)
         VALUES ($1, $2, to_timestamp($3 / 1000.0), 'active')
         ON CONFLICT (apple_original_transaction_id) DO UPDATE
           SET expires_at = EXCLUDED.expires_at,
               product_id = EXCLUDED.product_id,
               status = 'active',
               updated_at = now()`,
        [n.originalTransactionId, n.productId, n.expiresDateMs],
      );
    });
  }
}

Two things make this work and nothing else does. The dedup row and the side effect are in the same transaction, so a crash between them rolls back together. And the upsert on creator_subscriptions is itself idempotent on the natural key, so even if you mess up the dedup table once, the subscription row is still safe.

This is also where I’ll repeat the lesson from the war story up top. When the upstream is a third-party retry (Apple, Google, Stripe, Twilio), never trust the response of your own write as evidence the work happened. Read back from the upstream’s source of truth, or write your dedup row first.

A second war story, different shape

About a year before the duplicate-subscription bug, our branded-mobile-app submission pipeline at the same creator platform hit a related class of problem. Hundreds of native iOS releases per week, automated through Fastlane plus Rails plus GitHub Actions. Apple’s App Store Connect API started silently throttling our submission endpoint. It returned 200 OK with a normal-looking body. The submission was being dropped on their side.

Our pipeline had auto-retry on 5xx. So we extended retry to “stuck” state. That made it worse. Apple started seeing what looked like duplicate submissions, customers ended up with two competing review records, metadata diverged. The retry was treating 200 OK as truth.

The real fix was the same shape as the one above. Pulled the auto-retry. Added a circuit breaker that verified submission state with a separate GET against the App Store Connect resource, not via the response of the POST. Wrote a one-shot reconciliation job using an idempotency key derived from app_id + version + git_sha to dedupe pending reviews against Apple’s source of truth and merge metadata where it had diverged. Three days of slipped releases for a chunk of creators. Lesson, again, was that the response of a write to a human-moderated upstream is never the source of truth. The dedup key is.

Where I see teams get this wrong

Three patterns. First, putting the idempotency check in application code without a DB unique constraint underneath. Race conditions between two pods reading the same not-yet-written key will eat you alive. Second, using the request body as the key. Two legitimate “send the same email twice” requests should both succeed, that’s a feature, not a duplicate. The caller picks the key, you don’t. Third, treating “we already responded 200” as proof of work. You haven’t proved anything until the side effect is committed and the dedup row is committed in the same transaction.

Takeaways

Idempotency is a three-layer thing: caller-supplied key, DB unique constraint, consumer dedup table. Skip a layer and you’ll regret it.
The caller picks the key. Hashing the body to “generate” one silently merges legitimately-different requests.
Same key + different body is a caller bug. Reject loud.
Dedup row and side effect live in the same transaction. Otherwise you’re guessing.
For third-party retries (Apple, Google, Stripe, Twilio), never trust your own response as proof. Read upstream state.
Upserts on natural keys (ON CONFLICT ... DO UPDATE) are your friend even when your dedup layer is correct, because someday it won’t be.

Thanks for reading. If you’ve got thoughts, send them my way.