The Idempotency Key We Forgot

A duplicate-charge incident, the two-phase payment fix, and why idempotency keys belong in the contract, not in the optimization backlog.

A creator opened a support ticket on a Tuesday morning at the creator economy platform I worked at. The message was short. “All my customers got charged twice this month and the app shows them as having two active subscriptions.” I was on a different squad that week, but I’d built half the native billing pipeline a year earlier, so the thread landed in my DMs within ten minutes.

That ticket turned into the one production lesson I keep telling people about. Idempotency keys are not an optimization. They are part of the contract you sign with any third party that retries. Skip them and you don’t have a payment system, you have a duplicate-charge generator with good branding.

What actually broke that morning

The setup was simple enough. Native in-app purchase receipts validated server-side, subscription state mirrored in a creator_subscriptions table, a Rails endpoint receiving server-to-server renewal notifications from the store. Pretty standard for any payments integration, Stripe or Apple or otherwise. The endpoint did the work inline, returned 200 OK, moved on.

Then a renewal notification came in. Our handler took 31 seconds, mostly waiting on a slow downstream call. The store’s webhook deadline was 30 seconds. It retried. Our endpoint ran again, hit the same code path, inserted a second creator_subscriptions row, returned 200 OK. The retry was, from the store’s point of view, a different event. From ours, it should have been the same one. We had no key, no dedupe, no contract. A few thousand customers across dozens of branded apps got billed twice that cycle. The cards had already been charged before we even knew about the duplicate row.

The wrong fix that shipped first

Within an hour someone pushed a frontend “fix”. Show only the latest subscription row per customer. Hide the dupe. Done.

Yeah. The card had still been charged. The store wasn’t refunding anything just because we hid the row in our UI. The creator escalated to legal that afternoon. I’m not proud of how long we sat on the visible-only patch before pulling it.

The actual fix, in two phases

The real fix had two parts and one principle. Make the endpoint cheap and idempotent at the queue level. Don’t trust your own write, read back from the source of truth.

First phase: the endpoint stopped doing work inline. It validated the payload, enqueued a Sidekiq job, returned 200 OK within five seconds. The store’s retry policy stopped firing because we were never slow anymore.

class StoreNotificationsController < ApplicationController
  skip_before_action :verify_authenticity_token

  def renewal
    payload = StoreNotification.parse!(request.body.read)
    idempotency_key = "#{payload.original_transaction_id}:#{payload.notification_uuid}"

    ProcessRenewalJob.perform_async(payload.to_h, idempotency_key)
    head :ok
  rescue StoreNotification::InvalidSignature
    head :unauthorized
  end
end

Second phase: the job. The dedupe lives in the database, not in application code, because application-level checks lose to two pods racing on the same notification.

class ProcessRenewalJob
  include Sidekiq::Job
  sidekiq_options queue: :billing, retry: 5

  def perform(payload, idempotency_key)
    StoreNotification.transaction do
      # Insert-or-noop on unique (original_transaction_id, notification_uuid).
      # If the row already exists, every later write in this txn is skipped.
      record = StoreNotification.create_with(payload: payload)
                                .find_or_create_by!(idempotency_key: idempotency_key)
      return unless record.previously_new_record?

      CreatorSubscription.upsert_from_notification!(payload)
    end
  end
end

The migration is the boring half of the fix and the half that actually saves you.

class AddIdempotencyToStoreNotifications < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def change
    add_column :store_notifications, :idempotency_key, :string, null: false
    add_index :store_notifications, :idempotency_key,
              unique: true, algorithm: :concurrently
  end
end

That’s the contract. Unique constraint at the database level. Two pods can race all day. One wins the insert, the other raises RecordNotUnique and Sidekiq either retries or treats it as a noop. Either way, only one renewal row, only one subscription row.

Read after write, against the upstream

The second war story is from the same pipeline, slightly earlier. The native app submission pipeline, Rails plus Python plus Fastlane plus GitHub Actions, was quietly running for around six months. Hundreds of branded app submissions a week. Boring, in the good way.

Then on a Wednesday, our pending_apple_review Sidekiq queue started backing up. By 2 p.m. Pacific the mobile CX team had eighty-something tickets in. The store’s submission API had silently started throttling us. It was returning 200 OK with a normal-looking body, but the submission was being dropped on their side. The pipeline thought everything was fine.

Our first move was to extend auto-retry from 5xx to also retry on stuck state. That made it dramatically worse. The store started seeing what looked like duplicate submissions, and a chunk of customer apps ended up with two competing review records and conflicting metadata. We were treating 200 OK as truth, when 200 OK was lying to us.

The fix went in within a week. Pull the auto-retry. Add a circuit breaker around the submit step that confirms submission state via a separate GET against the store’s resource, not via the body of the POST. Then a one-shot reconciliation job, again keyed by something stable (app_id + version + git_sha), to dedupe pending reviews against the store’s source of truth and merge metadata where it had diverged. Three days of slipped releases. A lot of unhappy creators. The rule that stuck on the team:

// Never trust your own write to a human-moderated upstream.
// POST returns 200, then read-after-write against the same resource.
async function submitForReview(appId: string, version: string, gitSha: string) {
  const key = `${appId}:${version}:${gitSha}`;

  const post = await store.submissions.create({ appId, version, idempotencyKey: key });
  if (post.status !== 200) throw new SubmitError(post.status);

  // The truth is on the upstream, not in the POST body.
  const truth = await store.submissions.get({ appId, version });
  if (truth.state !== "in_review" && truth.state !== "approved") {
    throw new SubmitNotPersistedError(truth.state);
  }
  return truth;
}

Where the idempotency key actually lives

People tend to argue about which layer should own idempotency. I have an opinion. It lives in three places at once and pretending otherwise is how you ship the bug we shipped.

It lives on the client, as a key generated before the first request and reused on every retry of the same logical operation. It lives in the application, threaded through the job payload so a worker handling the retry can recognize it. And it lives in the database, as a unique constraint that ends the argument.

The client and application layers are helpful. The database layer is the contract. If you only have the first two, you’ll find out the day a deploy rolls and two pods both pick up the same retry.

Takeaways

Idempotency keys are not optimization. They’re the only thing that makes “at-least-once” upstreams safe.
Return 200 fast. Do the work in a background job keyed by something the upstream will reuse on retry.
Put the unique constraint in the database. Application-level checks lose to races.
After any write to a human-moderated or eventually-consistent upstream, read back against that upstream. The response body is not the truth.
A visible-only fix on a billing bug is worse than no fix. The card has already been charged.

Thanks for reading. If you’ve got thoughts, send them my way.