The Race Condition We Couldn't Reproduce

A three-week hunt for a once-per-million duplicate order bug, a Sidekiq super_fetch surprise, a Redis blip, and the database forensics that finally cracked it.

A creator opened a support ticket on a Tuesday morning saying every customer in their branded mobile app got charged twice that month, and our own data backed them up. The numbers were small at first. A handful of subscriptions doubled on a quiet weekend. Then a few hundred. Then somebody escalated, and the spreadsheet landed on my desk.

This was at the creator-economy platform I worked at, on the native billing surface for branded apps. Apple In-App Purchase, Google Play Billing, server-to-server renewal notifications, the whole stack. It had been quietly shipping for about six months and felt boring in the good way. Until it wasn’t.

The thing that made it a nightmare was simple. We could not reproduce it. Not in staging, not in a unit test, not in a scripted production-like load. A bug that fired roughly once per million events, sometimes none in a day, sometimes a small cluster in an hour. And it was bleeding real money out of customer cards.

Three weeks chasing a ghost

The first week was denial. The on-call set up a synthetic load test that hammered our renewal endpoint with Apple-shaped payloads. Nothing duplicated. We added a staging webhook that replayed real Apple traffic on a delay. Still nothing.

By week two we’d settled into a comfortable bad theory. “It’s Apple.” Their notification retry policy was aggressive, we all knew that. We even drafted a paragraph for the support ticket template explaining how Apple’s renewal pipeline occasionally double-fires and how creators should sit tight.

I’m not proud of that draft. We were three engineers, smart enough to know better, and we were one click away from telling paying customers it was someone else’s fault.

Why staging never saw it

Apple’s subscriptionRenewal server-to-server notification has a 30-second deadline. If your endpoint doesn’t return 200 OK inside that window, Apple retries. Retries land as separate HTTP requests, often a few seconds apart. So if your handler is slow and your processing isn’t idempotent, you’ll process the same renewal twice. Three times. Sometimes more.

Our handler was synchronous. It looked something like this.

class Webhooks::AppleRenewalsController < ApplicationController
  skip_before_action :verify_authenticity_token

  def create
    payload = AppleSignedPayload.new(request.body.read).verify!
    transaction_id = payload.original_transaction_id

    receipt = AppleReceiptValidator.call(payload.latest_receipt)
    creator_app = BrandedApp.find_by!(bundle_id: payload.bundle_id)

    subscription = CreatorSubscription.create!(
      creator_app: creator_app,
      apple_original_transaction_id: transaction_id,
      expires_at: receipt.expires_at,
      auto_renew: receipt.auto_renew_status,
      raw_notification: payload.raw
    )

    SubscriptionMailer.with(subscription: subscription).renewal_confirmed.deliver_later
    render json: { ok: true }
  end
end

Read it again. The receipt validator does an HTTPS call to Apple. The mailer enqueues a job. The BrandedApp lookup hits a table that on a Tuesday morning might be cold in the buffer cache. Pick a bad combination of any two and you’re past Apple’s 30-second budget. The 200 OK arrives late, Apple retries, and we do the whole thing again with no idempotency check.

Staging never reproduced it because we never had a slow-enough downstream and an Apple-shaped retry at the same time. Production had both, all day, every day. Once per million.

Reading the database, not the code

Around the start of week three I gave up trying to reproduce and went to the database instead.

This is a habit I picked up the hard way at the same job on an unrelated incident a year earlier. Our Community product’s Aurora reader replicas had fallen behind by minutes, the on-call had bumped reader instance class twice, and lag wasn’t moving. The fix was a single query on the writer’s pg_stat_activity showing a long-running ANALYZE starving WAL emission. After that, I started leading every incident with “what does the database know that I don’t”.

Same trick here.

WITH duplicates AS (
  SELECT
    apple_original_transaction_id,
    creator_app_id,
    COUNT(*) AS row_count,
    MIN(created_at) AS first_seen,
    MAX(created_at) AS last_seen,
    EXTRACT(EPOCH FROM (MAX(created_at) - MIN(created_at))) AS delta_seconds,
    ARRAY_AGG(id ORDER BY created_at) AS ids,
    ARRAY_AGG(raw_notification->>'notification_uuid' ORDER BY created_at) AS notification_uuids
  FROM creator_subscriptions
  WHERE created_at >= NOW() - INTERVAL '30 days'
  GROUP BY apple_original_transaction_id, creator_app_id
  HAVING COUNT(*) > 1
)
SELECT *
FROM duplicates
ORDER BY delta_seconds ASC
LIMIT 200;

Two patterns fell out immediately. The time delta between duplicates was clustered around 30 to 90 seconds, which screamed Apple-style retries. And a smaller set of duplicates had the same notification_uuid. That should have been impossible. Apple’s UUID is per-notification. If we saw the same UUID twice in our table, the duplicate wasn’t coming from Apple at all. It was coming from us.

That was the moment the investigation flipped. We weren’t being double-fired. We were double-processing.

The Sidekiq super_fetch wrinkle

The same-UUID duplicates traced back to our background queue. We were on Sidekiq Pro with super_fetch enabled, which recovers jobs from crashed or evicted workers. It’s the right default for most teams. It saved us from losing thousands of jobs the previous quarter when a node got OOM-killed mid-deploy.

But super_fetch has a behavior you have to think about. When a worker stops sending heartbeats to Redis, super_fetch assumes that worker died and re-enqueues anything in flight. If the worker actually finished the job and then lost its Redis connection for, say, fifteen seconds during a Redis failover, the job has already run. Super_fetch doesn’t know that. It re-enqueues. The next worker runs it again.

Two months earlier we’d had a quiet incident where ElastiCache for Redis was promoted to a new primary during a routine maintenance window. It took about eighteen seconds for clients to settle. Nothing paged. Sidekiq logs showed a few “fetched jobs from orphaned queue” lines and the on-call moved on. That was the Redis blip, and we’d treated it as cosmetic.

So picture the full sequence.

Apple sends a subscriptionRenewal notification.
Our slow synchronous controller takes 28 seconds. It writes the row. It enqueues SubscriptionMailer. It returns 200 OK.
Apple, on the edge of its 30-second budget, retries. The retry creates a second row with a different notification_uuid but the same original_transaction_id. First class of duplicates.
A few hours later, a Redis blip causes super_fetch to re-enqueue the mailer job, and the second pass walks the same handler logic. Second class of duplicates, the same-UUID ones.
Some unlucky customers got both.

Three small things stacked. No single one of them was a bug on its own. The slow synchronous controller was “slow but working”. Super_fetch was “the right default”. The missing unique constraint was “an oversight”. Stack them, run a few million renewals a month, and you get a once-per-million ghost.

# config/sidekiq.yml
:concurrency: 25
:queues:
  - [critical, 4]
  - [default, 2]
  - [low, 1]

production:
  :super_fetch: true
  :reliable_push: true
  :reliable_scheduler: true

And the renewal worker, post-rewrite, with the idempotency key derived from inputs we control.

class AppleRenewalProcessor
  include Sidekiq::Worker
  sidekiq_options queue: :critical, retry: 5

  def perform(notification_uuid, original_transaction_id, raw_payload_b64)
    idempotency_key = "apple:renewal:#{original_transaction_id}:#{notification_uuid}"

    ProcessedNotification.transaction do
      ProcessedNotification.create!(key: idempotency_key)

      payload = AppleSignedPayload.from_base64(raw_payload_b64).verify!
      receipt = AppleReceiptValidator.call(payload.latest_receipt)

      CreatorSubscription.upsert(
        {
          apple_original_transaction_id: original_transaction_id,
          notification_uuid: notification_uuid,
          creator_app_id: BrandedApp.find_by!(bundle_id: payload.bundle_id).id,
          expires_at: receipt.expires_at,
          auto_renew: receipt.auto_renew_status,
          raw_notification: payload.raw
        },
        unique_by: %i[apple_original_transaction_id notification_uuid]
      )
    end
  rescue ActiveRecord::RecordNotUnique
    Rails.logger.info(event: "renewal.duplicate_dropped", key: idempotency_key)
  end
end

The ProcessedNotification insert is the gate. It throws on a duplicate key before any side effects can run. The upsert is belt-and-braces in case anything ever bypasses the gate.

The fix that actually held

class AddUniqueIndexToCreatorSubscriptions < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def change
    add_index :creator_subscriptions,
      %i[apple_original_transaction_id notification_uuid],
      unique: true,
      where: "notification_uuid IS NOT NULL",
      algorithm: :concurrently,
      name: "idx_creator_subs_unique_apple_renewal"
  end
end

algorithm: :concurrently matters. We were running this against a hot table on Aurora during business hours, and we’d lived through a schema migration once that took an ACCESS EXCLUSIVE lock and held login traffic for over a minute. Scar tissue.

The controller got cut to almost nothing.

class Webhooks::AppleRenewalsController < ApplicationController
  skip_before_action :verify_authenticity_token

  def create
    raw = request.body.read
    payload = AppleSignedPayload.new(raw).verify!

    AppleRenewalProcessor.perform_async(
      payload.notification_uuid,
      payload.original_transaction_id,
      Base64.strict_encode64(raw)
    )

    head :ok
  end
end

Verify the signature. Enqueue. Acknowledge. The actual work lives in the worker, behind the idempotency key, behind the unique index.

Three weeks of investigation, twenty lines of code in the fix. That ratio is the punchline of most production bugs I’ve shipped.

What three weeks taught me

Most “race conditions you can’t reproduce” aren’t single bugs. They’re three small things that each looked fine on their own. A slow handler. A reasonable retry policy. A missing constraint. Staging never assembles all three at once.

The day I stopped trying to recreate it locally and started reading the production database was the day the bug cracked. Audit timestamps, time deltas, the same-UUID rows that should not have existed. Those rows told me more in twenty minutes than two weeks of synthetic load tests had told me total.

Takeaways

Race conditions that refuse to reproduce are usually three small things stacked, not one big bug.
Read the database before you read the code. pg_stat_activity, pg_locks, audit timestamps with time deltas.
Idempotency keys are the contract. Database unique constraints are how you enforce them.
Acknowledge external webhooks fast. Do the slow work in a background job, behind the key.
Trust the upstream’s source of truth. Don’t trust your own write response.
Once-per-million still hurts when “million” is your weekly throughput.

Thanks for reading. If you’ve got thoughts, send them my way.