A three-week hunt for a once-per-million duplicate order bug, a Sidekiq super_fetch surprise, a Redis blip, and the database forensics that finally cracked it.
A creator opened a support ticket on a Tuesday morning saying every customer in their branded mobile app got charged twice that month, and our own data backed them up. The numbers were small at first. A handful of subscriptions doubled on a quiet weekend. Then a few hundred. Then somebody escalated, and the spreadsheet landed on my desk.
This was at the creator-economy platform I worked at, on the native billing surface for branded apps. Apple In-App Purchase, Google Play Billing, server-to-server renewal notifications, the whole stack. It had been quietly shipping for about six months and felt boring in the good way. Until it wasn’t.
The thing that made it a nightmare was simple. We could not reproduce it. Not in staging, not in a unit test, not in a scripted production-like load. A bug that fired roughly once per million events, sometimes none in a day, sometimes a small cluster in an hour. And it was bleeding real money out of customer cards.
The first week was denial. The on-call set up a synthetic load test that hammered our renewal endpoint with Apple-shaped payloads. Nothing duplicated. We added a staging webhook that replayed real Apple traffic on a delay. Still nothing.
By week two we’d settled into a comfortable bad theory. “It’s Apple.” Their notification retry policy was aggressive, we all knew that. We even drafted a paragraph for the support ticket template explaining how Apple’s renewal pipeline occasionally double-fires and how creators should sit tight.
I’m not proud of that draft. We were three engineers, smart enough to know better, and we were one click away from telling paying customers it was someone else’s fault.
Apple’s subscriptionRenewal server-to-server notification has a 30-second deadline. If your endpoint doesn’t return 200 OK inside that window, Apple retries. Retries land as separate HTTP requests, often a few seconds apart. So if your handler is slow and your processing isn’t idempotent, you’ll process the same renewal twice. Three times. Sometimes more.
Our handler was synchronous. It looked something like this.
class Webhooks::AppleRenewalsController < ApplicationController
skip_before_action :verify_authenticity_token
def create
payload = AppleSignedPayload.new(request.body.read).verify!
transaction_id = payload.original_transaction_id
receipt = AppleReceiptValidator.call(payload.latest_receipt)
creator_app = BrandedApp.find_by!(bundle_id: payload.bundle_id)
subscription = CreatorSubscription.create!(
creator_app: creator_app,
apple_original_transaction_id: transaction_id,
expires_at: receipt.expires_at,
auto_renew: receipt.auto_renew_status,
raw_notification: payload.raw
)
SubscriptionMailer.with(subscription: subscription).renewal_confirmed.deliver_later
render json: { ok: true }
end
end
Read it again. The receipt validator does an HTTPS call to Apple. The mailer enqueues a job. The BrandedApp lookup hits a table that on a Tuesday morning might be cold in the buffer cache. Pick a bad combination of any two and you’re past Apple’s 30-second budget. The 200 OK arrives late, Apple retries, and we do the whole thing again with no idempotency check.
Staging never reproduced it because we never had a slow-enough downstream and an Apple-shaped retry at the same time. Production had both, all day, every day. Once per million.
Around the start of week three I gave up trying to reproduce and went to the database instead.
This is a habit I picked up the hard way at the same job on an unrelated incident a year earlier. Our Community product’s Aurora reader replicas had fallen behind by minutes, the on-call had bumped reader instance class twice, and lag wasn’t moving. The fix was a single query on the writer’s pg_stat_activity showing a long-running ANALYZE starving WAL emission. After that, I started leading every incident with “what does the database know that I don’t”.
Same trick here.
WITH duplicates AS (
SELECT
apple_original_transaction_id,
creator_app_id,
COUNT(*) AS row_count,
MIN(created_at) AS first_seen,
MAX(created_at) AS last_seen,
EXTRACT(EPOCH FROM (MAX(created_at) - MIN(created_at))) AS delta_seconds,
ARRAY_AGG(id ORDER BY created_at) AS ids,
ARRAY_AGG(raw_notification->>'notification_uuid' ORDER BY created_at) AS notification_uuids
FROM creator_subscriptions
WHERE created_at >= NOW() - INTERVAL '30 days'
GROUP BY apple_original_transaction_id, creator_app_id
HAVING COUNT(*) > 1
)
SELECT *
FROM duplicates
ORDER BY delta_seconds ASC
LIMIT 200;
Two patterns fell out immediately. The time delta between duplicates was clustered around 30 to 90 seconds, which screamed Apple-style retries. And a smaller set of duplicates had the same notification_uuid. That should have been impossible. Apple’s UUID is per-notification. If we saw the same UUID twice in our table, the duplicate wasn’t coming from Apple at all. It was coming from us.
That was the moment the investigation flipped. We weren’t being double-fired. We were double-processing.
The same-UUID duplicates traced back to our background queue. We were on Sidekiq Pro with super_fetch enabled, which recovers jobs from crashed or evicted workers. It’s the right default for most teams. It saved us from losing thousands of jobs the previous quarter when a node got OOM-killed mid-deploy.
But super_fetch has a behavior you have to think about. When a worker stops sending heartbeats to Redis, super_fetch assumes that worker died and re-enqueues anything in flight. If the worker actually finished the job and then lost its Redis connection for, say, fifteen seconds during a Redis failover, the job has already run. Super_fetch doesn’t know that. It re-enqueues. The next worker runs it again.
Two months earlier we’d had a quiet incident where ElastiCache for Redis was promoted to a new primary during a routine maintenance window. It took about eighteen seconds for clients to settle. Nothing paged. Sidekiq logs showed a few “fetched jobs from orphaned queue” lines and the on-call moved on. That was the Redis blip, and we’d treated it as cosmetic.
So picture the full sequence.
subscriptionRenewal notification.SubscriptionMailer. It returns 200 OK.notification_uuid but the same original_transaction_id. First class of duplicates.Three small things stacked. No single one of them was a bug on its own. The slow synchronous controller was “slow but working”. Super_fetch was “the right default”. The missing unique constraint was “an oversight”. Stack them, run a few million renewals a month, and you get a once-per-million ghost.
# config/sidekiq.yml
:concurrency: 25
:queues:
- [critical, 4]
- [default, 2]
- [low, 1]
production:
:super_fetch: true
:reliable_push: true
:reliable_scheduler: true
And the renewal worker, post-rewrite, with the idempotency key derived from inputs we control.
class AppleRenewalProcessor
include Sidekiq::Worker
sidekiq_options queue: :critical, retry: 5
def perform(notification_uuid, original_transaction_id, raw_payload_b64)
idempotency_key = "apple:renewal:#{original_transaction_id}:#{notification_uuid}"
ProcessedNotification.transaction do
ProcessedNotification.create!(key: idempotency_key)
payload = AppleSignedPayload.from_base64(raw_payload_b64).verify!
receipt = AppleReceiptValidator.call(payload.latest_receipt)
CreatorSubscription.upsert(
{
apple_original_transaction_id: original_transaction_id,
notification_uuid: notification_uuid,
creator_app_id: BrandedApp.find_by!(bundle_id: payload.bundle_id).id,
expires_at: receipt.expires_at,
auto_renew: receipt.auto_renew_status,
raw_notification: payload.raw
},
unique_by: %i[apple_original_transaction_id notification_uuid]
)
end
rescue ActiveRecord::RecordNotUnique
Rails.logger.info(event: "renewal.duplicate_dropped", key: idempotency_key)
end
end
The ProcessedNotification insert is the gate. It throws on a duplicate key before any side effects can run. The upsert is belt-and-braces in case anything ever bypasses the gate.
class AddUniqueIndexToCreatorSubscriptions < ActiveRecord::Migration[7.1]
disable_ddl_transaction!
def change
add_index :creator_subscriptions,
%i[apple_original_transaction_id notification_uuid],
unique: true,
where: "notification_uuid IS NOT NULL",
algorithm: :concurrently,
name: "idx_creator_subs_unique_apple_renewal"
end
end
algorithm: :concurrently matters. We were running this against a hot table on Aurora during business hours, and we’d lived through a schema migration once that took an ACCESS EXCLUSIVE lock and held login traffic for over a minute. Scar tissue.
The controller got cut to almost nothing.
class Webhooks::AppleRenewalsController < ApplicationController
skip_before_action :verify_authenticity_token
def create
raw = request.body.read
payload = AppleSignedPayload.new(raw).verify!
AppleRenewalProcessor.perform_async(
payload.notification_uuid,
payload.original_transaction_id,
Base64.strict_encode64(raw)
)
head :ok
end
end
Verify the signature. Enqueue. Acknowledge. The actual work lives in the worker, behind the idempotency key, behind the unique index.
Three weeks of investigation, twenty lines of code in the fix. That ratio is the punchline of most production bugs I’ve shipped.
Most “race conditions you can’t reproduce” aren’t single bugs. They’re three small things that each looked fine on their own. A slow handler. A reasonable retry policy. A missing constraint. Staging never assembles all three at once.
The day I stopped trying to recreate it locally and started reading the production database was the day the bug cracked. Audit timestamps, time deltas, the same-UUID rows that should not have existed. Those rows told me more in twenty minutes than two weeks of synthetic load tests had told me total.
pg_stat_activity, pg_locks, audit timestamps with time deltas.Thanks for reading. If you’ve got thoughts, send them my way.