How I build outgoing and incoming webhooks in Rails: HMAC signing, retries with backoff, idempotency keys, and per-endpoint observability that actually catches partner failures.
A partner support ticket came in on a Wednesday at the creator-economy platform I worked at. “We’re missing about 30% of your subscription events for the last two days.” We looked. Our outgoing webhooks dashboard showed 100% delivery success. Their ingest logs showed half of what we’d sent. Both sides were right, in a way. Our retries fired on 5xx only, the partner returned 200 OK and dropped events when their queue was full, and nothing on our side disagreed with their lie.
That ticket is most of the reason I have opinions about webhook infrastructure now. The rest comes from the inverse problem. Apple’s App Store Connect once retried our SubscriptionRenewal notification because our handler answered slightly past their 30 second deadline, and our consumer didn’t dedupe. A bunch of customers got billed twice. The Rails monolith made it look fine for a week.
So here’s the position I’ll defend in this post. Webhooks are not HTTP calls. They’re a durable, signed, idempotent, retried, observable contract between two systems that don’t trust each other. Treat them like that, and the integrations stop catching fire. Treat them like a POST, and you get the two stories above.
Every outgoing webhook in our system goes through one path. Producer code never calls Net::HTTP directly. It enqueues an OutboundWebhookJob with the event payload, the target endpoint, and an event id. The job signs, sends, classifies the response, and reports.
class OutboundWebhookJob
include Sidekiq::Job
sidekiq_options queue: :webhooks, retry: false
SIGNING_VERSION = "v1"
def perform(endpoint_id, event_id)
endpoint = WebhookEndpoint.find(endpoint_id)
event = WebhookEvent.find(event_id)
return if event.delivered?
body = event.payload.to_json
ts = Time.now.to_i.to_s
sig = sign(endpoint.secret, ts, body)
res = Faraday.post(endpoint.url, body) do |req|
req.headers["Content-Type"] = "application/json"
req.headers["X-Webhook-Id"] = event.public_id
req.headers["X-Webhook-Timestamp"] = ts
req.headers["X-Webhook-Signature"] = "#{SIGNING_VERSION}=#{sig}"
req.options.timeout = 10
req.options.open_timeout = 3
end
Delivery.record!(event:, endpoint:, response: res)
rescue Faraday::Error => e
Delivery.record_failure!(event:, endpoint:, error: e)
raise
end
def sign(secret, ts, body)
OpenSSL::HMAC.hexdigest("SHA256", secret, "#{ts}.#{body}")
end
end
A few things I won’t bend on. The signature covers the timestamp, not just the body, so a captured payload can’t be replayed an hour later. The signing version is in the header, so we can rotate the algorithm without breaking partners. Retries are off at the Sidekiq level. I want my own retry policy, not Sidekiq’s default of 25 attempts with a power curve I didn’t choose.
Verification on the receiving side mirrors this. We give partners a code snippet and ship a verified gem internally for other Rails apps in our portfolio.
def verify_signature!(request)
ts = request.headers["X-Webhook-Timestamp"]
signature = request.headers["X-Webhook-Signature"]
body = request.raw_post
raise SignatureError, "stale" if (Time.now.to_i - ts.to_i).abs > 300
version, hex = signature.split("=")
raise SignatureError, "bad version" unless version == "v1"
expected = OpenSSL::HMAC.hexdigest("SHA256", ENV.fetch("WEBHOOK_SECRET"), "#{ts}.#{body}")
raise SignatureError, "mismatch" unless ActiveSupport::SecurityUtils.secure_compare(expected, hex)
end
secure_compare matters here. A naive == leaks timing. I’ve watched a junior on my squad reach for it during code review more than once, which is how I know it isn’t obvious.
Retries belong in your domain, not in your transport. Faraday retried on 5xx won’t help when the partner returns 200 OK after dropping the event. So we classify responses ourselves and let the orchestrator decide.
class Delivery
RETRY_DELAYS = [10, 30, 120, 600, 1_800, 7_200, 21_600].freeze
def self.record!(event:, endpoint:, response:)
ok = response.status.between?(200, 299) && acked?(response)
attempt = event.delivery_attempts.create!(
endpoint:,
status_code: response.status,
response_body_excerpt: response.body.to_s.byteslice(0, 2_000),
succeeded: ok
)
if ok
event.update!(delivered_at: Time.current)
else
schedule_retry(event, attempt)
end
end
def self.schedule_retry(event, attempt)
n = event.delivery_attempts.count
return event.update!(dead_lettered_at: Time.current) if n >= RETRY_DELAYS.length
delay = RETRY_DELAYS[n] + rand(0..(RETRY_DELAYS[n] / 4))
OutboundWebhookJob.perform_in(delay, attempt.endpoint_id, event.id)
end
def self.acked?(response)
return true if response.body.blank?
JSON.parse(response.body)["received"] != false
rescue JSON::ParserError
true
end
end
Exponential delays with jitter. The jitter is the thing partners don’t talk about, and it’s the thing that saves you when their gateway falls over and a thousand of your retries arrive in the same second. The RETRY_DELAYS array goes out to six hours because partner outages routinely run longer than your patience does. After that, it’s a dead letter, not a 5xx storm.
When we’re the receiver, the rule is the one I learned the hard way with Apple. Server to server retries are normal. They can be aggressive. The endpoint must be a 200 OK returned in under five seconds, queued work behind it, and a database constraint that makes duplicate processing impossible.
class WebhooksController < ApplicationController
skip_before_action :verify_authenticity_token
def create
verify_signature!(request)
idem_key = request.headers["X-Webhook-Id"]
return head :ok if ProcessedWebhook.exists?(provider: "stripe", external_id: idem_key)
raw = request.raw_post
record = ProcessedWebhook.create!(provider: "stripe", external_id: idem_key, payload: raw)
StripeWebhookJob.perform_async(record.id)
head :ok
rescue ActiveRecord::RecordNotUnique
head :ok
rescue SignatureError
head :unauthorized
end
end
The matching migration is the part people forget:
class AddUniquenessToProcessedWebhooks < ActiveRecord::Migration[7.1]
def change
add_index :processed_webhooks,
[:provider, :external_id],
unique: true,
algorithm: :concurrently
end
end
Two things to notice. The unique index is the real idempotency check. The controller’s exists? lookup is just a fast path. If two retries land in the same millisecond and both pass exists?, the index catches the second one and we still answer 200 OK. The other thing is algorithm: :concurrently. Webhook tables are hot. Adding indexes without concurrently will lock writes.
The Apple ticket from the opening would have stayed open for weeks without two specific metrics. Delivery success rate per endpoint and end to end latency per endpoint. Aggregate dashboards hide partner failures because one good partner drowns out one broken one.
Every delivery emits an event we ship to Datadog. The metric tags are endpoint_id, event_type, status_class. The Datadog monitor that paged us was a simple one. If any endpoint’s delivery success rate dropped below 90% for ten minutes, page the on call. We added another after the Apple incident. If any endpoint’s median time to ack from the partner grew past five seconds for fifteen minutes, page. That second one would have caught Apple’s silent throttling within an hour.
The dead letter queue has its own dashboard. We run a recurring job that surfaces the oldest entries per endpoint, with a one click “retry from DLQ” link in our internal admin. That link is the reason the partner ticket from the opening got resolved in an afternoon. We replayed the two day window from our DLQ and reconciled against their ingest. They acknowledged the bug on their side a week later. The DLQ saved the integration. It saved my Friday too.
if check.200 OK fast. Queue the real work. Never trust a 200 as proof of receipt without a separate confirmation.Thanks for reading. If you’ve got thoughts, send them my way.