Building Webhooks Infrastructure in Rails

How I build outgoing and incoming webhooks in Rails: HMAC signing, retries with backoff, idempotency keys, and per-endpoint observability that actually catches partner failures.

A partner support ticket came in on a Wednesday at the creator-economy platform I worked at. “We’re missing about 30% of your subscription events for the last two days.” We looked. Our outgoing webhooks dashboard showed 100% delivery success. Their ingest logs showed half of what we’d sent. Both sides were right, in a way. Our retries fired on 5xx only, the partner returned 200 OK and dropped events when their queue was full, and nothing on our side disagreed with their lie.

That ticket is most of the reason I have opinions about webhook infrastructure now. The rest comes from the inverse problem. Apple’s App Store Connect once retried our SubscriptionRenewal notification because our handler answered slightly past their 30 second deadline, and our consumer didn’t dedupe. A bunch of customers got billed twice. The Rails monolith made it look fine for a week.

So here’s the position I’ll defend in this post. Webhooks are not HTTP calls. They’re a durable, signed, idempotent, retried, observable contract between two systems that don’t trust each other. Treat them like that, and the integrations stop catching fire. Treat them like a POST, and you get the two stories above.

Outgoing delivery

Every outgoing webhook in our system goes through one path. Producer code never calls Net::HTTP directly. It enqueues an OutboundWebhookJob with the event payload, the target endpoint, and an event id. The job signs, sends, classifies the response, and reports.

class OutboundWebhookJob
  include Sidekiq::Job
  sidekiq_options queue: :webhooks, retry: false

  SIGNING_VERSION = "v1"

  def perform(endpoint_id, event_id)
    endpoint = WebhookEndpoint.find(endpoint_id)
    event    = WebhookEvent.find(event_id)
    return if event.delivered?

    body = event.payload.to_json
    ts   = Time.now.to_i.to_s
    sig  = sign(endpoint.secret, ts, body)

    res = Faraday.post(endpoint.url, body) do |req|
      req.headers["Content-Type"]            = "application/json"
      req.headers["X-Webhook-Id"]            = event.public_id
      req.headers["X-Webhook-Timestamp"]     = ts
      req.headers["X-Webhook-Signature"]     = "#{SIGNING_VERSION}=#{sig}"
      req.options.timeout      = 10
      req.options.open_timeout = 3
    end

    Delivery.record!(event:, endpoint:, response: res)
  rescue Faraday::Error => e
    Delivery.record_failure!(event:, endpoint:, error: e)
    raise
  end

  def sign(secret, ts, body)
    OpenSSL::HMAC.hexdigest("SHA256", secret, "#{ts}.#{body}")
  end
end

A few things I won’t bend on. The signature covers the timestamp, not just the body, so a captured payload can’t be replayed an hour later. The signing version is in the header, so we can rotate the algorithm without breaking partners. Retries are off at the Sidekiq level. I want my own retry policy, not Sidekiq’s default of 25 attempts with a power curve I didn’t choose.

Verification on the receiving side mirrors this. We give partners a code snippet and ship a verified gem internally for other Rails apps in our portfolio.

def verify_signature!(request)
  ts        = request.headers["X-Webhook-Timestamp"]
  signature = request.headers["X-Webhook-Signature"]
  body      = request.raw_post

  raise SignatureError, "stale" if (Time.now.to_i - ts.to_i).abs > 300

  version, hex = signature.split("=")
  raise SignatureError, "bad version" unless version == "v1"

  expected = OpenSSL::HMAC.hexdigest("SHA256", ENV.fetch("WEBHOOK_SECRET"), "#{ts}.#{body}")
  raise SignatureError, "mismatch" unless ActiveSupport::SecurityUtils.secure_compare(expected, hex)
end

secure_compare matters here. A naive == leaks timing. I’ve watched a junior on my squad reach for it during code review more than once, which is how I know it isn’t obvious.

Retries that don’t lie

Retries belong in your domain, not in your transport. Faraday retried on 5xx won’t help when the partner returns 200 OK after dropping the event. So we classify responses ourselves and let the orchestrator decide.

class Delivery
  RETRY_DELAYS = [10, 30, 120, 600, 1_800, 7_200, 21_600].freeze

  def self.record!(event:, endpoint:, response:)
    ok = response.status.between?(200, 299) && acked?(response)
    attempt = event.delivery_attempts.create!(
      endpoint:,
      status_code: response.status,
      response_body_excerpt: response.body.to_s.byteslice(0, 2_000),
      succeeded: ok
    )
    if ok
      event.update!(delivered_at: Time.current)
    else
      schedule_retry(event, attempt)
    end
  end

  def self.schedule_retry(event, attempt)
    n = event.delivery_attempts.count
    return event.update!(dead_lettered_at: Time.current) if n >= RETRY_DELAYS.length

    delay = RETRY_DELAYS[n] + rand(0..(RETRY_DELAYS[n] / 4))
    OutboundWebhookJob.perform_in(delay, attempt.endpoint_id, event.id)
  end

  def self.acked?(response)
    return true if response.body.blank?
    JSON.parse(response.body)["received"] != false
  rescue JSON::ParserError
    true
  end
end

Exponential delays with jitter. The jitter is the thing partners don’t talk about, and it’s the thing that saves you when their gateway falls over and a thousand of your retries arrive in the same second. The RETRY_DELAYS array goes out to six hours because partner outages routinely run longer than your patience does. After that, it’s a dead letter, not a 5xx storm.

Idempotency on consumption

When we’re the receiver, the rule is the one I learned the hard way with Apple. Server to server retries are normal. They can be aggressive. The endpoint must be a 200 OK returned in under five seconds, queued work behind it, and a database constraint that makes duplicate processing impossible.

class WebhooksController < ApplicationController
  skip_before_action :verify_authenticity_token

  def create
    verify_signature!(request)

    idem_key = request.headers["X-Webhook-Id"]
    return head :ok if ProcessedWebhook.exists?(provider: "stripe", external_id: idem_key)

    raw = request.raw_post
    record = ProcessedWebhook.create!(provider: "stripe", external_id: idem_key, payload: raw)
    StripeWebhookJob.perform_async(record.id)

    head :ok
  rescue ActiveRecord::RecordNotUnique
    head :ok
  rescue SignatureError
    head :unauthorized
  end
end

The matching migration is the part people forget:

class AddUniquenessToProcessedWebhooks < ActiveRecord::Migration[7.1]
  def change
    add_index :processed_webhooks,
              [:provider, :external_id],
              unique: true,
              algorithm: :concurrently
  end
end

Two things to notice. The unique index is the real idempotency check. The controller’s exists? lookup is just a fast path. If two retries land in the same millisecond and both pass exists?, the index catches the second one and we still answer 200 OK. The other thing is algorithm: :concurrently. Webhook tables are hot. Adding indexes without concurrently will lock writes.

Per-endpoint observability

The Apple ticket from the opening would have stayed open for weeks without two specific metrics. Delivery success rate per endpoint and end to end latency per endpoint. Aggregate dashboards hide partner failures because one good partner drowns out one broken one.

Every delivery emits an event we ship to Datadog. The metric tags are endpoint_id, event_type, status_class. The Datadog monitor that paged us was a simple one. If any endpoint’s delivery success rate dropped below 90% for ten minutes, page the on call. We added another after the Apple incident. If any endpoint’s median time to ack from the partner grew past five seconds for fifteen minutes, page. That second one would have caught Apple’s silent throttling within an hour.

The dead letter queue has its own dashboard. We run a recurring job that surfaces the oldest entries per endpoint, with a one click “retry from DLQ” link in our internal admin. That link is the reason the partner ticket from the opening got resolved in an afternoon. We replayed the two day window from our DLQ and reconciled against their ingest. They acknowledged the bug on their side a week later. The DLQ saved the integration. It saved my Friday too.

Takeaways

Webhooks are a contract between distrustful systems. Sign them, version the signature, time bound it.
Retries belong in your domain. Sidekiq’s default policy is not a webhook policy.
Jittered exponential backoff with a six hour tail. Partner outages outlast your patience.
Idempotency is a database unique index, not a controller if check.
Return 200 OK fast. Queue the real work. Never trust a 200 as proof of receipt without a separate confirmation.
Per endpoint metrics or you’ll find out from a support ticket.
A DLQ with a replay tool is not optional. It’s the recovery path.

Thanks for reading. If you’ve got thoughts, send them my way.