Structured JSON logging and ActiveSupport::Notifications in a Rails monolith. The cleanup that turned an incident from a needle hunt into a single Datadog query.
A Tuesday morning at the creator-economy platform I spent the last few years at. The Community squad was triaging a slow read incident on Aurora. Datadog’s “AuroraReplicaLagMaximum > 60s for 2m” alert had fired around 10:14 a.m. PT. Replica lag was at 14 minutes and climbing. I got pulled onto the thread within minutes because I knew the Aurora layer well, even though I wasn’t on-call that week.
I’ve told the actual incident story before. The part I want to talk about today is what wasted the most time before anyone even touched the database.
Yeah. We could not find the bad query in our logs.
The Rails default logger had been left on. Every request was a four-line block of Started GET, the params hash, a hundred SELECT lines, then Completed 200 OK in 8123ms. The 8-second p99 latency was in there somewhere, but you had to scroll. There was no request id you could grep on, no db_runtime you could filter by, no JSON shape Datadog could parse without a regex parser tax. Half the team had to ssh into pods and tail logs by hand to find a single slow endpoint. That part of the morning took longer than the actual fix.
That afternoon I started cleaning it up. This is what I landed on.
Lograge is the smallest change you can make that pays for itself the next time you have an incident. One line of Rails logging becomes one structured event.
# Gemfile
gem "lograge", "~> 0.14"
gem "lograge-sql", "~> 2.0"
# config/initializers/lograge.rb
Rails.application.configure do
config.lograge.enabled = true
config.lograge.formatter = Lograge::Formatters::Json.new
config.lograge.base_controller_class = ["ActionController::API", "ActionController::Base"]
config.lograge.custom_options = lambda do |event|
{
request_id: event.payload[:request_id],
user_id: event.payload[:user_id],
account_id: event.payload[:account_id],
params: event.payload[:params].except("controller", "action", "format", "id"),
exception: event.payload[:exception]&.first,
exception_message: event.payload[:exception]&.last,
time: Time.now.utc.iso8601(3),
}
end
end
The custom_options lambda is where you earn your money. Pull request_id off the payload and you can join a request across the Rails log, Sidekiq log, and Datadog APM trace with one click. Pull user_id and account_id and you can scope a noisy incident down to one tenant without writing a query against the database.
Then teach the controller how to populate that payload.
# app/controllers/application_controller.rb
class ApplicationController < ActionController::API
before_action :append_lograge_context
private
def append_lograge_context
request.env["lograge.custom_options"] = {
user_id: current_user&.id,
account_id: current_account&.id,
}
end
end
That’s the boring half. The interesting half is ActiveSupport::Notifications.
Rails has had a real instrumentation API since 4.0 and most teams I’ve seen don’t use it. They reach for Rails.logger.info inside the service object and call it observability. It isn’t.
ActiveSupport::Notifications is a pub/sub bus that ships with Rails. The framework already publishes events on it: sql.active_record, process_action.action_controller, perform.active_job, cache_read.active_support, and a dozen others. You can subscribe to any of them. You can also publish your own.
That’s the move. Stop logging from inside your service objects. Publish a notification. Then have one subscriber in config/initializers/instrumentation.rb that decides what to do with it.
# app/services/branded_app_submission_service.rb
class BrandedAppSubmissionService
def call(branded_app:, build:)
ActiveSupport::Notifications.instrument(
"bma.submission",
branded_app_id: branded_app.id,
build_id: build.id,
store: :apple,
) do |payload|
result = AppStoreConnect::SubmitBuild.new(build).call
payload[:submission_id] = result.id
payload[:status] = result.status
result
end
end
end
The instrument block wraps the work, measures duration, swallows nothing, and lets the subscriber pick the event up with a duration and a payload already attached. If the block raises, the subscriber still gets the event, with an :exception key.
# config/initializers/instrumentation.rb
ActiveSupport::Notifications.subscribe(/^bma\./) do |name, started_at, ended_at, _id, payload|
duration_ms = ((ended_at - started_at) * 1000).round(2)
StructuredLogger.info(
event: name,
duration_ms: duration_ms,
branded_app_id: payload[:branded_app_id],
build_id: payload[:build_id],
store: payload[:store],
submission_id: payload[:submission_id],
status: payload[:status],
error: payload[:exception]&.first,
)
Datadog::Tracing.trace(name) do |span|
span.set_tag("branded_app_id", payload[:branded_app_id])
span.set_tag("store", payload[:store])
end
end
StructuredLogger is a thin wrapper that emits a single JSON line to STDOUT and lets the platform layer ship it to Datadog or CloudWatch. The point isn’t the wrapper, it’s the shape: every event the application emits is one line of JSON with a request_id, a duration, and the domain ids you care about. No multi-line ASCII art.
This pattern earned its rent the first time we had a real incident.
On a Wednesday morning, the pending_apple_review Sidekiq queue started backing up. By lunch, a few hundred customer app builds were stuck in “Waiting for Review” on App Store Connect, but our internal pipeline thought they were submitted successfully. Apple’s Connect API was silently throttling our submission endpoint and returning 200 OK with a body that looked normal. The submission was being dropped on their side. Customer support had 80-plus tickets by 2 p.m. Pacific.
First fix was wrong. The pipeline already had auto-retry on 5xx, and we extended it to retry on “stuck” state too. Apple started seeing what looked like duplicate submissions. A chunk of customers ended up with two competing review records.
The real fix was a circuit breaker around the submit step that verifies submission state via a separate GET against the App Store Connect resource. Plus a one-shot reconciliation job to dedupe pending reviews against Apple’s source of truth. We were able to write that job in an afternoon because the bma.submission events were already in Datadog with submission_id attached. One query, sorted by branded_app_id, gave us the exact list of duplicates without writing a single SQL statement against the production database.
That’s what good instrumentation buys you. The data you need at 2 a.m. is already there.
Two quick filters worth setting once and forgetting.
# config/initializers/filter_parameter_logging.rb
Rails.application.config.filter_parameters += [
:password, :password_confirmation, :token, :access_token, :refresh_token,
:api_key, :secret, :authorization, :cookie, :credit_card, :ssn,
]
And a hard rule for the team. The service object emits notifications. The subscriber emits logs. The service object does not call Rails.logger directly. If a request needs more context, you add a key to the notification payload, not a new log line.
ActiveSupport::Notifications as the boundary between business code and observability code. Publish events, subscribe once.request_id, a duration, and the domain ids that scope the noise.Thanks for reading. If you’ve got thoughts, send them my way.