A hidden serializer N+1 quietly turned every request into dozens of Aurora reads. Here's how we caught it, what the bill looked like, and the CI guardrails we shipped so it can't happen again.
It was a Wednesday afternoon at the creator economy platform I worked at, and the Aurora cost graph had a hump on it that nobody could explain. Not a spike. A hump. Multi-terabyte writer, healthy CPU, p99 fine. But total query volume on one of our reader replicas was sitting at roughly 2.3 million per hour for a single Rails endpoint that, on paper, should have been doing about thirty thousand.
I love a good hump. They’re always lying about something.
The endpoint was a community feed. A Rails controller, a Postgres-backed ActiveRecord model, a JSON serializer. Nothing exotic. The team that owned it had already done the obvious passes, includes(:author, :reactions, :tags), query plans looked fine in the test console, Datadog didn’t show any slow queries individually. Each query was 1 to 3 ms. The killer was that there were so many of them.
The actual culprit was a serializer method that, for each post, walked a polymorphic association to fetch the author’s “current badge”. And current_badge wasn’t defined as an association. It was a method.
class PostSerializer < ApplicationSerializer
attributes :id, :body, :created_at, :author
def author
{
id: object.author.id,
name: object.author.name,
avatar: object.author.avatar_url,
current_badge: object.author.current_badge&.label,
}
end
end
class User < ApplicationRecord
def current_badge
badges.where(active: true).order(awarded_at: :desc).first
end
end
includes doesn’t help you here. includes doesn’t know current_badge is a thing it should preload, because current_badge is a Ruby method that opens a fresh query every time it runs. Fifty posts per page, two badge lookups per post on average (the method ran twice in a couple of conditional branches I’d missed), three reader replicas spreading the load so latency looked fine. Multiply that out across the platform and you get a hump.
I’m not proud of how long it took me to find. I went looking in the slow query log first. There was nothing slow. There was just a lot.
My first instinct was the lazy one. Cache the badge per user. Throw it in Redis with a five-minute TTL. Ship.
I caught myself before I sent the PR. Caching a symptom of an N+1 is one of those decisions that buys you a quiet week and a louder month. The badges were going to change. The TTL would either be too long (wrong badge shown when a user earned a new one) or too short (cache miss storm at peak hours, same problem in a new costume). Caching is a real tool, but it’s not a fix for “we’re doing fifty queries when we should be doing one”.
The real fix was a join, a select that grabbed the relevant badge alongside the post, and a small refactor so the serializer never had to ask the model a second time.
class PostsController < ApplicationController
def index
posts = Post
.where(community_id: params[:community_id])
.joins("LEFT JOIN LATERAL (
SELECT b.label
FROM badges b
WHERE b.user_id = posts.author_id AND b.active = true
ORDER BY b.awarded_at DESC
LIMIT 1
) AS current_badge ON true")
.select("posts.*, current_badge.label AS author_current_badge")
.includes(:author, :reactions, :tags)
.order(created_at: :desc)
.limit(50)
render json: posts, each_serializer: PostSerializer
end
end
class PostSerializer < ApplicationSerializer
attributes :id, :body, :created_at, :author
def author
{
id: object.author.id,
name: object.author.name,
avatar: object.author.avatar_url,
current_badge: object.author_current_badge,
}
end
end
Query count for the endpoint went from “roughly fifty per request” to four. Reader CPU dropped immediately on the next deploy. The Aurora line item on the next month’s bill stopped being a hump and went back to being a flat line. I’ll leave the exact savings out of this because they’re embarrassing to admit to in either direction, but they paid for a meaningful chunk of the team for the year.
Here’s the thing nobody owns up to. The Rails ecosystem has had bullet since basically forever. We had it installed. It logged things to a file in development. Nobody read the file.
So the second fix was structural. I wired bullet to fail tests in CI when an N+1 fires, instead of just nagging in dev.
# config/environments/test.rb
Rails.application.configure do
config.after_initialize do
Bullet.enable = true
Bullet.bullet_logger = true
Bullet.raise = true
Bullet.add_safelist type: :n_plus_one_query,
class_name: "User",
association: :preferences
end
end
# spec/support/bullet.rb
RSpec.configure do |config|
config.before(:each) { Bullet.start_request }
config.after(:each) { Bullet.end_request }
end
A failing test in CI is annoying. A query bill that quietly doubles over a quarter is worse. The safelist exists because there’s always one or two legitimately-justified N+1s in any large codebase, and you want the team to have to defend them in a PR review instead of pretending they don’t exist.
strict_loading was the other half. We turned it on by default for new models, and progressively backfilled the old ones.
class Post < ApplicationRecord
self.strict_loading_by_default = true
belongs_to :author, class_name: "User"
has_many :reactions
end
If a developer touches post.reactions without preloading it, Rails raises ActiveRecord::StrictLoadingViolationError in test and in development. In production we ran it in :log mode for a couple of weeks before flipping it to raise, because we needed to find the surprises without taking the site down.
The last piece, and honestly the one I’m most proud of, was query budgets. Every endpoint that mattered got a declared budget. If a request blew past it, we logged a structured warning and, in CI, failed the request spec.
# app/middleware/query_budget.rb
class QueryBudget
def initialize(app); @app = app; end
def call(env)
counter = QueryCounter.new
ActiveSupport::Notifications.subscribed(counter, "sql.active_record") do
@app.call(env)
end
ensure
budget = env["action_dispatch.request.path_parameters"].then do |p|
Rails.application.config.x.query_budgets["#{p[:controller]}##{p[:action]}"]
end
if budget && counter.count > budget
Rails.logger.warn(
msg: "query_budget_exceeded",
endpoint: "#{env['REQUEST_METHOD']} #{env['PATH_INFO']}",
actual: counter.count,
budget: budget,
)
end
end
end
# config/query_budgets.yml
posts#index: 8
posts#show: 6
communities#feed: 12
users#me: 4
Six months later, that file is the single most-edited config in the repo, and that’s a feature. Every time someone bumps a number, there’s a PR conversation about why. Most of the time the answer is “we genuinely need it”. Sometimes it’s “oh, that’s the badge thing again”. Either way, the bump is intentional, not invisible.
Quick second story, because this pattern shows up everywhere. Years before that, at the combat-sports tournament platform I CTO’d in London, we had a rankings page powered off Elasticsearch with PostgreSQL as the source of truth. Different stack, same shape of bug.
A federation tournament finished on a Saturday night. The new champion’s ranking didn’t update for eight hours. The athlete in question noticed before we did, tweeted a screenshot of our broken rankings page tagging the federation. Quiet logs. The rankings-indexer consumer was still reading off Kafka, just silently not writing to Elasticsearch because its bulk-write client had entered a circuit-open state after a transient cluster blip and never came back. We triggered a full reindex from Postgres, took about twenty-five minutes to catch up, then patched the breaker to attempt half-open every sixty seconds instead of staying open until restart.
Different bug, same lesson. Volume hides things. “The pipeline is consuming” was not the same as “the pipeline is doing its job”. The rankings page now has a freshness metric. The query budget file is the same idea, one stack lower.
strict_loading_by_default on new models, then progressively on old ones. Run in log mode first.Thanks for reading. If you’ve got thoughts, send them my way.