The N+1 That Cost a Fortune

A hidden serializer N+1 quietly turned every request into dozens of Aurora reads. Here's how we caught it, what the bill looked like, and the CI guardrails we shipped so it can't happen again.

It was a Wednesday afternoon at the creator economy platform I worked at, and the Aurora cost graph had a hump on it that nobody could explain. Not a spike. A hump. Multi-terabyte writer, healthy CPU, p99 fine. But total query volume on one of our reader replicas was sitting at roughly 2.3 million per hour for a single Rails endpoint that, on paper, should have been doing about thirty thousand.

I love a good hump. They’re always lying about something.

Where the queries were hiding

The endpoint was a community feed. A Rails controller, a Postgres-backed ActiveRecord model, a JSON serializer. Nothing exotic. The team that owned it had already done the obvious passes, includes(:author, :reactions, :tags), query plans looked fine in the test console, Datadog didn’t show any slow queries individually. Each query was 1 to 3 ms. The killer was that there were so many of them.

The actual culprit was a serializer method that, for each post, walked a polymorphic association to fetch the author’s “current badge”. And current_badge wasn’t defined as an association. It was a method.

class PostSerializer < ApplicationSerializer
  attributes :id, :body, :created_at, :author

  def author
    {
      id: object.author.id,
      name: object.author.name,
      avatar: object.author.avatar_url,
      current_badge: object.author.current_badge&.label,
    }
  end
end

class User < ApplicationRecord
  def current_badge
    badges.where(active: true).order(awarded_at: :desc).first
  end
end

includes doesn’t help you here. includes doesn’t know current_badge is a thing it should preload, because current_badge is a Ruby method that opens a fresh query every time it runs. Fifty posts per page, two badge lookups per post on average (the method ran twice in a couple of conditional branches I’d missed), three reader replicas spreading the load so latency looked fine. Multiply that out across the platform and you get a hump.

I’m not proud of how long it took me to find. I went looking in the slow query log first. There was nothing slow. There was just a lot.

The wrong fix I tried first

My first instinct was the lazy one. Cache the badge per user. Throw it in Redis with a five-minute TTL. Ship.

I caught myself before I sent the PR. Caching a symptom of an N+1 is one of those decisions that buys you a quiet week and a louder month. The badges were going to change. The TTL would either be too long (wrong badge shown when a user earned a new one) or too short (cache miss storm at peak hours, same problem in a new costume). Caching is a real tool, but it’s not a fix for “we’re doing fifty queries when we should be doing one”.

The real fix was a join, a select that grabbed the relevant badge alongside the post, and a small refactor so the serializer never had to ask the model a second time.

class PostsController < ApplicationController
  def index
    posts = Post
      .where(community_id: params[:community_id])
      .joins("LEFT JOIN LATERAL (
         SELECT b.label
         FROM badges b
         WHERE b.user_id = posts.author_id AND b.active = true
         ORDER BY b.awarded_at DESC
         LIMIT 1
       ) AS current_badge ON true")
      .select("posts.*, current_badge.label AS author_current_badge")
      .includes(:author, :reactions, :tags)
      .order(created_at: :desc)
      .limit(50)

    render json: posts, each_serializer: PostSerializer
  end
end

class PostSerializer < ApplicationSerializer
  attributes :id, :body, :created_at, :author

  def author
    {
      id: object.author.id,
      name: object.author.name,
      avatar: object.author.avatar_url,
      current_badge: object.author_current_badge,
    }
  end
end

Query count for the endpoint went from “roughly fifty per request” to four. Reader CPU dropped immediately on the next deploy. The Aurora line item on the next month’s bill stopped being a hump and went back to being a flat line. I’ll leave the exact savings out of this because they’re embarrassing to admit to in either direction, but they paid for a meaningful chunk of the team for the year.

Bullet in CI, not just in dev

Here’s the thing nobody owns up to. The Rails ecosystem has had bullet since basically forever. We had it installed. It logged things to a file in development. Nobody read the file.

So the second fix was structural. I wired bullet to fail tests in CI when an N+1 fires, instead of just nagging in dev.

# config/environments/test.rb
Rails.application.configure do
  config.after_initialize do
    Bullet.enable        = true
    Bullet.bullet_logger = true
    Bullet.raise         = true
    Bullet.add_safelist type: :n_plus_one_query,
                        class_name: "User",
                        association: :preferences
  end
end

# spec/support/bullet.rb
RSpec.configure do |config|
  config.before(:each) { Bullet.start_request }
  config.after(:each)  { Bullet.end_request }
end

A failing test in CI is annoying. A query bill that quietly doubles over a quarter is worse. The safelist exists because there’s always one or two legitimately-justified N+1s in any large codebase, and you want the team to have to defend them in a PR review instead of pretending they don’t exist.

strict_loading was the other half. We turned it on by default for new models, and progressively backfilled the old ones.

class Post < ApplicationRecord
  self.strict_loading_by_default = true
  belongs_to :author, class_name: "User"
  has_many :reactions
end

If a developer touches post.reactions without preloading it, Rails raises ActiveRecord::StrictLoadingViolationError in test and in development. In production we ran it in :log mode for a couple of weeks before flipping it to raise, because we needed to find the surprises without taking the site down.

Query budgets per endpoint

The last piece, and honestly the one I’m most proud of, was query budgets. Every endpoint that mattered got a declared budget. If a request blew past it, we logged a structured warning and, in CI, failed the request spec.

# app/middleware/query_budget.rb
class QueryBudget
  def initialize(app); @app = app; end

  def call(env)
    counter = QueryCounter.new
    ActiveSupport::Notifications.subscribed(counter, "sql.active_record") do
      @app.call(env)
    end
  ensure
    budget = env["action_dispatch.request.path_parameters"].then do |p|
      Rails.application.config.x.query_budgets["#{p[:controller]}##{p[:action]}"]
    end
    if budget && counter.count > budget
      Rails.logger.warn(
        msg: "query_budget_exceeded",
        endpoint: "#{env['REQUEST_METHOD']} #{env['PATH_INFO']}",
        actual: counter.count,
        budget: budget,
      )
    end
  end
end

# config/query_budgets.yml
posts#index: 8
posts#show: 6
communities#feed: 12
users#me: 4

Six months later, that file is the single most-edited config in the repo, and that’s a feature. Every time someone bumps a number, there’s a PR conversation about why. Most of the time the answer is “we genuinely need it”. Sometimes it’s “oh, that’s the badge thing again”. Either way, the bump is intentional, not invisible.

A second time this bit me

Quick second story, because this pattern shows up everywhere. Years before that, at the combat-sports tournament platform I CTO’d in London, we had a rankings page powered off Elasticsearch with PostgreSQL as the source of truth. Different stack, same shape of bug.

A federation tournament finished on a Saturday night. The new champion’s ranking didn’t update for eight hours. The athlete in question noticed before we did, tweeted a screenshot of our broken rankings page tagging the federation. Quiet logs. The rankings-indexer consumer was still reading off Kafka, just silently not writing to Elasticsearch because its bulk-write client had entered a circuit-open state after a transient cluster blip and never came back. We triggered a full reindex from Postgres, took about twenty-five minutes to catch up, then patched the breaker to attempt half-open every sixty seconds instead of staying open until restart.

Different bug, same lesson. Volume hides things. “The pipeline is consuming” was not the same as “the pipeline is doing its job”. The rankings page now has a freshness metric. The query budget file is the same idea, one stack lower.

Takeaways

N+1s in serializers don’t show up as slow queries. They show up as too many fast ones.
Caching a symptom of an N+1 buys you a quiet week and a louder month.
Bullet is worth almost nothing in dev and a lot in CI. Wire it to fail tests.
strict_loading_by_default on new models, then progressively on old ones. Run in log mode first.
Per-endpoint query budgets turn “the bill went up” into a PR conversation.
Measure freshness, not throughput. Consuming is not the same as doing the job.

Thanks for reading. If you’ve got thoughts, send them my way.