Rails Performance Profiling

How I actually profile a slow Rails endpoint at scale, with rack-mini-profiler, Stackprof, memory_profiler, and derailed_benchmarks.

It was a Tuesday morning at the creator economy platform I worked at, and one of the Community feed endpoints had quietly turned into a 2.4 second p99. Not on fire. Not paging anyone. Just the slow kind of bad that piles up over a quarter. A teammate flagged it in Slack, said “feed feels heavy on mobile, can you peek”. I picked it up between two squad meetings.

I want to walk through how I actually profile something like that. Not the textbook order. The order I run the tools in real life.

My position up front. Most Rails performance work I see online jumps straight to “add Redis” or “rewrite in Go”. Both wrong. The Rails profiling stack, used in order, will tell you within an hour where your time is actually going. After that the fix is usually small. Skip the measurement, rewrite the architecture, lose a quarter.

Start with rack-mini-profiler

Step one is always rack-mini-profiler. Not because it’s fancy, it isn’t. Because it puts a little badge in the top-left of the page that tells me the wall-clock breakdown of the request, the SQL fired, and which queries are duplicates. I want that in my face before I touch anything.

# Gemfile
group :development, :staging do
  gem "rack-mini-profiler", "~> 3.3"
  gem "memory_profiler", "~> 1.0"
  gem "stackprof", "~> 0.2.26"
  gem "flamegraph", "~> 0.9"
end

# config/initializers/mini_profiler.rb
if defined?(Rack::MiniProfiler)
  Rack::MiniProfiler.config.position = "top-left"
  Rack::MiniProfiler.config.start_hidden = false
  Rack::MiniProfiler.config.authorization_mode = :allow_authorized

  Rack::MiniProfiler.config.skip_paths = ["/healthz", "/assets"]

  Rails.application.middleware.insert_before(
    Rack::Runtime,
    Rack::MiniProfiler
  )
end

On the Community feed endpoint, mini-profiler told me the story in about twenty seconds. Controller spent 180 ms in app code and 2.1 s in SQL. Of that SQL time, one query repeated 47 times. Classic N+1, hiding behind a has_many :through and a serializer that touched post.author.creator_profile per row. Most common slow-endpoint cause I run into in Rails, full stop.

I didn’t fix it yet. I was still measuring. If I’d shipped an includes(:author, author: :creator_profile) there, I’d have missed the second-worst thing, an allocation problem I only saw later.

Stackprof for CPU

When the SQL story doesn’t fully add up, or the controller time itself looks too high, Stackprof comes next. Stackprof is a sampling profiler. It samples the call stack on a clock, gives you a profile you can convert to a flamegraph, and tells you where CPU time actually lives.

# config/initializers/stackprof.rb
if Rails.env.staging? && ENV["STACKPROF"] == "1"
  require "stackprof"

  Rails.application.config.middleware.use(
    Rack::Builder.new do
      use(Class.new do
        def initialize(app); @app = app; end

        def call(env)
          if env["PATH_INFO"].start_with?("/communities/")
            StackProf.run(mode: :wall, raw: true, interval: 1_000) do
              @app.call(env)
            end.tap do |result|
              path = "tmp/stackprof-#{Process.pid}-#{Time.now.to_i}.dump"
              File.binwrite(path, Marshal.dump(result))
              Rails.logger.info(stackprof_dump: path)
            end
          else
            @app.call(env)
          end
        end
      end)
    end
  )
end

Two notes. I gate it on STACKPROF=1 so I can flip it on per-pod without redeploying. And I dump the raw profile to disk so I can pull it locally and run stackprof --flamegraph tmp/stackprof-*.dump > flame.html. Looking at SQL time in mini-profiler is one thing. Watching the flamegraph show 31 percent of wall time in ActiveModel::Serializer#attributes is another.

The serializer was doing the right work but the wrong way, dup-ing inside as_json and rebuilding for every nested association. Switching to Oj.dump on a plain Hash, with explicit attribute lists, cut serialization time by roughly 60 percent. Same data, fewer allocations.

memory_profiler for the alloc storm

Speaking of allocations. CPU is one budget. Memory is the other, and a Rails request that allocates a million objects pays for it later when GC runs in the middle of someone else’s request. memory_profiler is the tool I reach for when Stackprof shows time inside GC.start or when p99 is jittery while p50 looks fine.

class Communities::PostsController < ApplicationController
  def index
    if params[:profile_memory] == "1" && Rails.env.staging?
      report = MemoryProfiler.report do
        @posts = load_posts
        @json = serialize(@posts)
      end
      report.pretty_print(to_file: "tmp/memprof-#{Time.now.to_i}.txt")
      render plain: "memory profile written"
    else
      @posts = load_posts
      render json: serialize(@posts)
    end
  end

  private

  def load_posts
    Post
      .joins(:author)
      .includes(author: :creator_profile)
      .where(community_id: params[:id])
      .order(created_at: :desc)
      .limit(50)
  end
end

The report on that endpoint, before the fix, allocated about 1.1 million objects per request. After cutting the N+1 and switching the serializer, it dropped to roughly 38 thousand. GC pauses in p99 fell off the chart that week, which mattered more for the actual user experience than the SQL fix did.

derailed_benchmarks for boot and request shape

The fourth tool is the one I see used least but recommend most. derailed_benchmarks gives you boot time, gem-by-gem load cost, and a “hit one endpoint a thousand times” harness that’s much closer to how production actually feels.

# Gemfile
group :development do
  gem "derailed_benchmarks"
end

# .env (used by derailed)
USE_SERVER=puma
PATH_TO_HIT=/communities/42/posts
TEST_COUNT=1000

# commands
# bundle exec derailed bundle:mem
# bundle exec derailed exec perf:ips
# bundle exec derailed exec perf:mem_over_time

derailed bundle:mem is how I caught a gem we’d added six months earlier loading 22 MB of YAML at boot, on every pod, on every cold start. Pulling it saved that much resident memory per pod across thousands of pods. The boot-time win mattered more than the RAM, deploys rolled out faster and the mixed-version window shrank.

Takeaways

rack-mini-profiler first. SQL count and duplicate detection beats every other first move.
Stackprof when controller time looks high or the SQL story doesn’t add up. Wall-mode, flamegraph, read it.
memory_profiler when p99 is jittery, when GC time shows up in flames, or when you want to know what serializers really cost.
derailed_benchmarks for boot, gem load, and load-style request shape. Catches the things the others don’t.
The profile is the fix’s foundation. Skip it and you’ll scale up readers when the writer is the problem.

Thanks for reading. If you’ve got thoughts, send them my way.