How I actually profile a slow Rails endpoint at scale, with rack-mini-profiler, Stackprof, memory_profiler, and derailed_benchmarks.
It was a Tuesday morning at the creator economy platform I worked at, and one of the Community feed endpoints had quietly turned into a 2.4 second p99. Not on fire. Not paging anyone. Just the slow kind of bad that piles up over a quarter. A teammate flagged it in Slack, said “feed feels heavy on mobile, can you peek”. I picked it up between two squad meetings.
I want to walk through how I actually profile something like that. Not the textbook order. The order I run the tools in real life.
My position up front. Most Rails performance work I see online jumps straight to “add Redis” or “rewrite in Go”. Both wrong. The Rails profiling stack, used in order, will tell you within an hour where your time is actually going. After that the fix is usually small. Skip the measurement, rewrite the architecture, lose a quarter.
Step one is always rack-mini-profiler. Not because it’s fancy, it isn’t. Because it puts a little badge in the top-left of the page that tells me the wall-clock breakdown of the request, the SQL fired, and which queries are duplicates. I want that in my face before I touch anything.
# Gemfile
group :development, :staging do
gem "rack-mini-profiler", "~> 3.3"
gem "memory_profiler", "~> 1.0"
gem "stackprof", "~> 0.2.26"
gem "flamegraph", "~> 0.9"
end
# config/initializers/mini_profiler.rb
if defined?(Rack::MiniProfiler)
Rack::MiniProfiler.config.position = "top-left"
Rack::MiniProfiler.config.start_hidden = false
Rack::MiniProfiler.config.authorization_mode = :allow_authorized
Rack::MiniProfiler.config.skip_paths = ["/healthz", "/assets"]
Rails.application.middleware.insert_before(
Rack::Runtime,
Rack::MiniProfiler
)
end
On the Community feed endpoint, mini-profiler told me the story in about twenty seconds. Controller spent 180 ms in app code and 2.1 s in SQL. Of that SQL time, one query repeated 47 times. Classic N+1, hiding behind a has_many :through and a serializer that touched post.author.creator_profile per row. Most common slow-endpoint cause I run into in Rails, full stop.
I didn’t fix it yet. I was still measuring. If I’d shipped an includes(:author, author: :creator_profile) there, I’d have missed the second-worst thing, an allocation problem I only saw later.
When the SQL story doesn’t fully add up, or the controller time itself looks too high, Stackprof comes next. Stackprof is a sampling profiler. It samples the call stack on a clock, gives you a profile you can convert to a flamegraph, and tells you where CPU time actually lives.
# config/initializers/stackprof.rb
if Rails.env.staging? && ENV["STACKPROF"] == "1"
require "stackprof"
Rails.application.config.middleware.use(
Rack::Builder.new do
use(Class.new do
def initialize(app); @app = app; end
def call(env)
if env["PATH_INFO"].start_with?("/communities/")
StackProf.run(mode: :wall, raw: true, interval: 1_000) do
@app.call(env)
end.tap do |result|
path = "tmp/stackprof-#{Process.pid}-#{Time.now.to_i}.dump"
File.binwrite(path, Marshal.dump(result))
Rails.logger.info(stackprof_dump: path)
end
else
@app.call(env)
end
end
end)
end
)
end
Two notes. I gate it on STACKPROF=1 so I can flip it on per-pod without redeploying. And I dump the raw profile to disk so I can pull it locally and run stackprof --flamegraph tmp/stackprof-*.dump > flame.html. Looking at SQL time in mini-profiler is one thing. Watching the flamegraph show 31 percent of wall time in ActiveModel::Serializer#attributes is another.
The serializer was doing the right work but the wrong way, dup-ing inside as_json and rebuilding for every nested association. Switching to Oj.dump on a plain Hash, with explicit attribute lists, cut serialization time by roughly 60 percent. Same data, fewer allocations.
Speaking of allocations. CPU is one budget. Memory is the other, and a Rails request that allocates a million objects pays for it later when GC runs in the middle of someone else’s request. memory_profiler is the tool I reach for when Stackprof shows time inside GC.start or when p99 is jittery while p50 looks fine.
class Communities::PostsController < ApplicationController
def index
if params[:profile_memory] == "1" && Rails.env.staging?
report = MemoryProfiler.report do
@posts = load_posts
@json = serialize(@posts)
end
report.pretty_print(to_file: "tmp/memprof-#{Time.now.to_i}.txt")
render plain: "memory profile written"
else
@posts = load_posts
render json: serialize(@posts)
end
end
private
def load_posts
Post
.joins(:author)
.includes(author: :creator_profile)
.where(community_id: params[:id])
.order(created_at: :desc)
.limit(50)
end
end
The report on that endpoint, before the fix, allocated about 1.1 million objects per request. After cutting the N+1 and switching the serializer, it dropped to roughly 38 thousand. GC pauses in p99 fell off the chart that week, which mattered more for the actual user experience than the SQL fix did.
The fourth tool is the one I see used least but recommend most. derailed_benchmarks gives you boot time, gem-by-gem load cost, and a “hit one endpoint a thousand times” harness that’s much closer to how production actually feels.
# Gemfile
group :development do
gem "derailed_benchmarks"
end
# .env (used by derailed)
USE_SERVER=puma
PATH_TO_HIT=/communities/42/posts
TEST_COUNT=1000
# commands
# bundle exec derailed bundle:mem
# bundle exec derailed exec perf:ips
# bundle exec derailed exec perf:mem_over_time
derailed bundle:mem is how I caught a gem we’d added six months earlier loading 22 MB of YAML at boot, on every pod, on every cold start. Pulling it saved that much resident memory per pod across thousands of pods. The boot-time win mattered more than the RAM, deploys rolled out faster and the mixed-version window shrank.
Thanks for reading. If you’ve got thoughts, send them my way.