Russian doll fragment caching, counter caches, and surrogate-key CDN invalidation on a read-heavy creator page that wouldn't sit still under load.
Thursday afternoon at the creator economy platform I worked at. A creator’s course landing page was the most-hit route on the platform that week, and our p99 had drifted from ~140 ms to over 1.8 s in three days. No deploy on the path. No DB regression. Just a slow, ugly creep that the dashboards had started flagging in red.
The view pulled the course, the creator, their pricing, the lesson list, social proof, and an “X students enrolled” counter. Every chunk was its own DB call. Every render was rebuilding HTML that hadn’t changed since the creator last edited the page three weeks earlier.
Five caching layers, in the order I reach for them in a Rails app. Counter caches at the base, Russian doll fragment caching above, low-level cache for hot lookups, HTTP edge caching on Cloudflare, and surrogate-key invalidation tying it together. Skip any one and your read path eventually buckles.
The cheapest caching layer in Rails isn’t Rails.cache.fetch. It’s counter_cache: true on the association. If your hot view counts children of a parent record, you’re paying for that count in DB time on every request unless you store it.
# app/models/course.rb
class Course < ApplicationRecord
has_many :enrollments, dependent: :destroy
has_many :lessons, dependent: :destroy
# enrollments_count and lessons_count are denormalized columns
# on the courses table. Active Record keeps them honest.
end
# app/models/enrollment.rb
class Enrollment < ApplicationRecord
belongs_to :course, counter_cache: true
belongs_to :user
end
The migration:
class AddCountersToCourses < ActiveRecord::Migration[7.1]
def change
add_column :courses, :enrollments_count, :integer, default: 0, null: false
add_column :courses, :lessons_count, :integer, default: 0, null: false
reversible do |dir|
dir.up do
execute <<~SQL
UPDATE courses c SET
enrollments_count = (SELECT COUNT(*) FROM enrollments WHERE course_id = c.id),
lessons_count = (SELECT COUNT(*) FROM lessons WHERE course_id = c.id)
SQL
end
end
end
end
That backfill matters. I’ve seen counter caches added without it, sitting at zero for three weeks before someone noticed the homepage said “0 students enrolled” on every course. We added a nightly Sidekiq reconciler that picks 1% of rows at random and re-counts. Quiet, boring, catches the bugs.
Counter caches kill the cheap reads. Fragment caching kills the expensive renders. Russian doll means each cache key includes the updated_at of its parent, so when the course changes, the course fragment busts, but its lessons (nested fragments) stay warm. When a single lesson changes, only that lesson’s fragment busts. The rest of the page reuses what’s already in the cache.
<%# app/views/courses/show.html.erb %>
<% cache(@course, expires_in: 1.day) do %>
<article class="course">
<header>
<h1><%= @course.title %></h1>
<p><%= @course.enrollments_count %> students enrolled</p>
</header>
<% cache(["lesson-list", @course]) do %>
<ol class="lessons">
<% @course.lessons.includes(:section).each do |lesson| %>
<% cache(lesson) do %>
<li>
<%= lesson.title %>
<span class="duration"><%= lesson.duration_minutes %> min</span>
</li>
<% end %>
<% end %>
</ol>
<% end %>
</article>
<% end %>
You need touch: true on the child associations so that updating a lesson bumps the course’s updated_at and the outer fragment knows to invalidate.
class Lesson < ApplicationRecord
belongs_to :course, touch: true
end
This single change took rendering on that landing page from ~600 ms to ~40 ms on a cold-cache request, under 10 ms on a warm one. Russian doll wins when the page has structure. It doesn’t save you if the data underneath is shaped wrong, which is the whole point of the counter cache layer below.
Some lookups are too small for fragment caching and too hot for the DB. Pricing tier definitions, feature flags scoped to a creator, the “is this creator allowed to use this beta module” check. Rails gives you Rails.cache.fetch for this and you should reach for it freely, with one rule. Always set an expiry.
# app/models/pricing_tier.rb
class PricingTier < ApplicationRecord
def self.for_creator(creator_id)
Rails.cache.fetch(["pricing_tier", creator_id], expires_in: 15.minutes) do
joins(:creator_pricing_tiers)
.where(creator_pricing_tiers: { creator_id: creator_id })
.first
end
end
end
We had an incident a while back where a Rails.cache.fetch without an expiry, on a key that wasn’t being explicitly busted on update, kept serving a stale pricing tier for three days after a creator changed their plan. The fix isn’t “remember to bust the key everywhere.” It’s “set a TTL, always.” Defense in depth. Even if your invalidation is right, an expiry caps the blast radius.
Fragment caching saves the render. Low-level cache saves the lookup. Neither saves the network round trip from the user’s browser to the Rails app. For a read-heavy public page that gets hit by thousands of unique visitors per minute, you want the response cached at the edge.
I built the Cloudflare Workers edge layer for creator profile pages at a live-video creator platform I led engineering at, a couple of years before this gig. Same shape applies. The Rails app sets a Cache-Control header plus a Surrogate-Key header listing every resource the response depends on. Cloudflare reads the cache directive, stores the response, and indexes it by surrogate keys.
# app/controllers/courses_controller.rb
class CoursesController < ApplicationController
def show
@course = Course.find_by!(slug: params[:slug])
response.headers["Cache-Control"] = "public, max-age=60, s-maxage=3600"
response.headers["Surrogate-Key"] = surrogate_keys_for(@course).join(" ")
end
private
def surrogate_keys_for(course)
[
"course/#{course.id}",
"creator/#{course.creator_id}",
*course.lessons.pluck(:id).map { |id| "lesson/#{id}" }
]
end
end
The s-maxage is the long lease the CDN holds. The shorter max-age keeps browsers from holding stale content for too long without a revalidation. The split matters. You almost never want browsers caching as aggressively as your CDN.
The Cache-Control header is the contract going out. The invalidation hook is the contract coming back. When a creator edits a course, you don’t purge by URL. You purge by surrogate key, and Cloudflare drops every response anywhere that ever had that key on it.
# app/models/course.rb
class Course < ApplicationRecord
after_commit :purge_edge_cache, on: [:update, :destroy]
private
def purge_edge_cache
EdgeCachePurgeJob.perform_later(
keys: ["course/#{id}", "creator/#{creator_id}"]
)
end
end
# app/jobs/edge_cache_purge_job.rb
class EdgeCachePurgeJob < ApplicationJob
queue_as :critical
def perform(keys:)
Cloudflare::Client.new.purge_by_tag(keys)
rescue Cloudflare::TimeoutError => e
Rails.logger.warn(event: "cdn_purge_timeout", keys: keys, error: e.message)
retry_job wait: 5.seconds, queue: :critical
end
end
The async job matters. The HTTP request that triggered the save shouldn’t block on a CDN round trip. The retry matters more. If the purge fails silently, you’ll be looking at stale pages for an hour while the s-maxage ticks down.
Creator profile pages at the video startup served from Cloudflare Workers at the edge. A worker version shipped on a Wednesday that I’d reviewed and approved. Small refactor that supposedly tightened cache key composition. What actually happened, the new key included the path but dropped the locale segment. Cache stored the first response it saw per path, regardless of locale. EU users started seeing US users’ Open Graph previews on shared links and vice versa. We learned about it from a German creator who tweeted a screenshot.
First wrong fix was the global purge button. Wiped everything. The cache repopulated within three minutes with the same bad key. Right symptom, wrong root.
Real fix was a rollback to the previous worker (Cloudflare’s rollback is instant, which saved us), then a redeploy with locale back in the key. Then a deploy-time diff check that fails CI when cache key composition changes without an explicit flag.
About 40 minutes of mis-shared previews. Cache keys are part of your public API. Treat any change to them like a schema migration.
touch: true is not optional.Cache-Control and Surrogate-Key headers.after_commit, async, with retry.Thanks for reading. If you’ve got thoughts, send them my way.