Caching Strategies in Rails

Russian doll fragment caching, counter caches, and surrogate-key CDN invalidation on a read-heavy creator page that wouldn't sit still under load.

Thursday afternoon at the creator economy platform I worked at. A creator’s course landing page was the most-hit route on the platform that week, and our p99 had drifted from ~140 ms to over 1.8 s in three days. No deploy on the path. No DB regression. Just a slow, ugly creep that the dashboards had started flagging in red.

The view pulled the course, the creator, their pricing, the lesson list, social proof, and an “X students enrolled” counter. Every chunk was its own DB call. Every render was rebuilding HTML that hadn’t changed since the creator last edited the page three weeks earlier.

Five caching layers, in the order I reach for them in a Rails app. Counter caches at the base, Russian doll fragment caching above, low-level cache for hot lookups, HTTP edge caching on Cloudflare, and surrogate-key invalidation tying it together. Skip any one and your read path eventually buckles.

Start with counter caches

The cheapest caching layer in Rails isn’t Rails.cache.fetch. It’s counter_cache: true on the association. If your hot view counts children of a parent record, you’re paying for that count in DB time on every request unless you store it.

# app/models/course.rb
class Course < ApplicationRecord
  has_many :enrollments, dependent: :destroy
  has_many :lessons, dependent: :destroy

  # enrollments_count and lessons_count are denormalized columns
  # on the courses table. Active Record keeps them honest.
end

# app/models/enrollment.rb
class Enrollment < ApplicationRecord
  belongs_to :course, counter_cache: true
  belongs_to :user
end

The migration:

class AddCountersToCourses < ActiveRecord::Migration[7.1]
  def change
    add_column :courses, :enrollments_count, :integer, default: 0, null: false
    add_column :courses, :lessons_count, :integer, default: 0, null: false

    reversible do |dir|
      dir.up do
        execute <<~SQL
          UPDATE courses c SET
            enrollments_count = (SELECT COUNT(*) FROM enrollments WHERE course_id = c.id),
            lessons_count = (SELECT COUNT(*) FROM lessons WHERE course_id = c.id)
        SQL
      end
    end
  end
end

That backfill matters. I’ve seen counter caches added without it, sitting at zero for three weeks before someone noticed the homepage said “0 students enrolled” on every course. We added a nightly Sidekiq reconciler that picks 1% of rows at random and re-counts. Quiet, boring, catches the bugs.

Russian doll fragment caching

Counter caches kill the cheap reads. Fragment caching kills the expensive renders. Russian doll means each cache key includes the updated_at of its parent, so when the course changes, the course fragment busts, but its lessons (nested fragments) stay warm. When a single lesson changes, only that lesson’s fragment busts. The rest of the page reuses what’s already in the cache.

<%# app/views/courses/show.html.erb %>
<% cache(@course, expires_in: 1.day) do %>
  <article class="course">
    <header>
      <h1><%= @course.title %></h1>
      <p><%= @course.enrollments_count %> students enrolled</p>
    </header>

    <% cache(["lesson-list", @course]) do %>
      <ol class="lessons">
        <% @course.lessons.includes(:section).each do |lesson| %>
          <% cache(lesson) do %>
            <li>
              <%= lesson.title %>
              <span class="duration"><%= lesson.duration_minutes %> min</span>
            </li>
          <% end %>
        <% end %>
      </ol>
    <% end %>
  </article>
<% end %>

You need touch: true on the child associations so that updating a lesson bumps the course’s updated_at and the outer fragment knows to invalidate.

class Lesson < ApplicationRecord
  belongs_to :course, touch: true
end

This single change took rendering on that landing page from ~600 ms to ~40 ms on a cold-cache request, under 10 ms on a warm one. Russian doll wins when the page has structure. It doesn’t save you if the data underneath is shaped wrong, which is the whole point of the counter cache layer below.

Low-level cache for hot lookups

Some lookups are too small for fragment caching and too hot for the DB. Pricing tier definitions, feature flags scoped to a creator, the “is this creator allowed to use this beta module” check. Rails gives you Rails.cache.fetch for this and you should reach for it freely, with one rule. Always set an expiry.

# app/models/pricing_tier.rb
class PricingTier < ApplicationRecord
  def self.for_creator(creator_id)
    Rails.cache.fetch(["pricing_tier", creator_id], expires_in: 15.minutes) do
      joins(:creator_pricing_tiers)
        .where(creator_pricing_tiers: { creator_id: creator_id })
        .first
    end
  end
end

We had an incident a while back where a Rails.cache.fetch without an expiry, on a key that wasn’t being explicitly busted on update, kept serving a stale pricing tier for three days after a creator changed their plan. The fix isn’t “remember to bust the key everywhere.” It’s “set a TTL, always.” Defense in depth. Even if your invalidation is right, an expiry caps the blast radius.

Edge caching with Cloudflare

Fragment caching saves the render. Low-level cache saves the lookup. Neither saves the network round trip from the user’s browser to the Rails app. For a read-heavy public page that gets hit by thousands of unique visitors per minute, you want the response cached at the edge.

I built the Cloudflare Workers edge layer for creator profile pages at a live-video creator platform I led engineering at, a couple of years before this gig. Same shape applies. The Rails app sets a Cache-Control header plus a Surrogate-Key header listing every resource the response depends on. Cloudflare reads the cache directive, stores the response, and indexes it by surrogate keys.

# app/controllers/courses_controller.rb
class CoursesController < ApplicationController
  def show
    @course = Course.find_by!(slug: params[:slug])

    response.headers["Cache-Control"] = "public, max-age=60, s-maxage=3600"
    response.headers["Surrogate-Key"] = surrogate_keys_for(@course).join(" ")
  end

  private

  def surrogate_keys_for(course)
    [
      "course/#{course.id}",
      "creator/#{course.creator_id}",
      *course.lessons.pluck(:id).map { |id| "lesson/#{id}" }
    ]
  end
end

The s-maxage is the long lease the CDN holds. The shorter max-age keeps browsers from holding stale content for too long without a revalidation. The split matters. You almost never want browsers caching as aggressively as your CDN.

Surrogate-key invalidation

The Cache-Control header is the contract going out. The invalidation hook is the contract coming back. When a creator edits a course, you don’t purge by URL. You purge by surrogate key, and Cloudflare drops every response anywhere that ever had that key on it.

# app/models/course.rb
class Course < ApplicationRecord
  after_commit :purge_edge_cache, on: [:update, :destroy]

  private

  def purge_edge_cache
    EdgeCachePurgeJob.perform_later(
      keys: ["course/#{id}", "creator/#{creator_id}"]
    )
  end
end

# app/jobs/edge_cache_purge_job.rb
class EdgeCachePurgeJob < ApplicationJob
  queue_as :critical

  def perform(keys:)
    Cloudflare::Client.new.purge_by_tag(keys)
  rescue Cloudflare::TimeoutError => e
    Rails.logger.warn(event: "cdn_purge_timeout", keys: keys, error: e.message)
    retry_job wait: 5.seconds, queue: :critical
  end
end

The async job matters. The HTTP request that triggered the save shouldn’t block on a CDN round trip. The retry matters more. If the purge fails silently, you’ll be looking at stale pages for an hour while the s-maxage ticks down.

Cache key as public API

Creator profile pages at the video startup served from Cloudflare Workers at the edge. A worker version shipped on a Wednesday that I’d reviewed and approved. Small refactor that supposedly tightened cache key composition. What actually happened, the new key included the path but dropped the locale segment. Cache stored the first response it saw per path, regardless of locale. EU users started seeing US users’ Open Graph previews on shared links and vice versa. We learned about it from a German creator who tweeted a screenshot.

First wrong fix was the global purge button. Wiped everything. The cache repopulated within three minutes with the same bad key. Right symptom, wrong root.

Real fix was a rollback to the previous worker (Cloudflare’s rollback is instant, which saved us), then a redeploy with locale back in the key. Then a deploy-time diff check that fails CI when cache key composition changes without an explicit flag.

About 40 minutes of mis-shared previews. Cache keys are part of your public API. Treat any change to them like a schema migration.

Takeaways

  • Counter caches first. The cheapest layer is the one Active Record already gives you.
  • Russian doll fragment caching when the page has structure. touch: true is not optional.
  • Low-level cache freely, but always set a TTL. Defense in depth.
  • Edge caching at the CDN with explicit Cache-Control and Surrogate-Key headers.
  • Purge by surrogate key from after_commit, async, with retry.
  • Cache key composition is a public API. Changes are migrations.

Thanks for reading. If you’ve got thoughts, send them my way.

© 2026 Akin Gundogdu. All Rights Reserved.