CDN and Edge Computing With CloudFront

Cache key design, CloudFront Functions vs Lambda at Edge, origin shield, and multi-origin failover, written up from a few production incidents I'd rather not repeat.

A worker version I’d approved on a Wednesday afternoon at a live-video creator platform I led engineering at dropped the locale segment from the cache key. Within ninety minutes a German creator tweeted a screenshot of someone else’s profile photo showing up on his own OG preview. The thread had 200+ retweets in an hour. I was the one who approved the PR. I was also the one who built the edge caching layer the quarter before. So, yeah, I got to learn this lesson on my own deploy.

Most of what people call “edge computing” is, in practice, cache key design plus a small amount of viewer-side rewriting. Get the key right and the rest is plumbing. Get it wrong and you find out from Twitter.

This is what I’d tell a teammate doing CloudFront work for the first time, with the bias of someone who’s broken it a few times.

Why cache key design ruins your day first

The cache key is the public API of your edge. Anything you forget to put in the key, you’ll see show up wrong in production. Two different responses, one cache slot, first one in wins. Then it serves wrong for ten minutes, or two hours, or the rest of the day if your TTLs are generous.

Useful questions before you ship anything cacheable. What does the response actually vary on? Locale. Auth tier. Device class. Country. A/B bucket. Currency. If the response varies on it and the key doesn’t include it, you have a bug, you just haven’t seen it yet.

CloudFront’s cache policies let you spell this out. Here’s a sane default for an authenticated app with locale and device fanout:

resource "aws_cloudfront_cache_policy" "app_default" {
  name        = "app-default-v3"
  default_ttl = 60
  max_ttl     = 300
  min_ttl     = 0

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    headers_config {
      header_behavior = "whitelist"
      headers {
        items = [
          "x-locale-bucket",
          "x-device-class",
          "cloudfront-viewer-country",
        ]
      }
    }

    query_strings_config {
      query_string_behavior = "whitelist"
      query_strings {
        items = ["ab", "currency"]
      }
    }

    cookies_config {
      cookie_behavior = "whitelist"
      cookies {
        items = ["session_tier"]
      }
    }
  }
}

The x-locale-bucket and x-device-class headers aren’t request headers a browser sends. They’re written by a CloudFront Function on viewer-request, which is the next section. Compression is in the key. Brotli and gzip get different cache entries, and that’s correct.

One trap. Cookies and query strings have a habit of breeding without you noticing. Whitelist what you actually use. Anything else widens the key and drops your hit ratio for no reason.

CloudFront Functions versus Lambda at Edge

CloudFront Functions run on viewer-request and viewer-response. JavaScript only, sub-millisecond, no network calls, very tight memory budget. They’re shockingly cheap.

Lambda at Edge runs on viewer-request, origin-request, origin-response, and viewer-response. Full Node runtime, can make outbound HTTP calls, can read from S3 or DynamoDB, has cold starts.

My position. About ninety percent of what people reach for Lambda at Edge for is actually a CloudFront Function plus a smarter cache key. If you can express what you want in a header rewrite, a redirect, a normalization, or a bucket assignment, you don’t need a Node runtime at the edge. You’re paying cold start latency for something a viewer-request Function would do in 0.4 ms.

Here’s a real one. Accept-Language is awful for cache keys because it’s high-cardinality. en-US,en;q=0.9,tr;q=0.8 and en-US,en;q=0.9 are different strings, same cache slot you want. So normalize it to a small bucket first.

function handler(event) {
  var req = event.request;
  var headers = req.headers;

  var supported = ['en', 'tr', 'de', 'es', 'fr', 'pt'];
  var fallback = 'en';

  var raw = (headers['accept-language'] && headers['accept-language'].value) || '';
  var primary = raw.split(',')[0].split('-')[0].toLowerCase().trim();

  var bucket = supported.indexOf(primary) >= 0 ? primary : fallback;

  headers['x-locale-bucket'] = { value: bucket };

  var ua = (headers['user-agent'] && headers['user-agent'].value) || '';
  var device = 'desktop';
  if (/iPhone|Android.*Mobile|iPod/.test(ua)) device = 'mobile';
  else if (/iPad|Android(?!.*Mobile)/.test(ua)) device = 'tablet';
  headers['x-device-class'] = { value: device };

  return req;
}

Six locale buckets, three device classes. Eighteen cache variants. Without bucketing, Accept-Language alone gave us hundreds of variants for the exact same response.

When do you actually need Lambda at Edge. When you have to call a backend on origin-request, like fetching a signed URL from S3, or reading a feature flag from DynamoDB, or generating dynamic OG metadata that depends on a database row. Don’t reach for it before then.

Origin shield and multi origin failover

Origin shield puts a single regional cache between every edge POP and your origin. Without it, every edge that misses goes directly to origin. With it, the edges collapse on a shield, and the shield is the only thing your origin sees.

On a workload spread across regions, this is the difference between a comfortable hit ratio and a panicked Slack thread at 3 a.m. Pacific. Measure your hit ratio before and after. If it doesn’t move, you didn’t need shield. Mine moved enough every time I’ve used it that I now turn it on by default for any origin that isn’t on the same continent as the bulk of traffic.

Multi-origin failover sits next to it. You define an origin group with a primary and a secondary, and CloudFront will fail over on the status codes you list. Stick to a tight set. 500, 502, 503, 504. Don’t include 404 in failover criteria unless you really mean it.

resource "aws_cloudfront_distribution" "app" {
  enabled = true

  origin {
    domain_name = aws_lb.primary.dns_name
    origin_id   = "primary-alb"

    origin_shield {
      enabled              = true
      origin_shield_region = "us-east-1"
    }

    custom_origin_config {
      http_port              = 80
      https_port             = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols   = ["TLSv1.2"]
    }
  }

  origin {
    domain_name = aws_s3_bucket.lkg_snapshot.bucket_regional_domain_name
    origin_id   = "last-known-good"

    s3_origin_config {
      origin_access_identity = aws_cloudfront_origin_access_identity.lkg.cloudfront_access_identity_path
    }
  }

  origin_group {
    origin_id = "app-group"

    failover_criteria {
      status_codes = [500, 502, 503, 504]
    }

    member { origin_id = "primary-alb" }
    member { origin_id = "last-known-good" }
  }

  default_cache_behavior {
    target_origin_id       = "app-group"
    cache_policy_id        = aws_cloudfront_cache_policy.app_default.id
    viewer_protocol_policy = "redirect-to-https"
    allowed_methods        = ["GET", "HEAD", "OPTIONS"]
    cached_methods         = ["GET", "HEAD"]
    compress               = true
  }

  viewer_certificate { cloudfront_default_certificate = true }
  restrictions { geo_restriction { restriction_type = "none" } }
}

The secondary origin here is an S3 bucket holding a last-known-good snapshot of the public read paths. When the ALB starts returning 5xx, CloudFront serves the snapshot. Stale, but online.

A cache key that actually survives

OK so this is the war story I opened with, the locale-drop one, but with the technical detail.

It was at a live-video creator platform I led at. Creator profile pages, /creator/:slug, served from edge cache with dynamic OG metadata. Locales varied the copy and sometimes the image. I’d built the layer that quarter. On a Wednesday I approved a small refactor PR from a teammate that “tightened cache key composition.” The new key kept the path. It dropped the Accept-Language-derived locale segment.

What happened next. The cache stored the first response it saw per path, regardless of locale. EU users started getting US users’ OG previews on shared links. Most of the bugs would’ve stayed silent. But on social platforms, the first viral creator who tweets a screenshot of someone else’s profile photo on their own preview ends a quiet day.

First wrong fix. Global cache purge. Nuked everything. Three minutes later, the cache repopulated, with the same broken key, producing the same wrong results. The wrong code was still in production. Purging was treating the symptom.

Real fix. Rolled the worker back to the previous version. Workers rollback is instant, that part saved us. Then redeployed with locale put back in the key. Then I wrote a deploy-time check that diffs cache key composition against the previously deployed version, and refuses to deploy unless an explicit migrate flag is set.

- name: Diff cache key composition
  run: node scripts/diff-cache-key.js
  env:
    PREV_KEY_REF: ${{ secrets.PREV_CACHE_KEY_REF }}

- name: Block deploy on key change
  if: env.CACHE_KEY_CHANGED == 'true' && !contains(github.event.head_commit.message, '[migrate-cache-key]')
  run: |
    echo "Cache key composition changed. Commit message must include [migrate-cache-key] to proceed."
    exit 1

The Node script is small. Render the cache key composition from the Terraform plan, hash it, compare against the hash stored in SSM Parameter Store from the previous deploy, write the new hash on success. Boring. Effective.

About forty minutes of mis-shared previews, a handful of public screenshots, no data leak. Lesson I keep coming back to: cache keys are part of your public API, treat any change like a schema migration, diff them in CI.

Cost shape of a CloudFront edge

The thing that moves a CloudFront bill is not compute. It’s cache hit ratio. A 78% hit ratio versus 92% is two different invoices for the same traffic, and the gap is mostly origin egress and load-balancer hours. Chase hit ratio first.

Origin shield costs money. On multi-region traffic where most edges were missing through to origin, it paid for itself for me every time. On single-region traffic with a small POP set, sometimes it didn’t. Measure before, measure after.

Brotli and gzip at the edge are free money, enable both and keep them in the cache key. CloudFront Functions invocations are cheap enough I don’t think about cost when adding one. Lambda at Edge invocations stack up faster, especially on origin-request for every cache miss. Be deliberate.

Takeaways

Cache keys are a schema. Any change is a migration. Diff them in CI.
CloudFront Functions cover most viewer-side logic. Don’t reach for Lambda at Edge unless you need outbound calls.
Origin shield is worth its cost when traffic spans regions. Measure your hit ratio before and after.
Multi-origin failover should serve last-known-good when origin is degraded, not 5xx.
Hit ratio moves the bill more than any other lever at the edge.
Treat global cache purges as a last resort. They don’t fix bad keys.

Thanks for reading. If you’ve got thoughts, send them my way.