AWS CloudWatch Custom Metrics and Alarms

How I actually use CloudWatch in production: EMF, metric filters, composite alarms, and the cost gotchas I wish someone had warned me about.

It was a Tuesday morning at the creator economy platform I worked at. Aurora reader lag was climbing past 14 minutes, the Community feed was crawling, and the alert that should have fired hadn’t. We had Datadog. We had CloudWatch. We had a Slack channel literally named #aurora-health. None of it pointed at the actual cause, which was a maintenance ANALYZE on a hot table starving WAL emission on the writer.

The signal existed. We just hadn’t built it.

This post is what I actually do with CloudWatch in production now. EMF for custom metrics, composite alarms instead of single-metric ones, Logs Insights queries I keep pinned, the cost gotchas. And the slightly unfashionable take that for an AWS-native stack, you can get a long way before Datadog is worth the bill.

When CloudWatch is enough

OK so the honest answer first. If your stack is AWS-only and your services are EC2, EKS, Lambda, RDS, SQS, the usual list, CloudWatch covers a surprising amount of ground. Free metrics on every AWS resource, alarms with SNS routing, dashboards, Logs Insights for ad-hoc queries.

Datadog earns its money when you’re cross-cloud, doing deep distributed tracing across many services, or correlating logs and metrics and traces in a single pane. The trading platform I architected a few years back was CloudWatch-first and we lived. The creator platform ran on Datadog and that was the right call for that team. What doesn’t make sense is reaching for Datadog on day one by default, because it buries the actual signal under noise.

Embedded Metric Format the right way

The single biggest unlock in CloudWatch in the last few years is EMF. You emit structured JSON to a normal log stream, and CloudWatch auto-extracts metrics from it. No PutMetricData API calls, no rate limits, no per-call cost on the publish side.

Here’s roughly what I ship from a Lambda:

import { Context } from "aws-lambda";

interface EmfPayload {
  _aws: {
    Timestamp: number;
    CloudWatchMetrics: Array<{
      Namespace: string;
      Dimensions: string[][];
      Metrics: Array<{ Name: string; Unit: string }>;
    }>;
  };
  service: string;
  route: string;
  status_code: number;
  latency_ms: number;
  user_id?: string;
}

export function emitEmf(payload: Omit<EmfPayload, "_aws">): void {
  const emf: EmfPayload = {
    _aws: {
      Timestamp: Date.now(),
      CloudWatchMetrics: [
        {
          Namespace: "Platform/Api",
          Dimensions: [["service", "route", "status_code"]],
          Metrics: [
            { Name: "latency_ms", Unit: "Milliseconds" },
            { Name: "request_count", Unit: "Count" },
          ],
        },
      ],
    },
    ...payload,
  };

  // stdout in Lambda goes straight to CloudWatch Logs.
  // The _aws block triggers metric extraction at the log layer.
  console.log(JSON.stringify(emf));
}

export async function handler(event: any, ctx: Context) {
  const started = Date.now();
  try {
    const result = await processRequest(event);
    emitEmf({
      service: "checkout-api",
      route: event.routeKey,
      status_code: 200,
      latency_ms: Date.now() - started,
      user_id: event.requestContext?.authorizer?.userId,
    });
    return result;
  } catch (err) {
    emitEmf({
      service: "checkout-api",
      route: event.routeKey,
      status_code: 500,
      latency_ms: Date.now() - started,
    });
    throw err;
  }
}

Two things worth noting. The Dimensions array is an array of arrays, each inner array is one dimension set, and cardinality matters. Putting user_id as a dimension is a great way to get a five-figure CloudWatch bill, ask me how I know. Fields outside the _aws block are still searchable in Logs Insights, so you get the high-cardinality context for free, but only the metric dimensions count toward billing.

Metric filters from existing logs

Sometimes you’re not the team that owns the deploy. The legacy Rails monolith at the creator platform had its own deploy cadence, and bolting EMF into a hot path wasn’t always realistic. Metric filters let you mine metrics from logs you’re already paying to store.

{
  "filterName": "rails-5xx-by-route",
  "filterPattern": "{ $.severity = \"ERROR\" && $.status >= 500 }",
  "logGroupName": "/aws/eks/web/rails",
  "metricTransformations": [
    {
      "metricNamespace": "Platform/Rails",
      "metricName": "ServerErrors",
      "metricValue": "1",
      "defaultValue": 0,
      "dimensions": {
        "controller": "$.controller",
        "action": "$.action"
      }
    }
  ]
}

The pattern syntax is finicky and the docs are scattered, so a working example is worth more than the reference. The dimensions block extracts JSON fields from the matched log event. Same caveat, do not put unbounded user IDs in there.

War story: the BMA Sidekiq pile-up

This is the kind of moment metric filters earn their keep. The release pipeline that shipped native iOS and Android apps for creators on the platform was a Rails plus Python plus Fastlane plus GitHub Actions stack. One Wednesday morning the pending_apple_review Sidekiq queue started backing up. By lunch a couple hundred customer builds were stuck in “Waiting for Review” on App Store Connect, but our pipeline thought they were submitted successfully. Customer support was drowning by 2 p.m. Pacific.

What we tried first was wrong. The pipeline already had auto-retry on 5xx. We extended it to retry on the “stuck” state too. That made it worse, Apple started seeing duplicate submissions and a chunk of customers ended up with two competing review records.

The real fix was a circuit breaker that verified submission state with a separate GET against App Store Connect, never trusting the POST response. The CloudWatch part is that we had no metric for “Sidekiq job that returned success but produced no downstream state change.” After the incident, a metric filter on the queue’s structured logs pulled out job_duration_ms per queue_name and a separate filter counted submissions vs follow-up state confirmations. Divergence between the two became an alarm. Should have existed from day one.

Composite alarms over noisy primitives

This is the bit that took me longest to internalize. Single-metric alarms fire when their metric crosses a threshold. That’s it. The problem is that almost every interesting failure mode in production is the intersection of two or three conditions, not one. Replica lag spikes happen all the time, most are harmless. Replica lag spike plus elevated WAL write IOPS on the writer is a different story. That’s when you actually want to wake someone up.

CDK example:

import * as cw from "aws-cdk-lib/aws-cloudwatch";
import * as actions from "aws-cdk-lib/aws-cloudwatch-actions";
import { Construct } from "constructs";

export class AuroraHealthAlarms extends Construct {
  constructor(scope: Construct, id: string, props: { topicArn: string }) {
    super(scope, id);

    const lagHigh = new cw.Alarm(this, "ReaderLagHigh", {
      metric: new cw.Metric({
        namespace: "AWS/RDS",
        metricName: "AuroraReplicaLagMaximum",
        statistic: "Maximum",
        period: cdk.Duration.seconds(60),
        dimensionsMap: { DBClusterIdentifier: "prod-community" },
      }),
      threshold: 60_000,
      evaluationPeriods: 2,
      treatMissingData: cw.TreatMissingData.NOT_BREACHING,
    });

    const walPressure = new cw.Alarm(this, "WriterWALPressure", {
      metric: new cw.Metric({
        namespace: "AWS/RDS",
        metricName: "WriteIOPS",
        statistic: "Average",
        period: cdk.Duration.seconds(60),
        dimensionsMap: { DBInstanceIdentifier: "prod-community-writer-1" },
      }),
      threshold: 8_000,
      evaluationPeriods: 3,
      treatMissingData: cw.TreatMissingData.NOT_BREACHING,
    });

    const page = new cw.CompositeAlarm(this, "AuroraReadPathPage", {
      compositeAlarmName: "aurora-read-path-degraded",
      alarmRule: cw.AlarmRule.allOf(
        cw.AlarmRule.fromAlarm(lagHigh, cw.AlarmState.ALARM),
        cw.AlarmRule.fromAlarm(walPressure, cw.AlarmState.ALARM),
      ),
      actionsEnabled: true,
    });

    page.addAlarmAction(new actions.SnsAction(
      cw.Topic.fromTopicArn(this, "PagerDuty", props.topicArn),
    ));
  }
}

The individual alarms still exist, they just don’t page anyone. They route to a Slack channel for visibility. The composite is what hits PagerDuty.

War story: the Aurora lag morning

Back to that Tuesday morning. Around 10:14 a.m. PT, Datadog’s reader-lag alert fired. The on-call’s first move was reasonable, bump reader instance class up two tiers. The reasoning was “we’re CPU-bound on the readers.” Wrong root cause. The readers weren’t bottlenecked, they were starved of WAL. The composite signal, lag-high AND writer-WAL-pressure, would have pointed straight at the writer. Took us about 22 minutes of degraded read latency for millions of customers to figure out what we already had the data to know.

What I shipped that week was a small thing. A runbook that leads with the literal sentence “Before touching reader scaling, check pg_stat_activity on the writer.” And the composite alarm above, ported to our IaC. I’m the reason that sentence is in there.

Log Insights queries I keep around

Three queries live in a Notion page I pin during incidents.

p99 latency from EMF, per route:

fields @timestamp, route, latency_ms
| filter ispresent(latency_ms) and service = "checkout-api"
| stats pct(latency_ms, 99) as p99 by route, bin(1m)
| sort @timestamp desc
| limit 200

Error rate per controller, mined from Rails JSON logs:

fields @timestamp, controller, action, status
| filter status >= 500
| stats count(*) as errors by controller, action, bin(5m)
| sort errors desc
| limit 100

Slow database queries, mined from Rails verbose log output:

fields @timestamp, sql, duration_ms
| filter duration_ms > 1000
| sort duration_ms desc
| limit 50

These are not impressive. They’re the queries I run during the first ten minutes of an incident. Boring is the point.

Cost control on CloudWatch

CloudWatch defaults are not your friend. Log groups retain forever. Logs Insights scans cost real money on big log volumes. High-cardinality custom metrics multiply faster than you’d think.

A few rules I follow:

Every log group gets an explicit retention. 7 days for chatty access logs, 30 for app logs, 90 for audit. Set it at creation in IaC, do not rely on someone remembering.
Prefer EMF over PutMetricData. EMF rides on log volume, which is cheaper per data point and rate-limit-free.
Cardinality is the silent killer. user_id as a dimension is almost never what you want. Service, route, status code, region. That’s usually it.
For long-term storage, export to S3 with a lifecycle policy. CloudWatch is for the last 90 days, S3 plus Athena is for the rest.

Quick CLI:

aws logs put-retention-policy \
  --log-group-name /aws/eks/web/rails \
  --retention-in-days 30

aws logs put-subscription-filter \
  --log-group-name /aws/eks/web/rails \
  --filter-name to-s3-archive \
  --filter-pattern "" \
  --destination-arn arn:aws:firehose:us-east-1:111111111111:deliverystream/logs-archive

I’d rather see one well-tuned alarm fire and mean something than a wall of false positives and a five-figure monthly bill.

Takeaways

Use EMF for custom metrics from your code. It is cheaper, faster, and gets you high-cardinality context for free.
Use metric filters when you can’t instrument the app directly. They are the underrated 80% solution.
Composite alarms, not single-metric alarms, for anything that pages a human. Intersection of signals beats threshold of one.
Set log retention day one. Forever is not a strategy.
CloudWatch is enough for AWS-only infra. Reach for Datadog when you actually need cross-cloud or deep APM, not because it’s the default.

Thanks for reading. If you’ve got thoughts, send them my way.