When Auto-Scaling Wrecks Your Bill

How a quiet memory leak quadrupled an AWS bill, why NAT Gateway costs catch teams off guard, and the cost engineering program we built to stop the bleeding.

The first time I really stared at an AWS bill, monthly spend at the creator economy platform I worked at had gone from comfortable to “we need to talk on Monday” inside one quarter. Roughly 4x. No launch, no traffic surge to explain it. A slow climb in EKS node hours, a sharper climb in NAT Gateway data processing, and a NAT spike on a Tuesday I remember because I was on a flight when the finance Slack thread lit up.

Auto-scaling had done exactly what we told it to. We just hadn’t told it the right thing. Auto-scaling is not a cost strategy. It’s a reliability strategy. Treat it as the cost strategy and the bill eats you.

The leak that pretended to be load

Here’s how it starts. A Node.js service ships a feature that holds a request-scoped buffer slightly longer than it should. RSS climbs about 12 MB per hour per pod. Liveness probe is happy. Readiness probe is happy. The HPA is watching CPU.

What auto-scaling sees is memory pressure, GC pauses, slower handlers, CPU rising. The HPA does the only thing it can. It adds pods. The new pods inherit the leak. Within days you have thousands of pods on EKS, each opening a fresh Aurora connection on boot. The bill is just a faithful printout of that misunderstanding.

# what we had
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: community-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: community-api
  minReplicas: 12
  maxReplicas: 400
  metrics:
    - type: Resource
      resource:
        name: cpu
        targetAverageUtilization: 65

No memory-based scale-down. No pod-level memory budget as a hard limit. No upper guardrail. The cluster will happily quadruple itself in front of a leak and call that a healthy day.

The fix wasn’t cleverer scaling. It was a hard resources.limits.memory per pod, a CI check that fails any deploy without one, and a Datadog alert on process_resident_memory_bytes if RSS climbed more than 10% per hour at steady RPS. That signal would have caught the original leak in about a day, not a quarter.

The NAT Gateway surprise

NAT Gateway data processing is the line item nobody looks at until it’s too late. We were paying per-GB egress for almost every call from EKS to S3, Aurora, Secrets Manager, and a few internal APIs. Traffic shape was fine. Routing was wrong.

The fix is VPC endpoints. Gateway endpoints for S3 and DynamoDB (those are free), and Interface endpoints for the chatty services eating the most NAT bytes, Secrets Manager, ECR, STS, SSM.

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
  tags = {
    Name = "vpce-s3-gw"
    cost-center = "platform"
  }
}

resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpce.id]
  private_dns_enabled = true
}

Two things to call out. Interface endpoints aren’t free, so add them only where the NAT egress they replace beats the endpoint hourly cost. And private_dns_enabled = true is the line that actually moves traffic. Without it your SDK still resolves the public endpoint and routes through NAT. Learned that one on a Friday when the bill didn’t move after a deploy.

When autoscale fed the fire

This one is older. A real-time trading and charting platform I architected, designed for very high concurrent WebSocket fan-out. London market open on a Tuesday after a bank holiday. At 09:31:14, 74 seconds after open, connections started dropping. Clients reconnected, dropped, reconnected. Within 90 seconds every gateway pod was pinned at 100% CPU. p99 tick fan-out went from about 80 ms to 3 seconds. Stale prices on charts is the worst failure mode in a trading product.

I was on-call. First instinct, scale gateway pods 3x, kubectl scale straight to 9 pods. New pods came online, hit the reconnect storm, went CPU-bound within about 20 seconds. I was feeding the fire. Worse, the higher pod count meant more partial-success reconnects. Clients landed on a healthy pod briefly, got the “connection established” signal, then dropped when that pod saturated.

Real fix, two things in parallel. Emergency client-side config push through a remote-config channel we’d built for moments like this, jittered exponential backoff (min 200ms, max 30s, factor 2, jitter +/-50%). And a per-IP connection-rate limiter at nginx, tight at 3 new connections per second per IP. Pool stabilized in about 8 minutes. Tick fan-out came back under 200 ms.

Roughly 14 minutes of degraded ticks during one of the most-watched windows of the trading week. The lesson I repeat in every architecture review is short. Autoscale is not a fix for a self-amplifying client-side bug. Backoff lives on the client.

Right-sizing past the obvious

Years later, on the creator-tools platform, the Aurora layer was a multi-terabyte writer with three reader replicas, all running r6g.4xlarge. Instance class had been set once, eighteen months earlier, and never revisited.

The wrong move I want to call out wasn’t even ours. Reader replica lag fired at 10:14 a.m. Pacific. The on-call’s first reflex was bump instance class up two tiers, r6g.4xlarge to r6g.16xlarge. Lag didn’t move. The readers weren’t bottlenecked, they were starved of WAL because a long ANALYZE on a hot table was holding write-side locks. We killed the analyze. Lag drained in about 6 minutes. The cost lesson was secondary. Vertical scale-up is the most expensive guess in your toolbox. Almost nobody undoes it.

That kicked off the Reserved Instance work out of one of our hackathons. We audited Aurora usage from the actual CloudWatch workload graph, modeled RI coverage at various commitment levels, and pulled the trigger on a multi-year strategy matched to the real baseline. The audit is the work. Buying the RIs is the easy part.

import { CloudWatchClient, GetMetricStatisticsCommand } from "@aws-sdk/client-cloudwatch";

const cw = new CloudWatchClient({ region: "us-east-1" });

async function readerBaselineCpu(dbInstanceId: string, days = 30) {
  const end = new Date();
  const start = new Date(end.getTime() - days * 86_400_000);

  const res = await cw.send(new GetMetricStatisticsCommand({
    Namespace: "AWS/RDS",
    MetricName: "CPUUtilization",
    Dimensions: [{ Name: "DBInstanceIdentifier", Value: dbInstanceId }],
    StartTime: start,
    EndTime: end,
    Period: 3600,
    Statistics: ["p50", "p95", "Maximum"],
  }));

  return (res.Datapoints ?? []).sort((a, b) => +a.Timestamp! - +b.Timestamp!);
}

That’s the script I write first in any cost review. Before vendor pitches or FinOps tools, you need a flat baseline view of what the workload actually does.

The cost engineering program we built after

After the dust settled, we ran a small program inside one of the platform squads. Cost as engineering discipline, not a finance ticket. Three habits.

Weekly cost diff in Slack, by service and AWS line item, with anomaly detection on top of Cost Explorer. Owners get pinged when a service moves more than 15% week over week at flat traffic.

A pre-deploy check that requires resource limits on every Kubernetes manifest and refuses any deploy pushing maxReplicas above a ceiling without a justification annotation. Auto-scaling stays. It just has guardrails now.

A monthly architecture review where cost lives inside the reliability review, not separately. Cost sits next to p99 and error rate. Same dashboard. Same owner.

Auto-scaling didn’t betray us. We just never told it where to stop.

Takeaways

Treat auto-scaling as a reliability tool, not a cost tool. Cap it. Justify the cap.
Pod-level memory limits and a per-pod RSS-growth alert catch leaks faster than any HPA tuning.
NAT Gateway processing is one of the biggest hidden line items. VPC endpoints for the chatty AWS services pay for themselves in weeks.
Right-size before you scale up. The bigger instance almost never gets undone.
Reserved Instances are an engineering decision, not a procurement one. Do the workload audit yourself.
Put cost on the same dashboard as latency and error rate. Same review. Same owner.

Thanks for reading. If you’ve got thoughts, send them my way.