When AWS Lambda Earns Its Keep

An honest look at AWS Lambda for backend workloads: cold starts, RDS Proxy, provisioned concurrency, and the specific shapes of work where Lambda is the wrong call.

The first Lambda I ever paged on woke me up at 3 a.m. Pacific, and it wasn’t even the function’s fault. A community feature at the creator-economy platform I worked at was running on Node Lambdas behind API Gateway, hitting Aurora through the standard pg client. Traffic was fine. Cold starts were fine. What wasn’t fine was the writer’s max_connections counter, which had been climbing all night because every invocation was opening a fresh Postgres connection, holding it for 90 ms, and walking away without closing it cleanly. By the time the alert fired we had thousands of half-dead connections on the writer. Aurora started refusing new ones. The on-call rotated me in because I owned the Aurora layer that quarter.

That incident is the reason I have opinions about Lambda. Strong ones. So when someone on a new team asks me “should we put this on Lambda”, I usually answer with another question. Tell me the shape of the work first.

Where Lambda actually earns its keep

Lambda is great at three shapes of work. Event reactors with bursty, unpredictable concurrency. Glue between AWS services where you’d otherwise stand up a small Node container nobody wants to maintain. And scheduled jobs that run on a cron and would be embarrassing to host on a permanent EC2.

The Community product I touched at the creator platform leaned on Lambdas exactly here. S3 object created hooks. SQS consumers for fan-out work. EventBridge scheduled jobs for nightly aggregations. The hot read/write path stayed on container workloads behind EKS, with a few thousand pods. The async edges went serverless. That split has held up well.

import { SQSEvent, SQSBatchResponse } from 'aws-lambda';
import { DynamoDB } from '@aws-sdk/client-dynamodb';

const ddb = new DynamoDB({ region: process.env.AWS_REGION });

export const handler = async (event: SQSEvent): Promise<SQSBatchResponse> => {
  const failures: { itemIdentifier: string }[] = [];

  await Promise.allSettled(
    event.Records.map(async (record) => {
      try {
        const body = JSON.parse(record.body);
        await ddb.putItem({
          TableName: process.env.EVENTS_TABLE!,
          Item: {
            pk: { S: `evt#${body.id}` },
            occurred_at: { S: body.occurredAt },
            payload: { S: record.body },
          },
          ConditionExpression: 'attribute_not_exists(pk)',
        });
      } catch (err) {
        // partial-batch failure: SQS will redrive only the bad ids
        failures.push({ itemIdentifier: record.messageId });
      }
    }),
  );

  return { batchItemFailures: failures };
};

That’s the shape I happily put on Lambda. Idempotent, bounded, no fan-out to Aurora, partial-batch responses wired so SQS only redrives bad records. The function does one thing and gets out of the way.

The cold start tax is real

Cold starts are not as scary as the internet thinks, but they’re also not free. On Node 20 with a slim bundle I see somewhere between 280 ms and 700 ms for an init. With provisioned concurrency on, that drops to roughly 10 ms. The cost is you’re paying for warm capacity around the clock, which kills one of Lambda’s main selling points.

The honest call: if your p99 budget for the endpoint is 200 ms and your cold start is 600 ms, you have a problem on every scale-out event. Bursty consumer traffic at a creator-economy platform meant we hit this a lot. We turned on provisioned concurrency for the user-facing Lambdas, kept it off for the async ones, and watched the bill closely. Provisioned concurrency is fine. Just don’t pretend it’s still serverless pricing.

resource "aws_lambda_function" "community_search" {
  function_name = "community-search"
  role          = aws_iam_role.lambda_exec.arn
  package_type  = "Image"
  image_uri     = "${var.ecr_repo}:${var.image_tag}"
  memory_size   = 1024
  timeout       = 6

  vpc_config {
    subnet_ids         = var.private_subnet_ids
    security_group_ids = [aws_security_group.lambda_sg.id]
  }

  environment {
    variables = {
      DATABASE_URL = var.rds_proxy_endpoint
      LOG_LEVEL    = "info"
    }
  }
}

resource "aws_lambda_provisioned_concurrency_config" "community_search" {
  function_name                     = aws_lambda_function.community_search.function_name
  qualifier                         = aws_lambda_alias.live.name
  provisioned_concurrent_executions = 20
}

RDS Proxy is not optional

Back to the 3 a.m. page. The fix that night was operational: kill the long-lived stuck connections from the writer side, scale down the Lambda concurrency, and let things settle. The real fix went in over the next two days. We put RDS Proxy in front of Aurora for every Lambda that needed to talk to Postgres, and we changed the function’s client wiring to assume the connection is shared, not owned.

import { Pool } from 'pg';

const pool = new Pool({
  host: process.env.DB_PROXY_HOST,
  port: 5432,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  ssl: { rejectUnauthorized: true },
  max: 1,                  // each Lambda env gets one logical connection
  idleTimeoutMillis: 0,    // proxy owns idle reaping
  connectionTimeoutMillis: 800,
});

export const handler = async (event: ApiEvent) => {
  const client = await pool.connect();
  try {
    const { rows } = await client.query(
      'select id, slug from community_posts where id = $1 limit 1',
      [event.pathParameters.id],
    );
    return { statusCode: 200, body: JSON.stringify(rows[0] ?? null) };
  } finally {
    client.release();
  }
};

The proxy pools connections on the Aurora side. Your Lambdas get cheap, reused connections that survive cold starts and don’t blow up max_connections. If you’re putting a Lambda anywhere near a relational DB, the proxy is the price of admission. I won’t sign off on a design that skips it.

A Saturday night that taught me where Lambda doesn’t fit

Different story, different platform. At the combat-sports tournament platform I CTO’d in London, we had a leaderboard surface fed by a consumer group writing into Elasticsearch off Kafka. Someone proposed migrating the indexer to a Lambda triggered by an SQS bridge from MSK. The argument was reasonable on paper. Lambdas scale. Indexers are bursty. Why pay for idle pods.

We didn’t ship it. The thing that killed the design was Kafka consumer-group semantics. A Lambda invocation per message destroys batching. It also makes it almost impossible to reason about offset commits when a downstream call to the rules service occasionally took around 70 seconds, well past Lambda’s tolerance for that pattern. We had a real production incident on that consumer once. Stale rankings during a live federation tournament for eight hours, traced back to the bulk-write client silently entering a circuit-open state. The fix involved coordinated reindexing into a new ES index and aliasing. That kind of operation is a long-running, stateful, batch-aware job. Lambda is the wrong tool for it. We kept it on a container, sized it, monitored freshness, and moved on.

The rule I’ve come back to: if the work needs batching, long-lived connections, or careful offset handling, run it on a container. If the work is event-shaped, idempotent, bursty, and fits in a few seconds, Lambda is great.

Takeaways

Lambda earns its keep for event reactors, AWS glue, and scheduled jobs.
For anything talking to Postgres or Aurora, RDS Proxy is mandatory. The function should treat the connection as borrowed.
Cold starts on Node are real but tractable. Use provisioned concurrency when your latency budget demands it, accept the bill.
Don’t put Kafka consumers, long-running batch jobs, or anything with careful offset semantics on Lambda. Containers win that fight.
Always wire SQS partial-batch responses. Otherwise one bad message redelivers the whole batch.

Thanks for reading. If you’ve got thoughts, send them my way.