An honest look at AWS Lambda for backend workloads: cold starts, RDS Proxy, provisioned concurrency, and the specific shapes of work where Lambda is the wrong call.
The first Lambda I ever paged on woke me up at 3 a.m. Pacific, and it wasn’t even the function’s fault. A community feature at the creator-economy platform I worked at was running on Node Lambdas behind API Gateway, hitting Aurora through the standard pg client. Traffic was fine. Cold starts were fine. What wasn’t fine was the writer’s max_connections counter, which had been climbing all night because every invocation was opening a fresh Postgres connection, holding it for 90 ms, and walking away without closing it cleanly. By the time the alert fired we had thousands of half-dead connections on the writer. Aurora started refusing new ones. The on-call rotated me in because I owned the Aurora layer that quarter.
That incident is the reason I have opinions about Lambda. Strong ones. So when someone on a new team asks me “should we put this on Lambda”, I usually answer with another question. Tell me the shape of the work first.
Lambda is great at three shapes of work. Event reactors with bursty, unpredictable concurrency. Glue between AWS services where you’d otherwise stand up a small Node container nobody wants to maintain. And scheduled jobs that run on a cron and would be embarrassing to host on a permanent EC2.
The Community product I touched at the creator platform leaned on Lambdas exactly here. S3 object created hooks. SQS consumers for fan-out work. EventBridge scheduled jobs for nightly aggregations. The hot read/write path stayed on container workloads behind EKS, with a few thousand pods. The async edges went serverless. That split has held up well.
import { SQSEvent, SQSBatchResponse } from 'aws-lambda';
import { DynamoDB } from '@aws-sdk/client-dynamodb';
const ddb = new DynamoDB({ region: process.env.AWS_REGION });
export const handler = async (event: SQSEvent): Promise<SQSBatchResponse> => {
const failures: { itemIdentifier: string }[] = [];
await Promise.allSettled(
event.Records.map(async (record) => {
try {
const body = JSON.parse(record.body);
await ddb.putItem({
TableName: process.env.EVENTS_TABLE!,
Item: {
pk: { S: `evt#${body.id}` },
occurred_at: { S: body.occurredAt },
payload: { S: record.body },
},
ConditionExpression: 'attribute_not_exists(pk)',
});
} catch (err) {
// partial-batch failure: SQS will redrive only the bad ids
failures.push({ itemIdentifier: record.messageId });
}
}),
);
return { batchItemFailures: failures };
};
That’s the shape I happily put on Lambda. Idempotent, bounded, no fan-out to Aurora, partial-batch responses wired so SQS only redrives bad records. The function does one thing and gets out of the way.
Cold starts are not as scary as the internet thinks, but they’re also not free. On Node 20 with a slim bundle I see somewhere between 280 ms and 700 ms for an init. With provisioned concurrency on, that drops to roughly 10 ms. The cost is you’re paying for warm capacity around the clock, which kills one of Lambda’s main selling points.
The honest call: if your p99 budget for the endpoint is 200 ms and your cold start is 600 ms, you have a problem on every scale-out event. Bursty consumer traffic at a creator-economy platform meant we hit this a lot. We turned on provisioned concurrency for the user-facing Lambdas, kept it off for the async ones, and watched the bill closely. Provisioned concurrency is fine. Just don’t pretend it’s still serverless pricing.
resource "aws_lambda_function" "community_search" {
function_name = "community-search"
role = aws_iam_role.lambda_exec.arn
package_type = "Image"
image_uri = "${var.ecr_repo}:${var.image_tag}"
memory_size = 1024
timeout = 6
vpc_config {
subnet_ids = var.private_subnet_ids
security_group_ids = [aws_security_group.lambda_sg.id]
}
environment {
variables = {
DATABASE_URL = var.rds_proxy_endpoint
LOG_LEVEL = "info"
}
}
}
resource "aws_lambda_provisioned_concurrency_config" "community_search" {
function_name = aws_lambda_function.community_search.function_name
qualifier = aws_lambda_alias.live.name
provisioned_concurrent_executions = 20
}
Back to the 3 a.m. page. The fix that night was operational: kill the long-lived stuck connections from the writer side, scale down the Lambda concurrency, and let things settle. The real fix went in over the next two days. We put RDS Proxy in front of Aurora for every Lambda that needed to talk to Postgres, and we changed the function’s client wiring to assume the connection is shared, not owned.
import { Pool } from 'pg';
const pool = new Pool({
host: process.env.DB_PROXY_HOST,
port: 5432,
database: process.env.DB_NAME,
user: process.env.DB_USER,
ssl: { rejectUnauthorized: true },
max: 1, // each Lambda env gets one logical connection
idleTimeoutMillis: 0, // proxy owns idle reaping
connectionTimeoutMillis: 800,
});
export const handler = async (event: ApiEvent) => {
const client = await pool.connect();
try {
const { rows } = await client.query(
'select id, slug from community_posts where id = $1 limit 1',
[event.pathParameters.id],
);
return { statusCode: 200, body: JSON.stringify(rows[0] ?? null) };
} finally {
client.release();
}
};
The proxy pools connections on the Aurora side. Your Lambdas get cheap, reused connections that survive cold starts and don’t blow up max_connections. If you’re putting a Lambda anywhere near a relational DB, the proxy is the price of admission. I won’t sign off on a design that skips it.
Different story, different platform. At the combat-sports tournament platform I CTO’d in London, we had a leaderboard surface fed by a consumer group writing into Elasticsearch off Kafka. Someone proposed migrating the indexer to a Lambda triggered by an SQS bridge from MSK. The argument was reasonable on paper. Lambdas scale. Indexers are bursty. Why pay for idle pods.
We didn’t ship it. The thing that killed the design was Kafka consumer-group semantics. A Lambda invocation per message destroys batching. It also makes it almost impossible to reason about offset commits when a downstream call to the rules service occasionally took around 70 seconds, well past Lambda’s tolerance for that pattern. We had a real production incident on that consumer once. Stale rankings during a live federation tournament for eight hours, traced back to the bulk-write client silently entering a circuit-open state. The fix involved coordinated reindexing into a new ES index and aliasing. That kind of operation is a long-running, stateful, batch-aware job. Lambda is the wrong tool for it. We kept it on a container, sized it, monitored freshness, and moved on.
The rule I’ve come back to: if the work needs batching, long-lived connections, or careful offset handling, run it on a container. If the work is event-shaped, idempotent, bursty, and fits in a few seconds, Lambda is great.
Thanks for reading. If you’ve got thoughts, send them my way.