AWS SQS and SNS Production Patterns

Fan-out via SNS to SQS, FIFO ordering, DLQ design, idempotent consumers, and visibility-timeout tuning, drawn from async pipelines I actually ran in production.

It was a Tuesday afternoon at the creator-economy platform I spent the last few years at. The pending_apple_review SQS queue was sitting at ~270 stuck messages, and our pipeline thought every one of them had been submitted to App Store Connect cleanly. They hadn’t. Apple was silently throttling and returning 200 OKs with empty bodies. Our consumer was happily deleting the message and moving on. By 2 p.m. Pacific the support queue had 80-something tickets in it.

That incident is the reason I will not ship an SQS consumer without an idempotency key and a read-after-write check against the upstream. Ever. Anyway. SQS and SNS are deceptively simple. You can run a couple of pipelines on them and feel like you’ve got it. Then you grow into fan-out, FIFO, partial batch failures, visibility timeouts, and DLQ replay, and the picture gets a lot less clean. This is the shape I keep landing on after years of running async pipelines on AWS.

The default mistake teams make is publishing to a single SQS queue from the producer. Works fine until a second consumer wants the same events. Now the producer has to know about every downstream. Don’t do that.

Publish to an SNS topic. Subscribe one SQS queue per consumer. The producer doesn’t know who’s listening. Add a third consumer next quarter, you just subscribe a new queue. No producer changes, no coordination meeting.

import { SNSClient, PublishCommand } from '@aws-sdk/client-sns';

const sns = new SNSClient({ region: process.env.AWS_REGION });

export async function publishOrderEvent(event: OrderEvent) {
  await sns.send(new PublishCommand({
    TopicArn: process.env.ORDER_EVENTS_TOPIC_ARN,
    Message: JSON.stringify(event),
    MessageAttributes: {
      event_type: { DataType: 'String', StringValue: event.type },
      tenant_id:  { DataType: 'String', StringValue: event.tenantId },
    },
  }));
}

The MessageAttributes matter more than they look. Subscribe each queue with a filter policy so consumers only see what they care about. A receipt-emailer doesn’t need to wake up for every cart-updated event.

resource "aws_sns_topic_subscription" "billing_events_to_invoicer" {
  topic_arn = aws_sns_topic.order_events.arn
  protocol  = "sqs"
  endpoint  = aws_sqs_queue.invoicer.arn

  filter_policy = jsonencode({
    event_type = ["order.paid", "order.refunded"]
  })

  filter_policy_scope = "MessageAttributes"
  raw_message_delivery = true
}

raw_message_delivery = true is the one I forget the most. Without it, SNS wraps your payload in its own envelope and every consumer has to JSON.parse twice. Turn it on at the subscription level and your queue receives the original body.

FIFO is for ordering, not throughput

FIFO queues are tempting. “We want ordered processing” sounds reasonable. But FIFO caps you at 300 messages per second per group, 3,000 with batching, and it costs more per request. I’ve seen teams reach for FIFO when what they actually wanted was ordering within a single entity, not across the whole platform.

The trick is MessageGroupId. It’s the unit of ordering. Same group, ordered. Different groups, parallel. Pick the group to be the entity whose order you actually care about, usually a customer or a tenant or an aggregate root.

import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';

const sqs = new SQSClient({ region: process.env.AWS_REGION });

export async function enqueueUserActivity(userId: string, activity: Activity) {
  await sqs.send(new SendMessageCommand({
    QueueUrl: process.env.USER_ACTIVITY_FIFO_URL,
    MessageBody: JSON.stringify(activity),
    MessageGroupId: userId,
    MessageDeduplicationId: `${userId}:${activity.id}`,
  }));
}

That MessageDeduplicationId is your friend. SQS dedupes within a 5-minute window automatically. If your producer retries because of a network blip, the second publish is a no-op. Costs nothing, saves you from the producer-retry duplicate bug.

Idempotent consumers, no exceptions

This is the lesson from the Apple review incident I opened with. Server-to-server notifications from Apple and Google retry aggressively. So does SQS. So does any sensible producer. Anything downstream of an SQS message has to assume the same message will arrive twice. Probably more.

The cheapest place to enforce that is the database. A unique constraint on (event_source, event_id) and an INSERT ... ON CONFLICT DO NOTHING. If the insert took, you process. If it didn’t, you ack the message and move on.

async function handleRenewalNotification(msg: RenewalMessage) {
  const inserted = await db.query(
    `INSERT INTO processed_events (source, event_id, received_at)
     VALUES ($1, $2, NOW())
     ON CONFLICT (source, event_id) DO NOTHING
     RETURNING id`,
    ['apple_iap', msg.notificationUuid],
  );

  if (inserted.rowCount === 0) {
    return; // already processed, ack and move on
  }

  await applyRenewal(msg);
}

Short story from a creator platform I worked at. Native billing was wired up for branded mobile apps on the creator platform. Apple’s SubscriptionRenewal notifications were retried after our endpoint took 31 seconds to return one bad afternoon, past Apple’s 30-second window. No idempotency key on the handler. Every retry created a fresh subscription row. A few thousand customers got double-charged across dozens of branded apps before anyone noticed. The visible fix went out within an hour and was useless. Hiding rows in the UI doesn’t refund anyone. The real fix: a Sidekiq job that returned within 5 seconds, a unique constraint on (apple_original_transaction_id, notification_uuid), and refund coordination with Apple’s developer support API that took 4 days. Idempotency keys aren’t a nice-to-have, they’re the contract.

Visibility timeout, the most tuned setting

Visibility timeout is the window where a message is invisible to other consumers after one picks it up. Default is 30 seconds. That default has bitten me more times than I want to admit.

Rule of thumb: set visibility timeout to roughly 6 times your p99 processing latency. p99 of 4s, set to 30s. p99 of 40s, set to 240s. Too short and slow messages get redelivered mid-processing, you double-execute, your DLQ fills with garbage. Too long and a crashed consumer leaves a message stuck for minutes. Better: call ChangeMessageVisibility and heartbeat the timeout forward while you process long jobs.

DLQ design, replay-able by default

A DLQ is not a graveyard. Treat it like one and you have no DLQ, you have a slow leak. Every DLQ I run in production has:

maxReceiveCount of 5 on the source queue’s redrive policy. Three is too aggressive, ten is too patient.
A separate consumer that watches DLQ depth via CloudWatch and pages PagerDuty if it grows.
A replay command. One shell script, one approval, messages flow back to the source.

resource "aws_sqs_queue" "order_events" {
  name                       = "order-events"
  visibility_timeout_seconds = 180
  message_retention_seconds  = 1209600 # 14 days
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.order_events_dlq.arn
    maxReceiveCount     = 5
  })
}

resource "aws_sqs_queue" "order_events_dlq" {
  name                      = "order-events-dlq"
  message_retention_seconds = 1209600
}

The retention is the bit nobody changes. 4 days is the default, 14 days is the max. Always go max on the DLQ. You will not figure out why a message failed in 4 days when the engineer who wrote that consumer is on vacation.

Takeaways

Fan out via SNS to SQS, never producer-to-queue directly. Subscribe with filter policies.
FIFO is for ordering inside a MessageGroupId, not for replacing standard queues. Throughput caps will bite you.
Idempotency lives in the consumer, enforced by a unique constraint. Treat retries as the default case, not the edge case.
Tune visibility timeout to roughly 6 times p99 processing latency. Extend dynamically for long jobs.
DLQs need retention at the 14-day max, a depth alarm, and a one-command replay path.

Thanks for reading. If you’ve got thoughts, send them my way.

Fan-out with SNS to SQS

FIFO is for ordering, not throughput

Idempotent consumers, no exceptions

Visibility timeout, the most tuned setting

DLQ design, replay-able by default

Takeaways