Flow producers, worker concurrency, retry backoff, Redis Cluster connections, and Bull Board in NestJS. The patterns I keep reaching for when the queue has to survive real on-call.
A Wednesday at the creator economy platform I worked at. The branded-mobile-app pipeline was happily chewing through native iOS submissions, until it wasn’t. Apple’s Connect API had started silently dropping submits on a 200 OK. Our pipeline thought everything was fine. By 2 p.m. Pacific dozens of support tickets were in and a chunk of builds were stuck in “Waiting for Review” with no record on our side. The orchestration was Ruby, but the lesson is the same one I’ve written into every NestJS BullMQ service since. The queue is not the source of truth. Apple is.
That’s the frame. BullMQ is great. It’s also a foot-gun if you treat it like fire-and-forget.
I default to BullMQ in NestJS for anything that needs ordered retries, parent-child workflows, or observable failures. Plain Bull is unmaintained. SQS is fine for fan-out but you lose flow control, priorities, and the cheap local dev that a docker-compose Redis gives you. If you already run Redis in prod, you already paid the operational cost.
@nestjs/bullmq is what I use. Not @nestjs/bull. The two look identical in the docs and they aren’t.
// src/queue/queue.module.ts
import { Module } from '@nestjs/common';
import { BullModule } from '@nestjs/bullmq';
import { ConfigService } from '@nestjs/config';
@Module({
imports: [
BullModule.forRootAsync({
useFactory: (config: ConfigService) => ({
connection: {
host: config.getOrThrow('REDIS_HOST'),
port: config.getOrThrow<number>('REDIS_PORT'),
// BullMQ requires this. Workers block on BRPOPLPUSH; without it,
// ioredis won't reconnect cleanly on a failover.
maxRetriesPerRequest: null,
enableReadyCheck: false,
},
defaultJobOptions: {
attempts: 5,
backoff: { type: 'exponential', delay: 1_000 },
removeOnComplete: { age: 3_600, count: 10_000 },
removeOnFail: { age: 24 * 3_600 },
},
}),
inject: [ConfigService],
}),
BullModule.registerQueue({ name: 'app-submissions' }),
],
})
export class QueueModule {}
The removeOnComplete line is the one I get asked about. If you don’t bound it, Redis memory grows forever and you’ll find out on a Sunday night.
A real workflow rarely fits in one job. A native app release needs an iOS build, an Android build, store metadata sync, then a single “promote to review” step that only fires when its three children succeed. That shape is what FlowProducer exists for, and it’s the BullMQ feature I leaned on hardest in production.
// src/releases/release.flow.ts
import { Injectable } from '@nestjs/common';
import { FlowProducer } from 'bullmq';
import { InjectFlowProducer } from '@nestjs/bullmq';
@Injectable()
export class ReleaseFlow {
constructor(
@InjectFlowProducer('releases') private readonly flow: FlowProducer,
) {}
async submitRelease(appId: string, version: string) {
return this.flow.add({
name: 'promote-to-review',
queueName: 'releases',
data: { appId, version },
opts: {
jobId: `promote:${appId}:${version}`,
attempts: 3,
backoff: { type: 'exponential', delay: 5_000 },
},
children: [
{
name: 'build-ios',
queueName: 'releases-ios',
data: { appId, version },
opts: { jobId: `ios:${appId}:${version}`, attempts: 4 },
},
{
name: 'build-android',
queueName: 'releases-android',
data: { appId, version },
opts: { jobId: `android:${appId}:${version}`, attempts: 4 },
},
{
name: 'sync-store-metadata',
queueName: 'releases-metadata',
data: { appId, version },
opts: { jobId: `meta:${appId}:${version}`, attempts: 6 },
},
],
});
}
}
Deterministic jobId is what makes this idempotent. If a producer crashes mid-call and the API client retries, BullMQ refuses to enqueue the duplicate. The parent only runs after every child reports success. Failures bubble. You don’t need a saga library for this shape. You probably do for the next shape up.
Default concurrency in BullMQ is 1 per worker process. That number is a trap. Real concurrency is concurrency * pod_count, and the downstream you’re hitting has its own ceiling. I size workers around the slowest downstream call in the handler, not around how fast Node can spin event loops.
// src/releases/ios-build.processor.ts
import { Processor, WorkerHost } from '@nestjs/bullmq';
import { Job } from 'bullmq';
import { Logger } from '@nestjs/common';
@Processor('releases-ios', {
concurrency: 4,
limiter: { max: 8, duration: 1_000 },
lockDuration: 90_000,
stalledInterval: 30_000,
maxStalledCount: 1,
})
export class IosBuildProcessor extends WorkerHost {
private readonly logger = new Logger(IosBuildProcessor.name);
async process(job: Job<{ appId: string; version: string }>) {
const { appId, version } = job.data;
await job.updateProgress(10);
const archive = await this.runFastlane(appId, version);
await job.updateProgress(70);
const uploaded = await this.uploadToAppStoreConnect(archive);
await job.updateProgress(95);
// Read-after-write. Don't trust the POST. Apple lies.
const confirmed = await this.verifySubmissionOnAppStore(uploaded.id);
if (!confirmed) {
throw new Error(`submission not confirmed for ${appId}@${version}`);
}
return { submissionId: uploaded.id };
}
private async runFastlane(_appId: string, _version: string) { return { path: '' }; }
private async uploadToAppStoreConnect(_a: { path: string }) { return { id: '' }; }
private async verifySubmissionOnAppStore(_id: string) { return true; }
}
The limiter is doing the real work. The downstream had a published rate ceiling. We respected it at the queue layer so the handler didn’t have to. The lockDuration is set above the p99 of the slowest call, which is what most stalled-job nightmares come down to. If your handler can take 90 seconds, your lock has to last longer than 90 seconds, or BullMQ thinks the worker died and hands the job to a peer. Now you have two workers racing.
Exponential backoff with jitter is the default I want everywhere. The wrapper I use in NestJS lets the handler pick a backoff strategy by name without leaking config into the processor.
// src/queue/backoff.ts
import { WorkerOptions } from 'bullmq';
export const backoffStrategies: WorkerOptions['settings'] = {
backoffStrategy: (attemptsMade, _type, err) => {
if (err && (err as { name?: string }).name === 'NonRetryableError') {
return -1; // bail out, don't retry
}
const base = 1_000 * Math.pow(2, attemptsMade);
const cap = 60_000;
const jitter = Math.floor(Math.random() * 500);
return Math.min(cap, base) + jitter;
},
};
A NonRetryableError is a class your domain code throws when the failure is permanent: validation, 4xx from the upstream, “the receipt is invalid and will always be invalid.” Returning -1 from the strategy is BullMQ’s signal to stop retrying. You will get this wrong in PR review the first three times. The reason exponential without jitter is dangerous is the same reason a reconnect storm at market open is dangerous. Every retry lands at the same millisecond. Add jitter or you’ll thunder.
ioredis underlies BullMQ. In Cluster mode you have to pin BullMQ keys to a single slot, which is what the {} hash-tag syntax in the prefix does.
BullModule.forRootAsync({
useFactory: () => ({
connection: {
// Sentinel
sentinels: [
{ host: 'sentinel-1.internal', port: 26379 },
{ host: 'sentinel-2.internal', port: 26379 },
{ host: 'sentinel-3.internal', port: 26379 },
],
name: 'mymaster',
maxRetriesPerRequest: null,
enableReadyCheck: false,
},
prefix: '{bullmq}', // forces all keys onto the same Cluster slot
}),
});
If you skip the {bullmq} hash tag in Cluster mode, Lua scripts that BullMQ uses internally will fail with CROSSSLOT Keys in request don't hash to the same slot. You won’t see it in dev. You’ll see it on Black Friday.
Don’t ship a queue without a UI for it. I bolt Bull Board onto every Nest service that has more than two queues. Five minutes of setup, hours of saved on-call time the first incident that has a stuck job.
// src/observability/bull-board.module.ts
import { Module } from '@nestjs/common';
import { BullBoardModule } from '@bull-board/nestjs';
import { BullMQAdapter } from '@bull-board/api/bullMQAdapter';
import { ExpressAdapter } from '@bull-board/express';
import { BullModule } from '@nestjs/bullmq';
@Module({
imports: [
BullModule.registerQueue({ name: 'releases' }),
BullModule.registerQueue({ name: 'releases-ios' }),
BullBoardModule.forRoot({
route: '/internal/queues',
adapter: ExpressAdapter,
}),
BullBoardModule.forFeature(
{ name: 'releases', adapter: BullMQAdapter },
{ name: 'releases-ios', adapter: BullMQAdapter },
),
],
})
export class QueueObservabilityModule {}
Put it behind your internal auth layer. Bull Board can replay failed jobs from the UI and that is exactly the kind of button you don’t want exposed publicly.
Different stack, same lesson. Years before I worked with BullMQ, I was acting CTO at the combat-sports tournament platform I CTO’d in London, running hundreds of microservices on Kafka. On a Saturday afternoon during a live federation broadcast, the standings projector started rebalancing every ~30 seconds. The page froze at 14:32 local. Three pages came in within two minutes.
First instinct was to roll the deployment. We did. It joined cleanly, then triggered another rebalance forty seconds later. We were doing the same dance the group was already doing.
Side by side pod logs showed one of six pods had a max.poll.interval.ms of 60s while the rest had 300s. Someone had pushed a config fix without bumping the image tag, the deploy had pulled :latest, and the odd pod out kept getting kicked from the group. Cordoned the bad pod, drained the storm in about 90 seconds. The lasting fix was a CI check that fails any deploy with :latest on a consumer pod.
The BullMQ shape of this is the lockDuration and stalledInterval story above. Mismatched per-pod settings are how you get a queue that “works” except when it doesn’t.
The other one is the duplicate subscription incident on native billing for branded apps at the same creator platform. Apple’s SubscriptionRenewal server-to-server notification retried because our endpoint returned 200 OK slightly past Apple’s 30 second deadline. We had no idempotency check. Every retry created a new subscription row. A few thousand customers ended up double-billed across dozens of branded apps.
The visible-only fix went out in an hour and didn’t refund anyone. The real fix was BullMQ-shaped even though the original stack was Sidekiq. Return 200 OK within five seconds, enqueue the work with jobId derived from apple_original_transaction_id plus notification_uuid, let the queue dedupe. A database unique constraint on the same tuple as belt and suspenders. Apple’s retries became free.
I keep that pattern in every BullMQ service now. jobId is the idempotency key. If you can derive it from the upstream’s identifiers, do it.
@nestjs/bullmq, not @nestjs/bull. The old one is a dead branch.jobId is the cheapest idempotency you can buy.concurrency and limiter around the slowest downstream, not the CPU.lockDuration above your p99 handler time, or you’ll get duplicate processing.{...} prefix.Thanks for reading. If you’ve got thoughts, send them my way.