Scheduled Tasks in NestJS

How I run @nestjs/schedule across multiple pods without double-firing, using Redis locks or Postgres advisory locks, plus the dynamic scheduling and monitoring patterns I trust in production.

A Tuesday morning at the creator economy platform I worked at. Replica lag on the Aurora cluster started climbing past 60 seconds, Datadog lit up, and the on-call started chasing reader instance sizes. The real cause was a maintenance cron, a partition-stats refresh someone had wired into a NestJS scheduler. It was holding write-side locks on a hot table the whole platform queried constantly, and starving WAL emission on the writer. Not at 03:00 UTC like it should have been. Right at peak Pacific traffic. Because the pod that ran it didn’t know it wasn’t supposed to.

That’s the part of @nestjs/schedule people don’t talk about until it bites them. The library is a thin, lovely wrapper. It does exactly what it says. What it doesn’t do is stop you from running the same @Cron handler on every pod at the same instant. If you’ve got more than one replica, you’ve already got a problem and you might not know it.

I’d rather get this right once than chase the symptoms forever. So here’s how I run scheduled jobs in NestJS now.

What @nestjs/schedule actually gives you

A ScheduleModule.forRoot(), a @Cron('0 */15 * * * *') decorator, an @Interval, a @Timeout, and a SchedulerRegistry you can use to add, remove, and inspect jobs at runtime. That’s roughly it. There’s no leader election, no idempotency, no central log of “did this run actually happen”. The library trusts the deployment to be a single process. Most production deployments aren’t.

If you’re shipping a single instance behind nginx and you’re sure of it, you can stop reading. If you’ve got two or more pods, or a HorizontalPodAutoscaler that wakes up at 09:00 UTC and bumps you from one replica to four, your cron handlers are firing in parallel.

Distributed locks with Redis

The simplest mental model is a global mutex. Before the handler runs, it tries to grab a lock with a TTL. If it can’t, it walks away. If it can, it runs, refreshes the TTL while it works, and releases the lock when it’s done. Redis is good enough for almost every scheduled-task case I’ve shipped.

import { Injectable, Logger } from '@nestjs/common';
import { Cron, CronExpression } from '@nestjs/schedule';
import { InjectRedis } from '@nestjs-modules/ioredis';
import Redis from 'ioredis';
import { randomUUID } from 'crypto';

@Injectable()
export class ReceiptReconciliationJob {
  private readonly logger = new Logger(ReceiptReconciliationJob.name);

  constructor(@InjectRedis() private readonly redis: Redis) {}

  @Cron('0 */5 * * * *', { name: 'receipt-reconciliation' })
  async run(): Promise<void> {
    const key = 'locks:receipt-reconciliation';
    const token = randomUUID();
    const ttlSeconds = 240;

    const acquired = await this.redis.set(key, token, 'EX', ttlSeconds, 'NX');
    if (acquired !== 'OK') {
      this.logger.debug('another pod is running this, skipping');
      return;
    }

    const renew = setInterval(async () => {
      await this.redis.eval(
        `if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('expire', KEYS[1], ARGV[2]) else return 0 end`,
        1, key, token, ttlSeconds,
      );
    }, 60_000);

    try {
      await this.reconcile();
    } finally {
      clearInterval(renew);
      await this.redis.eval(
        `if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) else return 0 end`,
        1, key, token,
      );
    }
  }

  private async reconcile(): Promise<void> {
    // pull pending receipts from the last window, hit the upstream,
    // write idempotently keyed on (original_transaction_id, notification_uuid).
  }
}

The two things that catch people: forgetting the TTL (a pod crashes mid-run and the lock lives forever), and releasing the lock without checking the token (you release someone else’s lock and now two pods think they’re alone). The Lua eval handles both. Use it. Don’t write your own check-then-delete.

Postgres advisory locks, when you already have Postgres

If you don’t want another moving part, advisory locks are a clean alternative. pg_try_advisory_lock(bigint) either gets you the lock or doesn’t, the lock dies with the session, and you don’t need a separate cache cluster to run a cron.

import { Injectable, Logger } from '@nestjs/common';
import { Cron } from '@nestjs/schedule';
import { DataSource } from 'typeorm';

@Injectable()
export class PartitionMaintenanceJob {
  private readonly logger = new Logger(PartitionMaintenanceJob.name);
  private readonly lockKey = 9_220_117_001n;

  constructor(private readonly dataSource: DataSource) {}

  @Cron('0 15 4 * * *', { timeZone: 'UTC', name: 'partition-maintenance' })
  async run(): Promise<void> {
    const runner = this.dataSource.createQueryRunner();
    await runner.connect();
    try {
      const [{ locked }] = await runner.query(
        'SELECT pg_try_advisory_lock($1) AS locked',
        [this.lockKey.toString()],
      );
      if (!locked) {
        this.logger.debug('partition maintenance already running on another pod');
        return;
      }

      await this.rotatePartitions(runner);
    } finally {
      await runner.query('SELECT pg_advisory_unlock($1)', [this.lockKey.toString()]);
      await runner.release();
    }
  }

  private async rotatePartitions(runner: any): Promise<void> {
    // strictly read-side or off-hours work only. never run during peak windows.
  }
}

One thing the trading platform I architected hammered home: pick an explicit time zone on every @Cron. UTC by default, and never run maintenance crons during a market window or a peak-traffic window. The partition-stats incident I opened with happened because the cron was wired without timeZone and silently ran on the pod’s local clock. Set it explicitly. Always.

Dynamic scheduling without losing your mind

Static @Cron decorators cover maybe 80% of what you need. The rest is dynamic: a creator on the platform schedules a campaign send for next Tuesday at 09:00 in their own time zone, and your service has to actually fire at that moment. SchedulerRegistry is the answer, but it’s a footgun if you treat it as the system of record.

import { Injectable } from '@nestjs/common';
import { SchedulerRegistry } from '@nestjs/schedule';
import { CronJob } from 'cron';
import { CampaignRepository } from './campaign.repository';

@Injectable()
export class CampaignScheduler {
  constructor(
    private readonly registry: SchedulerRegistry,
    private readonly campaigns: CampaignRepository,
  ) {}

  async schedule(campaignId: string, runAt: Date, tz: string): Promise<void> {
    await this.campaigns.markScheduled(campaignId, runAt, tz);
    const name = `campaign:${campaignId}`;

    if (this.registry.doesExist('cron', name)) {
      this.registry.deleteCronJob(name);
    }

    const job = new CronJob(runAt, async () => {
      await this.campaigns.fire(campaignId);
      this.registry.deleteCronJob(name);
    }, null, false, tz);

    this.registry.addCronJob(name, job);
    job.start();
  }

  async rehydrateFromDb(): Promise<void> {
    const pending = await this.campaigns.findPendingScheduled();
    for (const c of pending) {
      await this.schedule(c.id, c.runAt, c.tz);
    }
  }
}

The SchedulerRegistry lives in process memory. If the pod dies, the jobs die with it. So the database is the source of truth, not the registry. On OnApplicationBootstrap, you rehydrate. And the same lock pattern applies inside campaigns.fire(id) so two pods that both rehydrated don’t both send the campaign. I’ve watched a team learn this the hard way. The same customer got the same launch email twice and the support thread was unpleasant.

A war story about silent crons

The federation platform I CTO’d in London ran on hundreds of microservices with Kafka as the async backbone. A standings-projector consumer kept rebalancing every 30 seconds during a live broadcast on a Saturday afternoon. The page froze at 14:32 local. I’d been pinged within two minutes.

First instinct, kubectl rollout restart deployment/standings-projector. Same dance the group was already doing. We just made it faster.

Real cause: one of six pods had a different max.poll.interval.ms. Five of them at 300s, the sixth at 60s. The sixth was running a stale image because the deploy had pulled :latest instead of a pinned SHA. A slow call to a downstream rules service sometimes took 70 seconds, past that pod’s max.poll.interval.ms, so the broker kept evicting it from the group and triggering a rebalance for everyone.

I tell that story because crons fail the exact same way. One pod runs a slightly different version of the handler, with a slightly different schedule, and you only learn about it from a metric you didn’t think to add. Pin your images. Pin your schedules. And monitor your jobs as if they were endpoints.

Monitoring jobs that should have run

The thing I always add now is a heartbeat. Every successful run writes a row to a cron_runs table with (job_name, started_at, finished_at, pod_name, status). A separate small NestJS controller exposes /health/scheduled, which fails if any job hasn’t completed within its expected window plus a buffer. Datadog hits the endpoint. The endpoint lies to no one.

import { Injectable } from '@nestjs/common';
import { DataSource } from 'typeorm';

@Injectable()
export class CronHealthService {
  constructor(private readonly dataSource: DataSource) {}

  async overdueJobs(now: Date = new Date()): Promise<Array<{ name: string; lateBy: number }>> {
    const rows = await this.dataSource.query(`
      SELECT name, EXTRACT(EPOCH FROM ($1::timestamptz - last_finished_at)) AS late_by
      FROM cron_run_summary
      WHERE last_finished_at + (max_interval_seconds || ' seconds')::interval < $1
    `, [now]);
    return rows.map((r: any) => ({ name: r.name, lateBy: Number(r.late_by) }));
  }
}

The reason I prefer this over a Prometheus exporter or a Sentry cron monitor is mostly that I want one source of truth I can query from a runbook at 02:00. “Show me the last 10 runs of receipt-reconciliation” should be a SQL query, not a tab in a vendor dashboard. Use the vendor dashboard too. But own the data.

Takeaways

Treat @nestjs/schedule as the trigger, not the system of record. Locks, idempotency, and a DB row are what stop double-fires.
Redis SET NX EX with a token is the cleanest distributed lock. Renew the TTL while you work, release with a Lua check.
Postgres advisory locks are great when you already have Postgres and don’t want another dependency.
Pin timeZone on every @Cron. Never assume the pod’s local clock.
Rehydrate dynamic jobs from the database on boot. The registry is in-memory and lies after a pod restart.
Heartbeat every job into a table you can query from a runbook. Vendor dashboards are a bonus, not the contract.

Thanks for reading. If you’ve got thoughts, send them my way.