Workflow Engines With Temporal and Step Functions

An opinionated take on Temporal vs AWS Step Functions for sagas, human approval steps, and observability from a backend architect lens.

The first time I actually needed a workflow engine was a Wednesday afternoon at the creator economy platform I worked at. Our branded mobile app pipeline had a backlog of around 270 stuck submissions on Apple’s side. Rails, Python, Fastlane, GitHub Actions, glued together with Sidekiq and a homegrown state machine in PostgreSQL. The state machine was 600 lines of Ruby I’d written and was no longer proud of. Every retry, every “wait for Apple”, every “human in the loop” step was a special case in a case/when block. I remember staring at it and thinking, OK, this is the moment we admit we needed Temporal or Step Functions six months ago.

Where I land on this

Default to Step Functions when you’re already on AWS and the workflow is mostly orchestration of AWS-native things. Reach for Temporal when the logic is rich enough that you’d rather write it as code, when you need long human approval steps, or when you want to test the whole thing locally without mocking a cloud. Don’t roll your own state machine in a database. You’ll regret it.

What an engine actually buys you

You already know what they do. Durable execution, retries, timers, the ability to pause for days waiting on a human. The thing nobody puts on the marketing page: they buy you the ability to crash and resume without rewriting the world. Your saga, your approval flow, your three-step booking process survives a pod restart because the engine holds the state, not your process.

Step Functions for AWS-native sagas

Step Functions is what I reach for when the workflow is mostly “do thing A in Lambda, do thing B in SQS, wait for a callback, branch on the result”. The state language is verbose, but readable once you let CDK generate it instead of writing it by hand.

import { Stack, StackProps, Duration } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';
import * as lambda from 'aws-cdk-lib/aws-lambda';

export class OrderSagaStack extends Stack {
  constructor(scope: Construct, id: string, props: lambda.Function[]) {
    super(scope, id);
    const [reserveInventory, chargeCard, allocateShipping, refund] = props;

    const reserve = new tasks.LambdaInvoke(this, 'ReserveInventory', {
      lambdaFunction: reserveInventory,
      resultPath: '$.reservation',
      retryOnServiceExceptions: true,
    });

    const charge = new tasks.LambdaInvoke(this, 'ChargeCard', {
      lambdaFunction: chargeCard,
      resultPath: '$.charge',
    }).addRetry({
      errors: ['StripeTransientError'],
      maxAttempts: 5,
      interval: Duration.seconds(2),
      backoffRate: 2,
    });

    const compensate = new tasks.LambdaInvoke(this, 'RefundAndReleaseInventory', {
      lambdaFunction: refund,
      resultPath: '$.refund',
    });

    charge.addCatch(compensate, { errors: ['States.ALL'], resultPath: '$.error' });

    const flow = reserve.next(charge).next(
      new tasks.LambdaInvoke(this, 'AllocateShipping', {
        lambdaFunction: allocateShipping,
        resultPath: '$.shipping',
      }),
    );

    new sfn.StateMachine(this, 'OrderSaga', {
      definition: flow,
      timeout: Duration.hours(2),
    });
  }
}

What I like: addCatch to a compensating step is the saga pattern in one line. Retry config is declarative. The execution history in the console is the best free distributed trace I’ve ever gotten. What I don’t love: try expressing a “wait 72 hours for a human to click approve” step purely in ASL. You can do it with waitForTaskToken, but the moment your approval logic has any nesting, the JSON becomes the kind of thing nobody wants to review in a PR.

Temporal when logic is rich

Temporal flips the model. Your workflow is real code. The engine ships your function across pod restarts by replaying its event history. You write a sleep(72 hours) and the engine handles persistence.

import { proxyActivities, defineSignal, setHandler, condition, sleep } from '@temporalio/workflow';
import type * as activities from './activities';

const { reserveInventory, chargeCard, allocateShipping, refund, notifyReviewer } =
  proxyActivities<typeof activities>({
    startToCloseTimeout: '2 minutes',
    retry: { maximumAttempts: 5, initialInterval: '2s', backoffCoefficient: 2 },
  });

export const approveSignal = defineSignal<[{ approved: boolean; reviewerId: string }]>('approve');

export async function highValueOrderSaga(input: { orderId: string; cents: number }) {
  const reservation = await reserveInventory(input.orderId);

  let decision: { approved: boolean; reviewerId: string } | undefined;
  setHandler(approveSignal, (d) => { decision = d; });

  if (input.cents > 500_00) {
    await notifyReviewer(input.orderId);
    const got = await condition(() => decision !== undefined, '72 hours');
    if (!got || decision?.approved === false) {
      await refund({ reservationId: reservation.id, reason: 'reviewer_rejected' });
      return { status: 'rejected' as const };
    }
  }

  try {
    const charge = await chargeCard(input.orderId);
    await allocateShipping({ orderId: input.orderId, chargeId: charge.id });
    return { status: 'fulfilled' as const };
  } catch (err) {
    await refund({ reservationId: reservation.id, reason: 'charge_failed' });
    throw err;
  }
}

The 72-hour wait is a single line. A human clicks a button in admin, your API sends a signal, the workflow wakes up where it left off. No callback URLs to manage, no DynamoDB row to poll, no token to lose.

The part that sold me though was unit testing. You can run the full workflow against a time-skipping test env, fast-forward the 72-hour wait, and assert on the resulting state. With Step Functions I end up mocking the AWS SDK and pretending.

A saga story I keep telling

Different shop, same lesson. At the live-video creator startup I led at, we had a checkout flow. Charge the card, create a Stripe customer, provision the storefront, send a welcome email, kick off a Cloudflare Workers cache warm. Five steps, all with failure modes. I shipped the first version as a chain of Sidekiq jobs with hand-rolled compensation. It mostly worked. Until the Stripe step succeeded and the provisioning step’s pod got OOM-killed mid-run. Customer billed, no storefront, no email, no record of the half-finished work anywhere reasonable.

First wrong fix: a “reconciliation cron” sweeping half-finished checkouts every five minutes. Felt clever. Then a creator hit it during the four-minute window before the cron ran, opened a support ticket, and that was the day I realized reconciliation crons are the past trying to fix the present.

Real fix: lifted the flow into an explicit saga with compensating activities and durable state in PostgreSQL using an outbox table. We didn’t take a full Temporal dependency yet, just modeled the orchestration that way. The next OOM kill was a non-event. If we’d been on a real workflow engine from day one, this whole arc would have been an afternoon.

Human approval is the unlock

Back to the mobile pipeline. About 270 builds stuck on Apple’s side, our internal state thinking they were submitted, a Sidekiq retry making things worse. We tried bumping the retry count first. Apple started seeing duplicate submissions. Around 30 customers ended up with two competing review records and conflicting metadata. Cleanup took a week.

Real fix: an idempotency key derived from app_id + version + git_sha, a read-after-write check against App Store Connect, and replacing the Sidekiq retry chain with a proper orchestration that knew how to wait for a human-moderated upstream. Sidekiq’s retry treats a response as truth. A workflow engine treats the upstream’s source of truth as truth, and waits.

For that shape of work I’d reach for Temporal today. The signal plus condition('72 hours') pattern fits Apple-Review flows the way Step Functions waitForTaskToken never quite did.

Observability, where it gets real

Step Functions hands you a free execution graph in the console. Killer feature of “operationally cheap”, just open the URL and see every step. Trade-off: integrating with Datadog or your own tracing means plumbing the X-Ray trace ID through every Lambda payload and back. Doable, not free.

Temporal needs you to think a little harder. The Web UI shows event history, great for engineers, useless for non-engineering stakeholders. You’ll want structured logs from activities with the workflow ID as a correlation field.

import { Context } from '@temporalio/activity';
import { logger } from './logger';

export async function chargeCard(orderId: string) {
  const ctx = Context.current();
  const log = logger.child({
    workflowId: ctx.info.workflowExecution.workflowId,
    runId: ctx.info.workflowExecution.runId,
    activity: 'chargeCard',
    orderId,
  });
  log.info('charging');
  try {
    const charge = await stripe.charges.create({ /* ... */ });
    log.info({ chargeId: charge.id }, 'charged');
    return charge;
  } catch (err) {
    log.warn({ err }, 'charge failed, will retry per workflow policy');
    throw err;
  }
}

That child logger is the most useful pattern I’ve adopted on Temporal. Every log line has the workflow ID, so when you find one bad event in Datadog you can pull the full workflow trace with one query.

How I’d pick today

Already deep in AWS, workflow is mostly Lambda and SQS plumbing, low logic density: Step Functions. Long human approval, rich branching, want unit tests, polyglot workers: Temporal. Neither, if the work is three steps and you can express it cleanly with an outbox table and idempotent handlers. Workflow engines aren’t free, they’re a runtime, an SDK, a UI, and an on-call rotation. Worth it when you need them. Wasted when you don’t.

Takeaways

Don’t build your own state machine in a database. You’ll write 600 lines of Ruby and hate yourself.
Step Functions wins on AWS-native orchestration and the free execution graph in the console.
Temporal wins when your workflow is real code, especially with long human-approval waits.
Saga compensation is a one-liner in both. The mistake is rolling it yourself in Sidekiq.
Treat the upstream’s source of truth as truth. Read-after-write beats retry-on-response.
Ship the workflow ID into every structured log. It is the cheapest correlation field you’ll add.

Thanks for reading. If you’ve got thoughts, send them my way.