Microservice CI/CD Pipelines

An opinionated take on independently deployable microservices: mono- vs multi-pipeline, affected-only builds, Pact contract verification, and per-service canary releases.

At the combat-sports tournament platform I CTO’d in London, we had hundreds of microservices and one CI pipeline that built all of them on every push. A typical green build was 42 minutes. On a Tuesday before a federation broadcast we shipped a one-line fix to a rankings service. CI was busy rebuilding everything else. By the time the pipeline finished, the broadcast had started without the fix. That was the day I stopped pretending we had independent deployability.

Independent deployability is the whole point of microservices. If the pipeline forces every service to ship together, you have a distributed monolith with extra YAML. The pipeline is not a side concern. It’s the thing.

My one paragraph opinion

One repo, one pipeline graph, affected-only builds. Per-service deploy units. Contract verification on every PR via Pact. Canary or blue-green per service with feature flags wrapping anything actually risky. Multi-repo with one pipeline per repo is fine if you already live there, but I’d never start there today.

Mono pipeline vs multi pipeline

The argument I keep hearing is “one repo means coupling, many repos means independence.” That’s backwards. Coupling lives in your runtime contracts, not your folder layout.

In a monorepo with an affected-only pipeline (nx affected, turbo run --filter, Bazel’s query rdeps), pushing a one-line change to a single service builds and ships that service. Nothing else. Everything else short-circuits because nothing it depends on changed.

In multi-repo land, you pay for the same outcome with extra ceremony: a shared CI template you keep in sync across dozens of repos, version pinning between internal libraries, the eternal “who updates the shared eslint config first” debate. I’ve lived both. Monorepo with affected detection is cheaper.

Here’s the affected-only entrypoint we used in our GitHub Actions setup for a community-and-talent product I CTO on the side. NestJS services, pnpm workspaces, Turborepo:

name: ci
on:
  pull_request:
  push:
    branches: [main]

jobs:
  affected:
    runs-on: ubuntu-latest
    outputs:
      services: ${{ steps.detect.outputs.services }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: pnpm/action-setup@v3
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: pnpm }
      - run: pnpm install --frozen-lockfile
      - id: detect
        run: |
          BASE="${{ github.event.pull_request.base.sha || github.event.before }}"
          AFFECTED=$(pnpm turbo run build --filter="...[${BASE}]" --dry=json \
            | jq -c '[.tasks[].package] | unique')
          echo "services=${AFFECTED}" >> "$GITHUB_OUTPUT"

  build:
    needs: affected
    if: ${{ needs.affected.outputs.services != '[]' }}
    strategy:
      fail-fast: false
      matrix:
        service: ${{ fromJSON(needs.affected.outputs.services) }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pnpm install --frozen-lockfile
      - run: pnpm turbo run lint test build --filter=${{ matrix.service }}
      - run: docker buildx build -t ${{ matrix.service }}:${{ github.sha }} apps/${{ matrix.service }}

The --dry=json step is the load-bearing piece. Turbo walks the dependency graph against the merge base and tells you exactly which packages need work. One-file change, one matrix job.

Contract verification in CI

Affected detection saves time. It does not save you from breaking contracts. If service A changes a response field and service B reads it, your affected-only build is green and production is broken.

The fix is consumer-driven contract testing. Pact is the boring, working answer. Consumers write Pacts describing what they expect. Providers verify those Pacts in CI before they ship.

I introduced this at a real-time trading platform I architected after we shipped a price-fan-out service that “improved” a JSON field by changing its type from string to number. The chart renderer didn’t crash. It rendered every price as NaN for about eleven minutes while I figured out what was happening.

Consumer side, in the chart renderer’s test suite:

import { PactV3, MatchersV3 } from '@pact-foundation/pact';
const { like, eachLike, decimal } = MatchersV3;

const provider = new PactV3({
  consumer: 'chart-renderer',
  provider: 'price-feed',
  dir: './pacts',
});

describe('price-feed contract', () => {
  it('returns tick events with numeric price', async () => {
    await provider
      .given('an active symbol AAPL')
      .uponReceiving('a request for latest tick')
      .withRequest({ method: 'GET', path: '/v1/ticks/AAPL' })
      .willRespondWith({
        status: 200,
        body: like({
          symbol: 'AAPL',
          price: decimal(184.21),
          ts: like(new Date().toISOString()),
          venues: eachLike({ id: 'NASDAQ', size: 100 }),
        }),
      })
      .executeTest(async (mock) => {
        const client = createPriceClient(mock.url);
        const tick = await client.latest('AAPL');
        expect(typeof tick.price).toBe('number');
      });
  });
});

The Pact file gets published to a Pact Broker on green. The provider’s CI runs verification against the latest Pacts before any build is allowed to deploy:

verify-contracts:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - run: pnpm install --frozen-lockfile
    - name: Verify pacts
      env:
        PACT_BROKER_BASE_URL: ${{ secrets.PACT_BROKER_URL }}
        PACT_BROKER_TOKEN: ${{ secrets.PACT_BROKER_TOKEN }}
      run: pnpm tsx scripts/verify-pacts.ts --provider=price-feed --tag=${{ github.sha }}
    - name: Can I deploy
      run: |
        pact-broker can-i-deploy \
          --pacticipant price-feed \
          --version ${{ github.sha }} \
          --to-environment production

can-i-deploy is the actual gate. If any consumer in production has a Pact this build doesn’t satisfy, the deploy is blocked. First time I turned this on for a side product I CTO, it caught a header rename on day two.

Canary and blue green per service

Independently deployable means each service rolls out on its own clock. Argo Rollouts is the cleanest way I’ve shipped this on EKS. Service-level canary, traffic splits, automatic rollback on a metric breach.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: rankings-api
spec:
  replicas: 12
  strategy:
    canary:
      maxSurge: 25%
      maxUnavailable: 0
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - setWeight: 25
        - analysis:
            templates:
              - templateName: error-rate-and-p99
            args:
              - { name: service, value: rankings-api }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100
      trafficRouting:
        istio:
          virtualService: { name: rankings-api, routes: [primary] }

The AnalysisTemplate watches a Datadog query for 5xx rate and p99 against the canary’s pods only. If error rate climbs past 0.5% or p99 doubles, Argo aborts and shifts traffic back. We pair this with LaunchDarkly flags wrapping anything genuinely risky inside the service, so a “deploy” and a “release” are separate events. Deploy quietly. Flip the flag when product is ready.

A war story about consumer rebalance

A live federation tournament. Saturday afternoon. The rankings consumer group started rebalancing every 30 seconds and the leaderboard froze at 14:32 local time. First instinct was the lazy one. Restart the deployment, hope it settles. It did not. Rebalance kicked off again 40 seconds later.

Pulled pod logs side by side. One pod out of six had a different max.poll.interval.ms than the others. The manifest pulled :latest instead of a pinned SHA, and a config-touching change had landed without a tag bump. The bad pod’s handler ran a downstream call that occasionally took 70 seconds, longer than its 60-second poll interval. So it got kicked out of the group, triggered a rebalance, knocked the rest off, repeat. 12 minutes of stale standings during a live broadcast.

The deploy rule that came out of it lives in our CI today, in pre-flight:

check-image-pinning:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - name: Reject :latest in consumer manifests
      run: |
        offenders=$(grep -rEn 'image:\s+\S+:(latest|main|master)' infra/consumers || true)
        if [ -n "$offenders" ]; then
          echo "Found floating image tags in consumer manifests:"
          echo "$offenders"
          exit 1
        fi

CI fails the deploy if any manifest touching a Kafka consumer references a floating tag. Cheap to write. Saved us from doing the same thing twice.

A war story about schema migrations

Late-evening deploy at the creator-economy platform I worked at. The migration was an add_column with null: false, default: false on a hot users table. Reviewed that morning. Looked safe. The Rails strong_migrations gem gave it a pass via add_column_with_default.

It acquired ACCESS EXCLUSIVE on a table with hundreds of millions of rows. Login error rate hit 100% for 85 seconds. PagerDuty woke half of California. First instinct was rollback, which would have left the table half-written. We let it finish. Lock released. Dependent service retried in. Login recovered.

The postmortem fix went into CI as a strong_migrations rule plus a deploy-time gate: any add_column with a non-null default on a table over a configured row threshold fails the build. The CI step that wraps it:

db-safety:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - run: bundle install
    - name: Reject unsafe migrations
      env:
        PG_URL: ${{ secrets.PG_SHADOW_URL }}
      run: bin/rails db:migrate:safety_check

The check connects to a shadow DB, applies pending migrations, and refuses to greenlight anything strong_migrations flags. On Aurora at scale, every schema change against a hot table is a three-step dance. The CI gate is what makes you actually do the dance.

Takeaways

One pipeline graph, affected-only builds. Multi-repo only if you’re already there.
Pact contracts verified per provider, gated by can-i-deploy. Skip this and your green builds will lie to you.
Canary or blue-green per service with metric-based abort. Same pipeline, different rollout clocks.
Pin image SHAs on anything that touches a consumer group. Fail CI if a manifest doesn’t.
Schema migrations are a CI concern, not a runtime concern. Block the unsafe shapes before they ship.
Deploy and release are separate events. Feature flags do the second one.

Thanks for reading. If you’ve got thoughts, send them my way.