Terraform at Scale

How I structure Terraform for multi-environment AWS infra, with remote state on S3 and DynamoDB locks, plan-on-PR in CI, drift detection, and tfsec plus Infracost in the loop.

A Tuesday morning at the creator economy platform I worked at. I’d just opened a PR that bumped a module version on the Aurora readers, nothing dramatic, a parameter group tweak. CI ran terraform plan and the diff was a wall of 187 changes. None of them were mine. Somebody had clicked through the AWS console the night before to “just check something” on a security group, then forgot. Our state file knew. Our HCL didn’t. The plan was going to silently revert all of it on apply.

Here’s where I land. At scale, Terraform is a discipline tool, not a magic tool. The defaults are fine for a hackathon and dangerous for a real org. You need remote state with locking on day one, modules versioned like libraries, separate state per environment plus per blast radius, plan-on-PR with a human in the loop, drift detection on a schedule, and tfsec plus Infracost in the same CI gate. Skip any one of those, you’re going to lose a weekend.

Remote state on S3 with locking

Local state is fine right up until two engineers run apply at the same time. The S3 backend with a DynamoDB lock table is the standard answer and it’s the right answer. The thing nobody tells you is that the bootstrap is recursive, so you create the bucket and the lock table out-of-band first, then import them.

terraform {
  required_version = ">= 1.6.0"

  backend "s3" {
    bucket         = "acme-tf-state-prod"
    key            = "platform/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "acme-tf-locks"
    encrypt        = true
    kms_key_id     = "alias/tf-state"
  }

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40"
    }
  }
}

resource "aws_s3_bucket" "tf_state" {
  bucket = "acme-tf-state-prod"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "tf_state" {
  bucket = aws_s3_bucket.tf_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_dynamodb_table" "tf_locks" {
  name         = "acme-tf-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  point_in_time_recovery {
    enabled = true
  }
}

Three details that matter. prevent_destroy on the state bucket, because somebody will type terraform destroy at 11pm. Versioning, because state file corruption is real and rolling back one version is the cheapest recovery. KMS with a customer-managed key, because state files contain RDS passwords and IAM ARNs you don’t want sitting plaintext in S3.

Splitting state by blast radius

Teams start with one big state file for “prod” and three years later that state has a thousand resources, a plan takes 12 minutes, and one bad apply can take down the network and the databases and the Kubernetes cluster in a single transaction.

The rule, roughly. Split state along boundaries you wouldn’t want to recreate together. Network gets its own. Data plane (RDS, ElastiCache, S3) gets its own. EKS and cluster addons get their own. Application stuff (per-service IAM, queues, target groups) gets a state per service. Cross-state references through SSM Parameter Store, not deep terraform_remote_state chains, so you can rotate a state file without breaking everyone downstream.

Modules treated as versioned libraries

Copy-pasted modules across environments are how drift starts. The fix is to treat modules like internal Go packages. Source from a Git tag, never a branch, never main.

module "aurora_postgres" {
  source = "git::ssh://[email protected]/acme/tf-modules.git//aurora-postgres?ref=v3.4.1"

  cluster_identifier   = "community-prod"
  engine_version       = "15.4"
  instance_class       = "db.r6g.4xlarge"
  reader_count         = 3
  backup_retention     = 14
  deletion_protection  = true
  apply_immediately    = false

  parameter_group_family = "aurora-postgresql15"
  custom_parameters = {
    log_min_duration_statement = "500"
    statement_timeout          = "120000"
  }

  monitoring_interval = 30
  performance_insights_enabled = true

  tags = local.common_tags
}

ref=v3.4.1 is the point. When I bump the module, I bump the ref in one environment, plan, review, apply, then promote the ref through staging to prod. The module repo has its own CI that runs terraform validate, tflint, and a contract test on every PR before a tag is cut.

Learned this one the hard way somewhere else, where a source = "../../modules/rds" change shipped instantly to every environment the moment you merged. Staging and prod silently diverged for a week because of a stale local checkout. Pin the ref.

Plan-on-PR and human in the loop

CI should run terraform plan on every PR and post the output as a comment. No exceptions. The human reviewing the PR should be reading the plan, not the HCL diff. The plan is the truth.

name: terraform
on:
  pull_request:
    paths: ["terraform/**"]

permissions:
  contents: read
  id-token: write
  pull-requests: write

jobs:
  plan:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        env: [staging, prod]
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::111111111111:role/tf-ci-plan
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.6.6

      - name: tfsec
        uses: aquasecurity/[email protected]
        with:
          working_directory: terraform/${{ matrix.env }}
          soft_fail: false

      - name: terraform init
        working-directory: terraform/${{ matrix.env }}
        run: terraform init -input=false

      - name: terraform plan
        working-directory: terraform/${{ matrix.env }}
        run: |
          terraform plan -input=false -no-color -out=tfplan \
            | tee plan.txt

      - name: infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: infracost diff
        working-directory: terraform/${{ matrix.env }}
        run: |
          infracost diff --path=. --format=github-comment \
            --out-file=infracost-comment.md

      - name: post plan
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('terraform/${{ matrix.env }}/plan.txt', 'utf8');
            const cost = fs.readFileSync('terraform/${{ matrix.env }}/infracost-comment.md', 'utf8');
            await github.rest.issues.createComment({
              ...context.repo,
              issue_number: context.issue.number,
              body: '## ${{ matrix.env }} plan\n\n```\n' + plan.slice(0, 60000) + '\n```\n\n' + cost
            });

OIDC for AWS auth, not long-lived keys. Plan role is read-only-ish, scoped to what plan needs. Apply runs from a separate job on main with a different role that has write permissions. tfsec runs before plan so a misconfigured S3 bucket fails the PR fast. Infracost runs after, so the PR comment tells you the cost delta before anyone approves.

Drift detection on a schedule

The story at the top is a drift story. Console clicks happen. Auto-remediation in incidents happens. Some other system writes to a tag your Terraform owns. State goes out of sync with reality and you don’t find out until your next plan.

Run terraform plan against every environment on a cron, every hour. Non-empty plan, post to a #platform-drift channel. It’s not a P1, it’s a “go fix the source before someone does an apply and reverts a hand-edit somebody made at 2am”.

tfsec and Infracost as the same gate

I keep these two together because they answer two questions about the same plan. tfsec answers “is this safe”. Infracost answers “is this affordable”. Both have to pass for the PR to merge.

A .tfsec/config.yml to ignore the noise that doesn’t apply to your stack, but never to ignore a real finding:

exclude:
  - aws-s3-enable-bucket-logging  # we log via CloudTrail data events instead
severity_overrides:
  aws-rds-encrypt-cluster-storage-data: CRITICAL

Infracost wants a baseline so the diff is meaningful. The first run on a repo, you check in infracost-base.json against the current main and the PR comment shows you the delta, not the absolute. Without that you’ll see a giant headline number on every PR and stop reading the comments by week two.

Takeaways

Remote state on S3 with DynamoDB locks, KMS-encrypted, versioned, prevent_destroy on the bucket. Day one. Not later.
Split state by blast radius. Network, data, cluster, per-service. Cross-state via SSM, not deep terraform_remote_state chains.
Pin module sources to Git tags. Never main, never relative paths to a shared checkout.
terraform plan on every PR, posted as a comment, with tfsec and Infracost in the same gate. OIDC for AWS auth.
Run plan on a cron against every environment. Non-empty plan means drift. Find the source, fold it back into HCL.
Apply runs from a separate, more privileged job on main. Plan and apply use different IAM roles.

Thanks for reading. If you’ve got thoughts, send them my way.