Database Backup and Recovery Strategies

Logical vs physical backups, point-in-time recovery on Aurora, cross-region DR, and why recovery drills matter more than backup configs.

It was a Tuesday morning at the creator economy platform I worked at. Aurora was fine on the writer. The three reader replicas, less fine. Lag had drifted to about 14 minutes and the Community feed p99 was over 8 seconds. We sorted it out within the hour. But while we were waiting for replica lag to drain, someone in the war room dropped a sentence into Slack that stuck.

“If this had been a corruption event instead of a lag event, what would our recovery actually look like?”

Nobody had a clean answer. We had snapshots. PITR enabled. A runbook. None of us had restored anything in the last quarter. That is backup theater, and most teams are doing it.

Backups are not a strategy

Here’s the deal. A backup you have not restored is not a backup, it is a Schrodinger’s snapshot. Until you’ve pulled it back into a live system and verified the rows, all you know is some bytes exist in S3. What you care about isn’t whether the snapshot exists, it’s whether you can hit RTO and RPO under pressure. Only way to know is to do it.

I’ve owned the PostgreSQL Aurora layer on a multi-terabyte writer with hundreds of millions of rows on the hot tables. The number of times the snapshot config mattered is small. The number of times a teammate asked “can you get me yesterday’s state of the orders table into a sandbox” is large. Operational recovery, not catastrophic recovery, is what eats your week.

Logical vs physical, when to pick which

Short version: use both, on purpose.

Physical backups (Aurora native snapshots, base backups) are fast, restore the whole cluster, and are locked to the engine version. They are what you reach for in a disaster. Logical backups (pg_dump, mysqldump) are slow, portable, and let you cherry-pick. They are what you reach for when one engineer dropped a column they shouldn’t have and you need just that table from 4 hours ago.

A logical extract I run weekly into a sandbox:

#!/usr/bin/env bash
set -euo pipefail

SRC_HOST="$1"        # primary writer endpoint
TABLE="$2"           # e.g. community_posts
SANDBOX_DB="${SANDBOX_DB:-sandbox_restore}"
DUMP_FILE="/tmp/${TABLE}.$(date +%Y%m%d-%H%M).dump"

PGPASSWORD="$SRC_PASSWORD" pg_dump \
  --host="$SRC_HOST" \
  --username="$SRC_USER" \
  --dbname="$SRC_DB" \
  --format=custom \
  --no-owner --no-acl \
  --table="public.${TABLE}" \
  --file="$DUMP_FILE"

PGPASSWORD="$SANDBOX_PASSWORD" pg_restore \
  --host="$SANDBOX_HOST" \
  --username="$SANDBOX_USER" \
  --dbname="$SANDBOX_DB" \
  --clean --if-exists \
  --no-owner --no-acl \
  "$DUMP_FILE"

psql "host=$SANDBOX_HOST user=$SANDBOX_USER dbname=$SANDBOX_DB" \
  -c "SELECT count(*) FROM $TABLE;" \
  -c "SELECT max(updated_at) FROM $TABLE;"

Not glamorous. Reads to a sandbox, prints two sanity checks, exits non-zero on failure. The script doesn’t have to be clever. It has to actually run, and someone has to look at the output.

Point-in-time recovery on Aurora

PITR is the answer to “we shipped a bad migration at 2:14 a.m., undo.” Aurora keeps continuous backups across the retention window you configure, and restore-db-cluster-to-point-in-time spins up a new cluster at any second inside that window. New cluster, not in-place. That detail catches people the first time they try it.

War story. Late evening deploy at the creator economy platform, past midnight UTC but only 6 p.m. Pacific. We shipped a schema change to add a non-null column to a users table with hundreds of millions of rows. The migration used the “safer” add_column_with_default helper from strong_migrations. I’d ack’d it as safe that morning.

It wasn’t. The migration acquired an ACCESS EXCLUSIVE lock while applying the default backfill. On Aurora at our row count that meant about 90 seconds of blocked writes. Login error rate hit 100% for roughly 85 seconds. Login, sign-up, password reset, every webhook tied to user creation, all dead. PagerDuty woke half the senior engineers in California.

First instinct was to roll back. Bad idea. Rails doesn’t have a clean rollback for a partially-applied add_column_with_default, and killing it mid-flight risks leaving the table in an inconsistent metadata state. We let it finish. 87 seconds. Locks released. Login came back within 15 seconds.

The postmortem question was the one that stuck. If that had corrupted the table instead of locking it, would we have PITR’d back? We thought yes. We weren’t sure. We hadn’t drilled it. So we drilled it. The command is unremarkable:

aws rds restore-db-cluster-to-point-in-time \
  --source-db-cluster-identifier prod-aurora-main \
  --db-cluster-identifier prod-aurora-pitr-test \
  --restore-to-time "$RESTORE_TIME" \
  --use-latest-restorable-time false \
  --db-subnet-group-name prod-private \
  --vpc-security-group-ids sg-0a1b2c3d4e

That spins up a fresh cluster. You attach readers, run validation queries, and in a real recovery cut over by switching the writer endpoint. Minimum drill: PITR a cluster, count rows on three tables, drop the cluster. Couple of hours of Aurora time. Worth it.

Backup retention I keep in Terraform so it’s reviewable like any other config:

resource "aws_rds_cluster" "main" {
  cluster_identifier      = "prod-aurora-main"
  engine                  = "aurora-postgresql"
  engine_version          = "15.4"
  database_name           = "app"

  backup_retention_period      = 35
  preferred_backup_window      = "07:00-08:00"  # UTC, off-peak
  copy_tags_to_snapshot        = true
  deletion_protection          = true
  storage_encrypted            = true
  kms_key_id                   = aws_kms_key.rds.arn

  enabled_cloudwatch_logs_exports = ["postgresql"]
}

resource "aws_db_cluster_snapshot_copy" "weekly_cross_region" {
  provider                       = aws.dr
  source_db_cluster_snapshot_arn = data.aws_db_cluster_snapshot.weekly.db_cluster_snapshot_arn
  target_db_cluster_snapshot_identifier = "prod-aurora-weekly-dr"
  kms_key_id                     = aws_kms_key.rds_dr.arn
}

35 days of retention is generous on cost but cheap on regret.

Cross-region DR, what it actually costs

Cross-region DR is where I see the most theater. Teams put Aurora Global Database in their architecture doc, never drill failover, and assume because the box on the diagram is green, they’re covered. They are not.

Three flavors, lightest to heaviest:

Cross-region snapshot copy. Cheap. RPO in hours, RTO in hours-to-restore. Good enough for most products.
Aurora Global Database (warm standby reader in another region). RPO seconds, RTO minutes if you’ve drilled failover. Expensive.
Active-active across regions. Rare, hard, almost always more complexity than the product needs.

Start with the snapshot-copy version and only move up if a real RTO requirement forces you. The trading platform I architected handled around 10M concurrent live market-data requests at peak, and even there we treated cross-region as warm standby. Split-brain at 3 a.m. is not a problem you want to discover you needed to solve.

Drills, not backup theater

If there’s one line I’d put on a sticky note above your desk, it’s this. Backups are the byproduct. Drills are the product.

I run scheduled restore drills as a CI job. Same way we run integration tests:

name: weekly-restore-drill

on:
  schedule:
    - cron: "0 8 * * 1"   # Mondays 08:00 UTC
  workflow_dispatch:

jobs:
  restore-and-verify:
    runs-on: ubuntu-latest
    timeout-minutes: 90
    env:
      AWS_REGION: us-east-1
      SOURCE_CLUSTER: prod-aurora-main
      TARGET_CLUSTER: drill-aurora-${{ github.run_id }}
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.DRILL_ROLE_ARN }}
          aws-region: ${{ env.AWS_REGION }}

      - name: PITR restore (15 minutes ago)
        run: |
          RESTORE_TIME=$(date -u -d "15 minutes ago" +"%Y-%m-%dT%H:%M:%SZ")
          aws rds restore-db-cluster-to-point-in-time \
            --source-db-cluster-identifier "$SOURCE_CLUSTER" \
            --db-cluster-identifier "$TARGET_CLUSTER" \
            --restore-to-time "$RESTORE_TIME" \
            --db-subnet-group-name drill-private \
            --vpc-security-group-ids "${{ secrets.DRILL_SG }}"

      - name: Wait for cluster available
        run: |
          aws rds wait db-cluster-available \
            --db-cluster-identifier "$TARGET_CLUSTER"

      - name: Verify row counts
        run: ./scripts/verify_restore.sh "$TARGET_CLUSTER"

      - name: Notify Slack
        if: always()
        run: ./scripts/notify_slack.sh "$TARGET_CLUSTER" "${{ job.status }}"

      - name: Teardown
        if: always()
        run: |
          aws rds delete-db-cluster \
            --db-cluster-identifier "$TARGET_CLUSTER" \
            --skip-final-snapshot

verify_restore.sh runs three queries against three known-large tables and checks they’re within tolerance of production row counts at that timestamp. If it fails, Slack gets pinged.

The reason this works is the same reason a reconciliation pattern I shipped at the creator platform worked. Briefly. Our pending_apple_review Sidekiq queue backed up one Wednesday because Apple’s Connect API was silently throttling submissions, returning 200 OK but dropping the work server-side. The pipeline trusted the response. We chased it with a retry first, which made it worse (duplicate submissions on Apple’s side). The actual fix was a reconciliation job that read state from Apple directly and treated Apple as the source of truth. Same shape as a backup drill. You don’t trust your snapshot. You restore it and compare against truth.

Add a CloudWatch alarm for backup age so the monitor fires when the snapshot pipeline stops working, not three days later when you need a restore:

resource "aws_cloudwatch_metric_alarm" "snapshot_age" {
  alarm_name          = "rds-snapshot-age-over-26h"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "SnapshotAgeHours"
  namespace           = "Custom/RDS"
  period              = 3600
  statistic           = "Maximum"
  threshold           = 26
  alarm_description   = "Most recent automated snapshot is older than 26h"
  alarm_actions       = [aws_sns_topic.pager.arn]
  treat_missing_data  = "breaching"
}

SnapshotAgeHours is a custom metric emitted from a tiny Lambda that calls describe-db-cluster-snapshots and reports the age of the freshest one. Cheap, boring, caught two snapshot-job outages in production nobody would have noticed otherwise.

Runbooks and the human side

On-call should be able to do a PITR restore without paging the database lead. If they can’t, the runbook is the bug. The first line of a backup runbook shouldn’t be “open the AWS console”, it should be “is the latest snapshot younger than 26 hours?” If the answer is no, the snapshot pipeline is your incident, not the database. Runbooks that need 12 paragraphs before the first action are unread runbooks.

Takeaways

An untested backup is not a backup. Drill quarterly, minimum.
Run logical and physical together. Logical for surgical restores, physical for disaster.
Measure RTO and RPO from a real drill, not from the design doc.
Cross-region DR is expensive. Snapshot-copy is enough for most products.
Runbooks lead with the first thing you check, not the first thing you read.
Alarm on backup age, not just backup existence.

Thanks for reading. If you’ve got thoughts, send them my way.