#database #backup #DevOps #PostgreSQL #disaster recovery

Database backups that actually restore — testing 3-2-1 in practice

Most database backups exist but have never been tested. A scary fraction won't actually restore when the database burns down. The 3-2-1 strategy plus quarterly restore drills is the minimum for any production database.

Jun 12, 2026

Industry surveys consistently find that 30-40% of production database backups don't actually restore when tested. The backups exist. The pipeline ran. The files are sitting in S3. They just don't work — corrupted, missing dependencies, wrong format, encryption keys lost.

The only way to know your backups work is to restore from them. Regularly. The 3-2-1 rule plus quarterly restore drills is the production minimum.

The 3-2-1 rule

3 copies of data — production + 2 backups.
2 different storage media — disk + cloud, NAS + tape, etc.
1 copy offsite — different region, different provider, different physical location.

Translated to typical cloud setup:

Production database (primary).
Snapshot to S3 in same region.
Replicated S3 backup in a different region.

Postgres-specific backup strategy

Continuous (pg_basebackup + WAL archiving)

For point-in-time recovery:

pg_basebackup creates a base image weekly.
WAL files archive continuously to S3 (every few minutes).
Tool: WAL-G or pgBackRest handles both.

Result: restore to any minute in the past N days.

Logical (pg_dump)

For schema-level backup, migrations, dev environments:

Nightly pg_dump --format=custom.
Smaller files, slower restore than physical.
Easier to selectively restore tables.

Snapshots (cloud-managed)

RDS, Aurora, Cloud SQL, Yandex Cloud Postgres handle snapshots natively:

Daily automatic snapshots.
Configurable retention (typically 7-35 days).
Point-in-time recovery within retention window.

Even with managed snapshots, you still want logical backups exported to your own S3 for the rare scenario where the entire cloud account is compromised.

Retention

A reasonable retention schedule:

Last 14 days: hourly granularity (WAL archive).
Last 90 days: daily granularity.
Last 12 months: weekly granularity.
Last 5-7 years: monthly granularity (compliance).

Encryption

Backup files encrypted at rest (S3 server-side encryption).
Encryption keys stored separately from data.
Keys versioned and rotated annually.
Restore tested with current keys — older backups may need older keys.

Losing the encryption key means losing all backups. Recovery plan must include key recovery.

Ransomware protection

Attackers in 2026 specifically target backup infrastructure. Modern ransomware:

Encrypts production.
Finds and encrypts backups too.
Leaves you with no recovery option.

Protection patterns:

S3 Object Lock (WORM): backups cannot be deleted or modified for X days.
Separate backup account: different AWS account, different credentials, no production-account access.
Air-gapped offsite copy: physical tape, periodically refreshed.
Backup verification: detect tampering immediately.

Restore testing — quarterly drill

At minimum, every 90 days:

Provision a fresh test database.
Restore from yesterday's backup.
Restore to a point-in-time within the last hour.
Verify expected row counts on key tables.
Run sample queries that should return known data.
Document time-to-restore.

Without this drill, 30-40% chance backups don't work when you need them.

Automate the drill

Best practice: automated restore every week to a test environment, verifying basic queries succeed. CI pipeline.

Catch backup issues before disasters.

Recovery time and point objectives

RTO (Recovery Time Objective): how long can production be down? 4 hours? 1 hour? 15 minutes?
RPO (Recovery Point Objective): how much data can be lost? 24 hours? 1 hour? Zero?

WAL archiving gives ~5-minute RPO. Read replicas give near-zero RPO. Managed database snapshots typically restore in 30-90 minutes for moderate sizes.

Common backup failures

Backup runs but doesn't include a critical schema. Caught only on restore.
Backup format changes between Postgres versions. Old backup unreadable by new pg_restore.
Compression corruption. File completes but is truncated.
Encryption key rotation without re-encrypting old backups.
S3 lifecycle policy archived backups to Glacier — restore takes hours.
Quota exhaustion stopped backups months ago, nobody noticed.

Monitor backup success metrics. Alert on failures immediately.

Verdict

Backups that haven't been restored are unreliable backups. 3-2-1 rule + WAL archiving for point-in-time recovery + S3 Object Lock against ransomware + quarterly restore drills. Document RTO and RPO. Automate verification. Without these, your disaster recovery plan is fiction.

DevOps setup