KnowledgebaseBackups & Recovery › Backup verification — how to know your backups will actually restore

Backup verification — how to know your backups will actually restore

The most common backup failure mode isn't "we forgot to run backups" — it's "we ran backups for two years and never tried to restore them." Bit rot, corrupted manifests, missing encryption keys, silent permission errors that left half your data out — all are invisible until the day you need to actually use the backup, which is the worst possible time to discover them. This article covers the structured drill that catches those failures before they bite.

The three things that go wrong

  1. Manifest corruption. The backup tool's index is broken; data chunks may still be present but unreadable. restic check and borg check exist precisely for this.
  2. Coverage gaps. Your include/exclude rules drift over time. A directory added six months ago isn't in any backup; nobody noticed until you needed it.
  3. Restore-path failure. The backup is fine, but restoring it onto a clean system reveals missing pieces: a database dump that's incomplete, a permissions issue that breaks the running service, a config file you forgot to back up that lives somewhere unexpected.

The drill: monthly, scheduled, calendar-blocked

Run a real restore once a month. Not "spot check a file" — a full restore onto a throwaway VPS, then verify the restored services come up and behave. If you can't do monthly, do quarterly. Less than quarterly and the drill becomes painful enough that people skip it.

Step 1: tool-level integrity

Run the backup tool's own integrity check first. For restic:

restic -r <repo> check --read-data-subset=10%

The --read-data-subset=10% spot-checks 10% of the data chunks (full --read-data can take hours on a large repo). For borg:

borg check --verify-data <repo>

If either reports errors, stop and investigate before continuing — your backup is already compromised.

Step 2: restore to a throwaway VPS

Order a small LYLIX VPS (or use a snapshot of an existing one), install the backup tool, and pull the latest backup down. Don't restore onto your production box — you want a clean environment so any missing pieces are obvious.

restic -r <repo> restore latest --target /restore

Time the restore. If it's surprisingly slow or fails partway, that's a real signal — your DR timeline is whatever this exercise reveals, not whatever you assumed.

Step 3: bring services up against the restored data

This is the step everyone skips and the one that catches the real failures.

  • Databases: point your database service at the restored data directory (or import the dump) and verify it starts cleanly. SELECT COUNT(*) against your largest tables — does it match production?
  • Web application: rsync the restored web root into place, start the app, hit it with curl. Does it serve actual content or throw 500s about missing config?
  • Mail server: restore the mail spool and Dovecot/Postfix config, start the services, deliver a test message locally, fetch it via IMAP.
  • PBX: restore the FreePBX® backup module's tarball, run fwconsole restore, sign into the GUI, verify extensions and trunks are present.

Step 4: write down what was missing

Inevitably the first restore reveals something not in the backup. A custom systemd unit, a per-server SSH host key, a TLS certificate's private key in an unusual path, a cron job. Add those to your backup config and re-run.

Coverage audit, separate from restore drill

Once a quarter, compare what's actually on your live system to what's in the backup include list:

# What restic would back up if you ran it now
restic -r <repo> ls latest | sort > /tmp/backed-up.txt

# What's on disk
find / -xdev -type f 2>/dev/null | sort > /tmp/on-disk.txt

# Set difference shows what's on disk but NOT in the backup
comm -23 /tmp/on-disk.txt /tmp/backed-up.txt | head -100

You'll see /proc, /sys, /tmp, /var/cache — that's fine, you intentionally excluded those. What you're hunting for is the unexpected: a new application directory, a database dump dropped somewhere unconventional, a custom /opt/ install.

The monitoring layer

The drill catches gaps. Monitoring catches failures between drills. Whatever you back up with should ping a healthcheck endpoint on success — see the related article on self-hosted healthchecks, or use Healthchecks.io. The pattern:

restic backup ... && curl -fsS -m 10 https://hc.example.com/<uuid>

The healthcheck side alerts when it doesn't hear from the backup job within the expected window. Silent backup failures (cron job started but errored out before sending) become loud immediately.

What "verified" means going forward

After the first successful drill, document what you restored and how long it took. The next time you make a major change to your VPS — adding a new service, migrating a database, upgrading the OS — re-run the drill before you trust the backup. Backups that worked six months ago for a different system layout don't automatically still work today.

Also Read

« « Back

Powered by WHMCompleteSolution