RPO vs RTO
RPO and RTO turn backup talk into measurable targets. RPO is how much data you can lose. RTO is how long you can be down. On a WordPress VPS, these metrics determine backup frequency, offsite strategy, and how much automation you need.
- RPO (Recovery Point Objective): maximum acceptable data loss measured in time.
- RTO (Recovery Time Objective): maximum acceptable downtime measured in time.
- RPO drives backup frequency and replication.
- RTO drives restore speed, local copies, and automation.
Definitions (simple and strict)
| Metric | Question it answers | Measured in | Controlled by |
|---|---|---|---|
| RPO | "How much data can I lose?" | minutes/hours | backup interval, replication lag, offsite delay |
| RTO | "How long can I be offline?" | minutes/hours | restore speed, automation, availability design |
Examples:
- RPO = 1 hour means you accept losing up to 1 hour of content/orders.
- RTO = 2 hours means you must be back online within 2 hours.
WordPress components and typical targets
Different parts of WordPress have different "pain" when lost.
| Component | Why it matters | Typical RPO | Typical RTO |
|---|---|---|---|
| Database | posts, users, orders, settings | tight | medium-to-tight |
| Uploads | user media and assets | medium | medium |
| Plugins/themes | code drift from updates | loose-to-medium | medium |
| Secrets/config | DB creds, salts, API keys | loose (but critical) | tight |
Notes:
- WooCommerce sites usually need tighter DB RPO than brochure sites.
- Config/secrets do not change often, but losing them blocks restores.
How RPO maps to backup frequency
If your DB dump runs every 6 hours, your best-case RPO is roughly 6 hours.
Reality is slightly worse because of:
- job runtime
- upload lag (offsite copies)
- failures you did not notice
Practical rule:
- RPO target should be comfortably larger than your backup interval.
How RTO is built (what you actually spend time on)
RTO is not just "extract and import". It is the sum of:
- detection time (someone notices)
- decision time (choose restore point)
- provisioning time (new VPS, packages)
- data transfer time (download from offsite)
- restore time (import DB, extract files)
- validation time (health checks)
- cutover time (DNS/LB)
If you want a low RTO, you must reduce these components.
Measure your real restore time (a restore drill)
Run timed drills. Your theoretical RTO is not your actual RTO.
Time the download
time rclone copy remote:wp-backups/site-a /tmp/restore-input --include 'wp-files-2026-03-01.tar.zst'
Time the file restore
sudo rm -rf /tmp/restore-test
sudo mkdir -p /tmp/restore-test
time sudo tar --use-compress-program=zstd -xf /tmp/restore-input/wp-files-2026-03-01.tar.zst -C /tmp/restore-test
Time the DB restore
time zstd -dc /tmp/restore-input/wp-db-2026-03-01.sql.zst | mysql wordpress_restore
Capture these timings in a log and use them when you set RTO.
Target tiers (a practical way to think)
These tiers are examples to calibrate expectations.
| Tier | Example RPO | Example RTO | Typical approach |
|---|---|---|---|
| Basic | 24h | 12h | daily backups, manual restore |
| Standard | 4-6h | 2-4h | frequent DB dumps + local copies + tested runbooks |
| High availability | minutes | minutes | replication + automated failover + continuous monitoring |
How backup design affects RPO and RTO
Local + offsite
Local copies reduce restore time (RTO) because you do not need to download.
Offsite copies reduce the chance of total loss (helps meet RPO in a disaster where local is gone).
See:
opt/docker-data/apps/docusaurus/site/docs/server/linux-server/10-backup-disaster-recovery/local-vs-remote-backup.mdxopt/docker-data/apps/docusaurus/site/docs/server/linux-server/10-backup-disaster-recovery/offsite--local-redundancy.mdx
Encryption
Encryption protects data, but increases restore steps:
- decrypt
- then decompress
- then restore
This can slightly increase RTO. Measure it.
time gpg --decrypt /backups/wp-db-2026-03-01.sql.zst.gpg | zstd -dc | mysql wordpress_restore
Retention
Retention interacts with RPO:
- if you keep only 7 days of daily dumps, you cannot restore to "last month"
- if pruning deletes baselines, incremental restores may fail
See:
opt/docker-data/apps/docusaurus/site/docs/server/linux-server/10-backup-disaster-recovery/rotation--retention-policies.mdx
Example: two-VPS redundant architecture (impact on RPO/RTO)
This is a common design when you want near-continuous service.
Assumptions:
- VPS-A is production.
- VPS-B is standby.
- files are synced (rsync) on a schedule.
- database changes are replicated.
- a load balancer can switch traffic.
VPS-A (primary)
- WordPress files
- MySQL primary
VPS-B (standby)
- rsync copies of files
- MySQL replica
Traffic
- load balancer health checks
- failover to VPS-B
Expected RPO and RTO
| Metric | What drives it | Example outcome |
|---|---|---|
| RPO | replication lag + rsync interval | minutes (if replication is healthy) |
| RTO | health check + failover + warm services | minutes (if automation is correct) |
Benefits
- Very low downtime during a single-node failure.
- Smaller restore operations (often failover instead of full restore).
- Can be positioned as a higher tier offering to clients.
Limitations and risks
- More moving parts: replication health, file sync health, LB health checks.
- Split-brain risk during failback if you are not disciplined.
- Higher costs (two servers, monitoring, paid LB features).
- Higher security surface area (two hosts to harden).
Setup difficulty (engineering estimate)
This is intentionally conservative:
| Component | Difficulty | Why |
|---|---|---|
| rsync file sync | medium | excludes, permissions, delete semantics, verification |
| DB replication | high | binlogs, lag monitoring, promotion/failback |
| automated failover | medium-to-high | health checks, TLS, caching, cutover behavior |
| operations | high | monitoring, drills, incident handling |
This design improves RPO/RTO, but it is closer to availability engineering than "backups".
Client communication (how to describe targets)
Keep language concrete:
- "We take DB dumps every 6 hours" (backup frequency)
- "We test restores monthly" (validation)
- "We can restore from offsite within ~X hours" (measured RTO)
Avoid promising numbers you have not measured.
An RPO/RTO worksheet
Use this worksheet to define targets and constraints.
Inputs
| Input | Example | Notes |
|---|---|---|
| Files archive size | 20 GB | affects download/extract time |
| DB dump size (compressed) | 800 MB | affects download/import time |
| Offsite bandwidth | 80 Mbps | affects download time |
| Restore operator | on-call | affects detection/decision |
| DNS/LB cutover | LB | affects traffic switch |
Measured times (fill in from drills)
| Step | Your measured time |
|---|---|
| download files archive | |
| extract files | |
| download DB dump | |
| import DB | |
| validation and smoke tests |
Targets
| Metric | Target |
|---|---|
| RPO | |
| RTO |
If your measured times exceed your target RTO, you must change the design (local copy, faster format, automation, or HA).
A restore drill script you can reuse
This creates a repeatable baseline for measuring RTO components.
#!/usr/bin/env bash
set -euo pipefail
FILES_ARCHIVE="/backups/wp-files-2026-03-01.tar.zst"
DB_DUMP="/backups/wp-db-2026-03-01.sql.zst"
echo "[$(date -Is)] start restore drill"
echo "[$(date -Is)] extract files"
sudo rm -rf /tmp/restore-test
sudo mkdir -p /tmp/restore-test
time sudo tar --use-compress-program=zstd -xf "$FILES_ARCHIVE" -C /tmp/restore-test
echo "[$(date -Is)] restore db"
time zstd -dc "$DB_DUMP" | mysql wordpress_restore
echo "[$(date -Is)] verify layout"
sudo find /tmp/restore-test -maxdepth 3 -type d -name wp-content -print
sudo find /tmp/restore-test -maxdepth 3 -type f -name wp-config.php -print
echo "[$(date -Is)] done"
This imports into wordpress_restore. Do not point restore drills at production databases.
Common mistakes
- Setting RPO/RTO without measuring restores.
- Ignoring offsite copy lag.
- Storing backups offsite without encryption.
- Assuming "two servers" automatically means low RTO (failover must be tested).
RPO math (practical examples)
RPO is primarily limited by how often you capture data and how reliably you move it offsite.
Example: DB dump cadence
| DB dump interval | Best-case RPO | Realistic RPO (with failures) |
|---|---|---|
| every 24h | ~24h | 24h+ (if one dump fails) |
| every 6h | ~6h | 6h+ |
| every 1h | ~1h | 1h+ |
If your dumps run every 6 hours but your offsite upload runs once per day, disaster RPO can still be close to 24 hours (because the VPS may die before the offsite copy completes).
Separate cadences for DB and files
This is common on WordPress:
- database dumps: frequent
- file snapshots: less frequent
Example:
DB dumps: every 2 hours
File snapshots: daily
Full baseline archive: weekly
Offsite upload: after every backup run
This keeps DB RPO tighter than file RPO, which is often acceptable.
How to reduce RTO (without changing business scope)
If your measured RTO is too high, you usually need to reduce one of these components:
Provisioning time
- Keep an infrastructure checklist (packages, versions, configs).
- Use scripts or automation to install your stack.
- Keep your web server and PHP-FPM config under version control (not secrets).
Data transfer time
- Keep at least one local copy (fast restores).
- Use faster formats (
zstd) for operational backups. - If offsite is required, ensure you have enough bandwidth and keep artifacts reasonably sized.
Restore execution time
- Practice restores so your steps are deterministic.
- Keep artifacts in predictable locations.
- Avoid manual decision making under pressure.
High availability is not a backup
HA reduces downtime, but it does not automatically protect you from:
- accidental deletion (deletion replicates)
- corruption (corruption replicates)
- malware (malware replicates)
Use HA to improve RTO, and backups to improve both RPO and recoverability.
Backups: restore to a previous point in time.
HA: keep service available across node failures.
You usually need both for critical sites.
Two-VPS design: operational runbook notes
If you implement primary/standby, document failover and failback.
Signals you need to monitor
- file sync freshness (last rsync success time)
- replication health (replica running, lag)
- LB health checks
- backup job success (still needed)
Example checks:
ssh backup@backup-host 'ls -lah /srv/wp-backups/site-a | sed -n "1,40p"'
Replication health checks vary by MySQL/MariaDB version, but you must track:
- replica IO thread state
- replica SQL thread state
- seconds behind source (lag)
mysql -e "SHOW REPLICA STATUS\\G" | sed -n '1,120p'
Failover checklist
- Confirm primary is unhealthy (not just slow).
- Confirm standby has recent files and an acceptable replication lag.
- Promote standby (DB + app) using your documented steps.
- Switch traffic.
- Announce and log the incident.
Failback checklist
Failback is where split-brain happens if you are not careful.
- Decide which node is authoritative.
- Freeze writes on the old primary.
- Re-sync files in the correct direction.
- Rebuild DB replication in the correct direction.
- Only then allow the old primary to serve traffic.
If both nodes accept writes at the same time, you can create diverging databases and lose data. Document your promotion/failback steps and test them.
Point-in-time recovery (advanced)
If you need a very tight RPO for the database, full dumps may not be enough.
Options include:
- MySQL/MariaDB replication (standby)
- binary logs (replay changes between dumps)
This is a deeper operational topic, but the high-level model is:
Full dump at T0
Binary logs capture changes after T0
Restore: import dump + replay binlogs up to target time
PITR improves RPO but increases complexity and requires careful testing.
Pricing and client value (how to think, not a quote)
If you sell managed recovery targets to clients, price the operational reality:
- number of sites
- data size and change rate
- retention period
- offsite storage costs
- frequency of restore drills
- on-call expectations
It is reasonable to offer tiers aligned to RPO/RTO targets. The important part is that the targets are measurable and tested.
Post-restore smoke checks (reduce surprise downtime)
After you restore, validate the things that commonly break:
- HTTP returns expected status codes
- database connectivity works
- wp-admin login works
- uploads load
- cron is running (or intentionally disabled)
curl -fsS -o /dev/null -w 'status=%{http_code} time_total=%{time_total}\n' http://127.0.0.1/
curl -fsS -o /dev/null -w 'status=%{http_code} time_total=%{time_total}\n' http://127.0.0.1/wp-login.php
mysql -e "SELECT 1" >/dev/null
Reference tier examples (inputs -> targets)
| Site profile | Suggested RPO | Suggested RTO | Notes |
|---|---|---|---|
| Personal blog | 24h | 12h | low change rate |
| Small business | 6h | 4h | basic operations |
| WooCommerce | 1h | 1-2h | transactions matter |
| Membership/news | 1h | 1h | frequent updates |
Treat these as starting points. Measure restores and adjust.
Next steps
- Backup types and schedules:
opt/docker-data/apps/docusaurus/site/docs/server/linux-server/10-backup-disaster-recovery/full-vs-incremental-vs-differential.mdx. - 3-2-1 design:
opt/docker-data/apps/docusaurus/site/docs/server/linux-server/10-backup-disaster-recovery/321-backup-strategy-for-wp.mdx. - Disaster recovery workflow:
opt/docker-data/apps/docusaurus/site/docs/server/linux-server/10-backup-disaster-recovery/disaster-recovery-workflow.mdx.