Skip to main content

Case Study

This case study demonstrates a realistic "site is slow" incident on a WordPress VPS and how to approach it with a repeatable workflow.

Quick Summary
  • Capture a snapshot first.
  • Identify whether you are CPU bound, memory bound, disk I/O bound, or space bound.
  • Fix the cause, then confirm improvement.

Scenario

  • Users report the site hangs intermittently.
  • SSH is responsive but pages take 10-30 seconds.
  • The server runs Nginx, PHP-FPM, and MySQL.

Step 1: take a snapshot

case-study-snapshot.sh
date -Is
uptime
free -h
df -hT /
ps -eo pid,user,pcpu,pmem,etime,cmd --sort=-pcpu | head -n 15
vmstat 1 5

Interpretation example:

  • Load is high.
  • CPU utilization is not extremely high.
  • vmstat shows high wa (I/O wait).

This points to storage pressure.

Step 2: confirm disk saturation

case-study-iostat.sh
iostat -x 1 10

Interpretation example:

  • Device util is near 100%.
  • await is high.

Step 3: find the process doing I/O

case-study-iotop.sh
sudo iotop -o -P -a

Interpretation example:

  • A backup job (tar + compression) is writing heavily.
  • It started at the same time complaints began.

Step 4: reduce impact safely

Instead of killing immediately, reduce priority:

case-study-reduce-backup-impact.sh
PID=12345
sudo renice +10 -p "$PID"
sudo ionice -c2 -n7 -p "$PID"

If the box recovers quickly, you have validated the primary culprit.

Step 5: fix the scheduling

Typical fixes:

  • Move backups off-peak.
  • Reduce compression level.
  • Exclude caches and nested archives.
  • Use a faster format (tar.zst) for operational backups.

Example: change a backup command to be less disruptive:

case-study-less-disruptive-backup.sh
nice -n 10 ionice -c2 -n7 tar -czf "/backups/site-$(date +%F).tar.gz" /var/www/html

Step 6: document and verify

Re-run the original snapshot and compare:

case-study-verify-improvement.sh
uptime
vmstat 1 5

If wa drops and page load improves, the incident is resolved.

What this case teaches

  • High load does not always mean "CPU is the problem".
  • Backups and compression are common I/O culprits.
  • Deprioritizing work can be safer than killing it.
  • Capturing snapshots before and after makes fixes defensible.