Monitoring Health
Health monitoring answers a simple question: "Is the server still serving useful work?" This page focuses on lightweight checks you can run on any VPS (even without a full metrics stack) and how to interpret them safely.
- Monitor: uptime/load, memory pressure, disk space/inodes, disk I/O wait, and critical services.
- Alert on trends, not single spikes.
- Verify from the outside (HTTP) and the inside (systemd + logs).
Pick a small set of signals
If you monitor too many things, you ignore alerts. Start with these:
- Load average and CPU usage (overall pressure).
- Memory and swap activity (risk of OOM or thrash).
- Disk space and inode usage (risk of services failing).
- Disk I/O wait (a common hidden bottleneck).
- Service health (
nginx/apache2/openlitespeed,php-fpm,mysql/mariadb). - External HTTP check to a known endpoint.
Define "healthy" for your server
Healthy thresholds depend on your VPS size and traffic shape. Use these as starting points:
- Disk usage: alert at 80%, page at 90%.
- Inodes: alert at 80%, page at 90%.
- Swap: alert if swap-in/swap-out is sustained for several minutes.
- Load: investigate when load is persistently above CPU core count (example: 8+ on an 8 vCPU box).
A simple health snapshot command
Run this manually during an incident, or as part of an automated check.
set -eu
echo "==== time ===="
date -Is
echo "==== uptime / load ===="
uptime
echo "==== memory ===="
free -h
echo "==== swap ===="
swapon --show || true
echo "==== disk space / inodes ===="
df -hT
df -ih
echo "==== vmstat (pressure hints) ===="
vmstat 1 5
echo "==== top cpu ===="
ps -eo pid,user,pcpu,pmem,etime,cmd --sort=-pcpu | head -n 15
External HTTP health check
The fastest way to catch total outages is checking HTTP from the same network location your users see.
curl -fsS -o /dev/null -w 'status=%{http_code} time_total=%{time_total}\n' https://example.com/
Tips:
- Prefer a dedicated endpoint that avoids heavy database work.
- If you have a load balancer or CDN, ensure your health check path is not cached.
Internal service checks (systemd)
Use systemctl to determine whether a service is healthy and whether it is restarting.
systemctl is-active --quiet nginx && echo "nginx=active" || echo "nginx=NOT_ACTIVE"
systemctl is-active --quiet php8.2-fpm && echo "php-fpm=active" || echo "php-fpm=NOT_ACTIVE"
systemctl is-active --quiet mysql && echo "mysql=active" || echo "mysql=NOT_ACTIVE"
systemctl --no-pager --full status nginx || true
If service names differ on your host, list them:
systemctl list-units --type=service --all | rg -n 'nginx|apache|php|mysql|maria|openlitespeed'
Turn checks into a recurring report
If you are not running Prometheus/Grafana yet, a systemd timer + journal logs can still give you a timeline.
Create a health script
sudo tee /usr/local/bin/healthcheck-snapshot >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
echo "==== time ===="
date -Is
uptime
free -h
df -hT /
df -ih /
vmstat 1 3
EOF
sudo chmod +x /usr/local/bin/healthcheck-snapshot
Create a systemd service and timer
[Unit]
Description=Periodic health snapshot
[Service]
Type=oneshot
ExecStart=/usr/local/bin/healthcheck-snapshot
[Unit]
Description=Run health snapshot every 5 minutes
[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
[Install]
WantedBy=timers.target
sudo systemctl daemon-reload
sudo systemctl enable --now healthcheck-snapshot.timer
systemctl list-timers --all | rg -n 'healthcheck-snapshot'
View collected snapshots:
journalctl -u healthcheck-snapshot.service --since '2 hours ago' --no-pager
Timers increase background work. Keep scripts lightweight and avoid expensive commands (full du, large log scans) on a tight loop.
Next steps
- If the server is up but slow: see
[Process control](./process-control). - If load is high with low CPU utilization: see
[Disk I/O troubleshooting](./disk-io-troubleshooting). - If you need a timeline: see
[Historical performance stats](./historical-performance-stats).