Skip to main content

Monitoring Health

Health monitoring answers a simple question: "Is the server still serving useful work?" This page focuses on lightweight checks you can run on any VPS (even without a full metrics stack) and how to interpret them safely.

Quick Summary
  • Monitor: uptime/load, memory pressure, disk space/inodes, disk I/O wait, and critical services.
  • Alert on trends, not single spikes.
  • Verify from the outside (HTTP) and the inside (systemd + logs).

Pick a small set of signals

If you monitor too many things, you ignore alerts. Start with these:

  • Load average and CPU usage (overall pressure).
  • Memory and swap activity (risk of OOM or thrash).
  • Disk space and inode usage (risk of services failing).
  • Disk I/O wait (a common hidden bottleneck).
  • Service health (nginx/apache2/openlitespeed, php-fpm, mysql/mariadb).
  • External HTTP check to a known endpoint.

Define "healthy" for your server

Healthy thresholds depend on your VPS size and traffic shape. Use these as starting points:

  • Disk usage: alert at 80%, page at 90%.
  • Inodes: alert at 80%, page at 90%.
  • Swap: alert if swap-in/swap-out is sustained for several minutes.
  • Load: investigate when load is persistently above CPU core count (example: 8+ on an 8 vCPU box).

A simple health snapshot command

Run this manually during an incident, or as part of an automated check.

health-snapshot.sh
set -eu

echo "==== time ===="
date -Is

echo "==== uptime / load ===="
uptime

echo "==== memory ===="
free -h

echo "==== swap ===="
swapon --show || true

echo "==== disk space / inodes ===="
df -hT
df -ih

echo "==== vmstat (pressure hints) ===="
vmstat 1 5

echo "==== top cpu ===="
ps -eo pid,user,pcpu,pmem,etime,cmd --sort=-pcpu | head -n 15

External HTTP health check

The fastest way to catch total outages is checking HTTP from the same network location your users see.

http-health-check.sh
curl -fsS -o /dev/null -w 'status=%{http_code} time_total=%{time_total}\n' https://example.com/

Tips:

  • Prefer a dedicated endpoint that avoids heavy database work.
  • If you have a load balancer or CDN, ensure your health check path is not cached.

Internal service checks (systemd)

Use systemctl to determine whether a service is healthy and whether it is restarting.

systemd-health-checks.sh
systemctl is-active --quiet nginx && echo "nginx=active" || echo "nginx=NOT_ACTIVE"
systemctl is-active --quiet php8.2-fpm && echo "php-fpm=active" || echo "php-fpm=NOT_ACTIVE"
systemctl is-active --quiet mysql && echo "mysql=active" || echo "mysql=NOT_ACTIVE"

systemctl --no-pager --full status nginx || true

If service names differ on your host, list them:

list-services-by-name.sh
systemctl list-units --type=service --all | rg -n 'nginx|apache|php|mysql|maria|openlitespeed'

Turn checks into a recurring report

If you are not running Prometheus/Grafana yet, a systemd timer + journal logs can still give you a timeline.

Create a health script

create-healthcheck-script.sh
sudo tee /usr/local/bin/healthcheck-snapshot >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

echo "==== time ===="
date -Is
uptime
free -h
df -hT /
df -ih /
vmstat 1 3
EOF

sudo chmod +x /usr/local/bin/healthcheck-snapshot

Create a systemd service and timer

/etc/systemd/system/healthcheck-snapshot.service
[Unit]
Description=Periodic health snapshot

[Service]
Type=oneshot
ExecStart=/usr/local/bin/healthcheck-snapshot
/etc/systemd/system/healthcheck-snapshot.timer
[Unit]
Description=Run health snapshot every 5 minutes

[Timer]
OnBootSec=2min
OnUnitActiveSec=5min

[Install]
WantedBy=timers.target
enable-healthcheck-timer.sh
sudo systemctl daemon-reload
sudo systemctl enable --now healthcheck-snapshot.timer
systemctl list-timers --all | rg -n 'healthcheck-snapshot'

View collected snapshots:

view-healthcheck-history.sh
journalctl -u healthcheck-snapshot.service --since '2 hours ago' --no-pager
warning

Timers increase background work. Keep scripts lightweight and avoid expensive commands (full du, large log scans) on a tight loop.

Next steps

  • If the server is up but slow: see [Process control](./process-control).
  • If load is high with low CPU utilization: see [Disk I/O troubleshooting](./disk-io-troubleshooting).
  • If you need a timeline: see [Historical performance stats](./historical-performance-stats).