Skip to main content

Production Patterns

Learning Focus

By the end of this lesson you will be able to deploy timer units with security sandboxing, overlap prevention, missed-run catch-up, fleet jitter, structured logging, error handling, and monitoring integration.

Hardened Production Template

Use this as a baseline for any production timer job:

/etc/systemd/system/safe-timer-job.timer
[Unit]
Description=Safe timer-driven job schedule

[Timer]
OnCalendar=02:15
Persistent=true
RandomizedDelaySec=5m
FixedRandomDelay=true
AccuracySec=1m

[Install]
WantedBy=timers.target
/etc/systemd/system/safe-timer-job.service
[Unit]
Description=Safe timer-driven job
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
User=www-data
Group=www-data
WorkingDirectory=/var/www/html
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
ExecStart=/usr/bin/flock -n /var/lock/myjob.lock /usr/local/bin/myjob.sh
RuntimeMaxSec=1h
StandardOutput=append:/var/log/myjob.log
StandardError=append:/var/log/myjob.log

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/var/log /var/lock /mnt/backups

Overlap Prevention with flock

The Problem

If a job takes longer than the timer interval, systemd will try to start a second instance. For Type=oneshot, systemd queues the request — but the safer pattern is explicit file locking.

The Solution: flock

flock-in-service.service
[Service]
Type=oneshot
ExecStart=/usr/bin/flock -n /var/lock/backup.lock /usr/local/bin/backup.sh
FlagBehavior
-nNon-blocking — exit immediately if lock is held
-w 30Wait up to 30 seconds for lock
-xExclusive lock (default)
-sShared lock (read lock)

How It Works

When To Use flock

ScenarioUse flock?Why
Nightly backup (30 min job, 24h interval)YesSafety net if backup is slow
Health check (2s job, 60s interval)NoJob is always fast
Database export (variable duration)YesCould exceed interval on large DBs
CDN sync (variable duration)YesNetwork delays are unpredictable

Combining with RuntimeMaxSec

flock-plus-runtimemax.service
[Service]
Type=oneshot
ExecStart=/usr/bin/flock -n /var/lock/backup.lock /usr/local/bin/backup.sh
RuntimeMaxSec=2h # Kill if it runs longer than 2 hours

RuntimeMaxSec is the kill switch. flock is the overlap preventer. Use both together.


Persistent Catch-Up

How It Works

[Timer]
OnCalendar=02:15
Persistent=true
  1. systemd stores the last trigger time on disk.
  2. After boot or timer restart, systemd checks: "Did any runs fire between the stored time and now?"
  3. If yes, systemd runs the service once at the next timer evaluation.
  4. The catch-up run still honors RandomizedDelaySec=.

When To Use

Job TypeUse Persistent?Why
Nightly backupYesMust not be silently skipped
Database exportYesData integrity depends on regular exports
Health checkNoA missed check is fine; the next one will fire
Cache warmingNoStale cache is temporary
Log rotationYesLogs must be rotated regularly
Retention purgeYesMust enforce storage limits

State Management

persistent-state-commands.sh
# Check when the timer last fired
systemctl show wp-backup.timer -p LastTriggerUSec

# Reset stored state (force fresh start)
sudo systemctl clean --what=state wp-backup.timer

# Restart the timer to pick up the reset
sudo systemctl restart wp-backup.timer

Fleet Jitter

The Thundering Herd Problem

10 servers, all scheduled at 02:15:

  • Without jitter: all 10 hit the backup target at 02:15:00.
  • With jitter: spread across 02:15:00–02:20:00.

The Solution

jitter-config.timer
[Timer]
OnCalendar=02:15
RandomizedDelaySec=5m # Add 0–5 minutes random delay
FixedRandomDelay=true # Same offset each day (stable per unit)
AccuracySec=1m # Coalescing window
DirectivePurpose
RandomizedDelaySec=5mEach run gets 0–5 minutes of random delay
FixedRandomDelay=trueThe delay is calculated from the unit name — stable across runs and reboots
AccuracySec=1msystemd may coalesce with other timers within this window

Example: 5 Servers

ServerBase ScheduleFixedRandomDelay OffsetActual Fire Time
server-0102:15+47s02:15:47
server-0202:15+2m12s02:17:12
server-0302:15+3m55s02:18:55
server-0402:15+1m30s02:16:30
server-0502:15+4m08s02:19:08

Security Hardening

Graduated Hardening Levels

Level 1 — Basic (Every Production Service)

[Service]
NoNewPrivileges=true
PrivateTmp=true
[Service]
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/var/log /var/lock /mnt/backups

Level 3 — Strict (Security-Sensitive Workloads)

[Service]
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/var/log /var/lock /mnt/backups
PrivateDevices=true
ProtectKernelTunables=true
ProtectControlGroups=true
MemoryDenyWriteExecute=true
RestrictRealtime=true

Security Audit

security-audit.sh
systemd-analyze security safe-timer-job.service

Aim for a score below 3.0:

example-output.txt
→ Overall exposure level for safe-timer-job.service: 2.1 OK

Structured Logging

Log to Both journald and File

logging-service.service
[Service]
StandardOutput=append:/var/log/myjob.log
StandardError=append:/var/log/myjob.log

Structured Log Format in Scripts

/usr/local/bin/backup-with-logging.sh
#!/usr/bin/env bash
set -euo pipefail

LOG_PREFIX="[$(date -Is)] [backup]"

log_info() { echo "$LOG_PREFIX [INFO] $*"; }
log_error() { echo "$LOG_PREFIX [ERROR] $*" >&2; }
log_warn() { echo "$LOG_PREFIX [WARN] $*"; }

log_info "Backup started"
if /usr/local/bin/wp db export /mnt/backups/db.sql --path=/var/www/html; then
log_info "Database exported successfully"
else
log_error "Database export failed"
exit 1
fi
log_info "Backup complete"

Log Rotation

/etc/logrotate.d/timer-jobs
/var/log/wp-backup.log
/var/log/wp-db-backup.log
/var/log/health-check.log
/var/log/media-sync.log
{
daily
rotate 14
compress
missingok
notifempty
create 0640 www-data www-data
}

Error Handling

OnFailure Notification

myjob.service
[Unit]
Description=My scheduled job
OnFailure=alert-failure@%n.service
/etc/systemd/system/alert-failure@.service
[Unit]
Description=Send alert for %i failure

[Service]
Type=oneshot
ExecStart=/usr/local/bin/alert-failure.sh %i
/usr/local/bin/alert-failure.sh
#!/usr/bin/env bash
UNIT="$1"
MSG="[$(date -Is)] ALERT: $UNIT failed on $(hostname)"
echo "$MSG"
# Send to Slack, PagerDuty, etc.
# curl -s -X POST "https://hooks.slack.com/..." -d "{\"text\": \"$MSG\"}"

Script-Level Error Handling

robust-script.sh
#!/usr/bin/env bash
set -euo pipefail

cleanup() {
if [ $? -ne 0 ]; then
echo "[$(date -Is)] [ERROR] Script failed at line $LINENO" >&2
fi
}
trap cleanup EXIT

# Your logic here...

Environment Configuration

Environment Files

Keep environment-specific settings outside the unit file:

myjob.service
[Service]
EnvironmentFile=/etc/default/myjob
ExecStart=/usr/local/bin/myjob.sh
/etc/default/myjob
DB_HOST=localhost
DB_NAME=wordpress
BACKUP_DIR=/mnt/backups
S3_BUCKET=my-bucket
LOG_LEVEL=INFO

Per-Environment Overrides

create-staging-override.sh
sudo mkdir -p /etc/systemd/system/myjob.service.d/
sudo tee /etc/systemd/system/myjob.service.d/staging.conf > /dev/null <<'EOF'
[Service]
EnvironmentFile=
EnvironmentFile=/etc/default/myjob-staging
RuntimeMaxSec=2h
EOF
sudo systemctl daemon-reload

Monitoring

Comprehensive Health Check

/usr/local/bin/check-timer-health.sh
#!/usr/bin/env bash
set -euo pipefail

TIMERS=(wp-backup wp-db-backup backup-prune disk-check wp-cron-runner)
EXIT_CODE=0

for name in "${TIMERS[@]}"; do
timer_active=$(systemctl is-active "${name}.timer" 2>/dev/null || echo "not-found")
if [ "$timer_active" != "active" ]; then
echo "CRITICAL: ${name}.timer is $timer_active"
EXIT_CODE=2
else
next=$(systemctl show "${name}.timer" -p NextElapseUSecRealtime --value 2>/dev/null)
last=$(systemctl show "${name}.timer" -p LastTriggerUSec --value 2>/dev/null)
echo "OK: ${name}.timer (last: $last, next: $next)"
fi
done

exit $EXIT_CODE

WordPress VPS Timer Reference

TaskScheduleKey Directives
WordPress cron events*:0/15User=www-data, wp cron event run --due-now
Nightly full backup02:15flock, RuntimeMaxSec=2h, Persistent=true
Database export02:30wp db export, flock, RandomizedDelaySec=5m
Backup retention purge04:00find -mtime +14 -delete
Object cache flush00/6:00:00wp cache flush, User=www-data
Media sync to S30/2:00:00rclone sync, User=www-data
SSL certificate renewal03:00certbot renew, FixedRandomDelay=true
PHP-FPM weekly restartSun 05:00systemctl restart php8.2-fpm
Weekly WP optimizationMon 03:00wp db optimize
Peak-hours cache warm08..20:0/10Persistent=false

Key Takeaways

  • Use flock -n + RuntimeMaxSec together for overlap prevention + kill switch.
  • Use Persistent=true for any job that must not be silently skipped.
  • Use RandomizedDelaySec= + FixedRandomDelay=true for fleet safety.
  • Apply at least Level 1 security hardening on every production service.
  • Use OnFailure= for alerting on job failures.
  • Use EnvironmentFile= and drop-in overrides for multi-environment setups.

What's Next

  • Study Cases — real-world scenarios where systemd timers solve complex automation problems.