Alert Rules

Forge ships with 38 built-in alert rules covering OS health, storage, networking, hardware, ZFS, security, and service health. Each rule is evaluated on every metrics push (default: every 300 seconds / 5 minutes). When a threshold is crossed, Forge fires a notification to all configured channels.

You can override default thresholds per server in /etc/glassmkr/collector.yaml or globally in the Forge dashboard under Settings > Alert Defaults.

Table of contents

Rule categories

CategoryCountRules
OS9ram_high, cpu_high, load_high, cpu_iowait_high, oom_kills, clock_drift, swap_high, ntp_not_synced, unexpected_reboot
Storage8disk_space_high, smart_failing, nvme_wear_high, raid_degraded, disk_latency_high, filesystem_readonly, inode_high, disk_io_errors
Network5interface_errors, link_speed_mismatch, interface_saturation, conntrack_exhaustion, bond_slave_down
Hardware / IPMI5cpu_temperature_high, ecc_errors, psu_redundancy_loss, ipmi_sel_critical, ipmi_fan_failure
ZFS2zfs_pool_unhealthy, zfs_scrub_errors
Security6ssh_root_password, no_firewall, pending_security_updates, kernel_vulnerabilities, kernel_needs_reboot, unattended_upgrades_disabled
Service Health3systemd_service_failed, fd_exhaustion, server_unreachable

Alert priorities (P1-P4)

Every alert is assigned a priority level based on its severity and urgency. Priority badges appear on alert cards in the dashboard and in notification messages.

PriorityMeaningExamples
P1Critical, immediate action required. Data loss or service outage is imminent or occurring.raid_degraded, smart_failing, oom_kills, ecc_errors (uncorrectable)
P2High, action needed soon. Significant degradation or risk.disk_space_high (critical threshold), cpu_temperature_high (critical), psu_redundancy_loss
P3Medium, investigate when convenient. Performance impact or early warning.ram_high, cpu_high, disk_latency_high, inode_high
P4Low, informational. Proactive recommendations.pending_security_updates, unattended_upgrades_disabled, nvme_wear_high

Alert cards in the dashboard show the priority badge (P1-P4), a one-line summary, evidence links to relevant charts, and copy-pasteable fix commands you can run on the server.

Alert muting

You can mute specific alert rules on a per-server basis. Muted rules stop firing and stop sending notifications for that server. This is useful during maintenance windows or when a known condition is expected.

To mute a rule, go to the server detail page, open the Alerts tab, and click the mute icon next to the rule. You can also mute rules via the API or in the configuration file:

muted_rules:
  - disk_space_high    # mute during disk migration
  - cpu_iowait_high    # mute during RAID rebuild

Muted rules are re-evaluated on the next ingest cycle after unmuting. They do not fire retroactively for conditions that occurred while muted.

Alert tabs

The server detail page provides three alert tabs for filtering:

  • Active: alerts currently firing. These need attention.
  • Acknowledged: alerts that have been acknowledged but not yet resolved. Notifications are silenced.
  • All: complete alert history including resolved alerts, filterable by date range and rule.

OS rules (9)

1. ram_high

Category: OS Severity: Warning Default threshold: 90%

What it means

The server's physical RAM usage has exceeded the configured threshold. This is calculated as (total - available) / total * 100, where "available" includes buffers and cache that the kernel can reclaim under pressure.

Why it matters

Sustained high memory usage leaves little headroom for traffic spikes or new processes. If RAM fills completely, the Linux OOM killer will start terminating processes, potentially taking down critical services.

What to do

  • Identify the top memory consumers: ps aux --sort=-%mem | head -20
  • Check for memory leaks in long-running processes by comparing RSS over time.
  • Consider adding swap as a safety net (though swap is not a substitute for adequate RAM).
  • If usage is consistently high, upgrade the server's memory or redistribute workloads.

Configuration

alerts:
  ram_high:
    enabled: true
    threshold: 90
    duration: 300  # seconds the condition must persist before firing

2. cpu_high

Category: OS Severity: Warning Default threshold: 90%

What it means

The aggregate CPU utilization (user + system + iowait) has exceeded the threshold for the configured duration. On servers with per-core monitoring enabled (Crucible 0.3.0+), the alert also reports which cores are saturated.

Why it matters

Sustained high CPU usage means the server is at capacity. New requests queue, response times increase, and background tasks (cron jobs, log rotation) may not complete on time. If steal time is also high, the hypervisor is overcommitting CPU resources.

What to do

  • Identify CPU-heavy processes: top -bn1 | head -20
  • Check per-core usage in the Forge dashboard to see if the load is evenly distributed or pinned to specific cores.
  • Look for runaway processes or infinite loops.
  • Consider scaling horizontally or upgrading CPU resources.

Configuration

alerts:
  cpu_high:
    enabled: true
    threshold: 90
    duration: 300

3. cpu_iowait_high

Category: OS Severity: Warning Default threshold: 20%

What it means

The percentage of CPU time spent waiting for I/O operations to complete has exceeded the threshold. High iowait indicates that the CPU is idle because it is waiting for disk or network I/O.

Why it matters

Elevated iowait is a strong signal that storage is the bottleneck. Applications that depend on disk reads or writes will experience increased latency. This often correlates with slow database queries, sluggish log processing, or degraded RAID rebuilds.

What to do

  • Identify processes generating I/O: iotop -oP
  • Check disk latency with iostat -x 1 5 and look at the await column.
  • If a RAID array is rebuilding, iowait is expected and will resolve on its own.
  • Consider moving heavy I/O workloads to faster storage (NVMe).
  • Tune the I/O scheduler or increase the filesystem's commit interval for write-heavy workloads.

Configuration

alerts:
  cpu_iowait_high:
    enabled: true
    threshold: 20
    duration: 180

4. oom_kills

Category: OS Severity: Critical Default threshold: 1 (any OOM kill)

What it means

The Linux kernel's Out-of-Memory killer has terminated one or more processes since the last check. Crucible reads this from /proc/vmstat (the oom_kill counter) and from kernel log messages.

Why it matters

OOM kills mean the server ran out of memory and the kernel had to sacrifice processes to keep the system alive. The killed process may be your database, web server, or another critical service. OOM events frequently cause cascading failures.

What to do

  • Check which process was killed: dmesg | grep -i "oom-killer"
  • Review memory usage trends in the Forge dashboard to identify the growth pattern.
  • Set memory limits on containers or systemd services using MemoryMax= to prevent a single process from consuming all RAM.
  • Add or increase swap as a safety buffer.
  • If OOM kills recur, the server needs more RAM or the workload needs to be reduced.

Configuration

alerts:
  oom_kills:
    enabled: true
    threshold: 1  # number of new OOM kills to trigger

5. load_high

Category: OS Severity: Warning Default threshold: 2x CPU core count

What it means

The system's 5-minute load average has exceeded the threshold, which defaults to twice the number of CPU cores. A load average above the core count means processes are waiting for CPU time.

Why it matters

High load averages cause increased latency for all processes. Unlike CPU percentage, load average counts processes waiting for both CPU and I/O, so it captures bottlenecks that pure CPU metrics miss.

What to do

  • Check current load and CPU count: uptime and nproc
  • Identify processes in D state (uninterruptible sleep, usually I/O): ps aux | awk '$8 ~ /D/'
  • If load is high but CPU usage is low, the bottleneck is likely disk I/O. Check with iostat -x 1 5.
  • If load is high and CPU is also high, the server is CPU-bound. Reduce workload or add capacity.

Configuration

alerts:
  load_high:
    enabled: true
    threshold: 0  # 0 = auto (2x core count). Set a fixed number to override.
    duration: 300

6. clock_drift

Category: OS Severity: Warning Default threshold: 500 ms

What it means

The system clock has drifted more than the configured threshold from the expected time. Crucible compares the local clock against NTP reference data from timedatectl or chronyc.

Why it matters

Clock drift breaks TLS certificate validation, causes log timestamps to be unreliable, desynchronizes distributed systems (databases, consensus protocols), and can cause authentication failures with time-sensitive tokens (TOTP, Kerberos). Even small drifts compound over time if NTP is misconfigured.

What to do

  • Check current drift: timedatectl status or chronyc tracking
  • Verify NTP is running: systemctl status chronyd or systemctl status systemd-timesyncd
  • Force a sync: chronyc makestep or timedatectl set-ntp true
  • Check that NTP servers are reachable from the server's network.

Configuration

alerts:
  clock_drift:
    enabled: true
    threshold: 500  # milliseconds

7. swap_high

Category: OS Severity: Warning Default threshold: 80%

What it means

Swap space usage has exceeded the configured threshold. Crucible reads swap usage from /proc/meminfo. High swap usage means the system is actively paging memory to disk.

Why it matters

Swap exists as a safety net, not as a primary memory source. When a server is actively swapping, performance degrades significantly because disk I/O is orders of magnitude slower than RAM access. Database queries slow down, application response times spike, and the system can enter a thrashing state where it spends more time swapping than doing useful work.

What to do

  • Check swap usage: free -h and swapon --show
  • Identify processes using swap: for f in /proc/*/status; do awk '/VmSwap/{swap=$2} /Name/{name=$2} END{if(swap>0) print swap,name}' "$f" 2>/dev/null; done | sort -rn | head -20
  • Check if RAM is the bottleneck: review memory usage trends in the Forge dashboard.
  • If swap usage is sustained, the server likely needs more RAM or the workload needs to be reduced.

Configuration

alerts:
  swap_high:
    enabled: true
    threshold: 80  # percentage of total swap

8. ntp_not_synced

Category: OS Severity: Warning Default: NTP synchronization not active

What it means

The system's NTP synchronization is not active. Crucible checks timedatectl for "NTP synchronized: yes" and verifies that an NTP daemon (chrony, ntpd, or systemd-timesyncd) is running.

Why it matters

Without active NTP synchronization, the system clock will drift over time. Hardware clocks are imprecise and can drift seconds per day. This leads to the same issues as clock_drift but is a more fundamental problem: the server has no mechanism to correct its time at all.

What to do

  • Check NTP status: timedatectl status
  • Enable time sync: sudo timedatectl set-ntp true
  • If using chrony: sudo systemctl enable --now chronyd
  • If using systemd-timesyncd: sudo systemctl enable --now systemd-timesyncd
  • Verify NTP servers are configured in /etc/chrony.conf or /etc/systemd/timesyncd.conf.

Configuration

alerts:
  ntp_not_synced:
    enabled: true

9. unexpected_reboot

Category: OS Severity: Warning Default: uptime decreased between snapshots

What it means

The server's uptime has decreased since the last snapshot, indicating a reboot occurred between collection intervals. Crucible detects this by comparing the current uptime against the previous snapshot's uptime value.

Why it matters

Unexpected reboots can indicate hardware instability (kernel panics, power loss, watchdog timer expiry), firmware issues, or someone rebooting the server without coordination. Even planned reboots should be tracked for audit purposes. Repeated unexpected reboots are a strong signal of a failing component.

What to do

  • Check the reboot cause: last reboot and journalctl --boot=-1 -e
  • Check for kernel panics: dmesg | grep -i panic
  • Check IPMI SEL for power events: ipmitool sel list
  • If reboots recur, investigate hardware (PSU, memory, thermal shutdown) and check for watchdog timer kills.

Configuration

alerts:
  unexpected_reboot:
    enabled: true
    # Triggers when uptime decreases between consecutive snapshots

Storage rules (8)

10. disk_space_high

Category: Storage Severity: Warning (90%), Critical (95%) Default threshold: 90%

What it means

A mounted filesystem has exceeded the configured disk usage threshold. Forge monitors all mounted filesystems except tmpfs, devtmpfs, and other virtual mounts.

Why it matters

When a filesystem fills to 100%, writes fail. This can crash databases, corrupt logs, prevent SSH logins (if /var or /tmp are full), and make the server difficult to recover remotely. The reserved blocks for root (typically 5% on ext4) provide a small buffer but are not a long-term solution.

What to do

  • Find large files: du -h --max-depth=2 /var | sort -hr | head -20
  • Clean up old logs: journalctl --vacuum-time=7d
  • Remove old package caches: apt clean or dnf clean all
  • Check for core dumps or stale temporary files in /tmp and /var/tmp.
  • If the filesystem is consistently near capacity, expand the volume or move data to a larger disk.

Configuration

alerts:
  disk_space_high:
    enabled: true
    threshold: 90
    critical_threshold: 95
    exclude_mounts:
      - /mnt/backup  # ignore specific mount points

11. smart_failing

Category: Storage Severity: Critical Default threshold: any SMART failure

What it means

A disk's SMART self-assessment has reported a failing status, or one or more critical SMART attributes (Reallocated Sector Count, Current Pending Sector, Offline Uncorrectable) have crossed their vendor-defined thresholds. Crucible uses smartctl to read these values. The dashboard displays the drive model name, power-on days, reallocated sector count, and temperature.

Why it matters

SMART failures are a strong predictor of imminent disk failure. A disk reporting "FAILING" can die within hours or weeks. Data loss is a real risk, especially if no RAID or backup is in place.

What to do

  • Check the SMART report: smartctl -a /dev/sdX
  • Back up the disk immediately if backups are not current.
  • If the disk is part of a RAID array, replace it as soon as possible and let the array rebuild.
  • Order a replacement drive. Do not wait for the disk to fail completely.
  • If you are in a data center, open a hardware ticket with your provider.

Configuration

alerts:
  smart_failing:
    enabled: true
    # No threshold - any SMART failure triggers this alert
    ignore_disks:
      - /dev/sda  # optionally ignore specific disks

12. nvme_wear_high

Category: Storage Severity: Warning Default threshold: 80% (percentage used)

What it means

An NVMe drive's "Percentage Used" indicator (from the NVMe health log) has exceeded the threshold. This value estimates how much of the drive's rated write endurance has been consumed. A value of 100% means the drive has reached its rated endurance, though many drives continue operating beyond this point.

Why it matters

NVMe flash cells have a finite number of program/erase cycles. As wear increases, the drive's internal spare cells are consumed. Eventually the drive will transition to read-only mode or fail entirely. Planning a replacement before 100% wear avoids unexpected downtime.

What to do

  • Check current wear: smartctl -a /dev/nvme0 | grep "Percentage Used"
  • Review Data Units Written to estimate remaining lifespan based on your write rate.
  • If wear is above 90%, order a replacement drive and schedule a migration.
  • Reduce unnecessary writes (disable access time updates with noatime, move logs to a different drive).

Configuration

alerts:
  nvme_wear_high:
    enabled: true
    threshold: 80  # percentage used

13. disk_latency_high

Category: Storage Severity: Warning Default threshold: 50 ms (average)

What it means

The average I/O latency for a block device has exceeded the threshold. Crucible measures this from /sys/block/*/stat by computing the average time per completed I/O operation over the collection interval.

Why it matters

High disk latency directly impacts application performance. Database queries slow down, file operations block, and services become unresponsive. For NVMe drives, latency should typically be under 1 ms. For SATA SSDs, under 5 ms. For spinning disks, under 20 ms. Anything above 50 ms is a clear sign of trouble.

What to do

  • Check per-device latency: iostat -x 1 5 (look at await).
  • Identify I/O-heavy processes: iotop -oP
  • If the disk is healthy, latency may be caused by I/O saturation. Reduce concurrent I/O or upgrade to faster storage.
  • Check if a RAID rebuild or filesystem check is running in the background.
  • If latency is intermittent, check SMART data for signs of failing hardware.

Configuration

alerts:
  disk_latency_high:
    enabled: true
    threshold: 50  # milliseconds
    duration: 120
    exclude_devices:
      - loop0
      - loop1

14. disk_io_errors

Category: Storage Severity: Critical Default threshold: any kernel I/O errors

What it means

Kernel-level I/O errors have been reported in dmesg or syslog. These indicate hardware-level read/write failures that the drive's firmware could not recover from.

Why it matters

Kernel I/O errors are a strong signal of imminent drive failure. Unlike SMART warnings which are predictive, I/O errors mean data operations are already failing. Applications may experience silent corruption.

What to do

  • Check dmesg | grep -i "i/o error" for the affected device.
  • Run smartctl -a /dev/sdX for the device mentioned in the errors.
  • Back up data from the affected device immediately.
  • Schedule drive replacement.

Configuration

alerts:
  disk_io_errors:
    enabled: true
    # Triggers on any kernel I/O error in the collection interval

15. filesystem_readonly

Category: Storage Severity: Critical Default threshold: any read-only remount

What it means

A filesystem that should be read-write has been remounted as read-only by the kernel. This typically happens when the kernel detects filesystem corruption or I/O errors and remounts the filesystem to prevent further damage.

Why it matters

A read-only filesystem means all write operations fail. Applications crash, logs stop writing, and databases become unavailable. This is usually a sign of underlying hardware failure or filesystem corruption.

What to do

  • Check mount options: mount | grep "ro,"
  • Check kernel logs for the cause: dmesg | grep -i "remount\|error\|readonly"
  • If caused by disk errors, check SMART data and plan a replacement.
  • If the filesystem is corrupted, run fsck from a rescue environment.

Configuration

alerts:
  filesystem_readonly:
    enabled: true
    exclude_mounts:
      - /mnt/cdrom  # ignore intentionally read-only mounts

16. inode_high

Category: Storage Severity: Warning Default threshold: 90%

What it means

A filesystem's inode usage has exceeded the threshold. Inodes track file metadata; when they run out, no new files can be created even if free space remains.

Why it matters

Inode exhaustion is a subtle failure mode. Disk usage may show plenty of free space, but the server cannot create new files. This breaks log rotation, temp file creation, and application writes. It is common on filesystems with many small files (mail spools, cache directories, container layers).

What to do

  • Check inode usage: df -i
  • Find directories with many small files: find / -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -20
  • Clean up unnecessary small files (session files, cache entries, old mail).
  • If the filesystem was created with too few inodes, it must be reformatted with a higher inode ratio.

Configuration

alerts:
  inode_high:
    enabled: true
    threshold: 90
    exclude_mounts: []

17. raid_degraded

Category: Storage Severity: Critical Default threshold: any degradation

What it means

A software RAID array (mdadm) or hardware RAID controller has reported a degraded state. This means one or more member disks have failed or been removed from the array. Crucible reads /proc/mdstat for software RAID and uses vendor tools (MegaCLI, storcli) for hardware RAID when available.

Why it matters

A degraded array has lost its redundancy. If another disk fails before the array is rebuilt, data loss is likely (or certain, depending on the RAID level). RAID 1 with one failed disk has zero redundancy. RAID 5 with one failed disk cannot survive another failure. RAID 6 with one failed disk is reduced to RAID 5 levels of protection.

What to do

  • Identify the failed disk: cat /proc/mdstat or mdadm --detail /dev/md0
  • Replace the failed disk as soon as possible.
  • Add the replacement to the array: mdadm --add /dev/md0 /dev/sdX
  • Monitor the rebuild progress: watch cat /proc/mdstat
  • Avoid heavy I/O during the rebuild to speed up reconstruction.

Configuration

alerts:
  raid_degraded:
    enabled: true
    # No threshold - any degradation triggers this alert
    arrays:
      - /dev/md0
      - /dev/md1

Network rules (5)

18. interface_errors

Category: Network Severity: Warning Default threshold: 10 errors/minute

What it means

A network interface is reporting errors (RX errors, TX errors, drops, or overruns) above the threshold rate. Crucible reads these counters from /sys/class/net/*/statistics/.

Why it matters

Network errors cause packet retransmissions, increased latency, and reduced throughput. Persistent errors often indicate a hardware problem: a bad cable, a failing NIC, or a misconfigured switch port. Drops can also be caused by receive buffer exhaustion under high traffic.

What to do

  • Check error counters: ip -s link show eth0
  • Inspect the cable and SFP modules. Reseat connections.
  • Check switch port counters and logs for CRC errors or alignment errors.
  • Increase ring buffer sizes: ethtool -G eth0 rx 4096 tx 4096
  • If the NIC is faulty, replace it.

Configuration

alerts:
  interface_errors:
    enabled: true
    threshold: 10  # errors per minute
    exclude_interfaces:
      - lo
      - docker0

20. interface_saturation

Category: Network Severity: Warning Default threshold: 80%

What it means

A network interface's throughput has exceeded the configured percentage of its link speed. Crucible measures bytes transmitted and received over the collection interval and compares the rate to the interface's reported link speed.

Why it matters

A saturated network link causes packet queuing, increased latency, and dropped packets. Services that depend on network throughput (file servers, databases with replication, backup jobs) will degrade. Saturation at 80% is a warning because TCP throughput collapses well before reaching 100% utilization due to protocol overhead and buffering.

What to do

  • Identify traffic sources: iftop -i eth0 or nload eth0
  • Check if a backup job or large transfer is running.
  • Implement traffic shaping or QoS to prioritize critical traffic.
  • Consider bonding multiple interfaces or upgrading to a faster link.
  • Move bulk transfers to off-peak hours.

Configuration

alerts:
  interface_saturation:
    enabled: true
    threshold: 80  # percentage of link speed
    duration: 60
    exclude_interfaces:
      - lo

21. conntrack_exhaustion

Category: Network Severity: Warning (80%), Critical (95%) Default threshold: 80%

What it means

The kernel's connection tracking (conntrack) table is approaching capacity. Crucible reads /proc/sys/net/netfilter/nf_conntrack_count and /proc/sys/net/netfilter/nf_conntrack_max to calculate the usage percentage.

Why it matters

When the conntrack table fills up, the kernel drops new connections silently. This affects all stateful firewall rules (iptables, nftables) and NAT. Services appear unreachable, but the server looks healthy otherwise. This is a common failure mode on busy NAT gateways, load balancers, and servers with many short-lived connections.

What to do

  • Check current usage: cat /proc/sys/net/netfilter/nf_conntrack_count and cat /proc/sys/net/netfilter/nf_conntrack_max
  • Increase the limit temporarily: sysctl -w net.netfilter.nf_conntrack_max=262144
  • Make it permanent in /etc/sysctl.d/99-conntrack.conf
  • Reduce timeouts for idle connections: sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
  • If the server does not need connection tracking, consider using stateless firewall rules.

Configuration

alerts:
  conntrack_exhaustion:
    enabled: true
    threshold: 80
    critical_threshold: 95

22. bond_slave_down

Category: Network Severity: Critical Priority: P1 Urgent

What it means

A network interface that is part of a bond (e.g. bond0) has gone down. Crucible reads /sys/class/net/{iface}/operstate and detects bond membership from /proc/net/bonding/*. Requires Crucible 0.6.5 or newer.

Why it matters

Bond interfaces provide network redundancy. When one slave goes down, the bond continues working but with reduced capacity and no redundancy. A second failure would cause a full network outage. This is often caused by a failed cable, SFP transceiver, or switch port.

What to do

  • Check bond status: cat /proc/net/bonding/bond0
  • Check the slave interface: ip link show enp1s0f0 and ethtool enp1s0f0
  • Try bringing it back up: sudo ip link set enp1s0f0 up
  • If the interface won't stay up, check the physical connection (cable, SFP, switch port).
  • Check kernel messages: dmesg -T | grep -i "enp1s0f0" | tail -10

Hardware / IPMI rules (5)

22. cpu_temperature_high

Category: Hardware Severity: Warning (85C), Critical (95C) Default threshold: 85C

What it means

The CPU package temperature has exceeded the threshold. Crucible reads temperatures from hwmon sensors via /sys/class/hwmon/ or from IPMI if available. Temperatures are displayed with the degree symbol (e.g., 85 C).

Why it matters

CPUs throttle their clock speed when they get too hot, which reduces performance. At extreme temperatures (above Tjunction max, typically 100-105C), the CPU will shut down to protect itself, causing an unclean server restart. Sustained high temperatures also reduce the CPU's lifespan.

What to do

  • Check current temperatures: sensors (from lm-sensors package).
  • Verify that fans are running: ipmitool sdr type Fan
  • Clean dust from heatsinks and fans.
  • Check that the thermal paste between the CPU and heatsink is not dried out.
  • If in a data center, check the room temperature and airflow. Verify hot/cold aisle separation.
  • Reduce CPU load temporarily if temperatures are critical.

Configuration

alerts:
  cpu_temperature_high:
    enabled: true
    threshold: 85       # warning threshold in Celsius
    critical_threshold: 95
    sensor: coretemp    # sensor driver name (auto-detected if omitted)

23. ecc_errors (correctable)

Category: Hardware Severity: Warning Default threshold: 1 (any new error)

What it means

The server's ECC memory has reported new correctable single-bit errors. These are silently fixed by the ECC hardware but logged for monitoring. Crucible reads these from edac-util or /sys/devices/system/edac/mc/.

Why it matters

Occasional correctable ECC errors are normal over long periods. A sudden increase in correctable errors on a single DIMM often predicts imminent failure. Tracking the rate helps you plan proactive DIMM replacements.

What to do

  • Check error counts: edac-util -s or edac-util -v
  • Identify which DIMM is affected from the EDAC output (mc/csrow/channel).
  • Monitor the rate. If errors are increasing, schedule a DIMM replacement.
  • Run a memory test (memtest86+) during the next maintenance window.

Configuration

alerts:
  ecc_errors:
    enabled: true
    threshold: 1  # new correctable errors to trigger warning

24. ecc_errors (uncorrectable)

Category: Hardware Severity: Critical Default threshold: 1 (any uncorrectable error)

What it means

The server's ECC memory has reported uncorrectable multi-bit errors. These cannot be repaired by ECC and may cause data corruption or application crashes.

Why it matters

Uncorrectable errors are serious. Corrupted data was delivered to the CPU, which can cause application crashes, data corruption, or silent data damage. This DIMM should be replaced immediately.

What to do

  • Identify the affected DIMM: edac-util -v
  • Replace the DIMM immediately.
  • Check application data integrity, especially database checksums.
  • Run memtest86+ to confirm the diagnosis.

Configuration

alerts:
  ecc_errors:
    critical_on_uncorrectable: true

25. psu_redundancy_loss

Category: Hardware Severity: Critical Default threshold: any PSU failure

What it means

A redundant power supply unit has failed or been disconnected. Crucible detects this via IPMI sensors or by reading /sys/class/hwmon/ entries for power supply status. In a typical 1+1 redundant configuration, the server continues running on the remaining PSU, but it has lost its power redundancy.

Why it matters

Servers with redundant PSUs are designed to survive a single PSU failure. Once one PSU is down, you are running without a safety net. If the remaining PSU fails, the server goes down immediately with no graceful shutdown.

What to do

  • Check PSU status: ipmitool sdr type "Power Supply"
  • Verify that the failed PSU is receiving power (check the outlet and PDU).
  • If the PSU has a fault LED, note the error pattern.
  • Replace the failed PSU. Most servers support hot-swap PSU replacement.
  • If in a data center, open a hardware ticket immediately.

Configuration

alerts:
  psu_redundancy_loss:
    enabled: true
    # No threshold - any PSU failure triggers this alert
    source: ipmi   # ipmi or hwmon (auto-detected if omitted)

26. ipmi_fan_failure

Category: Hardware Severity: Critical Default threshold: any fan failure or RPM below minimum

What it means

An IPMI-monitored fan has stopped spinning or dropped below the minimum RPM threshold. Crucible reads fan RPM values from IPMI SDR records. Fan speeds are displayed with proper units (RPM).

Why it matters

Fan failure leads to rising temperatures, which cause CPU throttling, component damage, and eventually thermal shutdown. In servers with redundant fans, a single failure reduces cooling capacity and puts stress on the remaining fans.

What to do

  • Check fan status: ipmitool sdr type Fan
  • Inspect the fan for physical damage or cable disconnection.
  • If the server is in a data center, open a hardware ticket for fan replacement.
  • Monitor CPU temperatures closely until the fan is replaced.

Configuration

alerts:
  ipmi_fan_failure:
    enabled: true
    min_rpm: 500  # fans below this RPM are considered failed

27. ipmi_sel_critical

Category: Hardware Severity: Critical Default threshold: any critical SEL event

What it means

A critical event has been logged in the IPMI System Event Log (SEL). This includes events like machine check exceptions, PCI-E fatal errors, and power unit failures. Crucible reads the SEL via ipmitool sel list.

Why it matters

Critical SEL events indicate hardware-level problems that may not be visible through OS-level monitoring. These events are logged by the BMC independently of the operating system and can indicate problems that the OS cannot detect on its own.

What to do

  • Read the full SEL: ipmitool sel list
  • Look up the specific event type in your server vendor's documentation.
  • If the event indicates a component failure, schedule replacement.
  • Clear the SEL after investigation: ipmitool sel clear

Configuration

alerts:
  ipmi_sel_critical:
    enabled: true
    # Triggers on any critical-severity SEL event since last check

ZFS rules (2)

28. zfs_pool_unhealthy

Category: ZFS Severity: Critical Default threshold: pool state != ONLINE

What it means

A ZFS pool health status is something other than ONLINE. This includes DEGRADED (redundancy lost), FAULTED (data loss possible), and UNAVAIL (pool cannot be accessed).

Why it matters

A non-ONLINE ZFS pool means either redundancy is lost (DEGRADED) or data may already be inaccessible (FAULTED/UNAVAIL). Immediate action is required to prevent data loss.

What to do

  • Check pool status: zpool status
  • If DEGRADED: identify the failed vdev and replace the drive with zpool replace
  • If FAULTED: attempt zpool clear then investigate the cause.
  • Never reboot a FAULTED pool without understanding the failure first.

Configuration

alerts:
  zfs_pool_unhealthy:
    enabled: true
    # Triggers when any zpool reports non-ONLINE state

29. zfs_scrub_errors

Category: ZFS Severity: Warning Default threshold: any scrub errors

What it means

Checksum or data errors were found during ZFS scrub operations. ZFS scrubs verify every block of data against its checksum to detect silent data corruption (bit rot).

Why it matters

Scrub errors mean data on disk does not match its checksum. On redundant pools, ZFS auto-repairs from good copies. On non-redundant pools, this is data corruption. Either way, it signals failing hardware.

What to do

  • Check scrub results: zpool status -v
  • If on a mirror/raidz: ZFS auto-repaired. Identify the drive with errors and plan replacement.
  • If on a single vdev: data corruption occurred. Restore affected files from backup.
  • Run smartctl -a on the underlying device to check for hardware issues.

Configuration

alerts:
  zfs_scrub_errors:
    enabled: true
    # Triggers when zpool scrub reports any errors

Security rules (6)

30. ssh_root_password

Category: Security Severity: Warning Default: detects PermitRootLogin with password

What it means

The SSH daemon is configured to allow root login with a password. Crucible checks /etc/ssh/sshd_config for PermitRootLogin yes or PermitRootLogin prohibit-password not being set.

Why it matters

Root login via password is a common attack vector. Brute-force SSH attacks target root constantly. Key-based authentication is much more secure.

What to do

  • Set PermitRootLogin prohibit-password in /etc/ssh/sshd_config
  • Ensure you have SSH key access before disabling password login.
  • Restart SSH: sudo systemctl restart sshd

Configuration

alerts:
  ssh_root_password:
    enabled: true

31. no_firewall

Category: Security Severity: Warning Default: detects no active firewall

What it means

No active firewall was detected. Crucible checks for iptables rules, nftables, ufw, and firewalld. If all are empty or inactive, this alert fires.

Why it matters

A server without a firewall exposes all listening services to the internet. Even services bound to localhost can be exposed if a misconfiguration changes the bind address.

What to do

  • Enable ufw: sudo ufw default deny incoming && sudo ufw allow ssh && sudo ufw enable
  • Or configure iptables/nftables with appropriate rules for your services.
  • If you use an external firewall (cloud security group), you can disable this rule.

Configuration

alerts:
  no_firewall:
    enabled: true

32. pending_security_updates

Category: Security Severity: Warning Default: any pending security update

What it means

The package manager has pending security updates that have not been installed. Crucible checks apt (Debian/Ubuntu) or dnf (RHEL/Rocky/Alma) for available security patches.

Why it matters

Unpatched security vulnerabilities are one of the most common attack vectors. Security updates should be applied promptly, especially for internet-facing services.

What to do

  • Review pending updates: apt list --upgradable or dnf check-update --security
  • Apply security updates: sudo apt upgrade or sudo dnf update --security
  • Consider enabling automatic security updates (see unattended_upgrades_disabled below).

Configuration

alerts:
  pending_security_updates:
    enabled: true

33. kernel_vulnerabilities

Category: Security Severity: Warning Default: any known kernel vulnerability

What it means

The running kernel has known vulnerabilities that are mitigatable or patchable. Crucible checks /sys/devices/system/cpu/vulnerabilities/ for Spectre, Meltdown, and other CPU/kernel vulnerabilities.

Why it matters

Kernel vulnerabilities can allow privilege escalation, container escapes, or data leaks between processes. While some mitigations are applied automatically, others require a kernel update and reboot.

What to do

  • Check vulnerability status: grep . /sys/devices/system/cpu/vulnerabilities/*
  • Update the kernel: sudo apt upgrade linux-image-generic
  • Reboot to load the new kernel.

Configuration

alerts:
  kernel_vulnerabilities:
    enabled: true

34. kernel_needs_reboot

Category: Security Severity: Warning Default: reboot required after kernel update

What it means

A kernel update has been installed but the server is still running the old kernel. Crucible detects this by comparing the running kernel version against the installed version and by checking for /var/run/reboot-required.

Why it matters

Security patches in the new kernel are not active until the server reboots. The server remains vulnerable to patched exploits until the reboot occurs.

What to do

  • Schedule a maintenance window and reboot the server.
  • Verify the new kernel is running after reboot: uname -r

Configuration

alerts:
  kernel_needs_reboot:
    enabled: true

35. unattended_upgrades_disabled

Category: Security Severity: Warning Default: detects disabled automatic security updates

What it means

Automatic security updates are not configured. On Debian/Ubuntu, Crucible checks whether the unattended-upgrades package is installed and enabled. On RHEL-based systems, it checks for dnf-automatic.

Why it matters

Without automatic security updates, critical patches sit uninstalled until someone manually runs the update. For servers that are not actively maintained, this can leave known vulnerabilities open for weeks or months.

What to do

  • Install and enable automatic updates:
    sudo apt install unattended-upgrades
    sudo dpkg-reconfigure -plow unattended-upgrades
  • Or on RHEL: sudo dnf install dnf-automatic && sudo systemctl enable --now dnf-automatic.timer
  • If you prefer manual updates, you can disable this rule.

Configuration

alerts:
  unattended_upgrades_disabled:
    enabled: true

Service Health rules (3)

36. systemd_service_failed

Category: Service Health Severity: Warning Default: any failed systemd service

What it means

One or more systemd services have entered the "failed" state. Crucible runs systemctl list-units --state=failed on each collection cycle and reports any units that are not running as expected.

Why it matters

Failed services may include databases, web servers, monitoring agents, or critical system daemons. A service in the failed state is not running and will not restart automatically unless configured to do so. Operators often do not notice failed services until users report problems.

What to do

  • List failed services: systemctl list-units --state=failed
  • Check the service logs: journalctl -u service-name -e --no-pager
  • Attempt a restart: sudo systemctl restart service-name
  • If the service fails repeatedly, check its configuration and dependencies.
  • For services you intentionally disabled, add them to the ignore list.

Configuration

alerts:
  systemd_service_failed:
    enabled: true
    ignore_services:
      - bluetooth.service   # ignore services that are not relevant
      - ModemManager.service

37. fd_exhaustion

Category: Service Health Severity: Warning (80%), Critical (95%) Default threshold: 80%

What it means

The system's file descriptor usage has exceeded the configured percentage of the maximum allowed. Crucible reads /proc/sys/fs/file-nr to get the current allocation and the system-wide limit.

Why it matters

File descriptors are used for open files, sockets, pipes, and other I/O handles. When the system runs out of file descriptors, processes cannot open new files or establish new network connections. This causes cascading failures: databases refuse connections, web servers return errors, and logging stops working.

What to do

  • Check current usage: cat /proc/sys/fs/file-nr (allocated, unused, max)
  • Find processes with many open FDs: for pid in /proc/[0-9]*; do echo "$(ls "$pid/fd" 2>/dev/null | wc -l) $(cat "$pid/comm" 2>/dev/null)"; done | sort -rn | head -20
  • Increase the system limit temporarily: sysctl -w fs.file-max=1048576
  • Make it permanent in /etc/sysctl.d/99-file-max.conf
  • Check per-process limits with cat /proc/PID/limits and adjust with systemd LimitNOFILE=.
  • Investigate if a process is leaking file descriptors (opening without closing).

Configuration

alerts:
  fd_exhaustion:
    enabled: true
    threshold: 80
    critical_threshold: 95

38. server_unreachable

Category: Service Health Severity: Critical Priority: P1 Urgent

What it means

The server has stopped sending snapshots to Forge. Crucible is an agent-based collector; if the server goes down, the agent goes down with it and Forge stops receiving data. This rule runs server-side on a schedule (every 2 minutes), not as part of the snapshot evaluation.

Why it matters

A server that stops reporting may be down, rebooting, or have a crashed Crucible service. Without this rule, the only signal would be the "Last seen X minutes ago" label on the dashboard, which is easy to miss.

How it works

  • Threshold: 2x the server's collection interval (default 300s, so 10 minutes).
  • Scales with custom intervals: if a server pushes every 600s, the threshold is 20 minutes.
  • Onboarding grace: servers younger than 10 minutes never fire this alert.
  • Servers that have never sent a snapshot are not alerted on.
  • Auto-resolves when the server sends its next snapshot.

What to do

  • Check if the server is reachable: ping {server_ip}
  • If reachable, check Crucible: ssh {server} sudo systemctl status glassmkr-crucible
  • Check logs: ssh {server} sudo journalctl -u glassmkr-crucible -n 20 --no-pager
  • If not reachable, check your hosting panel for IPMI or KVM access.

Global alert settings

These settings apply to all alert rules and can be set in the configuration file or the dashboard:

alerts:
  global:
    cooldown: 3600          # seconds between repeated notifications for the same alert
    resolve_notify: true    # send a notification when an alert resolves
    channels:
      - telegram
      - email