Alert Rules
Forge ships with 38 built-in alert rules covering OS health, storage, networking, hardware, ZFS, security, and service health. Each rule is evaluated on every metrics push (default: every 300 seconds / 5 minutes). When a threshold is crossed, Forge fires a notification to all configured channels.
You can override default thresholds per server in /etc/glassmkr/collector.yaml or globally in the Forge dashboard under Settings > Alert Defaults.
Table of contents
- Rule categories
- Alert priorities (P1-P4)
- Alert muting
- Alert tabs
- OS rules (9)
- Storage rules (8)
- Network rules (5)
- Hardware / IPMI rules (5)
- ZFS rules (2)
- Security rules (6)
- Service Health rules (3)
- Global alert settings
Rule categories
| Category | Count | Rules |
|---|---|---|
| OS | 9 | ram_high, cpu_high, load_high, cpu_iowait_high, oom_kills, clock_drift, swap_high, ntp_not_synced, unexpected_reboot |
| Storage | 8 | disk_space_high, smart_failing, nvme_wear_high, raid_degraded, disk_latency_high, filesystem_readonly, inode_high, disk_io_errors |
| Network | 5 | interface_errors, link_speed_mismatch, interface_saturation, conntrack_exhaustion, bond_slave_down |
| Hardware / IPMI | 5 | cpu_temperature_high, ecc_errors, psu_redundancy_loss, ipmi_sel_critical, ipmi_fan_failure |
| ZFS | 2 | zfs_pool_unhealthy, zfs_scrub_errors |
| Security | 6 | ssh_root_password, no_firewall, pending_security_updates, kernel_vulnerabilities, kernel_needs_reboot, unattended_upgrades_disabled |
| Service Health | 3 | systemd_service_failed, fd_exhaustion, server_unreachable |
Alert priorities (P1-P4)
Every alert is assigned a priority level based on its severity and urgency. Priority badges appear on alert cards in the dashboard and in notification messages.
| Priority | Meaning | Examples |
|---|---|---|
| P1 | Critical, immediate action required. Data loss or service outage is imminent or occurring. | raid_degraded, smart_failing, oom_kills, ecc_errors (uncorrectable) |
| P2 | High, action needed soon. Significant degradation or risk. | disk_space_high (critical threshold), cpu_temperature_high (critical), psu_redundancy_loss |
| P3 | Medium, investigate when convenient. Performance impact or early warning. | ram_high, cpu_high, disk_latency_high, inode_high |
| P4 | Low, informational. Proactive recommendations. | pending_security_updates, unattended_upgrades_disabled, nvme_wear_high |
Alert cards in the dashboard show the priority badge (P1-P4), a one-line summary, evidence links to relevant charts, and copy-pasteable fix commands you can run on the server.
Alert muting
You can mute specific alert rules on a per-server basis. Muted rules stop firing and stop sending notifications for that server. This is useful during maintenance windows or when a known condition is expected.
To mute a rule, go to the server detail page, open the Alerts tab, and click the mute icon next to the rule. You can also mute rules via the API or in the configuration file:
muted_rules: - disk_space_high # mute during disk migration - cpu_iowait_high # mute during RAID rebuild
Muted rules are re-evaluated on the next ingest cycle after unmuting. They do not fire retroactively for conditions that occurred while muted.
Alert tabs
The server detail page provides three alert tabs for filtering:
- Active: alerts currently firing. These need attention.
- Acknowledged: alerts that have been acknowledged but not yet resolved. Notifications are silenced.
- All: complete alert history including resolved alerts, filterable by date range and rule.
OS rules (9)
1. ram_high
What it means
The server's physical RAM usage has exceeded the configured threshold. This is calculated as (total - available) / total * 100, where "available" includes buffers and cache that the kernel can reclaim under pressure.
Why it matters
Sustained high memory usage leaves little headroom for traffic spikes or new processes. If RAM fills completely, the Linux OOM killer will start terminating processes, potentially taking down critical services.
What to do
- Identify the top memory consumers:
ps aux --sort=-%mem | head -20 - Check for memory leaks in long-running processes by comparing RSS over time.
- Consider adding swap as a safety net (though swap is not a substitute for adequate RAM).
- If usage is consistently high, upgrade the server's memory or redistribute workloads.
Configuration
alerts:
ram_high:
enabled: true
threshold: 90
duration: 300 # seconds the condition must persist before firing2. cpu_high
What it means
The aggregate CPU utilization (user + system + iowait) has exceeded the threshold for the configured duration. On servers with per-core monitoring enabled (Crucible 0.3.0+), the alert also reports which cores are saturated.
Why it matters
Sustained high CPU usage means the server is at capacity. New requests queue, response times increase, and background tasks (cron jobs, log rotation) may not complete on time. If steal time is also high, the hypervisor is overcommitting CPU resources.
What to do
- Identify CPU-heavy processes:
top -bn1 | head -20 - Check per-core usage in the Forge dashboard to see if the load is evenly distributed or pinned to specific cores.
- Look for runaway processes or infinite loops.
- Consider scaling horizontally or upgrading CPU resources.
Configuration
alerts:
cpu_high:
enabled: true
threshold: 90
duration: 3003. cpu_iowait_high
What it means
The percentage of CPU time spent waiting for I/O operations to complete has exceeded the threshold. High iowait indicates that the CPU is idle because it is waiting for disk or network I/O.
Why it matters
Elevated iowait is a strong signal that storage is the bottleneck. Applications that depend on disk reads or writes will experience increased latency. This often correlates with slow database queries, sluggish log processing, or degraded RAID rebuilds.
What to do
- Identify processes generating I/O:
iotop -oP - Check disk latency with
iostat -x 1 5and look at theawaitcolumn. - If a RAID array is rebuilding, iowait is expected and will resolve on its own.
- Consider moving heavy I/O workloads to faster storage (NVMe).
- Tune the I/O scheduler or increase the filesystem's commit interval for write-heavy workloads.
Configuration
alerts:
cpu_iowait_high:
enabled: true
threshold: 20
duration: 1804. oom_kills
What it means
The Linux kernel's Out-of-Memory killer has terminated one or more processes since the last check. Crucible reads this from /proc/vmstat (the oom_kill counter) and from kernel log messages.
Why it matters
OOM kills mean the server ran out of memory and the kernel had to sacrifice processes to keep the system alive. The killed process may be your database, web server, or another critical service. OOM events frequently cause cascading failures.
What to do
- Check which process was killed:
dmesg | grep -i "oom-killer" - Review memory usage trends in the Forge dashboard to identify the growth pattern.
- Set memory limits on containers or systemd services using
MemoryMax=to prevent a single process from consuming all RAM. - Add or increase swap as a safety buffer.
- If OOM kills recur, the server needs more RAM or the workload needs to be reduced.
Configuration
alerts:
oom_kills:
enabled: true
threshold: 1 # number of new OOM kills to trigger5. load_high
What it means
The system's 5-minute load average has exceeded the threshold, which defaults to twice the number of CPU cores. A load average above the core count means processes are waiting for CPU time.
Why it matters
High load averages cause increased latency for all processes. Unlike CPU percentage, load average counts processes waiting for both CPU and I/O, so it captures bottlenecks that pure CPU metrics miss.
What to do
- Check current load and CPU count:
uptimeandnproc - Identify processes in D state (uninterruptible sleep, usually I/O):
ps aux | awk '$8 ~ /D/' - If load is high but CPU usage is low, the bottleneck is likely disk I/O. Check with
iostat -x 1 5. - If load is high and CPU is also high, the server is CPU-bound. Reduce workload or add capacity.
Configuration
alerts:
load_high:
enabled: true
threshold: 0 # 0 = auto (2x core count). Set a fixed number to override.
duration: 3006. clock_drift
What it means
The system clock has drifted more than the configured threshold from the expected time. Crucible compares the local clock against NTP reference data from timedatectl or chronyc.
Why it matters
Clock drift breaks TLS certificate validation, causes log timestamps to be unreliable, desynchronizes distributed systems (databases, consensus protocols), and can cause authentication failures with time-sensitive tokens (TOTP, Kerberos). Even small drifts compound over time if NTP is misconfigured.
What to do
- Check current drift:
timedatectl statusorchronyc tracking - Verify NTP is running:
systemctl status chronydorsystemctl status systemd-timesyncd - Force a sync:
chronyc makesteportimedatectl set-ntp true - Check that NTP servers are reachable from the server's network.
Configuration
alerts:
clock_drift:
enabled: true
threshold: 500 # milliseconds7. swap_high
What it means
Swap space usage has exceeded the configured threshold. Crucible reads swap usage from /proc/meminfo. High swap usage means the system is actively paging memory to disk.
Why it matters
Swap exists as a safety net, not as a primary memory source. When a server is actively swapping, performance degrades significantly because disk I/O is orders of magnitude slower than RAM access. Database queries slow down, application response times spike, and the system can enter a thrashing state where it spends more time swapping than doing useful work.
What to do
- Check swap usage:
free -handswapon --show - Identify processes using swap:
for f in /proc/*/status; do awk '/VmSwap/{swap=$2} /Name/{name=$2} END{if(swap>0) print swap,name}' "$f" 2>/dev/null; done | sort -rn | head -20 - Check if RAM is the bottleneck: review memory usage trends in the Forge dashboard.
- If swap usage is sustained, the server likely needs more RAM or the workload needs to be reduced.
Configuration
alerts:
swap_high:
enabled: true
threshold: 80 # percentage of total swap8. ntp_not_synced
What it means
The system's NTP synchronization is not active. Crucible checks timedatectl for "NTP synchronized: yes" and verifies that an NTP daemon (chrony, ntpd, or systemd-timesyncd) is running.
Why it matters
Without active NTP synchronization, the system clock will drift over time. Hardware clocks are imprecise and can drift seconds per day. This leads to the same issues as clock_drift but is a more fundamental problem: the server has no mechanism to correct its time at all.
What to do
- Check NTP status:
timedatectl status - Enable time sync:
sudo timedatectl set-ntp true - If using chrony:
sudo systemctl enable --now chronyd - If using systemd-timesyncd:
sudo systemctl enable --now systemd-timesyncd - Verify NTP servers are configured in
/etc/chrony.confor/etc/systemd/timesyncd.conf.
Configuration
alerts:
ntp_not_synced:
enabled: true9. unexpected_reboot
What it means
The server's uptime has decreased since the last snapshot, indicating a reboot occurred between collection intervals. Crucible detects this by comparing the current uptime against the previous snapshot's uptime value.
Why it matters
Unexpected reboots can indicate hardware instability (kernel panics, power loss, watchdog timer expiry), firmware issues, or someone rebooting the server without coordination. Even planned reboots should be tracked for audit purposes. Repeated unexpected reboots are a strong signal of a failing component.
What to do
- Check the reboot cause:
last rebootandjournalctl --boot=-1 -e - Check for kernel panics:
dmesg | grep -i panic - Check IPMI SEL for power events:
ipmitool sel list - If reboots recur, investigate hardware (PSU, memory, thermal shutdown) and check for watchdog timer kills.
Configuration
alerts:
unexpected_reboot:
enabled: true
# Triggers when uptime decreases between consecutive snapshotsStorage rules (8)
10. disk_space_high
What it means
A mounted filesystem has exceeded the configured disk usage threshold. Forge monitors all mounted filesystems except tmpfs, devtmpfs, and other virtual mounts.
Why it matters
When a filesystem fills to 100%, writes fail. This can crash databases, corrupt logs, prevent SSH logins (if /var or /tmp are full), and make the server difficult to recover remotely. The reserved blocks for root (typically 5% on ext4) provide a small buffer but are not a long-term solution.
What to do
- Find large files:
du -h --max-depth=2 /var | sort -hr | head -20 - Clean up old logs:
journalctl --vacuum-time=7d - Remove old package caches:
apt cleanordnf clean all - Check for core dumps or stale temporary files in /tmp and /var/tmp.
- If the filesystem is consistently near capacity, expand the volume or move data to a larger disk.
Configuration
alerts:
disk_space_high:
enabled: true
threshold: 90
critical_threshold: 95
exclude_mounts:
- /mnt/backup # ignore specific mount points11. smart_failing
What it means
A disk's SMART self-assessment has reported a failing status, or one or more critical SMART attributes (Reallocated Sector Count, Current Pending Sector, Offline Uncorrectable) have crossed their vendor-defined thresholds. Crucible uses smartctl to read these values. The dashboard displays the drive model name, power-on days, reallocated sector count, and temperature.
Why it matters
SMART failures are a strong predictor of imminent disk failure. A disk reporting "FAILING" can die within hours or weeks. Data loss is a real risk, especially if no RAID or backup is in place.
What to do
- Check the SMART report:
smartctl -a /dev/sdX - Back up the disk immediately if backups are not current.
- If the disk is part of a RAID array, replace it as soon as possible and let the array rebuild.
- Order a replacement drive. Do not wait for the disk to fail completely.
- If you are in a data center, open a hardware ticket with your provider.
Configuration
alerts:
smart_failing:
enabled: true
# No threshold - any SMART failure triggers this alert
ignore_disks:
- /dev/sda # optionally ignore specific disks12. nvme_wear_high
What it means
An NVMe drive's "Percentage Used" indicator (from the NVMe health log) has exceeded the threshold. This value estimates how much of the drive's rated write endurance has been consumed. A value of 100% means the drive has reached its rated endurance, though many drives continue operating beyond this point.
Why it matters
NVMe flash cells have a finite number of program/erase cycles. As wear increases, the drive's internal spare cells are consumed. Eventually the drive will transition to read-only mode or fail entirely. Planning a replacement before 100% wear avoids unexpected downtime.
What to do
- Check current wear:
smartctl -a /dev/nvme0 | grep "Percentage Used" - Review Data Units Written to estimate remaining lifespan based on your write rate.
- If wear is above 90%, order a replacement drive and schedule a migration.
- Reduce unnecessary writes (disable access time updates with
noatime, move logs to a different drive).
Configuration
alerts:
nvme_wear_high:
enabled: true
threshold: 80 # percentage used13. disk_latency_high
What it means
The average I/O latency for a block device has exceeded the threshold. Crucible measures this from /sys/block/*/stat by computing the average time per completed I/O operation over the collection interval.
Why it matters
High disk latency directly impacts application performance. Database queries slow down, file operations block, and services become unresponsive. For NVMe drives, latency should typically be under 1 ms. For SATA SSDs, under 5 ms. For spinning disks, under 20 ms. Anything above 50 ms is a clear sign of trouble.
What to do
- Check per-device latency:
iostat -x 1 5(look atawait). - Identify I/O-heavy processes:
iotop -oP - If the disk is healthy, latency may be caused by I/O saturation. Reduce concurrent I/O or upgrade to faster storage.
- Check if a RAID rebuild or filesystem check is running in the background.
- If latency is intermittent, check SMART data for signs of failing hardware.
Configuration
alerts:
disk_latency_high:
enabled: true
threshold: 50 # milliseconds
duration: 120
exclude_devices:
- loop0
- loop114. disk_io_errors
What it means
Kernel-level I/O errors have been reported in dmesg or syslog. These indicate hardware-level read/write failures that the drive's firmware could not recover from.
Why it matters
Kernel I/O errors are a strong signal of imminent drive failure. Unlike SMART warnings which are predictive, I/O errors mean data operations are already failing. Applications may experience silent corruption.
What to do
- Check
dmesg | grep -i "i/o error"for the affected device. - Run
smartctl -a /dev/sdXfor the device mentioned in the errors. - Back up data from the affected device immediately.
- Schedule drive replacement.
Configuration
alerts:
disk_io_errors:
enabled: true
# Triggers on any kernel I/O error in the collection interval15. filesystem_readonly
What it means
A filesystem that should be read-write has been remounted as read-only by the kernel. This typically happens when the kernel detects filesystem corruption or I/O errors and remounts the filesystem to prevent further damage.
Why it matters
A read-only filesystem means all write operations fail. Applications crash, logs stop writing, and databases become unavailable. This is usually a sign of underlying hardware failure or filesystem corruption.
What to do
- Check mount options:
mount | grep "ro," - Check kernel logs for the cause:
dmesg | grep -i "remount\|error\|readonly" - If caused by disk errors, check SMART data and plan a replacement.
- If the filesystem is corrupted, run
fsckfrom a rescue environment.
Configuration
alerts:
filesystem_readonly:
enabled: true
exclude_mounts:
- /mnt/cdrom # ignore intentionally read-only mounts16. inode_high
What it means
A filesystem's inode usage has exceeded the threshold. Inodes track file metadata; when they run out, no new files can be created even if free space remains.
Why it matters
Inode exhaustion is a subtle failure mode. Disk usage may show plenty of free space, but the server cannot create new files. This breaks log rotation, temp file creation, and application writes. It is common on filesystems with many small files (mail spools, cache directories, container layers).
What to do
- Check inode usage:
df -i - Find directories with many small files:
find / -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -20 - Clean up unnecessary small files (session files, cache entries, old mail).
- If the filesystem was created with too few inodes, it must be reformatted with a higher inode ratio.
Configuration
alerts:
inode_high:
enabled: true
threshold: 90
exclude_mounts: []17. raid_degraded
What it means
A software RAID array (mdadm) or hardware RAID controller has reported a degraded state. This means one or more member disks have failed or been removed from the array. Crucible reads /proc/mdstat for software RAID and uses vendor tools (MegaCLI, storcli) for hardware RAID when available.
Why it matters
A degraded array has lost its redundancy. If another disk fails before the array is rebuilt, data loss is likely (or certain, depending on the RAID level). RAID 1 with one failed disk has zero redundancy. RAID 5 with one failed disk cannot survive another failure. RAID 6 with one failed disk is reduced to RAID 5 levels of protection.
What to do
- Identify the failed disk:
cat /proc/mdstatormdadm --detail /dev/md0 - Replace the failed disk as soon as possible.
- Add the replacement to the array:
mdadm --add /dev/md0 /dev/sdX - Monitor the rebuild progress:
watch cat /proc/mdstat - Avoid heavy I/O during the rebuild to speed up reconstruction.
Configuration
alerts:
raid_degraded:
enabled: true
# No threshold - any degradation triggers this alert
arrays:
- /dev/md0
- /dev/md1Network rules (5)
18. interface_errors
What it means
A network interface is reporting errors (RX errors, TX errors, drops, or overruns) above the threshold rate. Crucible reads these counters from /sys/class/net/*/statistics/.
Why it matters
Network errors cause packet retransmissions, increased latency, and reduced throughput. Persistent errors often indicate a hardware problem: a bad cable, a failing NIC, or a misconfigured switch port. Drops can also be caused by receive buffer exhaustion under high traffic.
What to do
- Check error counters:
ip -s link show eth0 - Inspect the cable and SFP modules. Reseat connections.
- Check switch port counters and logs for CRC errors or alignment errors.
- Increase ring buffer sizes:
ethtool -G eth0 rx 4096 tx 4096 - If the NIC is faulty, replace it.
Configuration
alerts:
interface_errors:
enabled: true
threshold: 10 # errors per minute
exclude_interfaces:
- lo
- docker019. link_speed_mismatch
What it means
A network interface is operating at a lower speed than expected. For example, a 10 Gbps NIC negotiating at 1 Gbps. Crucible reads the link speed from /sys/class/net/*/speed and compares it against the configured expected speed.
Why it matters
A link speed mismatch means you are getting a fraction of the bandwidth you are paying for or that your network design requires. This is usually caused by a bad cable, a damaged SFP module, or a switch port that auto-negotiated to a lower speed.
What to do
- Check current link speed:
ethtool eth0 | grep Speed - Reseat the cable and SFP module.
- Try a different cable. Cat5e cables cannot support 10 Gbps; use Cat6a or fiber.
- Check the switch port configuration. Force the expected speed if auto-negotiation is failing.
- If the NIC supports multiple speeds, verify the firmware is up to date.
Configuration
alerts:
link_speed_mismatch:
enabled: true
interfaces:
eth0:
expected_speed: 10000 # Mbps
eth1:
expected_speed: 100020. interface_saturation
What it means
A network interface's throughput has exceeded the configured percentage of its link speed. Crucible measures bytes transmitted and received over the collection interval and compares the rate to the interface's reported link speed.
Why it matters
A saturated network link causes packet queuing, increased latency, and dropped packets. Services that depend on network throughput (file servers, databases with replication, backup jobs) will degrade. Saturation at 80% is a warning because TCP throughput collapses well before reaching 100% utilization due to protocol overhead and buffering.
What to do
- Identify traffic sources:
iftop -i eth0ornload eth0 - Check if a backup job or large transfer is running.
- Implement traffic shaping or QoS to prioritize critical traffic.
- Consider bonding multiple interfaces or upgrading to a faster link.
- Move bulk transfers to off-peak hours.
Configuration
alerts:
interface_saturation:
enabled: true
threshold: 80 # percentage of link speed
duration: 60
exclude_interfaces:
- lo21. conntrack_exhaustion
What it means
The kernel's connection tracking (conntrack) table is approaching capacity. Crucible reads /proc/sys/net/netfilter/nf_conntrack_count and /proc/sys/net/netfilter/nf_conntrack_max to calculate the usage percentage.
Why it matters
When the conntrack table fills up, the kernel drops new connections silently. This affects all stateful firewall rules (iptables, nftables) and NAT. Services appear unreachable, but the server looks healthy otherwise. This is a common failure mode on busy NAT gateways, load balancers, and servers with many short-lived connections.
What to do
- Check current usage:
cat /proc/sys/net/netfilter/nf_conntrack_countandcat /proc/sys/net/netfilter/nf_conntrack_max - Increase the limit temporarily:
sysctl -w net.netfilter.nf_conntrack_max=262144 - Make it permanent in
/etc/sysctl.d/99-conntrack.conf - Reduce timeouts for idle connections:
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30 - If the server does not need connection tracking, consider using stateless firewall rules.
Configuration
alerts:
conntrack_exhaustion:
enabled: true
threshold: 80
critical_threshold: 9522. bond_slave_down
What it means
A network interface that is part of a bond (e.g. bond0) has gone down. Crucible reads /sys/class/net/{iface}/operstate and detects bond membership from /proc/net/bonding/*. Requires Crucible 0.6.5 or newer.
Why it matters
Bond interfaces provide network redundancy. When one slave goes down, the bond continues working but with reduced capacity and no redundancy. A second failure would cause a full network outage. This is often caused by a failed cable, SFP transceiver, or switch port.
What to do
- Check bond status:
cat /proc/net/bonding/bond0 - Check the slave interface:
ip link show enp1s0f0andethtool enp1s0f0 - Try bringing it back up:
sudo ip link set enp1s0f0 up - If the interface won't stay up, check the physical connection (cable, SFP, switch port).
- Check kernel messages:
dmesg -T | grep -i "enp1s0f0" | tail -10
Hardware / IPMI rules (5)
22. cpu_temperature_high
What it means
The CPU package temperature has exceeded the threshold. Crucible reads temperatures from hwmon sensors via /sys/class/hwmon/ or from IPMI if available. Temperatures are displayed with the degree symbol (e.g., 85 C).
Why it matters
CPUs throttle their clock speed when they get too hot, which reduces performance. At extreme temperatures (above Tjunction max, typically 100-105C), the CPU will shut down to protect itself, causing an unclean server restart. Sustained high temperatures also reduce the CPU's lifespan.
What to do
- Check current temperatures:
sensors(from lm-sensors package). - Verify that fans are running:
ipmitool sdr type Fan - Clean dust from heatsinks and fans.
- Check that the thermal paste between the CPU and heatsink is not dried out.
- If in a data center, check the room temperature and airflow. Verify hot/cold aisle separation.
- Reduce CPU load temporarily if temperatures are critical.
Configuration
alerts:
cpu_temperature_high:
enabled: true
threshold: 85 # warning threshold in Celsius
critical_threshold: 95
sensor: coretemp # sensor driver name (auto-detected if omitted)23. ecc_errors (correctable)
What it means
The server's ECC memory has reported new correctable single-bit errors. These are silently fixed by the ECC hardware but logged for monitoring. Crucible reads these from edac-util or /sys/devices/system/edac/mc/.
Why it matters
Occasional correctable ECC errors are normal over long periods. A sudden increase in correctable errors on a single DIMM often predicts imminent failure. Tracking the rate helps you plan proactive DIMM replacements.
What to do
- Check error counts:
edac-util -soredac-util -v - Identify which DIMM is affected from the EDAC output (mc/csrow/channel).
- Monitor the rate. If errors are increasing, schedule a DIMM replacement.
- Run a memory test (
memtest86+) during the next maintenance window.
Configuration
alerts:
ecc_errors:
enabled: true
threshold: 1 # new correctable errors to trigger warning24. ecc_errors (uncorrectable)
What it means
The server's ECC memory has reported uncorrectable multi-bit errors. These cannot be repaired by ECC and may cause data corruption or application crashes.
Why it matters
Uncorrectable errors are serious. Corrupted data was delivered to the CPU, which can cause application crashes, data corruption, or silent data damage. This DIMM should be replaced immediately.
What to do
- Identify the affected DIMM:
edac-util -v - Replace the DIMM immediately.
- Check application data integrity, especially database checksums.
- Run
memtest86+to confirm the diagnosis.
Configuration
alerts:
ecc_errors:
critical_on_uncorrectable: true25. psu_redundancy_loss
What it means
A redundant power supply unit has failed or been disconnected. Crucible detects this via IPMI sensors or by reading /sys/class/hwmon/ entries for power supply status. In a typical 1+1 redundant configuration, the server continues running on the remaining PSU, but it has lost its power redundancy.
Why it matters
Servers with redundant PSUs are designed to survive a single PSU failure. Once one PSU is down, you are running without a safety net. If the remaining PSU fails, the server goes down immediately with no graceful shutdown.
What to do
- Check PSU status:
ipmitool sdr type "Power Supply" - Verify that the failed PSU is receiving power (check the outlet and PDU).
- If the PSU has a fault LED, note the error pattern.
- Replace the failed PSU. Most servers support hot-swap PSU replacement.
- If in a data center, open a hardware ticket immediately.
Configuration
alerts:
psu_redundancy_loss:
enabled: true
# No threshold - any PSU failure triggers this alert
source: ipmi # ipmi or hwmon (auto-detected if omitted)26. ipmi_fan_failure
What it means
An IPMI-monitored fan has stopped spinning or dropped below the minimum RPM threshold. Crucible reads fan RPM values from IPMI SDR records. Fan speeds are displayed with proper units (RPM).
Why it matters
Fan failure leads to rising temperatures, which cause CPU throttling, component damage, and eventually thermal shutdown. In servers with redundant fans, a single failure reduces cooling capacity and puts stress on the remaining fans.
What to do
- Check fan status:
ipmitool sdr type Fan - Inspect the fan for physical damage or cable disconnection.
- If the server is in a data center, open a hardware ticket for fan replacement.
- Monitor CPU temperatures closely until the fan is replaced.
Configuration
alerts:
ipmi_fan_failure:
enabled: true
min_rpm: 500 # fans below this RPM are considered failed27. ipmi_sel_critical
What it means
A critical event has been logged in the IPMI System Event Log (SEL). This includes events like machine check exceptions, PCI-E fatal errors, and power unit failures. Crucible reads the SEL via ipmitool sel list.
Why it matters
Critical SEL events indicate hardware-level problems that may not be visible through OS-level monitoring. These events are logged by the BMC independently of the operating system and can indicate problems that the OS cannot detect on its own.
What to do
- Read the full SEL:
ipmitool sel list - Look up the specific event type in your server vendor's documentation.
- If the event indicates a component failure, schedule replacement.
- Clear the SEL after investigation:
ipmitool sel clear
Configuration
alerts:
ipmi_sel_critical:
enabled: true
# Triggers on any critical-severity SEL event since last checkZFS rules (2)
28. zfs_pool_unhealthy
What it means
A ZFS pool health status is something other than ONLINE. This includes DEGRADED (redundancy lost), FAULTED (data loss possible), and UNAVAIL (pool cannot be accessed).
Why it matters
A non-ONLINE ZFS pool means either redundancy is lost (DEGRADED) or data may already be inaccessible (FAULTED/UNAVAIL). Immediate action is required to prevent data loss.
What to do
- Check pool status:
zpool status - If DEGRADED: identify the failed vdev and replace the drive with
zpool replace - If FAULTED: attempt
zpool clearthen investigate the cause. - Never reboot a FAULTED pool without understanding the failure first.
Configuration
alerts:
zfs_pool_unhealthy:
enabled: true
# Triggers when any zpool reports non-ONLINE state29. zfs_scrub_errors
What it means
Checksum or data errors were found during ZFS scrub operations. ZFS scrubs verify every block of data against its checksum to detect silent data corruption (bit rot).
Why it matters
Scrub errors mean data on disk does not match its checksum. On redundant pools, ZFS auto-repairs from good copies. On non-redundant pools, this is data corruption. Either way, it signals failing hardware.
What to do
- Check scrub results:
zpool status -v - If on a mirror/raidz: ZFS auto-repaired. Identify the drive with errors and plan replacement.
- If on a single vdev: data corruption occurred. Restore affected files from backup.
- Run
smartctl -aon the underlying device to check for hardware issues.
Configuration
alerts:
zfs_scrub_errors:
enabled: true
# Triggers when zpool scrub reports any errorsSecurity rules (6)
30. ssh_root_password
What it means
The SSH daemon is configured to allow root login with a password. Crucible checks /etc/ssh/sshd_config for PermitRootLogin yes or PermitRootLogin prohibit-password not being set.
Why it matters
Root login via password is a common attack vector. Brute-force SSH attacks target root constantly. Key-based authentication is much more secure.
What to do
- Set
PermitRootLogin prohibit-passwordin/etc/ssh/sshd_config - Ensure you have SSH key access before disabling password login.
- Restart SSH:
sudo systemctl restart sshd
Configuration
alerts:
ssh_root_password:
enabled: true31. no_firewall
What it means
No active firewall was detected. Crucible checks for iptables rules, nftables, ufw, and firewalld. If all are empty or inactive, this alert fires.
Why it matters
A server without a firewall exposes all listening services to the internet. Even services bound to localhost can be exposed if a misconfiguration changes the bind address.
What to do
- Enable ufw:
sudo ufw default deny incoming && sudo ufw allow ssh && sudo ufw enable - Or configure iptables/nftables with appropriate rules for your services.
- If you use an external firewall (cloud security group), you can disable this rule.
Configuration
alerts:
no_firewall:
enabled: true32. pending_security_updates
What it means
The package manager has pending security updates that have not been installed. Crucible checks apt (Debian/Ubuntu) or dnf (RHEL/Rocky/Alma) for available security patches.
Why it matters
Unpatched security vulnerabilities are one of the most common attack vectors. Security updates should be applied promptly, especially for internet-facing services.
What to do
- Review pending updates:
apt list --upgradableordnf check-update --security - Apply security updates:
sudo apt upgradeorsudo dnf update --security - Consider enabling automatic security updates (see
unattended_upgrades_disabledbelow).
Configuration
alerts:
pending_security_updates:
enabled: true33. kernel_vulnerabilities
What it means
The running kernel has known vulnerabilities that are mitigatable or patchable. Crucible checks /sys/devices/system/cpu/vulnerabilities/ for Spectre, Meltdown, and other CPU/kernel vulnerabilities.
Why it matters
Kernel vulnerabilities can allow privilege escalation, container escapes, or data leaks between processes. While some mitigations are applied automatically, others require a kernel update and reboot.
What to do
- Check vulnerability status:
grep . /sys/devices/system/cpu/vulnerabilities/* - Update the kernel:
sudo apt upgrade linux-image-generic - Reboot to load the new kernel.
Configuration
alerts:
kernel_vulnerabilities:
enabled: true34. kernel_needs_reboot
What it means
A kernel update has been installed but the server is still running the old kernel. Crucible detects this by comparing the running kernel version against the installed version and by checking for /var/run/reboot-required.
Why it matters
Security patches in the new kernel are not active until the server reboots. The server remains vulnerable to patched exploits until the reboot occurs.
What to do
- Schedule a maintenance window and reboot the server.
- Verify the new kernel is running after reboot:
uname -r
Configuration
alerts:
kernel_needs_reboot:
enabled: true35. unattended_upgrades_disabled
What it means
Automatic security updates are not configured. On Debian/Ubuntu, Crucible checks whether the unattended-upgrades package is installed and enabled. On RHEL-based systems, it checks for dnf-automatic.
Why it matters
Without automatic security updates, critical patches sit uninstalled until someone manually runs the update. For servers that are not actively maintained, this can leave known vulnerabilities open for weeks or months.
What to do
- Install and enable automatic updates:
sudo apt install unattended-upgrades sudo dpkg-reconfigure -plow unattended-upgrades
- Or on RHEL:
sudo dnf install dnf-automatic && sudo systemctl enable --now dnf-automatic.timer - If you prefer manual updates, you can disable this rule.
Configuration
alerts:
unattended_upgrades_disabled:
enabled: trueService Health rules (3)
36. systemd_service_failed
What it means
One or more systemd services have entered the "failed" state. Crucible runs systemctl list-units --state=failed on each collection cycle and reports any units that are not running as expected.
Why it matters
Failed services may include databases, web servers, monitoring agents, or critical system daemons. A service in the failed state is not running and will not restart automatically unless configured to do so. Operators often do not notice failed services until users report problems.
What to do
- List failed services:
systemctl list-units --state=failed - Check the service logs:
journalctl -u service-name -e --no-pager - Attempt a restart:
sudo systemctl restart service-name - If the service fails repeatedly, check its configuration and dependencies.
- For services you intentionally disabled, add them to the ignore list.
Configuration
alerts:
systemd_service_failed:
enabled: true
ignore_services:
- bluetooth.service # ignore services that are not relevant
- ModemManager.service37. fd_exhaustion
What it means
The system's file descriptor usage has exceeded the configured percentage of the maximum allowed. Crucible reads /proc/sys/fs/file-nr to get the current allocation and the system-wide limit.
Why it matters
File descriptors are used for open files, sockets, pipes, and other I/O handles. When the system runs out of file descriptors, processes cannot open new files or establish new network connections. This causes cascading failures: databases refuse connections, web servers return errors, and logging stops working.
What to do
- Check current usage:
cat /proc/sys/fs/file-nr(allocated, unused, max) - Find processes with many open FDs:
for pid in /proc/[0-9]*; do echo "$(ls "$pid/fd" 2>/dev/null | wc -l) $(cat "$pid/comm" 2>/dev/null)"; done | sort -rn | head -20 - Increase the system limit temporarily:
sysctl -w fs.file-max=1048576 - Make it permanent in
/etc/sysctl.d/99-file-max.conf - Check per-process limits with
cat /proc/PID/limitsand adjust with systemdLimitNOFILE=. - Investigate if a process is leaking file descriptors (opening without closing).
Configuration
alerts:
fd_exhaustion:
enabled: true
threshold: 80
critical_threshold: 9538. server_unreachable
What it means
The server has stopped sending snapshots to Forge. Crucible is an agent-based collector; if the server goes down, the agent goes down with it and Forge stops receiving data. This rule runs server-side on a schedule (every 2 minutes), not as part of the snapshot evaluation.
Why it matters
A server that stops reporting may be down, rebooting, or have a crashed Crucible service. Without this rule, the only signal would be the "Last seen X minutes ago" label on the dashboard, which is easy to miss.
How it works
- Threshold: 2x the server's collection interval (default 300s, so 10 minutes).
- Scales with custom intervals: if a server pushes every 600s, the threshold is 20 minutes.
- Onboarding grace: servers younger than 10 minutes never fire this alert.
- Servers that have never sent a snapshot are not alerted on.
- Auto-resolves when the server sends its next snapshot.
What to do
- Check if the server is reachable:
ping {server_ip} - If reachable, check Crucible:
ssh {server} sudo systemctl status glassmkr-crucible - Check logs:
ssh {server} sudo journalctl -u glassmkr-crucible -n 20 --no-pager - If not reachable, check your hosting panel for IPMI or KVM access.
Global alert settings
These settings apply to all alert rules and can be set in the configuration file or the dashboard:
alerts:
global:
cooldown: 3600 # seconds between repeated notifications for the same alert
resolve_notify: true # send a notification when an alert resolves
channels:
- telegram
- email