Troubleshooting

This page covers common issues with the Crucible agent and Forge dashboard, along with step-by-step solutions.

Crucible service fails to start

Symptom: systemctl status crucible shows failed or inactive (dead).

Steps:

  1. Check the service logs:
    journalctl -u crucible --no-pager -n 50
  2. If you see config: parse error, validate the YAML:
    crucible config check
    Common YAML mistakes include using tabs instead of spaces, missing quotes around strings with special characters, and incorrect indentation.
  3. If you see permission denied, ensure the configuration file is readable:
    ls -la /etc/glassmkr/collector.yaml
    The file should be owned by root with mode 0600 or 0640.
  4. If you see bind: address already in use, another instance of Crucible may be running:
    pgrep -a crucible
    Kill the stale process and try again.

Server shows "offline" in the dashboard

Symptom: The server card in Forge shows a gray status indicator and "last seen" is more than 5 minutes ago.

Steps:

  1. Check that Crucible is running:
    systemctl status crucible
  2. Check network connectivity to the API:
    curl -s -o /dev/null -w "%{http_code}" https://forge.glassmkr.com/api/v1/health
    You should get 200. If not, check DNS resolution, firewall rules, and proxy settings.
  3. Check if the token is valid:
    sudo journalctl -u glassmkr-crucible --since "5 min ago" --no-pager
    If you see auth error: 401, generate a new token in the Forge dashboard and update /etc/glassmkr/collector.yaml.
  4. Check for network-level blocks. Some firewalls or security groups block outbound HTTPS. Verify that port 443 to forge.glassmkr.com is open:
    nc -zv forge.glassmkr.com 443
  5. If you are behind a proxy, configure it in collector.yaml:
    proxy:
      https: http://proxy.internal:3128

Metrics are delayed or missing

Symptom: The dashboard shows gaps in charts or data arrives minutes late.

Steps:

  1. Check the agent's push timing:
    sudo journalctl -u glassmkr-crucible --since "5 min ago" --no-pager
    The "Last push" value should be close to the configured interval (default: 300 seconds).
  2. If pushes are slow, check the agent log for timeout errors:
    grep -i "timeout\|retry" /var/log/glassmkr/crucible.log | tail -20
  3. If the server's clock is significantly off, metrics may be dropped. Verify NTP is working:
    timedatectl status
    The system clock should be synchronized. If not, enable NTP:
    sudo timedatectl set-ntp true
  4. If specific collectors are slow (e.g., SMART queries on many disks), they can delay the entire push. Check collector timing:
    sudo journalctl -u glassmkr-crucible -f
    Consider increasing the collection interval or disabling slow collectors.

SMART data is not appearing

Symptom: The Disk tab in the dashboard shows no SMART information.

Steps:

  1. Ensure smartmontools is installed:
    # Debian/Ubuntu
    sudo apt install smartmontools
    
    # RHEL/Rocky/Alma
    sudo dnf install smartmontools
  2. Verify that smartctl can read your drives:
    sudo smartctl -a /dev/sda
    If this fails with a permission error, Crucible needs to run as root (which is the default for the systemd service).
  3. For hardware RAID controllers, drives behind the controller are not visible to smartctl without the -d flag. Check if your controller is supported:
    sudo smartctl -a /dev/sda -d megaraid,0
  4. Verify the SMART collector is enabled in collector.yaml:
    collectors:
      smart:
        enabled: true

Telegram notifications are not arriving

Symptom: Alerts fire in the dashboard but no Telegram messages are received.

Steps:

  1. Test the channel from the dashboard or API:
    curl -X POST https://forge.glassmkr.com/api/v1/channels/CHANNEL_ID/test \
      -H "Authorization: Bearer YOUR_TOKEN"
  2. If the test fails with 401 Unauthorized, the bot token is invalid. Create a new bot with BotFather or regenerate the token.
  3. If the test fails with 400 Bad Request: chat not found, the chat ID is wrong. Common mistakes:
    • Missing the -100 prefix for supergroups.
    • The bot was removed from the group after setup.
    • The bot has not received any messages in the chat yet (send a message to the bot first).
  4. If the test succeeds but real alerts do not arrive, check the channel routing. Go to Settings > Alert Defaults and verify that your Telegram channel is listed.
  5. Check the alert cooldown. By default, Forge only sends one notification per alert per hour. If you acknowledged the alert or it was recently notified, additional notifications are suppressed.

Email notifications go to spam

Symptom: Test emails arrive in the spam folder.

Steps:

  1. Check the spam folder and mark messages as "not spam" to train your mail provider.
  2. Add [email protected] to your contacts or safe senders list.
  3. If you control the recipient domain, add an SPF record allowing Glassmkr's mail servers. Contact support for the current IP ranges.
  4. For better deliverability, use a custom SMTP server with your own domain. See the Channels page for setup instructions.

Temperature or IPMI data is missing

Symptom: The Hardware tab shows no temperature, fan, or PSU data.

Steps:

  1. Install lm-sensors for hwmon data:
    # Debian/Ubuntu
    sudo apt install lm-sensors
    sudo sensors-detect --auto
  2. For IPMI data, install ipmitool:
    sudo apt install ipmitool
    Verify it works:
    sudo ipmitool sdr list
  3. If IPMI is not available (common on consumer hardware and many cloud VMs), Crucible falls back to hwmon. Virtual machines typically have no thermal sensors at all.
  4. Check that the thermal collector is not disabled:
    collectors:
      thermal:
        enabled: true
        source: auto

High CPU usage by Crucible

Symptom: The Crucible process uses more than 1-2% CPU consistently.

Steps:

  1. Check which collectors are running:
    sudo journalctl -u glassmkr-crucible -f
  2. SMART queries on many disks can be expensive. If you have more than 20 disks, increase the interval or limit which disks are scanned:
    collectors:
      smart:
        devices:
          - /dev/sda
          - /dev/sdb
  3. Per-core CPU metrics on machines with 64+ cores generate a lot of data. Disable per-core reporting if you do not need it:
    collectors:
      cpu:
        per_core: false
  4. If the collection interval is set very low (e.g., 10 seconds), increase it to reduce overhead:
    collectors:
      interval: 300

Registration fails with "server limit reached"

Symptom: the Forge dashboard ("+ Add Server") returns an error about the server limit.

Steps:

  1. Check your current plan limits in the Forge dashboard under Settings > Account.
  2. The Free plan allows up to 3 servers. The Pro plan allows unlimited servers. The Enterprise plan has no limit.
  3. If you have decommissioned servers that are still registered, delete them from the dashboard to free up slots.
  4. To upgrade your plan, go to Settings > Account > Billing.

Configuration changes are not taking effect

Symptom: You edited collector.yaml but Crucible still uses the old settings.

Steps:

  1. Restart the service after any configuration change:
    sudo systemctl restart crucible
  2. Verify the configuration was parsed correctly:
    crucible config check
  3. Check that you edited the correct file. If the CRUCIBLE_CONFIG environment variable is set, it may point to a different location:
    systemctl show crucible -p Environment
  4. Environment variables override the config file. Check if any CRUCIBLE_* variables are set in the systemd unit or the shell environment.

Per-core CPU data is not showing

Symptom: The per-core CPU chart does not appear in the expanded CPU view, or per-core data is missing from AI analysis.

Steps:

  1. Per-core monitoring requires Crucible 0.3.0 or later. Check your version:
    crucible --version
  2. Ensure per-core monitoring is enabled in the configuration:
    collectors:
      cpu:
        per_core: true
  3. Restart Crucible after changing the configuration:
    sudo systemctl restart crucible
  4. Wait for the next collection interval (default: 5 minutes) for data to appear.

Muted rules are still firing

Symptom: You muted a rule but it continues to fire alerts or send notifications.

Steps:

  1. Muting takes effect on the next ingest cycle. Wait for at least one full collection interval (default: 5 minutes) after muting.
  2. If you muted via the configuration file, restart Crucible for the change to take effect:
    sudo systemctl restart crucible
  3. If you muted via the dashboard, no restart is needed, but the change applies on the next push from that server.
  4. Verify the rule is muted in the dashboard under the server's Alerts tab. Muted rules show a mute icon.

Getting help

If your issue is not covered here:

  • Run crucible debug to generate a diagnostic bundle. This collects logs, configuration (with tokens redacted), system info, and recent metrics. Attach it when contacting support.
  • Email [email protected] with your server ID and a description of the issue.