Monitoring Everything: Prometheus, Grafana, and Loki on k3s

TL;DR

After running the cluster for nearly two weeks, today I took a step back to document and optimize the monitoring stack. This covers kube-prometheus-stack (Prometheus + Grafana + AlertManager), Loki for log aggregation, custom dashboards for every service, alert tuning to reduce noise, and the cluster-wide performance benchmarks I ran to establish baseline metrics.

The Monitoring Architecture

┌──────────────────────────────────────────────────┐
│                  Grafana                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │ Metrics  │  │   Logs   │  │  Alerts  │       │
│  │ Explorer │  │ Explorer │  │  Rules   │       │
│  └──────┬───┘  └──────┬───┘  └──────┬───┘       │
└─────────┼──────────────┼─────────────┼───────────┘
          │              │             │
    ┌─────┴─────┐  ┌─────┴─────┐      │
    │Prometheus │  │   Loki    │      │
    │ (metrics) │  │  (logs)   │      │
    └─────┬─────┘  └─────┬─────┘      │
          │              │       ┌─────┴──────┐
   ┌──────┴──────┐ ┌─────┴────┐ │AlertManager│
   │ Exporters   │ │Promtail  │ │  → Slack   │
   │ node        │ │(log      │ └────────────┘
   │ kube-state  │ │ shipper) │
   │ cAdvisor    │ └──────────┘
   │ custom      │
   └─────────────┘

kube-prometheus-stack

The foundation is kube-prometheus-stack, deployed via Helm. This single chart installs:

Prometheus: Time-series metrics collection and storage
Grafana: Visualization and dashboarding
AlertManager: Alert routing and notification
node-exporter: Host-level metrics (CPU, memory, disk, network)
kube-state-metrics: Kubernetes object state (pod status, deployment replicas, etc.)
Recording rules: Pre-computed metrics for common queries

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

I keep 30 days of metrics data on a 50GB Longhorn volume. This is more than enough for trend analysis and capacity planning.

Custom Exporters

Beyond the standard exporters, I run several custom ones:

GitHub Exporter

Exports repository metrics: stars, forks, open issues, workflow run counts and durations. The Grafana dashboard shows CI/CD activity patterns across all repositories.

Proxmox Hardware Exporter

Each Proxmox host runs lm-sensors feeding into node-exporter’s textfile collector. This provides:

CPU package and core temperatures
Fan speeds (when the fan spins up)
Chipset temperatures

Seedbox Exporter

Custom Python exporter that scrapes seedbox statistics:

Active transfer count
Upload/download throughput
Disk usage
Ratio statistics

NAS Exporter

Exposes TrueNAS metrics:

Pool health status
Disk SMART data
ZFS ARC hit rate
Network throughput per interface

Exportarr (Radarr/Sonarr)

Sidecar containers that expose *arr application metrics:

Queue length
Download completion rate
Library size
Calendar upcoming entries

ServiceMonitors

Every application in the cluster that exposes a /metrics endpoint gets a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: cardboard
  namespace: cardboard
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: cardboard
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

The release: kube-prometheus-stack label is critical — the Prometheus operator uses label selectors to discover ServiceMonitors, and by default it only picks up monitors matching the Helm release name.

Loki Log Aggregation

Prometheus handles metrics. For logs, I use Loki + Promtail:

Promtail runs as a DaemonSet on every node, tailing container logs via the kubelet API
Loki indexes log labels (namespace, pod, container) and stores log chunks on the Longhorn volume
Grafana provides a unified view — click from a metric anomaly directly to the relevant pod logs

This is particularly useful for debugging CronJob failures. Instead of running kubectl logs on ephemeral Job pods (which may have already been cleaned up), I query Loki for the logs persisted during the job run.

Dashboard Gallery

I have built Grafana dashboards for every layer of the stack:

Cluster Overview

Node count, pod count, namespace count
CPU/memory/disk utilization per node
Pod restart counts (top 10)
Network I/O per node

Proxmox Watchdog

Power cycle events (timeline)
Proxmox host availability
API response latency per host
Temperature trends per host (4-across layout for 4 nodes)

Application Dashboards

Each application has its own dashboard:

Cardboard: Price scrape success rate, database size, query latency
Trade Bot: Trade execution count, Bedrock token usage, portfolio value
Media Stack: Library growth, transcode queue, disk usage trending
Alert Responder: Alerts analyzed, remediation suggestions, response latency

CI/CD Dashboard

Workflow run duration by repository
Runner pod utilization
Build success/failure rate
Queue wait time

Alert Tuning

The default kube-prometheus-stack alerts are noisy. I spent time today tuning them:

Removed: PveNodeHighMemory

As documented in the production failures post, Proxmox hosts show high memory usage due to ZFS ARC cache. This is normal and not actionable. Replaced with PveNodeHighSwap:

- alert: PveNodeHighSwap
  expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes > 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High swap usage on {{ $labels.instance }}"

Added: Media Pipeline Alerts

- alert: RcloneSyncFailed
  expr: time() - rclone_last_sync_success_timestamp > 7200
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "rclone sync has not succeeded in 2+ hours"

- alert: RadarrQueueStuck
  expr: radarr_queue_total > 10
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Radarr queue has 10+ items for 30+ minutes"

Added: Backup Monitoring

- alert: LonghornBackupFailed
  expr: longhorn_backup_last_completed_timestamp < (time() - 86400)
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "Longhorn backup has not completed in 24+ hours"

Performance Benchmarks

Today I also ran comprehensive benchmarks to establish baseline performance numbers:

CPU (sysbench)

Multi-thread (12 threads): 10,864 events/sec
Single-thread:              1,293 events/sec

Memory (sysbench)

Sequential read:  16,114 MiB/sec
Sequential write: 15,892 MiB/sec

Disk I/O (fio on Longhorn volume)

Random Write IOPS: 783K
Random Read IOPS:  414K
Sequential Write:  1.2 GB/s
Sequential Read:   890 MB/s

Network (iperf3 between nodes)

1GbE throughput: 938 Mbps

These numbers serve as baselines. If future benchmarks show degradation, it indicates a configuration change or hardware issue.

CI/CD Fixes

While reviewing the monitoring data, I noticed some CI/CD issues:

Recreate Strategy

Trade Bot and Cardboard were using RollingUpdate deployment strategy, which caused brief periods where both old and new pods ran simultaneously. For applications with external session state (like Robinhood API sessions), this caused conflicts. Switched both to Recreate:

spec:
  strategy:
    type: Recreate

Rollout Timeouts

The default rollout timeout of 120 seconds was not enough for the Alert Responder and Media Controller, which have slow startup (fetching ML models and media metadata). Increased to 300 seconds.

Lessons Learned

Default Prometheus alerts need tuning for homelab. Enterprise-oriented thresholds generate too much noise in a homelab context. Tune aggressively.
Loki is essential for CronJob debugging. Ephemeral workloads like Jobs and CronJobs may be cleaned up before you can read their logs. Loki preserves them.
ServiceMonitor labels must match the Prometheus operator’s selector. The release: kube-prometheus-stack label is easy to forget.
Benchmark early, benchmark often. Establishing baseline performance numbers lets you detect degradation before users notice.
Recreate strategy is safer for applications with external session state. Rolling updates can cause session conflicts.

The monitoring stack now covers every layer: hardware, hypervisor, Kubernetes, and applications. When something goes wrong, I can see it in Grafana before the alert fires.

TL;DR#

The Monitoring Architecture#

kube-prometheus-stack#

Custom Exporters#

GitHub Exporter#

Proxmox Hardware Exporter#

Seedbox Exporter#

NAS Exporter#

Exportarr (Radarr/Sonarr)#

ServiceMonitors#

Loki Log Aggregation#

Dashboard Gallery#

Cluster Overview#

Proxmox Watchdog#

Application Dashboards#

CI/CD Dashboard#

Alert Tuning#

Removed: PveNodeHighMemory#

Added: Media Pipeline Alerts#

Added: Backup Monitoring#

Performance Benchmarks#

CPU (sysbench)#

Memory (sysbench)#

Disk I/O (fio on Longhorn volume)#

Network (iperf3 between nodes)#

CI/CD Fixes#

Recreate Strategy#

Rollout Timeouts#

Lessons Learned#