Monitoring

Monitoring Everything: Prometheus, Grafana, and Loki on k3s

TL;DR After running the cluster for nearly two weeks, today I took a step back to document and optimize the monitoring stack. This covers kube-prometheus-stack (Prometheus + Grafana + AlertManager), Loki for log aggregation, custom dashboards for every service, alert tuning to reduce noise, and the cluster-wide performance benchmarks I ran to establish baseline metrics. The Monitoring Architecture ┌──────────────────────────────────────────────────┐ │ Grafana │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Metrics │ │ Logs │ │ Alerts │ │ │ │ Explorer │ │ Explorer │ │ Rules │ │ │ └──────┬───┘ └──────┬───┘ └──────┬───┘ │ └─────────┼──────────────┼─────────────┼───────────┘ │ │ │ ┌─────┴─────┐ ┌─────┴─────┐ │ │Prometheus │ │ Loki │ │ │ (metrics) │ │ (logs) │ │ └─────┬─────┘ └─────┬─────┘ │ │ │ ┌─────┴──────┐ ┌──────┴──────┐ ┌─────┴────┐ │AlertManager│ │ Exporters │ │Promtail │ │ → Slack │ │ node │ │(log │ └────────────┘ │ kube-state │ │ shipper) │ │ cAdvisor │ └──────────┘ │ custom │ └─────────────┘ kube-prometheus-stack The foundation is kube-prometheus-stack, deployed via Helm. This single chart installs: ...

Building an AI-Powered Alert System with AWS Bedrock

TL;DR Today I deployed two significant additions to the cluster: an AI-powered Alert Responder that uses AWS Bedrock (Amazon Nova Micro) to analyze Prometheus alerts and post remediation suggestions to Slack, and a multi-user dev workspace with per-user environments. I also hardened the cluster by constraining all workloads to the correct architecture nodes and fixing arm64 scheduling issues. The Alert Responder Running 13+ applications on a homelab cluster means alerts fire regularly. Most are straightforward — high memory, restart loops, certificate expiry warnings — but analyzing each one, determining root cause, and knowing the right remediation command gets tedious, especially at 2 AM. ...

Home Assistant on Kubernetes and Building a Proxmox Watchdog

TL;DR Home Assistant runs on k3s using hostNetwork: true for mDNS/SSDP device discovery. I implemented split DNS routing so it is accessible both externally via Traefik and internally via its host IP. Then I built a Proxmox Watchdog — a custom service that monitors all Proxmox hosts via their API and automatically power-cycles unresponsive nodes using TP-Link Kasa HS300 smart power strips. Home Assistant on Kubernetes Home Assistant is one of those applications that does not play well with Kubernetes out of the box. It needs to discover devices on the local network via mDNS, SSDP, and other broadcast protocols. Put it in a regular Kubernetes pod with cluster networking and it cannot see any of your smart home devices. ...

Deploying First Applications: From Zero to Production in 24 Hours

TL;DR Day two of the cluster was a marathon. I deployed two full-stack applications (Cardboard TCG tracker and Trade Bot), set up PostgreSQL with Longhorn persistent storage, created a cluster dashboard, configured Prometheus service monitors, built a dev workspace for remote SSH, and scaled the ARC runners. By the end, the cluster was running real workloads and I had a proper development workflow. The Deployment Pattern Before diving into the applications, I established a consistent deployment pattern that every service follows: ...