Operations

Upgrading k3s Across Five Minor Versions: v1.29 to v1.34 on a Homelab Cluster

TL;DR Upgraded a production k3s cluster from v1.29.0+k3s1 to v1.34.4+k3s1 across 8 nodes — 3 control plane servers, 4 amd64 worker agents, and 1 arm64 Lima VM agent. The upgrade stepped through every minor version (v1.29 → v1.30 → v1.31 → v1.32 → v1.33 → v1.34) with etcd snapshots between each step. Longhorn was upgraded from v1.6.0 to v1.8.2 in two stages (v1.7.3 as an intermediate step). SSH was broken to all cluster nodes, so the entire upgrade was done via Proxmox QEMU Guest Agent (qm guest exec) and Lima CLI (limactl shell). Discovered that k3s intentionally pins Traefik to v2.11.24 even when bundling Helm chart v27 — Traefik v3 migration is a separate effort. ...

AI-Assisted Infrastructure: Claude, Copilot, and the Memory Protocol

TL;DR Two weeks of building a production Kubernetes cluster with AI pair programming. Claude Opus 4.6 handles complex multi-step infrastructure work via the CLI. GitHub Copilot provides inline code completion in VS Code. AWS Bedrock (Nova Micro, Claude Sonnet 4.5) powers runtime AI services inside the cluster. The key discovery: AI tools without persistent memory are dangerous. Every session starts from zero. The same bugs get recreated, the same anti-patterns get suggested, the same cluster-specific constraints get forgotten. The solution is the “Memory Protocol” – a set of documentation files the AI reads before every session and updates after every discovery. ...

Monitoring Everything: Prometheus, Grafana, and Loki on k3s

TL;DR After running the cluster for nearly two weeks, today I took a step back to document and optimize the monitoring stack. This covers kube-prometheus-stack (Prometheus + Grafana + AlertManager), Loki for log aggregation, custom dashboards for every service, alert tuning to reduce noise, and the cluster-wide performance benchmarks I ran to establish baseline metrics. The Monitoring Architecture ┌──────────────────────────────────────────────────┐ │ Grafana │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Metrics │ │ Logs │ │ Alerts │ │ │ │ Explorer │ │ Explorer │ │ Rules │ │ │ └──────┬───┘ └──────┬───┘ └──────┬───┘ │ └─────────┼──────────────┼─────────────┼───────────┘ │ │ │ ┌─────┴─────┐ ┌─────┴─────┐ │ │Prometheus │ │ Loki │ │ │ (metrics) │ │ (logs) │ │ └─────┬─────┘ └─────┬─────┘ │ │ │ ┌─────┴──────┐ ┌──────┴──────┐ ┌─────┴────┐ │AlertManager│ │ Exporters │ │Promtail │ │ → Slack │ │ node │ │(log │ └────────────┘ │ kube-state │ │ shipper) │ │ cAdvisor │ └──────────┘ │ custom │ └─────────────┘ kube-prometheus-stack The foundation is kube-prometheus-stack, deployed via Helm. This single chart installs: ...

Top 10 Production Failures and What I Learned

TL;DR After one week of operating this cluster with real workloads, I have accumulated a healthy list of production failures. Each one taught me something about Kubernetes, infrastructure, or my own assumptions. Here are the top 10, ranked by how much time they cost to investigate and fix. 1. The Longhorn S3 Backup Credential Rotation Impact: All Longhorn backups silently failed for 12 hours. What happened: I rotated the IAM credentials used for S3 backups and updated the Kubernetes secret. But Longhorn caches credentials at startup — it does not re-read the secret dynamically. All backup jobs continued using the stale credentials and failing silently. ...