Debugging

AI-driven Kubernetes incident response — seven alerts resolved

Seven Alerts, Three Bugs, One AI Debug Session: A Kubernetes Incident Report

TL;DR A routine cluster health check surfaced seven simultaneous issues. Most were transient — Longhorn self-healed its replica fault, Prometheus recovered behind it, a stale manually-created Job was deleted in one command, and a liveness probe blip fixed itself. The real work was dnd-backend, which had been in CrashLoopBackOff and turned out to contain three separate bugs layered on top of each other. The AI identified all three during a single debugging session, authored the fixes across three PRs, and the service came up 1/1 Running with all 18 database tables created on the first boot after the final merge. ...

Monitoring goes blind — Longhorn storage corruption incident report

When Monitoring Goes Blind: A Longhorn Storage Corruption Incident

TL;DR Grafana went completely dark for about 26 hours on my home k3s cluster. Two things broke simultaneously: Loki entered CrashLoopBackOff, and Prometheus silently stopped ingesting metrics — its pods showed as healthy and 2/2 Running the whole time. The actual cause was Longhorn’s auto-balancer migrating replicas onto a freshly-added cluster node (k3s-agent-4) that had unstable storage during its first 48 hours. The replica I/O errors propagated directly into the workloads, corrupting mid-write files: a Prometheus WAL segment and a Loki TSDB index file. Both required offline surgery via a busybox pod to delete the corrupted files before the services could recover. ...

Top 10 Production Failures and What I Learned

TL;DR After one week of operating this cluster with real workloads, I have accumulated a healthy list of production failures. Each one taught me something about Kubernetes, infrastructure, or my own assumptions. Here are the top 10, ranked by how much time they cost to investigate and fix. 1. The Longhorn S3 Backup Credential Rotation Impact: All Longhorn backups silently failed for 12 hours. What happened: I rotated the IAM credentials used for S3 backups and updated the Kubernetes secret. But Longhorn caches credentials at startup — it does not re-read the secret dynamically. All backup jobs continued using the stale credentials and failing silently. ...