Production failures and lessons

Top 10 Production Failures and What I Learned

TL;DR After one week of operating this cluster with real workloads, I have accumulated a healthy list of production failures. Each one taught me something about Kubernetes, infrastructure, or my own assumptions. Here are the top 10, ranked by how much time they cost to investigate and fix. 1. The Longhorn S3 Backup Credential Rotation Impact: All Longhorn backups silently failed for 12 hours. What happened: I rotated the IAM credentials used for S3 backups and updated the Kubernetes secret. But Longhorn caches credentials at startup — it does not re-read the secret dynamically. All backup jobs continued using the stale credentials and failing silently. ...

February 15, 2026 · 6 min · zolty