One month retrospective One month retrospective

One Month Retrospective: From Bare Metal to Production Platform

TL;DR One month ago, I had three empty Lenovo ThinkCentre M920q mini PCs and a Proxmox installer USB. Today, the cluster runs 8 Kubernetes nodes, 15+ applications, full observability with Prometheus and Grafana, AI-powered alert analysis, self-hosted CI/CD, 10GbE networking, and a 3D printer fabricating custom hardware. Total hardware cost: under $800. This post traces the entire journey, day by day, including the things that went wrong. ...

March 21, 2026 · 10 min · zolty
Jellyfin HA on Kubernetes Jellyfin HA on Kubernetes

Jellyfin HA on Kubernetes: Redis-Backed Transcode Session Failover

TL;DR Jellyfin dies mid-stream when a Kubernetes pod restarts because all transcode state is in-memory. I forked it, added a Redis-backed ITranscodeSessionStore, and wired in atomic lease-based pod takeover. The fork is at github.com/ZoltyMat/jellyfin-ha, and I also published a repo-level diff document at docs/FORK-DIFF.md showing exactly what changed versus upstream Jellyfin. Single-instance deployments need zero config changes because it falls back to a no-op store transparently. The Problem Jellyfin is great. It’s also built with the assumption that exactly one server instance is running at a time. Transcode state — which pods are running FFmpeg, what segments have been written, who owns a given play session — lives entirely in memory. When the process dies, that state is gone. ...

March 14, 2026 · 7 min · zolty
AI-driven Kubernetes incident response — seven alerts resolved AI-driven Kubernetes incident response — seven alerts resolved

Seven Alerts, Three Bugs, One AI Debug Session: A Kubernetes Incident Report

TL;DR A routine cluster health check surfaced seven simultaneous issues. Most were transient — Longhorn self-healed its replica fault, Prometheus recovered behind it, a stale manually-created Job was deleted in one command, and a liveness probe blip fixed itself. The real work was dnd-backend, which had been in CrashLoopBackOff and turned out to contain three separate bugs layered on top of each other. The AI identified all three during a single debugging session, authored the fixes across three PRs, and the service came up 1/1 Running with all 18 database tables created on the first boot after the final merge. ...

March 4, 2026 · 8 min · zolty

Reference: k3s Homelab — AI Lessons Learned

Context: This is the live docs/ai-lessons.md from the home k3s cluster repository, referenced extensively across posts on this blog — starting with AI Memory System and GitHub Copilot Setup Guide. Every entry exists because its absence caused a production incident. Personal identifiers and internal domains have been replaced with generic placeholders. Updated: 2026-03-03 Rules discovered through production breakage. Each entry prevents recurrence of a specific failure. Update this file whenever a new non-obvious failure pattern is discovered. ...

March 2, 2026 · 81 min · zolty
Monitoring goes blind — Longhorn storage corruption incident report Monitoring goes blind — Longhorn storage corruption incident report

When Monitoring Goes Blind: A Longhorn Storage Corruption Incident

TL;DR Grafana went completely dark for about 26 hours on my home k3s cluster. Two things broke simultaneously: Loki entered CrashLoopBackOff, and Prometheus silently stopped ingesting metrics — its pods showed as healthy and 2/2 Running the whole time. The actual cause was Longhorn’s auto-balancer migrating replicas onto a freshly-added cluster node (k3s-agent-4) that had unstable storage during its first 48 hours. The replica I/O errors propagated directly into the workloads, corrupting mid-write files: a Prometheus WAL segment and a Loki TSDB index file. Both required offline surgery via a busybox pod to delete the corrupted files before the services could recover. ...

February 25, 2026 · 8 min · zolty
k3s cluster upgrade from v1.29 to v1.34 k3s cluster upgrade from v1.29 to v1.34

Upgrading k3s Across Five Minor Versions: v1.29 to v1.34 on a Homelab Cluster

TL;DR Upgraded a production k3s cluster from v1.29.0+k3s1 to v1.34.4+k3s1 across 8 nodes — 3 control plane servers, 4 amd64 worker agents, and 1 arm64 Lima VM agent. The upgrade stepped through every minor version (v1.29 → v1.30 → v1.31 → v1.32 → v1.33 → v1.34) with etcd snapshots between each step. Longhorn was upgraded from v1.6.0 to v1.8.2 in two stages (v1.7.3 as an intermediate step). SSH was broken to all cluster nodes, so the entire upgrade was done via Proxmox QEMU Guest Agent (qm guest exec) and Lima CLI (limactl shell). Discovered that k3s intentionally pins Traefik to v2.11.24 even when bundling Helm chart v27 — Traefik v3 migration is a separate effort. ...

February 22, 2026 · 10 min · zolty
First application deployments First application deployments

Deploying First Applications: From Zero to Production in 24 Hours

TL;DR Day two of the cluster was a marathon. I deployed two full-stack applications (Cardboard TCG tracker and Trade Bot), set up PostgreSQL with Longhorn persistent storage, created a cluster dashboard, configured Prometheus service monitors, built a dev workspace for remote SSH, and scaled the ARC runners. By the end, the cluster was running real workloads and I had a proper development workflow. The Deployment Pattern Before diving into the applications, I established a consistent deployment pattern that every service follows: ...

February 9, 2026 · 6 min · zolty
Cluster bootstrapping Cluster bootstrapping

Day One: Bootstrapping a k3s Cluster with Terraform and Ansible

TL;DR Today was cluster genesis. Starting from 3 bare Proxmox hosts, I built the entire infrastructure-as-code pipeline: Terraform to provision VMs from cloud-init templates, Ansible to configure and bootstrap k3s, and a full GitOps deployment model with SOPS-encrypted secrets and S3-backed Terraform state. By end of day: 3 server nodes, 3 agent nodes, cert-manager with Route53 DNS-01 validation, and self-hosted GitHub Actions runners on the cluster itself. The Architecture The design goal was simple: everything as code, nothing manual, everything reproducible. ...

February 8, 2026 · 6 min · zolty

Affiliate Disclosure: Some links on this site are affiliate links (Amazon Associates, DigitalOcean referral). As an Amazon Associate, I earn from qualifying purchases. This does not affect the price you pay or my editorial independence — I only recommend products and services I personally use and trust.