AI-driven Kubernetes incident response — seven alerts resolved AI-driven Kubernetes incident response — seven alerts resolved

Seven Alerts, Three Bugs, One AI Debug Session: A Kubernetes Incident Report

TL;DR A routine cluster health check surfaced seven simultaneous issues. Most were transient — Longhorn self-healed its replica fault, Prometheus recovered behind it, a stale manually-created Job was deleted in one command, and a liveness probe blip fixed itself. The real work was dnd-backend, which had been in CrashLoopBackOff and turned out to contain three separate bugs layered on top of each other. The AI identified all three during a single debugging session, authored the fixes across three PRs, and the service came up 1/1 Running with all 18 database tables created on the first boot after the final merge. ...

March 4, 2026 · 8 min · zolty
Self-hosted AI chat deployment Self-hosted AI chat deployment

Self-Hosted AI Chat: Open WebUI, LiteLLM, and AWS Bedrock on k3s

TL;DR I deployed a private, self-hosted ChatGPT alternative on the homelab k3s cluster. Open WebUI provides a polished chat interface. LiteLLM acts as a proxy that translates the OpenAI API into AWS Bedrock’s Converse API. Four models are available: Claude Sonnet 4, Claude Haiku 4.5, Amazon Nova Micro, and Amazon Nova Lite. Authentication is handled by the existing OAuth2 Proxy – no additional SSO configuration needed. The whole stack runs in three pods consuming under 500MB of RAM, and the only ongoing cost is per-request Bedrock pricing. No API keys from OpenAI or Anthropic required. ...

March 4, 2026 · 8 min · zolty
AI Dungeon Master platform architecture diagram AI Dungeon Master platform architecture diagram

Building an AI Dungeon Master: Full-Stack D&D Platform on k3s

TL;DR I’m building a multiplayer D&D platform where an AI powered by AWS Bedrock Claude runs the game. Players connect via a Next.js web app or Discord. A 5-tier lore context system gives the AI persistent memory across sessions. A background world simulation engine tracks NPC positions, inventory, faction standings, and in-game time so the AI can focus on storytelling instead of bookkeeping. The foundation is fully deployed on my home k3s cluster. The current work is turning a working tech demo into a game people actually want to sit down and play. ...

March 2, 2026 · 14 min · zolty

Reference: k3s Homelab — AI Lessons Learned

Context: This is the live docs/ai-lessons.md from the home k3s cluster repository, referenced extensively across posts on this blog — starting with AI Memory System and GitHub Copilot Setup Guide. Every entry exists because its absence caused a production incident. Personal identifiers and internal domains have been replaced with generic placeholders. Updated: 2026-03-03 Rules discovered through production breakage. Each entry prevents recurrence of a specific failure. Update this file whenever a new non-obvious failure pattern is discovered. ...

March 2, 2026 · 81 min · zolty
Monitoring goes blind — Longhorn storage corruption incident report Monitoring goes blind — Longhorn storage corruption incident report

When Monitoring Goes Blind: A Longhorn Storage Corruption Incident

TL;DR Grafana went completely dark for about 26 hours on my home k3s cluster. Two things broke simultaneously: Loki entered CrashLoopBackOff, and Prometheus silently stopped ingesting metrics — its pods showed as healthy and 2/2 Running the whole time. The actual cause was Longhorn’s auto-balancer migrating replicas onto a freshly-added cluster node (k3s-agent-4) that had unstable storage during its first 48 hours. The replica I/O errors propagated directly into the workloads, corrupting mid-write files: a Prometheus WAL segment and a Loki TSDB index file. Both required offline surgery via a busybox pod to delete the corrupted files before the services could recover. ...

February 25, 2026 · 8 min · zolty
Wiki.js self-hosted knowledge base Wiki.js self-hosted knowledge base

The Cluster That Documents Itself: Self-Hosted Wiki.js as Living Infrastructure Knowledge

TL;DR I run Wiki.js on k3s as the cluster’s internal knowledge base. It is not a place I write documentation — it is a place the AI writes documentation after completing work. When Claude finishes deploying a service, debugging an incident, or refactoring infrastructure, it commits the results to the wiki with architecture diagrams, decision rationale, and operational notes. I am the primary reader. When I want to understand how something works, or why a specific decision was made three weeks ago, I go to the wiki instead of digging through git history or re-reading code. ...

February 24, 2026 · 5 min · zolty
k3s cluster upgrade from v1.29 to v1.34 k3s cluster upgrade from v1.29 to v1.34

Upgrading k3s Across Five Minor Versions: v1.29 to v1.34 on a Homelab Cluster

TL;DR Upgraded a production k3s cluster from v1.29.0+k3s1 to v1.34.4+k3s1 across 8 nodes — 3 control plane servers, 4 amd64 worker agents, and 1 arm64 Lima VM agent. The upgrade stepped through every minor version (v1.29 → v1.30 → v1.31 → v1.32 → v1.33 → v1.34) with etcd snapshots between each step. Longhorn was upgraded from v1.6.0 to v1.8.2 in two stages (v1.7.3 as an intermediate step). SSH was broken to all cluster nodes, so the entire upgrade was done via Proxmox QEMU Guest Agent (qm guest exec) and Lima CLI (limactl shell). Discovered that k3s intentionally pins Traefik to v2.11.24 even when bundling Helm chart v27 — Traefik v3 migration is a separate effort. ...

February 22, 2026 · 10 min · zolty
Natural language media requests via Jellyseerr Natural language media requests via Jellyseerr

I Am Zolty: Building a Natural Language Media Request System

TL;DR Jellyseerr already knows what I have. Radarr and Sonarr already know how to find things. The missing piece was a front door that understood intent instead of requiring me to search for specific titles. I wired Jellyseerr’s REST API to Claude and gave it a system prompt that knows my taste profile. Now I can say “download 100GB of family-friendly anime I might like” and get a queue of requests back. A Kubernetes CronJob runs the same prompt on a schedule so the library grows without me thinking about it. ...

February 21, 2026 · 5 min · zolty
Monitoring stack Monitoring stack

Monitoring Everything: Prometheus, Grafana, and Loki on k3s

TL;DR After running the cluster for nearly two weeks, today I took a step back to document and optimize the monitoring stack. This covers kube-prometheus-stack (Prometheus + Grafana + AlertManager), Loki for log aggregation, custom dashboards for every service, alert tuning to reduce noise, and the cluster-wide performance benchmarks I ran to establish baseline metrics. The Monitoring Architecture ┌──────────────────────────────────────────────────┐ │ Grafana │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Metrics │ │ Logs │ │ Alerts │ │ │ │ Explorer │ │ Explorer │ │ Rules │ │ │ └──────┬───┘ └──────┬───┘ └──────┬───┘ │ └─────────┼──────────────┼─────────────┼───────────┘ │ │ │ ┌─────┴─────┐ ┌─────┴─────┐ │ │Prometheus │ │ Loki │ │ │ (metrics) │ │ (logs) │ │ └─────┬─────┘ └─────┬─────┘ │ │ │ ┌─────┴──────┐ ┌──────┴──────┐ ┌─────┴────┐ │AlertManager│ │ Exporters │ │Promtail │ │ → Slack │ │ node │ │(log │ └────────────┘ │ kube-state │ │ shipper) │ │ cAdvisor │ └──────────┘ │ custom │ └─────────────┘ kube-prometheus-stack The foundation is kube-prometheus-stack, deployed via Helm. This single chart installs: ...

February 19, 2026 · 6 min · zolty
Media stack on Kubernetes Media stack on Kubernetes

Building a Complete Media Stack with Kubernetes

TL;DR The media stack is now fully automated: content gets sourced, synced from a remote seedbox to the local NAS via an rclone CronJob, organized by Radarr/Sonarr, and served by Jellyfin with Intel iGPU hardware transcoding. I also deployed a Media Controller for lifecycle management and a Media Profiler for content analysis. This post covers the full pipeline from acquisition to playback. The Media Pipeline Jellyseerr (request) │ ▼ Radarr/Sonarr (search + organize) │ ▼ Prowlarr (indexer management) │ ▼ Seedbox (remote acquisition) │ ▼ rclone-sync CronJob (seedbox → NAS) │ ▼ NAS (TrueNAS, VLAN 30) │ ▼ Jellyfin (playback + GPU transcode) Each component runs as a Kubernetes deployment in the media namespace. ...

February 17, 2026 · 5 min · zolty

Affiliate Disclosure: Some links on this site are affiliate links (Amazon Associates, DigitalOcean referral). As an Amazon Associate, I earn from qualifying purchases. This does not affect the price you pay or my editorial independence — I only recommend products and services I personally use and trust.