Homelab

Reference: k3s Homelab — AI Lessons Learned

Context: This is the live docs/ai-lessons.md from the home k3s cluster repository, referenced extensively across posts on this blog — starting with AI Memory System and GitHub Copilot Setup Guide. Every entry exists because its absence caused a production incident. Personal identifiers and internal domains have been replaced with generic placeholders. Updated: 2026-03-03 Rules discovered through production breakage. Each entry prevents recurrence of a specific failure. Update this file whenever a new non-obvious failure pattern is discovered. ...

GitHub Copilot setup guide with AI skills and memory

Getting Started with GitHub Copilot: What Actually Works

TL;DR A $20/month GitHub Copilot subscription gives you Claude Sonnet 4.6, GPT-4o, and Gemini inside VS Code. Out of the box it’s useful. With a proper instruction setup — a copilot-instructions.md file, path-scoped rules, and skill documents — it becomes something you actually rely on. Most of the posts on this blog were built with this toolchain, mostly in the context of my k3s cluster, but the patterns apply anywhere. This is how I have it set up. ...

Building an AI Memory System: From Blank Slate to 482 Lines of Hard-Won Knowledge

TL;DR The .github/copilot-instructions.md file started as 10 lines of project description and grew into a 99-line “operating system” for AI assistants. Then it split: failure patterns moved into docs/ai-lessons.md (now 482 lines across 20+ categories), and file-type-specific rules moved into .github/instructions/ with applyTo glob patterns. The same template structure was standardized across 5 repositories. This post traces the three generations of AI instruction architecture and shows how every production incident permanently improves AI reliability. ...

Monitoring goes blind — Longhorn storage corruption incident report

When Monitoring Goes Blind: A Longhorn Storage Corruption Incident

TL;DR Grafana went completely dark for about 26 hours on my home k3s cluster. Two things broke simultaneously: Loki entered CrashLoopBackOff, and Prometheus silently stopped ingesting metrics — its pods showed as healthy and 2/2 Running the whole time. The actual cause was Longhorn’s auto-balancer migrating replicas onto a freshly-added cluster node (k3s-agent-4) that had unstable storage during its first 48 hours. The replica I/O errors propagated directly into the workloads, corrupting mid-write files: a Prometheus WAL segment and a Loki TSDB index file. Both required offline surgery via a busybox pod to delete the corrupted files before the services could recover. ...

The Cluster That Documents Itself: Self-Hosted Wiki.js as Living Infrastructure Knowledge

TL;DR I run Wiki.js on k3s as the cluster’s internal knowledge base. It is not a place I write documentation — it is a place the AI writes documentation after completing work. When Claude finishes deploying a service, debugging an incident, or refactoring infrastructure, it commits the results to the wiki with architecture diagrams, decision rationale, and operational notes. I am the primary reader. When I want to understand how something works, or why a specific decision was made three weeks ago, I go to the wiki instead of digging through git history or re-reading code. ...

When Your AI Memory System Eats Its Own Context Window

TL;DR The AI memory system I built three weeks ago started causing the problem it was designed to solve: context window exhaustion. Five generic Claude skills — duplicated identically across all 5 repositories in my workspace — consumed 401KB (~100K tokens) of potential context. The gh-cli skill alone was 40KB per copy, accounting for 42% of all skill content. I ran a full audit, deleted 25 duplicate files, and documented the anti-pattern to prevent recurrence. ...

Upgrading k3s Across Five Minor Versions: v1.29 to v1.34 on a Homelab Cluster

TL;DR Upgraded a production k3s cluster from v1.29.0+k3s1 to v1.34.4+k3s1 across 8 nodes — 3 control plane servers, 4 amd64 worker agents, and 1 arm64 Lima VM agent. The upgrade stepped through every minor version (v1.29 → v1.30 → v1.31 → v1.32 → v1.33 → v1.34) with etcd snapshots between each step. Longhorn was upgraded from v1.6.0 to v1.8.2 in two stages (v1.7.3 as an intermediate step). SSH was broken to all cluster nodes, so the entire upgrade was done via Proxmox QEMU Guest Agent (qm guest exec) and Lima CLI (limactl shell). Discovered that k3s intentionally pins Traefik to v2.11.24 even when bundling Helm chart v27 — Traefik v3 migration is a separate effort. ...

AI-Assisted Infrastructure: Claude, Copilot, and the Memory Protocol

TL;DR Two weeks of building a production Kubernetes cluster with AI pair programming. Claude Opus 4.6 handles complex multi-step infrastructure work via the CLI. GitHub Copilot provides inline code completion in VS Code. AWS Bedrock (Nova Micro, Claude Sonnet 4.5) powers runtime AI services inside the cluster. The key discovery: AI tools without persistent memory are dangerous. Every session starts from zero. The same bugs get recreated, the same anti-patterns get suggested, the same cluster-specific constraints get forgotten. The solution is the “Memory Protocol” – a set of documentation files the AI reads before every session and updates after every discovery. ...

Benchmarking every subsystem on four Lenovo M920q Proxmox hosts — NVMe, CPU, memory, and 10GbE network

Benchmarking Every Subsystem: NVMe, CPU, Memory, and 10GbE on Four Proxmox Hosts

TL;DR Prometheus and Grafana both crashed with I/O errors on the same node. Before assuming software, I ran a full hardware audit across all four Proxmox hosts — SMART health, NVMe disk benchmarks (fio), CPU benchmarks (sysbench), memory bandwidth tests, and 10GbE network throughput (iperf3). The result: all hardware is healthy. The I/O errors were Longhorn CSI virtual block device corruption, not physical disk failure. Along the way, I established baseline performance numbers for every subsystem and discovered that custom cooling makes a dramatic difference in thermal performance. ...

Natural language media requests via Jellyseerr

I Am Zolty: Building a Natural Language Media Request System

TL;DR Jellyseerr already knows what I have. Radarr and Sonarr already know how to find things. The missing piece was a front door that understood intent instead of requiring me to search for specific titles. I wired Jellyseerr’s REST API to Claude and gave it a system prompt that knows my taste profile. Now I can say “download 100GB of family-friendly anime I might like” and get a queue of requests back. A Kubernetes CronJob runs the same prompt on a schedule so the library grows without me thinking about it. ...