AI failure patterns and guardrails AI failure patterns and guardrails

When the AI Breaks Production: Failure Patterns, Guardrails, and Measuring What Works

TL;DR AI tools have caused multiple production incidents in this cluster. The AI alert responder agent alone generated 14 documented failure patterns before it became reliable. A security scanner deployed by AI applied restricted PodSecurity labels to every namespace, silently blocking pod creation for half the applications in the cluster. The service selector trap – where AI routes 50% of requests to PostgreSQL instead of the application – appeared in 4 separate incidents before guardrails stopped it. This post catalogs the failure patterns, the five-layer guardrail architecture built to prevent them, and an honest assessment of what still goes wrong. ...

March 2, 2026 · 14 min · zolty
Two AIs managing a GitHub repository via issues and pull requests Two AIs managing a GitHub repository via issues and pull requests

Two AIs, One Codebase: Using Local Copilot to Direct GitHub Copilot via Issues and PRs

TL;DR A 109-day project plan. One day of actual work. Eight hours of active pipeline time. The key was treating planning and implementation as two separate AI-driven phases: spend an evening getting the plan right by routing it through multiple models, then let Claude Sonnet 4.6 implement it autonomously overnight via GitHub Copilot’s cloud agent while you sleep. This is the full playbook — planning phase included. The Project This came out of building dnd-multi, a full-stack AI Dungeon Master platform: FastAPI backend, Next.js 15 frontend, a Discord bot, LiveKit voice, and AWS Bedrock integration. Seven feature phases, a plan projected to take until June 19. ...

March 2, 2026 · 11 min · zolty
AI Dungeon Master platform architecture diagram AI Dungeon Master platform architecture diagram

Building an AI Dungeon Master: Full-Stack D&D Platform on k3s

TL;DR I’m building a multiplayer D&D platform where an AI powered by AWS Bedrock Claude runs the game. Players connect via a Next.js web app or Discord. A 5-tier lore context system gives the AI persistent memory across sessions. A background world simulation engine tracks NPC positions, inventory, faction standings, and in-game time so the AI can focus on storytelling instead of bookkeeping. The foundation is fully deployed on my home k3s cluster. The current work is turning a working tech demo into a game people actually want to sit down and play. ...

March 2, 2026 · 14 min · zolty
A terminal prompt with tool version numbers lined up neatly A terminal prompt with tool version numbers lined up neatly

Environment manifests for AI assistants across every repo

TL;DR I added a standardized environment.instructions.md file to every repository in my workspace. It’s a simple Markdown table of tool versions plus a few workflow snippets. AI assistants pick it up automatically, and they’ve stopped suggesting commands for tools or versions I don’t have. The whole thing took less than an hour to write out and propagate. The problem I run GitHub Copilot heavily across several projects — homelab infrastructure, a blog, a few Python services, and some supporting tooling. The AI context setup (those copilot-instructions.md files I wrote about here) covers what each project does and what its conventions are. What it doesn’t cover is what I’m actually running when I type commands in a terminal. ...

March 1, 2026 · 8 min · zolty
GitHub Copilot setup guide with AI skills and memory GitHub Copilot setup guide with AI skills and memory

Getting Started with GitHub Copilot: What Actually Works

TL;DR A $20/month GitHub Copilot subscription gives you Claude Sonnet 4.6, GPT-4o, and Gemini inside VS Code. Out of the box it’s useful. With a proper instruction setup — a copilot-instructions.md file, path-scoped rules, and skill documents — it becomes something you actually rely on. Most of the posts on this blog were built with this toolchain, mostly in the context of my k3s cluster, but the patterns apply anywhere. This is how I have it set up. ...

March 1, 2026 · 12 min · zolty
AI memory system architecture AI memory system architecture

Building an AI Memory System: From Blank Slate to 482 Lines of Hard-Won Knowledge

TL;DR The .github/copilot-instructions.md file started as 10 lines of project description and grew into a 99-line “operating system” for AI assistants. Then it split: failure patterns moved into docs/ai-lessons.md (now 482 lines across 20+ categories), and file-type-specific rules moved into .github/instructions/ with applyTo glob patterns. The same template structure was standardized across 5 repositories. This post traces the three generations of AI instruction architecture and shows how every production incident permanently improves AI reliability. ...

February 26, 2026 · 11 min · zolty
Monitoring goes blind — Longhorn storage corruption incident report Monitoring goes blind — Longhorn storage corruption incident report

When Monitoring Goes Blind: A Longhorn Storage Corruption Incident

TL;DR Grafana went completely dark for about 26 hours on my home k3s cluster. Two things broke simultaneously: Loki entered CrashLoopBackOff, and Prometheus silently stopped ingesting metrics — its pods showed as healthy and 2/2 Running the whole time. The actual cause was Longhorn’s auto-balancer migrating replicas onto a freshly-added cluster node (k3s-agent-4) that had unstable storage during its first 48 hours. The replica I/O errors propagated directly into the workloads, corrupting mid-write files: a Prometheus WAL segment and a Loki TSDB index file. Both required offline surgery via a busybox pod to delete the corrupted files before the services could recover. ...

February 25, 2026 · 8 min · zolty
Wiki.js self-hosted knowledge base Wiki.js self-hosted knowledge base

The Cluster That Documents Itself: Self-Hosted Wiki.js as Living Infrastructure Knowledge

TL;DR I run Wiki.js on k3s as the cluster’s internal knowledge base. It is not a place I write documentation — it is a place the AI writes documentation after completing work. When Claude finishes deploying a service, debugging an incident, or refactoring infrastructure, it commits the results to the wiki with architecture diagrams, decision rationale, and operational notes. I am the primary reader. When I want to understand how something works, or why a specific decision was made three weeks ago, I go to the wiki instead of digging through git history or re-reading code. ...

February 24, 2026 · 5 min · zolty
AI context window audit AI context window audit

When Your AI Memory System Eats Its Own Context Window

TL;DR The AI memory system I built three weeks ago started causing the problem it was designed to solve: context window exhaustion. Five generic Claude skills — duplicated identically across all 5 repositories in my workspace — consumed 401KB (~100K tokens) of potential context. The gh-cli skill alone was 40KB per copy, accounting for 42% of all skill content. I ran a full audit, deleted 25 duplicate files, and documented the anti-pattern to prevent recurrence. ...

February 23, 2026 · 6 min · zolty
k3s cluster upgrade from v1.29 to v1.34 k3s cluster upgrade from v1.29 to v1.34

Upgrading k3s Across Five Minor Versions: v1.29 to v1.34 on a Homelab Cluster

TL;DR Upgraded a production k3s cluster from v1.29.0+k3s1 to v1.34.4+k3s1 across 8 nodes — 3 control plane servers, 4 amd64 worker agents, and 1 arm64 Lima VM agent. The upgrade stepped through every minor version (v1.29 → v1.30 → v1.31 → v1.32 → v1.33 → v1.34) with etcd snapshots between each step. Longhorn was upgraded from v1.6.0 to v1.8.2 in two stages (v1.7.3 as an intermediate step). SSH was broken to all cluster nodes, so the entire upgrade was done via Proxmox QEMU Guest Agent (qm guest exec) and Lima CLI (limactl shell). Discovered that k3s intentionally pins Traefik to v2.11.24 even when bundling Helm chart v27 — Traefik v3 migration is a separate effort. ...

February 22, 2026 · 10 min · zolty

Affiliate Disclosure: Some links on this site are affiliate links (Amazon Associates, DigitalOcean referral). As an Amazon Associate, I earn from qualifying purchases. This does not affect the price you pay or my editorial independence — I only recommend products and services I personally use and trust.