Kubernetes

Auto-documenting homelab architecture diagrams

Auto-documenting a homelab: the quest for free architecture diagrams

TL;DR I spent a full day trying to automatically generate professional architecture diagrams for a 7-node k3s homelab. Figma’s MCP integration was perfect but requires a paid subscription. I tried Excalidraw (JSON generation + Kroki rendering), Mermaid, and finally landed on raw SVG generation in Python. The result is 27 diagrams with tech icons, drop shadows, and curved arrows — but the process is more manual than I’d like. I’m curious if anyone else has found a truly automated, free solution. ...

ComfyUI on Mac Studio: MPS-Accelerated Image Generation Behind k3s Ingress

TL;DR I deployed ComfyUI natively on my Mac Studio M3 Ultra using Apple’s MPS GPU backend, proxied it through k3s Traefik ingress with Authentik SSO, wired it into Open WebUI as the image generation backend (replacing $0.04/image Bedrock calls), and built an MCP server so AI agents can generate images programmatically. The whole pipeline is Ansible-managed and generates images for free on local hardware. Why native instead of containerized ComfyUI needs GPU access. On Linux, that’s straightforward — pass through the GPU via device plugins. On macOS, there’s no container runtime that exposes MPS (Metal Performance Shaders) to containers. Docker Desktop on Mac runs a Linux VM — no Metal, no MPS. ...

Hardening a Self-Hosted AI Agent: Multi-Stage Builds, NetworkPolicies, and Automated CVE Triage

TL;DR OpenClaw, my self-hosted AI trading agent, was running in a fat container with 46 Critical CVEs, no network restrictions, and no automated vulnerability scanning. I fixed all three: multi-stage Dockerfile dropped the CVE count to single digits, default-deny NetworkPolicies locked down traffic, and a daily CronJob triages Trivy scan results via local LLM and posts a digest to Slack. Total cost of the automated triage: $0/day. The problem with AI agent containers AI agent containers are uniquely bad from a security perspective. They need: ...

Dream Workers: Letting an AI Agent Improve Your Cluster While You Sleep

TL;DR I built an “Ops Dream Worker” — a Kubernetes CronJob that runs at 3 AM, inspects the cluster, identifies improvements, and files GitHub issues with specific fixes. It runs entirely on local models (Mac Studio M3 Ultra), costs $0 per run, and went through 240 A/B test iterations to optimize the prompts. The anti-hallucination patterns were harder to get right than the analysis itself. The idea I have a k3s cluster with ~40 deployed services. I maintain it solo. There’s always something that could be better — a deployment missing resource limits, a CronJob that’s been failing silently, an ingress without SSO protection, a container image with known CVEs. These improvements pile up because I’m usually focused on building features, not auditing infrastructure. ...

Voice AI on k3s: Whisper, Piper, and openWakeWord in Kubernetes

TL;DR I deployed Whisper (speech-to-text), Piper (text-to-speech), and openWakeWord (wake word detection) as Kubernetes workloads on my k3s cluster. Home Assistant connects to them over the Wyoming protocol for fully local voice pipelines. Total resource cost: ~1 CPU core and 1.75GB RAM. Total cloud cost: $0. Why run voice services in Kubernetes Home Assistant’s voice pipeline needs three things: something to listen for a wake word, something to transcribe speech, and something to speak back. The usual approach is running these on the same box as HA, or on a dedicated Pi. Both work fine until you want the services to survive node failures, be independently upgradeable, or share resources with other workloads. ...

Running AWS Lens as a Self-Hosted Web App on k3s

TL;DR AWS Lens is an open-source Electron desktop app for managing AWS resources — EC2, S3, Lambda, IAM, Cost Explorer, and more. I wanted it accessible from my browser without running a desktop app. I adapted it to run as a containerized Express server on k3s, fixed a class of runtime crashes from the Electron-to-web adapter, hardened it against three security issues, and deployed it behind Traefik and Let’s Encrypt. The changes are open-source in BoraKostem/AWS-Lens#21. ...

Week of March 23: Security Patches, AI Tooling, and Defending the Homelab on Reddit

TL;DR Busy week. Three CVE patches shipped on the same day. OpenClaw stabilized with OpenRouter support and a cost exporter. The Wiki.js fork with Mermaid 11 went live after clearing a Trivy scan. PiKey — a Raspberry Pi that pretends to be a Bluetooth keyboard — shipped as a side project. A self-hosted GitHub Actions cache server cut CI restore times from minutes to seconds. And a Reddit comment defending “I use Claude to manage my infrastructure” turned into five new blog posts and a documentation sprint. ...

Forking Wiki.js to Get Mermaid 11: When Upstream Won't Move

TL;DR Wiki.js 2.x ships Mermaid 8.8.2, released in 2020. Mermaid 11 — the current stable release — adds timeline diagrams, improved gitGraph, better theming, and fixes years of rendering bugs. The upstream project defers this upgrade to Wiki.js v3, which has no release date. The PR queue has sat idle for over a year. I forked requarks/wiki at tag v2.5.312, upgraded Mermaid in-place, patched 8 CVEs including one Critical SAML authentication bypass, reduced the vulnerability count from 8 Critical / 48 High to 3 Critical / 42 High, and deployed it to the cluster. The fork stays close to upstream — Vue 2 and Webpack 4 are untouched. Only the Mermaid surface and security dependencies are modified. ...

Self-Hosting a GitHub Actions Cache Server on NAS Storage

TL;DR If you run self-hosted GitHub Actions runners, every actions/cache step is round-tripping to GitHub’s cloud storage. For a homelab cluster with local runners, that means cache restores travel from GitHub’s CDN to your runner, through your ISP, and back – even though the runner is 10 feet from your NAS. I deployed falcondev-oss/github-actions-cache-server as a Kubernetes deployment, pointed it at NFS storage on my NAS, set one environment variable on my runners, and flushed all the GitHub-hosted caches. Zero workflow changes required. ...

Two Months of K3s Stability Improvements

TL;DR Over the past two months, I have made a series of stability improvements to my k3s homelab cluster. The biggest wins: migrating from AWS ECR to self-hosted Harbor (eliminating 12-hour token expiry), fixing recurring Grafana crashes caused by SQLite corruption on Longhorn, recovering pve4 after a failed LXC experiment, hardening NetworkPolicies to close gaps in pod-to-host traffic rules, and patching multiple CVEs across the media stack. The cluster now runs 7/7 nodes on k3s v1.34.4, all services monitored, all images pulled from Harbor with static credentials that never expire. ...