Operations

Power meter and heat-flow diagram for a homelab rack

Watts, BTUs, and the real cost of running a homelab 24/7

TL;DR A homelab feels free until you read the meter. After a year of running seven k3s nodes plus a pair of Mac Studios under whatever workload I felt like throwing at them, I sat down with a Kill-a-Watt and worked out what the cluster actually costs to keep on. Idle is genuinely cheap. Sustained LLM inference is not. The honest break-even against cloud inference is workload-shaped, and for my workloads, on-prem wins — but only because I run them often enough to amortize the wattage. The numbers below are mine; substitute your electricity rate to get yours. ...

Four-rung ladder showing supervised, monitored, trusted, full autonomy stages

The agent autonomy trust ladder: supervised → monitored → trusted → full

TL;DR I run a growing fleet of autonomous agents — homelab ops, trading research, content generation. Most blow up the first few times they try anything new. I needed a way to decide what an agent is allowed to do without asking me, and what still requires a human checkpoint. The answer is a four-rung trust ladder — supervised, monitored, trusted, full autonomy. Agents earn rungs through track record, not promises. Demotions are possible and routine. The framework took the question “should this agent be allowed to do X” out of my head every single time and turned it into a policy I can apply consistently. ...

LLM evaluator with masked headlines and dates

Blind Oracle: stripping dates, headlines, and tickers before trusting an LLM trading evaluator

TL;DR I run an LLM-driven trading hypothesis engine. For a while, every result that came back looked too good — Sharpe ratios above 5, win rates above 70%, all on out-of-sample windows. They were lies. The model was reading dates, headlines, and tickers in the prompt and pattern-matching against its training data, which extends well past my “out-of-sample” cutoff. The fix was a masking layer I now call Blind Oracle: strip every leak before evaluation, run the trigger before the eval, gate promotion on out-of-sample Sharpe with the masking enforced. After it shipped, the inflated numbers collapsed back to honest reality. Some hypotheses survived; most didn’t. That’s exactly what I needed to know. ...

Harbor proxy cache fronting upstream registries

Harbor as a proxy cache for every upstream registry — killing rate limits in a homelab

TL;DR Every node in my k3s cluster used to pull images directly from docker.io, ghcr.io, lscr.io, and quay.io. That meant Docker Hub rate limits, occasional 5xx storms from ghcr, and a hard outage when quay.io went sideways for a few hours. I put Harbor in front of all of them as a proxy cache, pointed containerd at Harbor, and the registry-related noise in my cluster effectively went to zero. Image pulls also got faster — 10GbE LAN beats every public CDN I’ve measured against. ...

Migrating from GitHub to self-hosted GitLab CE — and rebuilding it from S3

TL;DR I moved every private homelab repo off GitHub onto a self-hosted GitLab CE 18.10 instance running on my k3s cluster. GitHub stays as a read-only mirror plus the break-glass k3s_bootstrap repo. Two weeks later I accidentally blkdiscard’d the GitLab volume and rebuilt the entire instance from an S3 backup. It worked, but the boring parts — runner re-registration, group tokens, container-registry pull secrets — were the real cost. Why bother GitHub was fine. GitHub Actions was fine. The thing that pushed me over was billing math plus blast radius: ...

Auto-documenting homelab architecture diagrams

Auto-documenting a homelab: the quest for free architecture diagrams

TL;DR I spent a full day trying to automatically generate professional architecture diagrams for a 7-node k3s homelab. Figma’s MCP integration was perfect but requires a paid subscription. I tried Excalidraw (JSON generation + Kroki rendering), Mermaid, and finally landed on raw SVG generation in Python. The result is 27 diagrams with tech icons, drop shadows, and curved arrows — but the process is more manual than I’d like. I’m curious if anyone else has found a truly automated, free solution. ...

ComfyUI on Mac Studio: MPS-Accelerated Image Generation Behind k3s Ingress

TL;DR I deployed ComfyUI natively on my Mac Studio M3 Ultra using Apple’s MPS GPU backend, proxied it through k3s Traefik ingress with Authentik SSO, wired it into Open WebUI as the image generation backend (replacing $0.04/image Bedrock calls), and built an MCP server so AI agents can generate images programmatically. The whole pipeline is Ansible-managed and generates images for free on local hardware. Why native instead of containerized ComfyUI needs GPU access. On Linux, that’s straightforward — pass through the GPU via device plugins. On macOS, there’s no container runtime that exposes MPS (Metal Performance Shaders) to containers. Docker Desktop on Mac runs a Linux VM — no Metal, no MPS. ...

Monitoring a Mac Studio as a First-Class Cluster Citizen: Prometheus, Loki, and Custom Ollama Exporters

TL;DR My Mac Studio M3 Ultra runs Ollama with 70B+ models but isn’t a k3s node. I needed it to show up in Grafana next to the cluster workloads. The solution: node_exporter for system metrics, a Go reverse proxy for per-model inference metrics, a custom Python exporter for model inventory and VRAM tracking, and Grafana Alloy for shipping logs to Loki. All four services managed by Ansible, all metrics scraped by the cluster’s Prometheus. ...

Hardening a Self-Hosted AI Agent: Multi-Stage Builds, NetworkPolicies, and Automated CVE Triage

TL;DR OpenClaw, my self-hosted AI trading agent, was running in a fat container with 46 Critical CVEs, no network restrictions, and no automated vulnerability scanning. I fixed all three: multi-stage Dockerfile dropped the CVE count to single digits, default-deny NetworkPolicies locked down traffic, and a daily CronJob triages Trivy scan results via local LLM and posts a digest to Slack. Total cost of the automated triage: $0/day. The problem with AI agent containers AI agent containers are uniquely bad from a security perspective. They need: ...

Dream Workers: Letting an AI Agent Improve Your Cluster While You Sleep

TL;DR I built an “Ops Dream Worker” — a Kubernetes CronJob that runs at 3 AM, inspects the cluster, identifies improvements, and files GitHub issues with specific fixes. It runs entirely on local models (Mac Studio M3 Ultra), costs $0 per run, and went through 240 A/B test iterations to optimize the prompts. The anti-hallucination patterns were harder to get right than the analysis itself. The idea I have a k3s cluster with ~40 deployed services. I maintain it solo. There’s always something that could be better — a deployment missing resource limits, a CronJob that’s been failing silently, an ingress without SSO protection, a container image with known CVEs. These improvements pile up because I’m usually focused on building features, not auditing infrastructure. ...