zolty.systems

Parallel agents sweeping repos for improvements under a token budget

Token-budgeted self-improvement: pointing parallel agents at my own repos

TL;DR I have $X in monthly Claude tokens I don’t always use. Instead of letting the unused credit evaporate, I built a parallel agent sweep that fans out autonomous scouts to scan for dependency upgrades, CVEs, CI waste, and quick wins across my repos. Each discovery agent returns a scored candidate list. The orchestrator triages and ranks them, then spins up isolated worktree agents to implement the safe ones — all under a hard token cap and with human gates between phases. The output is a pile of merge requests, not silent commits. Noise is real and review burden is the limiting factor, but when it lands right, an hour of agent work + human review beats a weekend of manual maintenance. ...

A GitLab CI pipeline using an LLM to review and fix merge requests

LLM-powered GitLab CI: auto-reviewing and auto-fixing merge requests

TL;DR I’ve wired LLMs into my GitLab CI pipeline to auto-review merge requests, post findings as comments, and (on command) generate patches and commit fixes. The key insight: deterministic gates run first. Before the LLM ever sees a diff, regex-enforced checks block deleted tests, committed secrets, and destructive commands. Regex is certain; LLM judgment is probabilistic. Gate first, judge second. The bot reviews silently unless it finds something, posts to the MR with confidence levels, and can be leveled up from read-only observer to trusted committer as it proves itself — hence the “autonomy ladder” (Rungs 0–4) that gates who decides what. Infrastructure repos cap at Rung 2 (never auto-merge). ...

A stack of Dell OptiPlex small-form-factor desktops wired as a k3s cluster

Build a 3-node K3s cluster from $150 surplus Dell OptiPlex desktops

TL;DR My production homelab runs on Lenovo M920q tinies, and I still think those are the sweet spot. But if I were starting over today with a tight budget, I’d buy a stack of government-surplus Dell OptiPlex 7060 and 7070 desktops instead. They go for around $150 each refurbished — 6-core 8th/9th-gen Intel, an SSD, and Windows 11 already on them — and they make excellent Kubernetes nodes with exactly two cheap upgrades: a bit more RAM and a second network card. ...

Langfuse tracing and cost dashboards for autonomous LLM agents

Tracing and budgeting LLM agents with Langfuse

TL;DR I run unattended LLM agents on my homelab — they write code, open MRs, generate content, rotate secrets. The problem: they fail silently and bill silently. Langfuse (a tracing platform) logs every LLM call with input/output tokens, latency, and cost. On top of those traces, I built three background monitors that run weekly: a goal-drift detector that compares an agent’s stated objective to what its commits actually did (via embedding similarity), a cost-spike alert that fires at 80% and 100% of a daily budget cap, and an action audit that exports traces and flags sessions where the tool-call sequence diverged from the plan. Together, these let me sleep while autonomous agents handle repetitive work. ...

A LiteLLM gateway routing many model providers behind one OpenAI-compatible endpoint

A LiteLLM gateway for the homelab: one endpoint, many models, hard cost caps

TL;DR I put a LiteLLM proxy gateway in front of every LLM I use — local Ollama models for bulk/cheap classification work, OpenRouter for frontier models when I need them, plus cloud vendors if needed. Every app and agent targets one OpenAI-compatible endpoint. Per-key budgets and daily spend alerts make runaway costs impossible. I define model-to-backend mappings in YAML, let LiteLLM handle the routing, and route based on intent: ask for solar-expert when I need a domain-specific Q&A bot backed by a small local model, ask for claude-opus-4-8 when I need real reasoning. The gateway cost? ~50ms latency overhead and one Kubernetes Deployment. The gain? No more vendor SDK sprawl, no more guessing which model is wired into a cron job, and spend visibility that I actually trust. ...

A crowded Ultima Online street where every NPC has something to say

The peasant has friends now: rumors, routines, and a 3,200-strong crowd

TL;DR Last time I wrote about giving my Ultima Online shard’s NPCs a voice, a memory, and a small autonomous life. That post ended with “the peasant talks back now.” In the eight days since, the project grew six new systems: NPCs keep daily routines anchored to real places, every town runs a rumor board that traveling NPCs physically carry between cities, townsfolk gossip about players (your katana, your karma, your reputation), the GM avatar got actual powers governed by a genie rule, villagers hand out delivery quests, and a population director keeps every city stocked with 200 ambient “denizens” who hail you in the street. That’s ~3,200 new NPCs and maybe a dozen new LLM call sites, still running entirely on a local gemma-class model — the trick is that the model never gained a single new permission. Every new capability is deterministic code; the LLM still only ever produces words and picks verbs off allowlists. Also: I found out my RAG pipeline had been silently dead for days, and the lesson there is worth the price of admission. ...

An MCP server wrapping a local homelab API for AI agents

Writing MCP servers for your homelab: five tools, 200 lines, and your agents get hands

TL;DR Model Context Protocol (MCP) is a transport layer that lets Claude and other LLM agents call local tools with typed signatures and structured responses. Any HTTP API running on your homelab — ComfyUI, a wiki, a dashboard, a custom service — can become a set of agent-callable tools by wrapping it in a FastMCP server. A typical server takes 150–250 lines of Python, exposes 3–5 tools via @mcp.tool() decorators, and runs as a stdio process. The pattern scales from single-purpose (image generation) to multi-tool (queue status, model listing, system stats) without complexity explosion. This post shows the anatomy by dissecting the ComfyUI MCP server: how to build workflows, poll for completion, parse results, and return structured JSON that agents actually use. ...

Traefik forward-auth middleware fronting homelab services with Authentik SSO

Every homelab service behind one login: Traefik forward-auth with Authentik

TL;DR Every service I run — ComfyUI, Grafana, Vault, even the ancient app on a Mac across the network — lives behind a Traefik forward-auth middleware that hands off to Authentik. No per-service login page. One Authentik login shared across everything. The magic is a two-route IngressRoute pattern: a protected route with the middleware + an unprotected callback route for the OAuth flow itself. Adding a new service to the cluster takes five lines of YAML. Wiring a non-Kubernetes backend — like the Mac that runs ComfyUI and Ollama — takes a service-with-manual-endpoints proxy. ...

Tiered model storage across local SSD and MinIO object storage

Tiered model storage with MinIO and rclone: keep the SSD hot, archive the rest

TL;DR Stable Diffusion 3.5 Large is 15 GB. RealVisXL is 6.5 GB. Throw in a few LoRAs and a VAE, and your SSD hits the wall fast. I run a MinIO bucket as the long-tail model store, sync it to a local overflow directory on a 30-minute schedule via rclone, and register both the hot (SSD) and cold (synced overflow) paths in ComfyUI’s extra_model_paths.yaml. Models appear transparently; the loader searches both tiers. A fresh model lands in MinIO, appears locally within 30 minutes, and ComfyUI finds it without any manual shuffling. ...

Mac Studio M3 Ultra as a GPU appliance proxied into a k3s cluster

The Mac Studio as a GPU appliance: serving Ollama and ComfyUI to a k3s cluster

TL;DR A Mac Studio M3 Ultra costs the same as a single 4090 but comes with 256 GB of unified memory and 60-core GPU, all running at 100–200 W under inference. I stopped trying to pass MPS into containers and instead run Ollama and ComfyUI natively on macOS, then proxy them back into k3s as simple Kubernetes Services with manual Endpoints. Two Mac Studios connected via Thunderbolt 5 split the load: one handles hot-path LLM inference and embeddings, the other runs the heavy forge for diffusion and long-horizon reasoning. Both are cheaper to run than a single-socket A100 and require no special driver stacks. ...