Posts

A LiteLLM gateway routing many model providers behind one OpenAI-compatible endpoint

A LiteLLM gateway for the homelab: one endpoint, many models, hard cost caps

TL;DR I put a LiteLLM proxy gateway in front of every LLM I use — local Ollama models for bulk/cheap classification work, OpenRouter for frontier models when I need them, plus cloud vendors if needed. Every app and agent targets one OpenAI-compatible endpoint. Per-key budgets and daily spend alerts make runaway costs impossible. I define model-to-backend mappings in YAML, let LiteLLM handle the routing, and route based on intent: ask for solar-expert when I need a domain-specific Q&A bot backed by a small local model, ask for claude-opus-4-8 when I need real reasoning. The gateway cost? ~50ms latency overhead and one Kubernetes Deployment. The gain? No more vendor SDK sprawl, no more guessing which model is wired into a cron job, and spend visibility that I actually trust. ...

A crowded Ultima Online street where every NPC has something to say

The peasant has friends now: rumors, routines, and a 3,200-strong crowd

TL;DR Last time I wrote about giving my Ultima Online shard’s NPCs a voice, a memory, and a small autonomous life. That post ended with “the peasant talks back now.” In the eight days since, the project grew six new systems: NPCs keep daily routines anchored to real places, every town runs a rumor board that traveling NPCs physically carry between cities, townsfolk gossip about players (your katana, your karma, your reputation), the GM avatar got actual powers governed by a genie rule, villagers hand out delivery quests, and a population director keeps every city stocked with 200 ambient “denizens” who hail you in the street. That’s ~3,200 new NPCs and maybe a dozen new LLM call sites, still running entirely on a local gemma-class model — the trick is that the model never gained a single new permission. Every new capability is deterministic code; the LLM still only ever produces words and picks verbs off allowlists. Also: I found out my RAG pipeline had been silently dead for days, and the lesson there is worth the price of admission. ...

An MCP server wrapping a local homelab API for AI agents

Writing MCP servers for your homelab: five tools, 200 lines, and your agents get hands

TL;DR Model Context Protocol (MCP) is a transport layer that lets Claude and other LLM agents call local tools with typed signatures and structured responses. Any HTTP API running on your homelab — ComfyUI, a wiki, a dashboard, a custom service — can become a set of agent-callable tools by wrapping it in a FastMCP server. A typical server takes 150–250 lines of Python, exposes 3–5 tools via @mcp.tool() decorators, and runs as a stdio process. The pattern scales from single-purpose (image generation) to multi-tool (queue status, model listing, system stats) without complexity explosion. This post shows the anatomy by dissecting the ComfyUI MCP server: how to build workflows, poll for completion, parse results, and return structured JSON that agents actually use. ...

Traefik forward-auth middleware fronting homelab services with Authentik SSO

Every homelab service behind one login: Traefik forward-auth with Authentik

TL;DR Every service I run — ComfyUI, Grafana, Vault, even the ancient app on a Mac across the network — lives behind a Traefik forward-auth middleware that hands off to Authentik. No per-service login page. One Authentik login shared across everything. The magic is a two-route IngressRoute pattern: a protected route with the middleware + an unprotected callback route for the OAuth flow itself. Adding a new service to the cluster takes five lines of YAML. Wiring a non-Kubernetes backend — like the Mac that runs ComfyUI and Ollama — takes a service-with-manual-endpoints proxy. ...

Tiered model storage across local SSD and MinIO object storage

Tiered model storage with MinIO and rclone: keep the SSD hot, archive the rest

TL;DR Stable Diffusion 3.5 Large is 15 GB. RealVisXL is 6.5 GB. Throw in a few LoRAs and a VAE, and your SSD hits the wall fast. I run a MinIO bucket as the long-tail model store, sync it to a local overflow directory on a 30-minute schedule via rclone, and register both the hot (SSD) and cold (synced overflow) paths in ComfyUI’s extra_model_paths.yaml. Models appear transparently; the loader searches both tiers. A fresh model lands in MinIO, appears locally within 30 minutes, and ComfyUI finds it without any manual shuffling. ...

Mac Studio M3 Ultra as a GPU appliance proxied into a k3s cluster

The Mac Studio as a GPU appliance: serving Ollama and ComfyUI to a k3s cluster

TL;DR A Mac Studio M3 Ultra costs the same as a single 4090 but comes with 256 GB of unified memory and 60-core GPU, all running at 100–200 W under inference. I stopped trying to pass MPS into containers and instead run Ollama and ComfyUI natively on macOS, then proxy them back into k3s as simple Kubernetes Services with manual Endpoints. Two Mac Studios connected via Thunderbolt 5 split the load: one handles hot-path LLM inference and embeddings, the other runs the heavy forge for diffusion and long-horizon reasoning. Both are cheaper to run than a single-socket A100 and require no special driver stacks. ...

An Ultima Online town NPC with a speech bubble driven by a local language model

When the peasant talks back: LLM NPCs in Ultima Online

TL;DR I run an Ultima Online shard on my homelab where the NPCs are driven by a local LLM instead of canned dialog trees. Each NPC rolls a persisted identity, remembers conversations with individual players across reboots, runs its own errands and cross-map journeys, and — the part I’m writing about today — strikes up ambient chatter with nearby NPCs on its own. The newest work extends all of that from townsfolk to language-speaking monsters: ogres, lizardmen, ratmen, gargoyles, daemons, and especially liches, who address each other like god-kings deigning to notice an insect. Inference is a local gemma-class model behind an in-cluster gateway, so it’s free and private, with the one tradeoff being cold-load latency. It’s single-shard hobby-scale and it absolutely shows the seams. I love it. ...

C# integration scripts wiring a local language model into an Ultima Online shard

How LLM-driven NPCs work in Ultima Online (ServUO)

TL;DR I open-sourced the integration that puts a local LLM behind the NPCs on my Ultima Online (ServUO) shard. It’s about 7,500 lines of C# that drop into a shard’s Scripts/Custom/ directory and compile at boot — no separate build, no service to deploy. This post is the code-level companion to the story version of the project: how config hot-reloads, how the model client marshals async results back onto the game thread, how the LLM is kept entirely out of the simulation loop, and how a deterministic allowlist makes a non-deterministic model safe to put in a stateful world. The whole thing is fail-open: if the model is slow, down, or wrong, the NPC silently degrades to a vanilla ServUO NPC. Code is on GitHub: ZoltyMat/uo-llm-npc. ...

A Surface tablet wall-mounted as a Home Assistant dashboard

A $150 Surface Pro 7 is the best Home Assistant wall panel you can buy

TL;DR Purpose-built smart-home wall panels are expensive, locked down, and usually underpowered. A used Microsoft Surface Pro 7 — Core i5 or i7, 16 GB RAM, a sharp 12.3" touchscreen — runs about $150 on the surplus market and makes a fantastic wall-mounted dashboard for Home Assistant, Grafana, or whatever you self-host. It’s a full x86 PC behind a great touchscreen, so it runs a real browser with your real dashboards, not a stripped-down panel app. Here’s the build. ...

Re-tuning an AI coding agent for a new model release

Re-tuning my Claude Code setup for a new Opus model

TL;DR A new Opus model shipped, so I sat down to re-tune the agent harness I drive it with — the CLAUDE.md files, skills, hooks, and settings that shape every session. The surprising part: the most valuable changes weren’t trimming prompts for the smarter model. They were wiring the agent into infrastructure I already run — offloading bulk work to a local LLM (≈$0), a live homelab statusline, session tracing for an “action audit,” and a goal-drift monitor that uses the local model as judge. I also learned not to trust the new model’s own suggestions about what to cut. It wanted to delete load-bearing guardrails. ...