Llm | zolty.systems

Multiple language models consulted as a panel

A panel of LLMs: using Gemini and Claude to pressure-test decisions

TL;DR One model = one blind spot. Claude and Gemini were trained on different datasets, have different architectures, and will miss different things. Consult for breadth, not consensus. The goal is to widen your analysis surface and catch blindspots, not to reach agreement. If all three models agree, that’s suspicious—run stress tests. The bridge is simple: Playwright drives a logged-in Gemini tab via Chrome DevTools Protocol, pastes your prompt, and scrapes the response when done. Real session, no API key. Two modes: (1) cold second-opinion—hand off a problem and ask for an independent take; (2) adversarial debate—assign sides, force defense, then reconcile. Both work; adversarial is brutal and useful for high-stakes calls. Anchoring is the trap. If you show Model B what Model A said and ask “do you agree?”, you get theater. Always pose cold or ask for the opposite argument first. Every Model Has Its Blind Spot I make the same mistakes in code as everyone else. My reasoning hits walls. I miss architectural gotchas and cut corners on testing. An LLM does the same, just in different places. ...

Parallel agents sweeping repos for improvements under a token budget

Token-budgeted self-improvement: pointing parallel agents at my own repos

TL;DR I have $X in monthly Claude tokens I don’t always use. Instead of letting the unused credit evaporate, I built a parallel agent sweep that fans out autonomous scouts to scan for dependency upgrades, CVEs, CI waste, and quick wins across my repos. Each discovery agent returns a scored candidate list. The orchestrator triages and ranks them, then spins up isolated worktree agents to implement the safe ones — all under a hard token cap and with human gates between phases. The output is a pile of merge requests, not silent commits. Noise is real and review burden is the limiting factor, but when it lands right, an hour of agent work + human review beats a weekend of manual maintenance. ...

A GitLab CI pipeline using an LLM to review and fix merge requests

LLM-powered GitLab CI: auto-reviewing and auto-fixing merge requests

TL;DR I’ve wired LLMs into my GitLab CI pipeline to auto-review merge requests, post findings as comments, and (on command) generate patches and commit fixes. The key insight: deterministic gates run first. Before the LLM ever sees a diff, regex-enforced checks block deleted tests, committed secrets, and destructive commands. Regex is certain; LLM judgment is probabilistic. Gate first, judge second. The bot reviews silently unless it finds something, posts to the MR with confidence levels, and can be leveled up from read-only observer to trusted committer as it proves itself — hence the “autonomy ladder” (Rungs 0–4) that gates who decides what. Infrastructure repos cap at Rung 2 (never auto-merge). ...

Langfuse tracing and cost dashboards for autonomous LLM agents

Tracing and budgeting LLM agents with Langfuse

TL;DR I run unattended LLM agents on my homelab — they write code, open MRs, generate content, rotate secrets. The problem: they fail silently and bill silently. Langfuse (a tracing platform) logs every LLM call with input/output tokens, latency, and cost. On top of those traces, I built three background monitors that run weekly: a goal-drift detector that compares an agent’s stated objective to what its commits actually did (via embedding similarity), a cost-spike alert that fires at 80% and 100% of a daily budget cap, and an action audit that exports traces and flags sessions where the tool-call sequence diverged from the plan. Together, these let me sleep while autonomous agents handle repetitive work. ...

A LiteLLM gateway routing many model providers behind one OpenAI-compatible endpoint

A LiteLLM gateway for the homelab: one endpoint, many models, hard cost caps

TL;DR I put a LiteLLM proxy gateway in front of every LLM I use — local Ollama models for bulk/cheap classification work, OpenRouter for frontier models when I need them, plus cloud vendors if needed. Every app and agent targets one OpenAI-compatible endpoint. Per-key budgets and daily spend alerts make runaway costs impossible. I define model-to-backend mappings in YAML, let LiteLLM handle the routing, and route based on intent: ask for solar-expert when I need a domain-specific Q&A bot backed by a small local model, ask for claude-opus-4-8 when I need real reasoning. The gateway cost? ~50ms latency overhead and one Kubernetes Deployment. The gain? No more vendor SDK sprawl, no more guessing which model is wired into a cron job, and spend visibility that I actually trust. ...

A crowded Ultima Online street where every NPC has something to say

The peasant has friends now: rumors, routines, and a 3,200-strong crowd

TL;DR Last time I wrote about giving my Ultima Online shard’s NPCs a voice, a memory, and a small autonomous life. That post ended with “the peasant talks back now.” In the eight days since, the project grew six new systems: NPCs keep daily routines anchored to real places, every town runs a rumor board that traveling NPCs physically carry between cities, townsfolk gossip about players (your katana, your karma, your reputation), the GM avatar got actual powers governed by a genie rule, villagers hand out delivery quests, and a population director keeps every city stocked with 200 ambient “denizens” who hail you in the street. That’s ~3,200 new NPCs and maybe a dozen new LLM call sites, still running entirely on a local gemma-class model — the trick is that the model never gained a single new permission. Every new capability is deterministic code; the LLM still only ever produces words and picks verbs off allowlists. Also: I found out my RAG pipeline had been silently dead for days, and the lesson there is worth the price of admission. ...

Mac Studio M3 Ultra as a GPU appliance proxied into a k3s cluster

The Mac Studio as a GPU appliance: serving Ollama and ComfyUI to a k3s cluster

TL;DR A Mac Studio M3 Ultra costs the same as a single 4090 but comes with 256 GB of unified memory and 60-core GPU, all running at 100–200 W under inference. I stopped trying to pass MPS into containers and instead run Ollama and ComfyUI natively on macOS, then proxy them back into k3s as simple Kubernetes Services with manual Endpoints. Two Mac Studios connected via Thunderbolt 5 split the load: one handles hot-path LLM inference and embeddings, the other runs the heavy forge for diffusion and long-horizon reasoning. Both are cheaper to run than a single-socket A100 and require no special driver stacks. ...

An Ultima Online town NPC with a speech bubble driven by a local language model

When the peasant talks back: LLM NPCs in Ultima Online

TL;DR I run an Ultima Online shard on my homelab where the NPCs are driven by a local LLM instead of canned dialog trees. Each NPC rolls a persisted identity, remembers conversations with individual players across reboots, runs its own errands and cross-map journeys, and — the part I’m writing about today — strikes up ambient chatter with nearby NPCs on its own. The newest work extends all of that from townsfolk to language-speaking monsters: ogres, lizardmen, ratmen, gargoyles, daemons, and especially liches, who address each other like god-kings deigning to notice an insect. Inference is a local gemma-class model behind an in-cluster gateway, so it’s free and private, with the one tradeoff being cold-load latency. It’s single-shard hobby-scale and it absolutely shows the seams. I love it. ...

C# integration scripts wiring a local language model into an Ultima Online shard

How LLM-driven NPCs work in Ultima Online (ServUO)

TL;DR I open-sourced the integration that puts a local LLM behind the NPCs on my Ultima Online (ServUO) shard. It’s about 7,500 lines of C# that drop into a shard’s Scripts/Custom/ directory and compile at boot — no separate build, no service to deploy. This post is the code-level companion to the story version of the project: how config hot-reloads, how the model client marshals async results back onto the game thread, how the LLM is kept entirely out of the simulation loop, and how a deterministic allowlist makes a non-deterministic model safe to put in a stateful world. The whole thing is fail-open: if the model is slow, down, or wrong, the NPC silently degrades to a vanilla ServUO NPC. Code is on GitHub: ZoltyMat/uo-llm-npc. ...

Two Mac Studios bridged by Thunderbolt 5 running a 1T parameter MoE

Running a 1T-parameter MoE locally on two Mac Studios over Thunderbolt 5

TL;DR Two M3 Ultra Mac Studios — 256GB unified memory each — connected by a Thunderbolt 5 cable can run mixture-of-experts models in the trillion-parameter range that no single 256GB box can fit. The hot path stays on Box 1; Box 2 hosts heavier experts and gets called via a local nginx proxy on port 11436. Real-world power draw is nowhere near the spec sheet. Some models still don’t fit even with two boxes (Kimi K2.6 native INT4), and that’s a genuinely useful constraint to know. ...