An Ultima Online town NPC with a speech bubble driven by a local language model An Ultima Online town NPC with a speech bubble driven by a local language model

When the peasant talks back: LLM NPCs in Ultima Online

TL;DR I run an Ultima Online shard on my homelab where the NPCs are driven by a local LLM instead of canned dialog trees. Each NPC rolls a persisted identity, remembers conversations with individual players across reboots, runs its own errands and cross-map journeys, and — the part I’m writing about today — strikes up ambient chatter with nearby NPCs on its own. The newest work extends all of that from townsfolk to language-speaking monsters: ogres, lizardmen, ratmen, gargoyles, daemons, and especially liches, who address each other like god-kings deigning to notice an insect. Inference is a local gemma-class model behind an in-cluster gateway, so it’s free and private, with the one tradeoff being cold-load latency. It’s single-shard hobby-scale and it absolutely shows the seams. I love it. ...

June 3, 2026 · 13 min · zolty
C# integration scripts wiring a local language model into an Ultima Online shard C# integration scripts wiring a local language model into an Ultima Online shard

How LLM-driven NPCs work in Ultima Online (ServUO)

TL;DR I open-sourced the integration that puts a local LLM behind the NPCs on my Ultima Online (ServUO) shard. It’s about 7,500 lines of C# that drop into a shard’s Scripts/Custom/ directory and compile at boot — no separate build, no service to deploy. This post is the code-level companion to the story version of the project: how config hot-reloads, how the model client marshals async results back onto the game thread, how the LLM is kept entirely out of the simulation loop, and how a deterministic allowlist makes a non-deterministic model safe to put in a stateful world. The whole thing is fail-open: if the model is slow, down, or wrong, the NPC silently degrades to a vanilla ServUO NPC. Code is on GitHub: ZoltyMat/uo-llm-npc. ...

June 1, 2026 · 11 min · zolty
Two Mac Studios bridged by Thunderbolt 5 running a 1T parameter MoE Two Mac Studios bridged by Thunderbolt 5 running a 1T parameter MoE

Running a 1T-parameter MoE locally on two Mac Studios over Thunderbolt 5

TL;DR Two M3 Ultra Mac Studios — 256GB unified memory each — connected by a Thunderbolt 5 cable can run mixture-of-experts models in the trillion-parameter range that no single 256GB box can fit. The hot path stays on Box 1; Box 2 hosts heavier experts and gets called via a local nginx proxy on port 11436. Real-world power draw is nowhere near the spec sheet. Some models still don’t fit even with two boxes (Kimi K2.6 native INT4), and that’s a genuinely useful constraint to know. ...

May 6, 2026 · 6 min · zolty
LLM evaluator with masked headlines and dates LLM evaluator with masked headlines and dates

Blind Oracle: stripping dates, headlines, and tickers before trusting an LLM trading evaluator

TL;DR I run an LLM-driven trading hypothesis engine. For a while, every result that came back looked too good — Sharpe ratios above 5, win rates above 70%, all on out-of-sample windows. They were lies. The model was reading dates, headlines, and tickers in the prompt and pattern-matching against its training data, which extends well past my “out-of-sample” cutoff. The fix was a masking layer I now call Blind Oracle: strip every leak before evaluation, run the trigger before the eval, gate promotion on out-of-sample Sharpe with the masking enforced. After it shipped, the inflated numbers collapsed back to honest reality. Some hypotheses survived; most didn’t. That’s exactly what I needed to know. ...

May 4, 2026 · 5 min · zolty
Self-hosted AI setup with OpenClaw and Ollama Self-hosted AI setup with OpenClaw and Ollama

Self-Hosted AI on a 24GB GPU: OpenClaw + Ollama Setup Guide for Windows

TL;DR You have a 24GB VRAM GPU. You want a private, self-hosted AI assistant that rivals ChatGPT – no subscriptions, no data leaving your machine. This guide walks you through setting up Ollama (local model runtime) and OpenClaw (AI gateway with a web UI) on Windows using Docker Desktop. But the real value here is the model recommendations. I ran 5,475 evaluations across 21 prompt variants and 6 models on real trading data. The results contradicted almost everything the community recommends. Finance-tuned models performed worse than a coin flip. Chain-of-thought reasoning models were anti-patterns. The winners were general-purpose MoE (Mixture-of-Experts) models that nobody talks about for specialized tasks. ...

April 14, 2026 · 21 min · zolty
GLM-5.1 benchmark on Mac Studio GLM-5.1 benchmark on Mac Studio

Running GLM-5.1 (744B) Locally on a Mac Studio: Benchmark Results

TL;DR I loaded Z.ai’s GLM-5.1 — a 744B parameter MoE model with 40B active parameters — onto a Mac Studio M3 Ultra with 256GB unified memory using a 2-bit quantized GGUF via llama.cpp. It runs at 5.8 tok/s with a 120-second time to first token. The financial analysis quality is genuinely impressive, but it eats 222GB of the 256GB available, leaving room for literally nothing else. It’s a “clear the schedule” model, not an always-on one. ...

April 13, 2026 · 8 min · zolty
Dream Workers autonomous cluster agent Dream Workers autonomous cluster agent

Dream Workers: Letting an AI Agent Improve Your Cluster While You Sleep

TL;DR I built an “Ops Dream Worker” — a Kubernetes CronJob that runs at 3 AM, inspects the cluster, identifies improvements, and files GitHub issues with specific fixes. It runs entirely on local models (Mac Studio M3 Ultra), costs $0 per run, and went through 240 A/B test iterations to optimize the prompts. The anti-hallucination patterns were harder to get right than the analysis itself. The idea I have a k3s cluster with ~40 deployed services. I maintain it solo. There’s always something that could be better — a deployment missing resource limits, a CronJob that’s been failing silently, an ingress without SSO protection, a container image with known CVEs. These improvements pile up because I’m usually focused on building features, not auditing infrastructure. ...

April 8, 2026 · 8 min · zolty
PiKey Bluetooth keyboard emulator PiKey Bluetooth keyboard emulator

PiKey: A Raspberry Pi That Pretends to Be Your Keyboard

TL;DR PiKey is a Raspberry Pi project that spoofs a Logitech K380 Bluetooth keyboard and mouse. It jiggles the mouse to prevent idle detection and auto-types LLM-generated text to simulate human activity. The device appears as a standard Bluetooth HID peripheral – no drivers or software needed on the target machine. Three full implementations exist: Python (primary), Rust (static binary), and C (minimal dependencies). The whole thing was inspired by a Reddit thread on r/overemployed where someone asked for exactly this device. ...

March 27, 2026 · 6 min · zolty
OpenClaw AI gateway on k3s OpenClaw AI gateway on k3s

OpenClaw on k3s: Replacing Open WebUI with a Lighter AI Gateway

TL;DR I replaced Open WebUI with OpenClaw – a lighter, WebSocket-based AI assistant gateway that installs from npm, supports multiple chat channels (web, Telegram, Discord, WhatsApp), and deploys on k3s as a single Deployment with a custom Docker image. The primary model provider is Anthropic’s direct API (Claude Sonnet 4.5), with LiteLLM/Bedrock as a fallback. The biggest deployment lesson: OpenClaw binds to loopback by default, which makes it invisible to Kubernetes Services and health probes. The fix is --bind lan, which requires a gateway token for authentication. ...

March 23, 2026 · 13 min · zolty
Multi-model AI planning workflow diagram Multi-model AI planning workflow diagram

Multi-Model Planning: The Same Pattern That Shipped dnd-multi

TL;DR The Jellyfin HA conversion touches a .NET 10 codebase, Entity Framework Core migrations, Kubernetes manifests, Terraform infrastructure, PostgreSQL operations, and FFmpeg transcoding pipelines. No single AI model understands all of this equally well. So I used four of them — the same multi-model planning pattern that shipped dnd-multi in a single day and that I documented in the LLM GitHub PR workflow. This post covers how I adapted that pattern for infrastructure work, what each model caught, and why planning is where all the human time should go. ...

March 7, 2026 · 7 min · zolty

Affiliate Disclosure: Some links on this site are affiliate links (Amazon Associates, DigitalOcean referral). As an Amazon Associate, I earn from qualifying purchases. This does not affect the price you pay or my editorial independence — I only recommend products and services I personally use and trust.