TL;DR
I open-sourced the integration that puts a local LLM behind the NPCs on my Ultima Online (ServUO) shard. It’s about 7,500 lines of C# that drop into a shard’s Scripts/Custom/ directory and compile at boot — no separate build, no service to deploy. This post is the code-level companion to the story version of the project: how config hot-reloads, how the model client marshals async results back onto the game thread, how the LLM is kept entirely out of the simulation loop, and how a deterministic allowlist makes a non-deterministic model safe to put in a stateful world. The whole thing is fail-open: if the model is slow, down, or wrong, the NPC silently degrades to a vanilla ServUO NPC. Code is on GitHub: ZoltyMat/uo-llm-npc.
The shape of the problem
ServUO — the modern descendant of the RunUO emulator — has a property that makes this whole project tractable: it compiles the C# under Scripts/ at server boot. Drop a .cs file into Scripts/Custom/, restart the shard, and your code is live game logic with full access to the world’s object model. There’s no NuGet, no csproj, no separate artifact. The shard is the build system.
That’s why this ships as scripts, not as a service. The integration is sixteen files under one folder:
Scripts/Custom/LLMNpc/
├── LLMConfig.cs # config load + [LLMReload, default-file writer
├── LLMClient.cs # OpenAI-compatible chat client
├── LLMConversation.cs # per-(player, npc) short-term memory
├── LLMTalkingMobile.cs # base class for LLM-voiced creatures
├── LLMAmbientSpeech.cs # gives EXISTING vanilla NPCs a voice
├── LLMAmbientMemory.cs # disk-backed identity/relationships for those
├── NpcIdentity.cs # rolled, persisted per-NPC persona
├── NpcActions.cs # the cosmetic-action allowlist
├── NpcChatter.cs # NPC-to-NPC ambient conversation
├── Errand.cs # errand phase model
├── ErrandDirector.cs # the deterministic heartbeat
├── ErrandPolicy.cs # per-type roam limits
├── BritanniaGeography.cs # position → town an NPC serves
├── LLMRag.cs # optional Qdrant lore/style/journal grounding
├── AnomalyDirector.cs # rare 4th-wall "anomaly" events
└── LLMNpcCommands.cs # in-game GM commands
The interesting thing about this list is how little of it is “call the model.” LLMClient.cs is the only file that talks HTTP to an LLM. Everything else is the scaffolding that makes a clicked-on blacksmith feel like a person — and that scaffolding is the actual project.
Configuration you can reload without a restart
A UO shard is a long-lived stateful process. Restarting it to flip a config flag means kicking everyone off and reloading the entire world from disk. So the very first thing the integration needed was hot-reloadable config.
LLMConfig.cs reads a flat key=value file at Config/LLMNpc.cfg. If the file doesn’t exist on first boot, it writes a fully-commented default. An in-game GM command, [LLMReload, re-reads it live:
public static void Reload()
{
var next = Load(Path.Combine(Core.BaseDirectory, "Config", "LLMNpc.cfg"));
Volatile.Write(ref _current, next); // atomic swap; readers never tear
Console.WriteLine("LLMNpc: config reloaded.");
}
Config is read on the hot path constantly, so it’s a single immutable snapshot swapped atomically rather than a mutable bag of fields. The knobs that matter most:
Enabled=false # master switch
BaseUrl=http://127.0.0.1:11434/v1 # any OpenAI-compatible endpoint
Model=qwen3-coder:30b
CooldownMs=2500 # min ms between calls per NPC
MaxMemoryTurns=6 # conversation turns replayed per pair
MaxReplyChars=240 # hard cap on spoken length
RagEnabled=false # optional Qdrant grounding
ChatterEnabled=false # NPC-to-NPC ambient conversation
AnomalyEnabled=false # rare 4th-wall events
The default endpoint is a local Ollama on 127.0.0.1, so a fresh clone targets a free, local, no-auth model out of the box. Point BaseUrl at a gateway and drop a key in ApiKey if you’d rather route through something like LiteLLM. The game code never knows the difference — it speaks the OpenAI chat API and nothing more.
One rule the example config repeats in three places: never commit your real
Config/LLMNpc.cfg. It can hold an endpoint and an API key. The repo gitignores it and ships onlyLLMNpc.cfg.example. The whole point of publishing was to not publish secrets.
The model client and the fail-open contract
Here’s the constraint that shaped LLMClient.cs: ServUO is single-threaded for game logic. There is one game thread, and touching a Mobile (an NPC, a player) off that thread is how you corrupt a world. But an HTTP call to a model that might cold-load for four seconds absolutely cannot block that thread, or the entire shard freezes for every player at once.
So the client does the network call on a thread-pool task, then marshals the result back onto the game thread before it ever touches a game object:
public static void ChatAsync(string system, string user, Action<string> onReply)
{
Task.Run(async () =>
{
string reply = null;
try { reply = await PostChat(system, user); } // off-thread HTTP
catch { /* swallow: fail-open */ }
// Back onto the game thread before touching any Mobile.
Timer.DelayCall(TimeSpan.Zero, () =>
{
if (!string.IsNullOrEmpty(reply))
onReply(Sanitize(reply)); // NPC speaks here
});
});
}
Timer.DelayCall is ServUO’s idiom for “run this on the game thread next tick.” That one line is the whole threading model: async I/O on the pool, mutation on the game thread, never the reverse.
And notice the catch that swallows. That’s the fail-open contract, and it’s everywhere in this codebase. Every model call, every Qdrant call, has a timeout and a path where failure means nothing happens rather than an error. A dead endpoint doesn’t throw a player a stack trace — the NPC just behaves like an ordinary ServUO NPC that says one of its canned lines. The worst case of the entire LLM layer being down is that the game reverts to exactly how it shipped in 1997.
The allowlist that makes a model safe in a stateful world
The moment you let a language model influence behavior in a world with durable state, you have to answer one question: what can it actually do? Generating text is harmless. But if any token the model emits can be interpreted as an in-world action, then a hallucination is now an event.
The answer is a closed allowlist of cosmetic-only actions, in NpcActions.cs, checked deterministically before any model output is trusted:
// Cosmetic-only. The model may request one of these by name; anything
// else it emits is never interpreted as an action.
private static readonly HashSet<string> Allowed = new()
{
"bow", "nod", "shake_head", "laugh", "point",
"wave", "shrug", "salute", "clap", "yawn",
};
public static bool TryResolve(string token, out int animationId)
{
animationId = 0;
if (token == null || !Allowed.Contains(token.Trim().ToLowerInvariant()))
return false; // not on the menu → dropped
animationId = AnimationFor(token);
return true;
}
The model picks from the menu; it cannot invent a verb. It can’t move an NPC, open a door, drop an item, or start combat by describing it. If it hallucinates “and then the lich razes the village,” the verb raze isn’t in the set, so the only thing that happens is text. This is the same principle I lean on for agentic CI in my day job: deterministic guardrails come before LLM judgment. Regex and allowlists are certain; the model’s good behavior is merely probable. You put the cheap, certain check in front of the expensive, fallible one. A non-deterministic system whose worst case is a weird animation is a system you can ship.
Keeping the LLM out of the simulation loop
The biggest design rule in the whole project: the model generates words, never game state. NPCs run their own errands and cross-map journeys, but the LLM is never consulted to advance one. ErrandDirector.cs drives all of it from a single deterministic heartbeat that scans player-centrically:
private static void Heartbeat()
{
foreach (NetState ns in NetState.Instances) // connected players only
{
Mobile player = ns.Mobile;
if (player == null) continue;
foreach (Mobile m in player.GetMobilesInRange(SimRange))
{
if (m is ILLMNpc npc && npc.HasErrand)
npc.AdvanceErrand(); // pure state machine
}
}
}
Two things fall out of this. First, off-screen NPCs cost nothing — the heartbeat only ever looks at NPCs near a connected player, so an empty shard is idle and a model with nobody watching is never called. Second, errand progress is a plain state machine (Errand.cs defines the phases; ErrandPolicy.cs caps how far each NPC type may roam, so the banker stays at the counter and players can still bank). The model’s only involvement is optionally rewriting an errand’s purpose text into a richer one-liner — an async, fail-open flourish whose deterministic version always stands as the fallback. The simulation never waits on inference.
The same heartbeat opportunistically powers NPC-to-NPC chatter (NpcChatter.cs) and the rare anomaly events (AnomalyDirector.cs) — they ride the existing scan rather than spinning their own timers, gated behind layered cooldowns and low odds so a town square stays a murmur instead of an inference stampede.
Grounding replies with Qdrant
LLMRag.cs is the one optional, off-by-default subsystem, and it does three related jobs against a Qdrant vector database, all sharing one embedding endpoint (Ollama’s native /api/embeddings):
- Lore grounding (
uo_lore) — a vector search over a Britannia lore collection so an NPC’s claims about the world stay roughly canon. - Voice-style exemplars (
bg3_style) — archetype-matched example lines (cadence and wit, setting stripped) that season a reply’s delivery without dictating its content. - A per-NPC deed journal (
npc_journal) — each completed errand is embedded and stored per NPC, and a few of its own past deeds are recalled on the chat path, so a merchant remembers the trip it actually took.
Every one of these is fail-open in the same way as the chat path: any embedding or search error just drops the grounding for that one reply and the NPC speaks anyway. RAG makes the NPCs better; it is never load-bearing for them speaking at all. The blast radius of Qdrant being down is “replies are a little less grounded for a few seconds,” not “the town goes silent.”
Identity and memory that survive a reboot
Two kinds of state make an NPC feel like a person instead of a stateless responder, and both are persisted with the world save rather than held in RAM.
NpcIdentity.cs rolls a persona once from vocation-keyed pools — town, origin, temperament, backstory seed, speech style, private drive, drifting mood — and serializes it onto the mobile. The same surly ex-soldier blacksmith from Minoc is the same character next week, after a reboot. The model isn’t inventing a character per call; it’s handed a fixed character sheet and asked to act.
For the existing vanilla NPCs that aren’t custom subclasses (the stock vendors, bankers, guards that LLMAmbientSpeech.cs voices via a single global speech listener), there’s no subclass to serialize onto — so LLMAmbientMemory.cs keeps a disk-backed store keyed by the mobile’s stable serial, carrying both the rolled identity and per-player relationships across reboots. LLMConversation.cs holds the short-term per-(player, NPC) turn memory, bounded by MaxMemoryTurns so a chatty regular never slowly inflates every prompt until inference crawls.
Testing a thing that only breaks at runtime
The lesson I learned the embarrassing way: a green build proves the code compiles, not that the NPC talks. The failures that matter all live at runtime, in-world — the endpoint is reachable but returns garbage, a persona prompt typo makes every NPC mute, the allowlist filter is too aggressive and eats every emote. None of those are visible to a compiler.
So testing is a headless harness that drives the world with no graphical client: boot the shard, connect as a synthetic player, walk up to an NPC, say something, and assert that a sane line comes back and any emote is on the allowlist. That catches the entire “compiles fine, says nothing” class of bug that no unit test would, because the bug only exists once the world is running and the model is actually in the loop. It also runs on the same house rules as every other workload I build — amd64-only images, --provenance=false on every buildx (k3s’s containerd chokes on the attestation manifests buildx adds by default), and every pull through a Harbor proxy cache.
Get the code
The integration is public, MIT-licensed, and scrubbed of anything environment-specific:
github.com/ZoltyMat/uo-llm-npc
It’s scripts, not a server — you bring your own ServUO shard, drop the folder into Scripts/Custom/, copy LLMNpc.cfg.example to Config/LLMNpc.cfg, point BaseUrl at a model, and set Enabled=true. Inference is whatever OpenAI-compatible endpoint you give it; a local Ollama on a Mac Studio is what I run, and the marginal cost of an NPC saying something snarky is electricity and nothing else.
If I were to change one thing, it’d be the cold-load latency — the first line after a quiet stretch hangs while the model reloads into VRAM. Pinning the model would fix it and starve everything else the box does, so instead the NPC just scratches its head and answers a beat late. For a hobby world I’m going to leave running indefinitely, that’s a tradeoff I’ll take. The fuller story of why I built this — and what it’s like when the liches start addressing each other like god-kings — is in the companion post.