When the peasant talks back: LLM NPCs in Ultima Online

TL;DR

I run an Ultima Online shard on my homelab where the NPCs are driven by a local LLM instead of canned dialog trees. Each NPC rolls a persisted identity, remembers conversations with individual players across reboots, runs its own errands and cross-map journeys, and — the part I’m writing about today — strikes up ambient chatter with nearby NPCs on its own. The newest work extends all of that from townsfolk to language-speaking monsters: ogres, lizardmen, ratmen, gargoyles, daemons, and especially liches, who address each other like god-kings deigning to notice an insect. Inference is a local gemma-class model behind an in-cluster gateway, so it’s free and private, with the one tradeoff being cold-load latency. It’s single-shard hobby-scale and it absolutely shows the seams. I love it.

What if the peasant could actually talk back?

If you’ve played any Ultima Online server, you know the bit: you double-click a townsperson, and they say one of four hardcoded lines. “Greetings, traveler.” “I have nothing for you.” The illusion that this is a living world survives for about eight seconds, right up until the second NPC says the exact same thing as the first.

I’d been running a private ServUO shard — ServUO is the modern descendant of the RunUO emulator that the late-90s UO community reverse-engineered — mostly as a sandbox for messing with the world simulation. And the thing that kept nagging me wasn’t the combat or the crafting. It was that the people were furniture. A blacksmith with a name, a vendor inventory, and the inner life of a vending machine.

So the question that started this whole project: what if the peasant could actually talk back? Not “pick from a menu” talk back. Talk back like a person who has a name, a town, a bad opinion about the local tax collector, and a memory of the last time you bothered them.

The answer turned out to be a lot more interesting than “wire up a chatbot,” because the chat is the least interesting part.

This is not a web app

I want to get the architecture weirdness out of the way first, because it shaped everything.

UO is not HTTP. The client and server speak a binary UO protocol over a raw TCP socket — the same wire format that’s existed, with extensions, since 1997. There’s no request/response framing you’d recognize, no REST, no nice place to bolt middleware. The “server” is a long-lived stateful process holding the entire game world in memory and periodically flushing it to disk as a world save.

That has consequences when you put it on Kubernetes:

            Internet / LAN
                  │
          raw UO TCP (port 2593)
                  │
        ┌─────────▼─────────┐
        │  MetalLB LB IP    │   ← not an Ingress; L4, not L7
        └─────────┬─────────┘
                  │
        ┌─────────▼─────────┐
        │  shard pod (1/1)  │   Recreate strategy
        │  ServUO + world   │   single-writer world save
        └─────────┬─────────┘
                  │ in-cluster HTTP
        ┌─────────▼─────────┐
        │  LiteLLM gateway  │ → Ollama (Mac Studio, gemma-class)
        └───────────────────┘

A few things fall out of this that are unusual for a homelab service:

Game traffic is L4, not L7. It comes in over a MetalLB LoadBalancer IP on the UO port, not through Traefik. There’s no hostname routing, no TLS termination at the edge, no Ingress object. For someone whose homelab is otherwise all HTTP services behind an ingress controller, it was genuinely refreshing to remember that L4 load balancing is a thing that exists.
It’s a single writer. The world save is one process writing one set of files. You cannot scale this horizontally. There is exactly one replica and there will only ever be one replica.
A rollout is a world restart. Because of the above, the Deployment uses the Recreate strategy — kill the old pod fully, then start the new one. A RollingUpdate would briefly run two shards fighting over the same world state, which is how you corrupt a save. So every deploy is a brief, visible “the world blinked” for anyone logged in. You learn to deploy when nobody’s online.

None of this is how I’d build a web service, and that’s exactly why it was fun.

Inference: local, free, and occasionally asleep

The dialog itself is generated by a local model. There is no per-token bill and nothing about a player’s conversation leaves my network.

The shard talks HTTP to a LiteLLM gateway running in-cluster, which proxies to an Ollama instance on a Mac Studio doing the actual inference on a gemma-class model. The gateway indirection matters more than it looks: the game code targets a stable OpenAI-compatible endpoint and doesn’t know or care which model is behind it, so I can swap the backing model without touching a line of C#.

The tradeoff is cold-load latency. The Mac Studio is shared infrastructure — it does image generation, other local inference, the works — and Ollama evicts idle models to free VRAM. So the first NPC line after a quiet stretch can hang for several seconds while the model loads back into memory. Subsequent lines are snappy. In a chat app you’d paper over this with a typing indicator. In a medieval fantasy world there’s no clean UI affordance for “the blacksmith is buffering,” so I leaned into it: NPCs occasionally pause, emote a *scratches his head*, and then answer. The cold-load became a personality tic. Not my proudest engineering, but it ships.

I keep saying “free,” and I want to be precise about what that means: free of marginal cost. The Mac Studio was a capital expense that earns its keep across a dozen workloads. But once it’s sitting there, an NPC saying something snarky costs me electricity and nothing else, and the conversation never touches a third party. For a hobby world I’m going to leave running indefinitely, that economic shape is the whole game. A per-token API bill on idle-chatter NPCs would be a slow-motion budgeting disaster.

The interesting layer is the life, not the chat

Here’s the thing I underestimated. Getting an NPC to respond to a player is a weekend. Getting an NPC to feel like a person is the actual project, and almost none of it is the language model.

Persona: rolled once, persisted forever

When an NPC is created, it rolls an identity from vocation-keyed pools — a blacksmith draws from different tables than a healer or a brigand. The roll fixes:

town and origin — where they live, where they’re from (not always the same)
personality — a handful of trait axes
backstory — a seed the model expands on consistently
speech style — terse, florid, drunk, pious, whatever
motivation — what they want
mood — which drifts over time

Critically, this is rolled once and persisted. It’s not regenerated per conversation. The same blacksmith is the same surly ex-soldier from Minoc every time you talk to him, next week, after a server reboot. The LLM isn’t inventing a character on each call — it’s being handed a fixed character sheet and asked to act. That distinction is the difference between “a world” and “a slot machine that outputs medieval Mad Libs.”

Memory: across reboots, per player

Each NPC remembers its prior conversations with individual players, and those memories survive restarts. The fifth time you shake down the same fence for stolen goods, he remembers the previous four. This is stored alongside the world state, so it’s durable the same way your character’s backpack is durable.

This is also where the cost discipline lives: I don’t stuff an NPC’s entire history into every prompt. There’s a summarization step so the context stays bounded — recent exchanges verbatim, older ones compressed to a gist. Otherwise a chatty regular would slowly inflate every prompt until inference crawled.

Autonomous life: errands and journeys

NPCs aren’t pinned to a spawn point waiting to be clicked. They run town errands — wander to the bank, visit a shop, loiter at the tavern — and occasionally undertake cross-map journeys to another town entirely. If you met a merchant in Britain on Tuesday, you might genuinely run into him on the road to Trinsic on Thursday because he decided to make the trip. The world keeps moving whether or not a player is watching, which is the entire point.

NPC-to-NPC ambient chatter — the new part

The newest townsfolk feature: two nearby NPCs will, on their own, strike up a short exchange. You round a corner and the baker and the guard are already mid-conversation about the price of grain, and neither of them is talking to you. You’re eavesdropping.

This sounds small and it changes the feel of a town enormously. A space where the inhabitants only ever address the player is a stage set. A space where they talk to each other — and you’re just one more body in the square — reads as alive. The exchanges are short by design (a few lines, then they disperse) both for atmosphere and, frankly, to keep a town square from turning into an inference stampede.

Now the monsters talk

Which brings me to the work I actually sat down to write about: extending all of this from townsfolk to language-speaking monsters.

In UO lore, plenty of creatures can speak — ogres, lizardmen, ratmen, gargoyles, daemons, liches. They’ve always technically spoken, in the same canned-line sense as the townsfolk. Now they get the full persona-and-banter treatment, with a few deliberate design constraints I’m happy with:

Monsters only banter with their own kind. A lich pronounces doom at another lich. It does not strike up grain-price chatter with a blacksmith, and it does not banter with an ogre. This is partly lore sanity — a daemon and a ratman have nothing to say to each other — and partly a hard scoping rule that keeps the combinatorial space of “who can talk to whom” small and authored rather than emergent and chaotic.

Each monster type gets an authored persona plus a scene hint. The lich’s is my favorite. The brief, in spirit, is a megalomaniacal god-king addressing an insect. So two liches in a crypt don’t chat — they hold a contest of grandiosity, each more contemptuous of the living than the last. Ratmen are scheming and paranoid. Ogres are dim and territorial. The persona does the heavy lifting; the model just has to stay in character.

A lich exchange reads roughly like:

Mal'akhar: You linger in this crypt, lesser shade. Do the worms
           still whisper your forgotten name?
Vyssth:    Forgotten? I am the silence between heartbeats, brother.
           The living are a rounding error I have not yet bothered
           to correct.

It’s ridiculous and I love it. It’s also doing real work — a dungeon where the inhabitants have opinions about each other is a dungeon that feels populated rather than spawned.

The guardrail that makes this safe to ship

Here’s where my homelab habits and my day-job instincts collided in a good way. The moment you let a language model drive behavior in a stateful world, you have to ask: what can it actually do? Generating text is one thing. If the model can emit anything that gets interpreted as an in-world action, a hallucinated line is now a hallucinated event.

The answer is a closed allowlist of safe cosmetic actions, checked deterministically before any LLM output is trusted. The model can ask an NPC to bow, nod, laugh, shake its head, point — a short, fixed set of purely visual emotes. That’s it. The allowlist is a hardcoded set in the game code, and anything the model produces that isn’t on it is dropped:

// Cosmetic-only. The LLM may request one of these; nothing else
// it emits is ever interpreted as an action.
private static readonly HashSet<string> AllowedActions = new()
{
    "bow", "nod", "shake_head", "laugh", "point",
    "wave", "shrug", "salute", "clap", "yawn",
};

The model never gets to move an NPC, open a door, drop an item, or initiate combat by talking about it. Even if it hallucinates “and then the lich destroys the village,” the destroy verb isn’t in the set, so nothing happens but text. This is the same principle I bang on about for agentic CI at work: deterministic guardrails come before LLM judgment. Regex and allowlists are not probabilistic; the model’s good behavior is. You put the cheap, certain check in front of the expensive, fallible one. A non-deterministic system that can only ever produce cosmetic emotes is a system whose worst case is a weird animation.

A passing build is not proof the NPC speaks

The other lesson, and one I had to learn the embarrassing way: a green build tells you the code compiles, not that the NPC actually talks. The interesting failures all live at runtime, in-world — the model endpoint is reachable but returns garbage, the persona prompt has a typo that makes every NPC mute, the allowlist filter is too aggressive and eats every emote.

So there’s a headless test harness that drives the world without a graphical client: it spins up the shard, connects as a synthetic player, walks up to an NPC, says something, and asserts that a sane line comes back and that any emote is on the allowlist. That harness catches the “compiles fine, says nothing” class of bug that no unit test would, because the bug only exists once the world is actually running and the model is actually in the loop.

The homelab build loop

Packaging follows the same rules as everything else I run on the cluster, which by now are muscle memory:

amd64-only images. No multi-arch, no arm64. The cluster nodes are x86 and I’m not paying the build tax for architectures I don’t run.
--provenance=false on every buildx. k3s’s containerd can’t resolve the attestation manifests that buildx adds by default, and you find this out via a baffling image-pull error at the worst time. Once burned, permanently remembered.
Everything through the Harbor proxy cache. No image pulls directly from Docker Hub or other public registries — they go through my Harbor pull-through cache, which means builds don’t break when an upstream registry rate-limits me and I’m not re-downloading the same base layers forever.

Nothing exotic. But it means the NPC-brain image builds and ships exactly like every other workload on the cluster, which is the whole point of having house rules.

Where the seams show

Let me be honest about what this is and isn’t, because I’d rather you hear it from me.

Cold-load latency is real. When the model has been evicted, that first line is slow enough to break immersion, and the head-scratch emote only papers over so much. If I cared about a “production” feel I’d pin the model in memory, but pinning it would starve the other things the Mac Studio does, and the tradeoff isn’t worth it for a hobby world.
It’s single-shard, hobby-scale. One world, one replica, however many friends I can talk into logging in. I have not stress-tested what fifty players simultaneously triggering NPC chatter does to inference throughput, and I suspect the answer is “the town square stops talking and starts queueing.” That’s a problem I’d love to have.
The model stays in character until it doesn’t. Most of the time the persona holds. Occasionally a gemma-class model produces a line that’s a little too modern, or a lich that briefly forgets it’s supposed to be a god-king. The persona scaffolding makes this rare, not impossible. I’ve decided the occasional break is part of the charm rather than something to chase to zero.
None of this makes the game better, exactly. It makes the world feel more alive, which is a different axis. A min-maxer grinding for loot will not care that the ratmen have opinions. I am not building for that person.

What I keep coming back to is that the language model was the easy 20%. The persona pools, the durable per-player memory, the autonomous errands, the own-kind-only banter rules, the deterministic action allowlist, the headless harness — that’s the 80%, and that’s the part that makes a clicked-on blacksmith feel like a person instead of a vending machine.

The peasant talks back now. Sometimes he even has a point.

TL;DR#

What if the peasant could actually talk back?#

This is not a web app#

Inference: local, free, and occasionally asleep#

The interesting layer is the life, not the chat#

Persona: rolled once, persisted forever#

Memory: across reboots, per player#

Autonomous life: errands and journeys#

NPC-to-NPC ambient chatter — the new part#

Now the monsters talk#

The guardrail that makes this safe to ship#

A passing build is not proof the NPC speaks#

The homelab build loop#

Where the seams show#