High-Availability

What's Still Broken and What Comes Next

TL;DR Over the last six posts, I’ve documented converting Jellyfin from a single-process media server into a two-replica, PostgreSQL-backed, sticky-session-coordinated deployment on k3s. Five of six failover tests passed cleanly. The key result: zero-downtime failover — killing a pod doesn’t take down the service. Users on the surviving replica see no interruption; displaced users reconnect in seconds. Node maintenance no longer kills Jellyfin for the household. But this project isn’t finished, and some problems can’t be solved with this architecture. This final post is an honest inventory of what’s still broken, what was deferred, and what the path forward looks like. ...

Scaling to Two Replicas and Failover Testing

TL;DR This is the moment everything was built for. Three phases of preparation — PostgreSQL provider (Day 3), storage migration (Day 4), state externalization (Day 5) — all leading to a single kubectl scale command. This post covers Phase 4: scaling the Jellyfin StatefulSet to 2 replicas, configuring anti-affinity to spread pods across nodes, running six structured failover tests, building Prometheus alerts, and one test that only partially passed. The headline result: killing a pod causes zero service downtime — users on the surviving replica experience no interruption at all, and displaced users reconnect within seconds. ...

Jellyfin state externalization architecture

State Externalization and the Sticky Session Compromise

TL;DR Phase 3 is where the rubber meets the road. We have PostgreSQL for persistent data (Day 4) and NFS for shared config. But Jellyfin still holds critical runtime state — sessions, users, devices, tasks — in 11 ConcurrentDictionary instances scattered across singleton managers. Two pods with independent memory spaces means two independent views of reality. This post covers the state externalization decision: what got moved to Redis, what got solved by sticky sessions, what got disabled entirely, and why pragmatism beat perfection for a homelab media server. ...

Multi-model AI planning workflow diagram

Multi-Model Planning: The Same Pattern That Shipped dnd-multi

TL;DR The Jellyfin HA conversion touches a .NET 10 codebase, Entity Framework Core migrations, Kubernetes manifests, Terraform infrastructure, PostgreSQL operations, and FFmpeg transcoding pipelines. No single AI model understands all of this equally well. So I used four of them — the same multi-model planning pattern that shipped dnd-multi in a single day and that I documented in the LLM GitHub PR workflow. This post covers how I adapted that pattern for infrastructure work, what each model caught, and why planning is where all the human time should go. ...

Jellyfin single-instance architecture diagram

Why Jellyfin Can't Scale (And What We're Going to Do About It)

TL;DR Jellyfin is a fantastic open-source media server. It is also, architecturally, a single-process application that assumes it’s the only instance running. SQLite as the database. Eleven ConcurrentDictionary caches holding sessions, users, devices, and task queues in memory. A file-based config directory that gets written to at runtime. None of this survives a second pod. This is the first post in a seven-part series documenting how I converted Jellyfin into a highly available, multi-replica deployment on my home k3s cluster. The project spans two repositories, four phases, ~20 GitHub Issues executed by AI agents, and a live failover demo where I killed a pod and the service continued with zero downtime — users on the surviving replica never saw an interruption. ...