TL;DR
Over the last six posts, I’ve documented converting Jellyfin from a single-process media server into a two-replica, PostgreSQL-backed, sticky-session-coordinated deployment on k3s. Five of six failover tests passed cleanly. The key result: zero-downtime failover — killing a pod doesn’t take down the service. Users on the surviving replica see no interruption; displaced users reconnect in seconds. Node maintenance no longer kills Jellyfin for the household.
But this project isn’t finished, and some problems can’t be solved with this architecture. This final post is an honest inventory of what’s still broken, what was deferred, and what the path forward looks like.
What Works
Before the problems, let’s acknowledge what the project delivered:
| Capability | Before | After |
|---|---|---|
| Node maintenance | Jellyfin down for 2-3 minutes (pod reschedule) | Clients failover to second pod in ~12 seconds |
| Pod crash | Jellyfin down until StatefulSet restarts pod | Surviving pod serves all traffic immediately |
| Database corruption risk | SQLite lock contention with any concurrent access | PostgreSQL handles concurrent connections natively |
| Playback state durability | In memory — lost on pod restart | PostgreSQL — survives any pod/node failure |
| Task coordination | N/A (single pod) | Redis leader election prevents duplicate execution |
| Monitoring | Basic pod health | Per-replica Prometheus metrics + HA-specific alerts |
For a homelab media server, this is a meaningful improvement. The family no longer notices when I drain a node for kernel updates.
Problem 1: Transcoding Is Not HA
This is the biggest limitation and it’s fundamental. FFmpeg transcoding is an OS process that writes HLS segments to local disk. When a pod dies mid-transcode:
- The FFmpeg process is killed
- The partially-written segments are lost (they’re on the pod’s Longhorn PVC)
- The client must start a new transcode on the surviving pod
- The new transcode starts from the last saved playback position, not from where the video was playing
Gap: there’s a 5-10 second delay while the new FFmpeg process starts, seeks to the correct position, and generates enough segments for the client to buffer.
Why it’s hard to fix: distributed transcoding would require a shared filesystem for HLS segments (NFS would be too slow for real-time 4K segments at multi-gigabit rates), a central coordinator to assign transcode work, and a way to hand off a running FFmpeg process between nodes. This is the domain of purpose-built streaming platforms (Plex’s Relay, Netflix’s encoding pipeline) — not something you bolt onto a single-process media server.
Mitigation: most playback in the household is direct play (the client can decode the format natively). Transcoding only happens when a client can’t handle the video codec or when streaming remotely over a limited connection. For direct play, failover is near-instant — no FFmpeg involved.
Problem 2: SyncPlay Is Pod-Local
SyncPlay — synchronized group playback — uses three interconnected ConcurrentDictionary instances and sub-second WebSocket coordination between group members. In Day 5, I deferred this entirely.
Current state: SyncPlay works on whichever pod the group creator is pinned to. If all group members happen to land on the same pod (likely with sticky sessions in a small household), it works fine. If they’re split across pods, commands only affect the local pod’s members.
Why it’s hard to fix: SyncPlay needs real-time pub/sub between pods. A play command on Pod A must reach Pod B’s WebSocket connections within ~100ms or visible desync occurs. Redis pub/sub could work, but every SyncPlay command would need to be serialized through Redis, and the existing SyncPlayManager would need a near-complete rewrite to support remote group membership.
Who it affects: households with 2+ people using SyncPlay. In practice, this feature is rarely used — none of my users have complained because they don’t use it.
Problem 3: QuickConnect Is Disabled
QuickConnect lets new devices pair by entering a code. The codes live in memory and can’t survive cross-pod routing. We disabled it entirely in Day 5.
Impact: users must authenticate with username/password instead of a 6-digit code. On a Smart TV with an on-screen keyboard, this is annoying. On a phone with saved credentials, it’s invisible.
Fix path: Track B (Redis-backed sessions) would solve this trivially — store pending codes in Redis instead of ConcurrentDictionary. If I ever implement Track B, QuickConnect is a 30-minute fix.
Problem 4: The Fork Rebase Burden
The Jellyfin fork adds ~500 lines across ~10 files:
| Category | Files | Lines |
|---|---|---|
| Database provider interface + PostgreSQL implementation | 4 | ~200 |
| EF Core PostgreSQL migrations | 2 | ~150 |
| Task leader election + circuit breaker | 3 | ~148 |
| DI registration, config additions | 2 | ~20 |
This is a small fork, but it’s a fork. Every upstream Jellyfin release needs a rebase. The risk areas:
JellyfinDbContext.cs— upstream adds a new DbSet, I need to add it to the PostgreSQL migrationStartup.cs— upstream changes DI registration order or adds new servicesTaskManager.cs— upstream changes the task scheduling interface (unlikely but possible)Directory.Packages.props— upstream upgrades EF Core, which may break Npgsql compatibility
Historically, Jellyfin releases every 2-3 months. Each rebase takes 30-60 minutes if the diff is clean, longer if there are migration conflicts. This is ongoing maintenance, not a one-time cost.
Mitigation: the fork was designed to be minimal. The IJellyfinDatabaseProvider interface isolates all PostgreSQL-specific code. If upstream ever adds a pluggable database layer (there’s been discussion in the issue tracker), the fork can be retired.
Problem 5: Test Coverage Gaps
The project ships with:
- Integration tests for the PostgreSQL provider (Testcontainers, run in CI)
- Failover tests (manual, documented procedures from Day 6)
- Leader election unit tests (Redis mock, verify lease acquisition and circuit breaker)
What’s missing:
| Gap | Risk |
|---|---|
| Automated failover test suite | Manual tests are time-consuming and may not be re-run after future changes |
| Load testing under failover | Unknown: what happens with 10 concurrent streams during pod death? |
| Long-running soak test | Memory leaks in the leader election loop? Redis connection pool exhaustion? |
| Migration tool edge cases | Empty strings vs NULL, Unicode in metadata, very large libraries (50k+ items) |
For a homelab project, manual testing is acceptable. For production, I’d want a CI pipeline that spins up a test cluster, deploys the HA stack, runs automated failover scenarios, and gates the release.
Problem 6: Monitoring Blind Spots
The Prometheus alerts from Day 6 cover the basics: replica count, PostgreSQL availability, Redis availability, restart rate. But there are gaps:
| Blind Spot | Why It Matters |
|---|---|
| Sticky session distribution | No metric for how clients are distributed across pods. One pod could be serving 90% of traffic. |
| Session re-auth rate | No metric tracking how often clients re-authenticate. A spike indicates failover or sticky session misconfiguration. |
| PostgreSQL connection pool | No alert for approaching pool exhaustion. EF Core defaults to 100 max connections — two pods could theoretically exhaust this. |
| NFS config latency | Shared config is on NFS over the network. If NFS degrades, both pods feel it. No latency metric. |
These are nice-to-have observability features, not critical gaps. The cluster’s existing Prometheus infrastructure (node-exporter, kube-state-metrics) catches the broad failures. The Jellyfin-specific gaps would matter more at scale.
Problem 7: Documentation Debt
The project produced:
- This 7-part blog series (you’re reading it)
- The execution plan in the infra repo (
docs/jellyfin-ha-plan.md) - Inline code comments in the fork
What’s missing:
- Operator runbook — step-by-step procedures for common operations (scale up, scale down, PostgreSQL backup/restore, force leader election, disable HA and revert to single pod)
- Architecture decision records — the Track A vs Track B decision, PostgreSQL vs MySQL, Redis vs etcd should be formally documented for future-me
- Disaster recovery playbook — what happens if both pods die and PostgreSQL has a corrupt volume? The S3 backups exist but the restore procedure isn’t documented.
This is technical debt that compounds over time. Future-me will forget why decisions were made and may repeat failed experiments.
What Track B Would Look Like
If I ever revisit this project for full stateless HA:
| Component | Track A (Current) | Track B (Future) |
|---|---|---|
| Sessions | Pod-local, sticky cookie | Redis-backed IDistributedCache |
| User cache | Pod-local, PostgreSQL fallback | Redis, all pods share |
| Device capabilities | Pod-local | Redis |
| QuickConnect | Disabled | Redis-backed |
| SyncPlay | Pod-local | Redis pub/sub + WebSocket bridge |
| Task coordination | Redis leader election | Redis distributed lock (unchanged) |
| Transcoding | Pod-local (sticky) | Pod-local (sticky) — no change possible |
Estimated work:
- Sessions in Redis: ~400 lines. Replace
ConcurrentDictionary<string, SessionInfo>withIDistributedCache. SerializeSessionInfoto JSON. Add cache invalidation on logout. - User cache in Redis: ~200 lines. Change
UserManagerto read-through cache backed by Redis + PostgreSQL. - QuickConnect in Redis: ~100 lines. Move codes and authorizations to Redis keys with TTL.
- SyncPlay pub/sub: ~800+ lines. Major redesign. Redis pub/sub for group commands, shared group state, cross-pod WebSocket notifications.
Total: ~1,500+ lines of Jellyfin fork changes (vs ~500 today). The fork rebase burden triples.
The trigger for Track B would be: consistently hitting sticky session problems with >10 concurrent users, or the SyncPlay limitation becoming unacceptable.
Standing on Shoulders
This project wouldn’t exist without two pieces of prior work:
pseudopseudonym’s HA Jellyfin setup — proved that Jellyfin could survive node failures on Kubernetes, identified the Rook PVC release bottleneck, and built the
rook-pushtool that made Ceph failover practical. Their r/selfhosted post surfaced the key question from the community: “How do you synchronize the database between jellyfin instances?” — the top comment, and the exact question this series answers.The Hacker News SQLite concurrency discussion — crystallized the architectural argument for PostgreSQL over SQLite, and highlighted that the Jellyfin 10.11 EF Core refactor finally made a pluggable database layer feasible.
The gap between “Jellyfin can fail over in 2.5 minutes” and “Jellyfin runs active-active with 12-second failover” is entirely about solving the database and state problems that these discussions identified.
What I’d Do Differently
Looking back at the project with fresh eyes:
1. Start with the migration tool
I built the SQLite-to-PostgreSQL migration tool as part of Phase 2. In hindsight, I should have built it first and run it on a copy of my data before touching any Jellyfin code. The migration is the highest-risk step — if the tool has bugs, you discover them after you’ve already deployed PostgreSQL and can’t easily revert.
2. Test NFS config performance earlier
NFS config works well in practice, but I didn’t benchmark it until after deployment. If NFS had been too slow for Jellyfin’s config reads (which happen on every startup and settings change), I’d have needed to redesign the shared config approach.
3. Document before implementing
The blog series is being written after the project completed. Writing the architecture docs first (even as rough drafts) would have forced me to articulate assumptions that the AI agents silently resolved — some correctly, some not.
4. Automate the failover tests from day one
Manual failover testing is error-prone and won’t be repeated regularly. A simple shell script that runs the six tests and validates the expected outcomes would pay for itself on the first rebase.
The Numbers
| Metric | Value |
|---|---|
| Phases | 4 |
| GitHub Issues | 20 |
| PRs merged | 22 |
| Fork lines added | ~500 |
| Kubernetes manifest lines | ~350 |
| AI agent (Copilot) PRs | ~13 (60% of total) |
| Manual implementation PRs | ~9 (40% of total) |
| Failover test pass rate | 5/6 (83%) |
| Total disruption on pod failure | ~12-20 seconds (down from 2-3 minutes) |
| New infrastructure components | PostgreSQL StatefulSet, Redis Deployment, NFS share |
| Ongoing maintenance | Upstream rebase every 2-3 months (~30-60 min each) |
Series Recap
| Day | Post | Phase |
|---|---|---|
| 1 | Why Jellyfin Can’t Scale | Problem definition, strategy selection |
| 2 | Multi-Model Planning | AI planning workflow, gap analysis |
| 3 | PostgreSQL Database Provider | Phase 1: Fork, EF Core, Npgsql |
| 4 | Storage Migration | Phase 2: PostgreSQL deploy, volume restructure, data migration |
| 5 | State and Sessions | Phase 3: Redis leader election, sticky sessions, feature flags |
| 6 | Failover Testing | Phase 4: Scale to 2, six tests, Prometheus alerts |
| 7 | What’s Still Broken (this post) | Retrospective, limitations, future work |
Final Thoughts
This project proves that even architecturally single-process applications can be made highly available on Kubernetes with a pragmatic approach. You don’t need to solve every state problem — you need to solve enough of them that the user experience during failure is acceptable.
For a homelab media server, “re-authenticate and press Play” is an acceptable failure mode. It’s infinitely better than “wait 3 minutes for the pod to reschedule.”
The fork stays small. The maintenance stays manageable. And the family stops yelling from the living room when I need to patch a node.
Browse the code: The complete Jellyfin fork is at github.com/zolty-mat/jellyfin. The PostgreSQL provider, Redis leader election, and HA Dockerfile are all there. Star it if you’re thinking about doing something similar with your media server.
The infrastructure repo (Kubernetes manifests, Terraform, Ansible) at github.com/zolty-mat/home_k3s_cluster will be published once secrets remediation is complete. Watch the repo for updates.
Don’t have a homelab? This entire stack — Jellyfin, PostgreSQL, Redis, Traefik — can run on any Kubernetes cluster. A DigitalOcean managed cluster with $200 in free credits is enough to run the complete HA setup for several months.