What's Still Broken and What Comes Next

TL;DR

Over the last six posts, I’ve documented converting Jellyfin from a single-process media server into a two-replica, PostgreSQL-backed, sticky-session-coordinated deployment on k3s. Five of six failover tests passed cleanly. The key result: zero-downtime failover — killing a pod doesn’t take down the service. Users on the surviving replica see no interruption; displaced users reconnect in seconds. Node maintenance no longer kills Jellyfin for the household.

But this project isn’t finished, and some problems can’t be solved with this architecture. This final post is an honest inventory of what’s still broken, what was deferred, and what the path forward looks like.

What Works

Before the problems, let’s acknowledge what the project delivered:

Capability	Before	After
Node maintenance	Jellyfin down for 2-3 minutes (pod reschedule)	Clients failover to second pod in ~12 seconds
Pod crash	Jellyfin down until StatefulSet restarts pod	Surviving pod serves all traffic immediately
Database corruption risk	SQLite lock contention with any concurrent access	PostgreSQL handles concurrent connections natively
Playback state durability	In memory — lost on pod restart	PostgreSQL — survives any pod/node failure
Task coordination	N/A (single pod)	Redis leader election prevents duplicate execution
Monitoring	Basic pod health	Per-replica Prometheus metrics + HA-specific alerts

For a homelab media server, this is a meaningful improvement. The family no longer notices when I drain a node for kernel updates.

Problem 1: Transcoding Is Not HA

This is the biggest limitation and it’s fundamental. FFmpeg transcoding is an OS process that writes HLS segments to local disk. When a pod dies mid-transcode:

The FFmpeg process is killed
The partially-written segments are lost (they’re on the pod’s Longhorn PVC)
The client must start a new transcode on the surviving pod
The new transcode starts from the last saved playback position, not from where the video was playing

Gap: there’s a 5-10 second delay while the new FFmpeg process starts, seeks to the correct position, and generates enough segments for the client to buffer.

Why it’s hard to fix: distributed transcoding would require a shared filesystem for HLS segments (NFS would be too slow for real-time 4K segments at multi-gigabit rates), a central coordinator to assign transcode work, and a way to hand off a running FFmpeg process between nodes. This is the domain of purpose-built streaming platforms (Plex’s Relay, Netflix’s encoding pipeline) — not something you bolt onto a single-process media server.

Mitigation: most playback in the household is direct play (the client can decode the format natively). Transcoding only happens when a client can’t handle the video codec or when streaming remotely over a limited connection. For direct play, failover is near-instant — no FFmpeg involved.

Problem 2: SyncPlay Is Pod-Local

SyncPlay — synchronized group playback — uses three interconnected ConcurrentDictionary instances and sub-second WebSocket coordination between group members. In Day 5, I deferred this entirely.

Current state: SyncPlay works on whichever pod the group creator is pinned to. If all group members happen to land on the same pod (likely with sticky sessions in a small household), it works fine. If they’re split across pods, commands only affect the local pod’s members.

Why it’s hard to fix: SyncPlay needs real-time pub/sub between pods. A play command on Pod A must reach Pod B’s WebSocket connections within ~100ms or visible desync occurs. Redis pub/sub could work, but every SyncPlay command would need to be serialized through Redis, and the existing SyncPlayManager would need a near-complete rewrite to support remote group membership.

Who it affects: households with 2+ people using SyncPlay. In practice, this feature is rarely used — none of my users have complained because they don’t use it.

Problem 3: QuickConnect Is Disabled

QuickConnect lets new devices pair by entering a code. The codes live in memory and can’t survive cross-pod routing. We disabled it entirely in Day 5.

Impact: users must authenticate with username/password instead of a 6-digit code. On a Smart TV with an on-screen keyboard, this is annoying. On a phone with saved credentials, it’s invisible.

Fix path: Track B (Redis-backed sessions) would solve this trivially — store pending codes in Redis instead of ConcurrentDictionary. If I ever implement Track B, QuickConnect is a 30-minute fix.

Problem 4: The Fork Rebase Burden

The Jellyfin fork adds ~500 lines across ~10 files:

Category	Files	Lines
Database provider interface + PostgreSQL implementation	4	~200
EF Core PostgreSQL migrations	2	~150
Task leader election + circuit breaker	3	~148
DI registration, config additions	2	~20

This is a small fork, but it’s a fork. Every upstream Jellyfin release needs a rebase. The risk areas:

JellyfinDbContext.cs — upstream adds a new DbSet, I need to add it to the PostgreSQL migration
Startup.cs — upstream changes DI registration order or adds new services
TaskManager.cs — upstream changes the task scheduling interface (unlikely but possible)
Directory.Packages.props — upstream upgrades EF Core, which may break Npgsql compatibility

Historically, Jellyfin releases every 2-3 months. Each rebase takes 30-60 minutes if the diff is clean, longer if there are migration conflicts. This is ongoing maintenance, not a one-time cost.

Mitigation: the fork was designed to be minimal. The IJellyfinDatabaseProvider interface isolates all PostgreSQL-specific code. If upstream ever adds a pluggable database layer (there’s been discussion in the issue tracker), the fork can be retired.

Problem 5: Test Coverage Gaps

The project ships with:

Integration tests for the PostgreSQL provider (Testcontainers, run in CI)
Failover tests (manual, documented procedures from Day 6)
Leader election unit tests (Redis mock, verify lease acquisition and circuit breaker)

What’s missing:

Gap	Risk
Automated failover test suite	Manual tests are time-consuming and may not be re-run after future changes
Load testing under failover	Unknown: what happens with 10 concurrent streams during pod death?
Long-running soak test	Memory leaks in the leader election loop? Redis connection pool exhaustion?
Migration tool edge cases	Empty strings vs NULL, Unicode in metadata, very large libraries (50k+ items)

For a homelab project, manual testing is acceptable. For production, I’d want a CI pipeline that spins up a test cluster, deploys the HA stack, runs automated failover scenarios, and gates the release.

The Prometheus alerts from Day 6 cover the basics: replica count, PostgreSQL availability, Redis availability, restart rate. But there are gaps:

Blind Spot	Why It Matters
Sticky session distribution	No metric for how clients are distributed across pods. One pod could be serving 90% of traffic.
Session re-auth rate	No metric tracking how often clients re-authenticate. A spike indicates failover or sticky session misconfiguration.
PostgreSQL connection pool	No alert for approaching pool exhaustion. EF Core defaults to 100 max connections — two pods could theoretically exhaust this.
NFS config latency	Shared config is on NFS over the network. If NFS degrades, both pods feel it. No latency metric.

These are nice-to-have observability features, not critical gaps. The cluster’s existing Prometheus infrastructure (node-exporter, kube-state-metrics) catches the broad failures. The Jellyfin-specific gaps would matter more at scale.

Problem 7: Documentation Debt

The project produced:

This 7-part blog series (you’re reading it)
The execution plan in the infra repo (docs/jellyfin-ha-plan.md)
Inline code comments in the fork

What’s missing:

Operator runbook — step-by-step procedures for common operations (scale up, scale down, PostgreSQL backup/restore, force leader election, disable HA and revert to single pod)
Architecture decision records — the Track A vs Track B decision, PostgreSQL vs MySQL, Redis vs etcd should be formally documented for future-me
Disaster recovery playbook — what happens if both pods die and PostgreSQL has a corrupt volume? The S3 backups exist but the restore procedure isn’t documented.

This is technical debt that compounds over time. Future-me will forget why decisions were made and may repeat failed experiments.

What Track B Would Look Like

If I ever revisit this project for full stateless HA:

Component	Track A (Current)	Track B (Future)
Sessions	Pod-local, sticky cookie	Redis-backed `IDistributedCache`
User cache	Pod-local, PostgreSQL fallback	Redis, all pods share
Device capabilities	Pod-local	Redis
QuickConnect	Disabled	Redis-backed
SyncPlay	Pod-local	Redis pub/sub + WebSocket bridge
Task coordination	Redis leader election	Redis distributed lock (unchanged)
Transcoding	Pod-local (sticky)	Pod-local (sticky) — no change possible

Estimated work:

Sessions in Redis: ~400 lines. Replace ConcurrentDictionary<string, SessionInfo> with IDistributedCache. Serialize SessionInfo to JSON. Add cache invalidation on logout.
User cache in Redis: ~200 lines. Change UserManager to read-through cache backed by Redis + PostgreSQL.
QuickConnect in Redis: ~100 lines. Move codes and authorizations to Redis keys with TTL.
SyncPlay pub/sub: ~800+ lines. Major redesign. Redis pub/sub for group commands, shared group state, cross-pod WebSocket notifications.

Total: ~1,500+ lines of Jellyfin fork changes (vs ~500 today). The fork rebase burden triples.

The trigger for Track B would be: consistently hitting sticky session problems with >10 concurrent users, or the SyncPlay limitation becoming unacceptable.

Standing on Shoulders

This project wouldn’t exist without two pieces of prior work:

pseudopseudonym’s HA Jellyfin setup — proved that Jellyfin could survive node failures on Kubernetes, identified the Rook PVC release bottleneck, and built the rook-push tool that made Ceph failover practical. Their r/selfhosted post surfaced the key question from the community: “How do you synchronize the database between jellyfin instances?” — the top comment, and the exact question this series answers.
The Hacker News SQLite concurrency discussion — crystallized the architectural argument for PostgreSQL over SQLite, and highlighted that the Jellyfin 10.11 EF Core refactor finally made a pluggable database layer feasible.

The gap between “Jellyfin can fail over in 2.5 minutes” and “Jellyfin runs active-active with 12-second failover” is entirely about solving the database and state problems that these discussions identified.

What I’d Do Differently

Looking back at the project with fresh eyes:

1. Start with the migration tool

I built the SQLite-to-PostgreSQL migration tool as part of Phase 2. In hindsight, I should have built it first and run it on a copy of my data before touching any Jellyfin code. The migration is the highest-risk step — if the tool has bugs, you discover them after you’ve already deployed PostgreSQL and can’t easily revert.

2. Test NFS config performance earlier

NFS config works well in practice, but I didn’t benchmark it until after deployment. If NFS had been too slow for Jellyfin’s config reads (which happen on every startup and settings change), I’d have needed to redesign the shared config approach.

3. Document before implementing

The blog series is being written after the project completed. Writing the architecture docs first (even as rough drafts) would have forced me to articulate assumptions that the AI agents silently resolved — some correctly, some not.

4. Automate the failover tests from day one

Manual failover testing is error-prone and won’t be repeated regularly. A simple shell script that runs the six tests and validates the expected outcomes would pay for itself on the first rebase.

The Numbers

Metric	Value
Phases	4
GitHub Issues	20
PRs merged	22
Fork lines added	~500
Kubernetes manifest lines	~350
AI agent (Copilot) PRs	~13 (60% of total)
Manual implementation PRs	~9 (40% of total)
Failover test pass rate	5/6 (83%)
Total disruption on pod failure	~12-20 seconds (down from 2-3 minutes)
New infrastructure components	PostgreSQL StatefulSet, Redis Deployment, NFS share
Ongoing maintenance	Upstream rebase every 2-3 months (~30-60 min each)

Series Recap

Day	Post	Phase
1	Why Jellyfin Can’t Scale	Problem definition, strategy selection
2	Multi-Model Planning	AI planning workflow, gap analysis
3	PostgreSQL Database Provider	Phase 1: Fork, EF Core, Npgsql
4	Storage Migration	Phase 2: PostgreSQL deploy, volume restructure, data migration
5	State and Sessions	Phase 3: Redis leader election, sticky sessions, feature flags
6	Failover Testing	Phase 4: Scale to 2, six tests, Prometheus alerts
7	What’s Still Broken (this post)	Retrospective, limitations, future work

Final Thoughts

This project proves that even architecturally single-process applications can be made highly available on Kubernetes with a pragmatic approach. You don’t need to solve every state problem — you need to solve enough of them that the user experience during failure is acceptable.

For a homelab media server, “re-authenticate and press Play” is an acceptable failure mode. It’s infinitely better than “wait 3 minutes for the pod to reschedule.”

The fork stays small. The maintenance stays manageable. And the family stops yelling from the living room when I need to patch a node.

Browse the code: The complete Jellyfin fork is at github.com/zolty-mat/jellyfin. The PostgreSQL provider, Redis leader election, and HA Dockerfile are all there. Star it if you’re thinking about doing something similar with your media server.

The infrastructure repo (Kubernetes manifests, Terraform, Ansible) at github.com/zolty-mat/home_k3s_cluster will be published once secrets remediation is complete. Watch the repo for updates.

Don’t have a homelab? This entire stack — Jellyfin, PostgreSQL, Redis, Traefik — can run on any Kubernetes cluster. A DigitalOcean managed cluster with $200 in free credits is enough to run the complete HA setup for several months.

TL;DR#

What Works#

Problem 1: Transcoding Is Not HA#

Problem 2: SyncPlay Is Pod-Local#

Problem 3: QuickConnect Is Disabled#

Problem 4: The Fork Rebase Burden#

Problem 5: Test Coverage Gaps#

Problem 6: Monitoring Blind Spots#

Problem 7: Documentation Debt#

What Track B Would Look Like#

Standing on Shoulders#

What I’d Do Differently#

1. Start with the migration tool#

2. Test NFS config performance earlier#

3. Document before implementing#

4. Automate the failover tests from day one#

The Numbers#

Series Recap#

Final Thoughts#