TL;DR

Over the last six posts, I’ve documented converting Jellyfin from a single-process media server into a two-replica, PostgreSQL-backed, sticky-session-coordinated deployment on k3s. Five of six failover tests passed cleanly. The key result: zero-downtime failover — killing a pod doesn’t take down the service. Users on the surviving replica see no interruption; displaced users reconnect in seconds. Node maintenance no longer kills Jellyfin for the household.

But this project isn’t finished, and some problems can’t be solved with this architecture. This final post is an honest inventory of what’s still broken, what was deferred, and what the path forward looks like.


What Works

Before the problems, let’s acknowledge what the project delivered:

CapabilityBeforeAfter
Node maintenanceJellyfin down for 2-3 minutes (pod reschedule)Clients failover to second pod in ~12 seconds
Pod crashJellyfin down until StatefulSet restarts podSurviving pod serves all traffic immediately
Database corruption riskSQLite lock contention with any concurrent accessPostgreSQL handles concurrent connections natively
Playback state durabilityIn memory — lost on pod restartPostgreSQL — survives any pod/node failure
Task coordinationN/A (single pod)Redis leader election prevents duplicate execution
MonitoringBasic pod healthPer-replica Prometheus metrics + HA-specific alerts

For a homelab media server, this is a meaningful improvement. The family no longer notices when I drain a node for kernel updates.

Problem 1: Transcoding Is Not HA

This is the biggest limitation and it’s fundamental. FFmpeg transcoding is an OS process that writes HLS segments to local disk. When a pod dies mid-transcode:

  1. The FFmpeg process is killed
  2. The partially-written segments are lost (they’re on the pod’s Longhorn PVC)
  3. The client must start a new transcode on the surviving pod
  4. The new transcode starts from the last saved playback position, not from where the video was playing

Gap: there’s a 5-10 second delay while the new FFmpeg process starts, seeks to the correct position, and generates enough segments for the client to buffer.

Why it’s hard to fix: distributed transcoding would require a shared filesystem for HLS segments (NFS would be too slow for real-time 4K segments at multi-gigabit rates), a central coordinator to assign transcode work, and a way to hand off a running FFmpeg process between nodes. This is the domain of purpose-built streaming platforms (Plex’s Relay, Netflix’s encoding pipeline) — not something you bolt onto a single-process media server.

Mitigation: most playback in the household is direct play (the client can decode the format natively). Transcoding only happens when a client can’t handle the video codec or when streaming remotely over a limited connection. For direct play, failover is near-instant — no FFmpeg involved.

Problem 2: SyncPlay Is Pod-Local

SyncPlay — synchronized group playback — uses three interconnected ConcurrentDictionary instances and sub-second WebSocket coordination between group members. In Day 5, I deferred this entirely.

Current state: SyncPlay works on whichever pod the group creator is pinned to. If all group members happen to land on the same pod (likely with sticky sessions in a small household), it works fine. If they’re split across pods, commands only affect the local pod’s members.

Why it’s hard to fix: SyncPlay needs real-time pub/sub between pods. A play command on Pod A must reach Pod B’s WebSocket connections within ~100ms or visible desync occurs. Redis pub/sub could work, but every SyncPlay command would need to be serialized through Redis, and the existing SyncPlayManager would need a near-complete rewrite to support remote group membership.

Who it affects: households with 2+ people using SyncPlay. In practice, this feature is rarely used — none of my users have complained because they don’t use it.

Problem 3: QuickConnect Is Disabled

QuickConnect lets new devices pair by entering a code. The codes live in memory and can’t survive cross-pod routing. We disabled it entirely in Day 5.

Impact: users must authenticate with username/password instead of a 6-digit code. On a Smart TV with an on-screen keyboard, this is annoying. On a phone with saved credentials, it’s invisible.

Fix path: Track B (Redis-backed sessions) would solve this trivially — store pending codes in Redis instead of ConcurrentDictionary. If I ever implement Track B, QuickConnect is a 30-minute fix.

Problem 4: The Fork Rebase Burden

The Jellyfin fork adds ~500 lines across ~10 files:

CategoryFilesLines
Database provider interface + PostgreSQL implementation4~200
EF Core PostgreSQL migrations2~150
Task leader election + circuit breaker3~148
DI registration, config additions2~20

This is a small fork, but it’s a fork. Every upstream Jellyfin release needs a rebase. The risk areas:

  • JellyfinDbContext.cs — upstream adds a new DbSet, I need to add it to the PostgreSQL migration
  • Startup.cs — upstream changes DI registration order or adds new services
  • TaskManager.cs — upstream changes the task scheduling interface (unlikely but possible)
  • Directory.Packages.props — upstream upgrades EF Core, which may break Npgsql compatibility

Historically, Jellyfin releases every 2-3 months. Each rebase takes 30-60 minutes if the diff is clean, longer if there are migration conflicts. This is ongoing maintenance, not a one-time cost.

Mitigation: the fork was designed to be minimal. The IJellyfinDatabaseProvider interface isolates all PostgreSQL-specific code. If upstream ever adds a pluggable database layer (there’s been discussion in the issue tracker), the fork can be retired.

Problem 5: Test Coverage Gaps

The project ships with:

  • Integration tests for the PostgreSQL provider (Testcontainers, run in CI)
  • Failover tests (manual, documented procedures from Day 6)
  • Leader election unit tests (Redis mock, verify lease acquisition and circuit breaker)

What’s missing:

GapRisk
Automated failover test suiteManual tests are time-consuming and may not be re-run after future changes
Load testing under failoverUnknown: what happens with 10 concurrent streams during pod death?
Long-running soak testMemory leaks in the leader election loop? Redis connection pool exhaustion?
Migration tool edge casesEmpty strings vs NULL, Unicode in metadata, very large libraries (50k+ items)

For a homelab project, manual testing is acceptable. For production, I’d want a CI pipeline that spins up a test cluster, deploys the HA stack, runs automated failover scenarios, and gates the release.

Problem 6: Monitoring Blind Spots

The Prometheus alerts from Day 6 cover the basics: replica count, PostgreSQL availability, Redis availability, restart rate. But there are gaps:

Blind SpotWhy It Matters
Sticky session distributionNo metric for how clients are distributed across pods. One pod could be serving 90% of traffic.
Session re-auth rateNo metric tracking how often clients re-authenticate. A spike indicates failover or sticky session misconfiguration.
PostgreSQL connection poolNo alert for approaching pool exhaustion. EF Core defaults to 100 max connections — two pods could theoretically exhaust this.
NFS config latencyShared config is on NFS over the network. If NFS degrades, both pods feel it. No latency metric.

These are nice-to-have observability features, not critical gaps. The cluster’s existing Prometheus infrastructure (node-exporter, kube-state-metrics) catches the broad failures. The Jellyfin-specific gaps would matter more at scale.

Problem 7: Documentation Debt

The project produced:

  • This 7-part blog series (you’re reading it)
  • The execution plan in the infra repo (docs/jellyfin-ha-plan.md)
  • Inline code comments in the fork

What’s missing:

  • Operator runbook — step-by-step procedures for common operations (scale up, scale down, PostgreSQL backup/restore, force leader election, disable HA and revert to single pod)
  • Architecture decision records — the Track A vs Track B decision, PostgreSQL vs MySQL, Redis vs etcd should be formally documented for future-me
  • Disaster recovery playbook — what happens if both pods die and PostgreSQL has a corrupt volume? The S3 backups exist but the restore procedure isn’t documented.

This is technical debt that compounds over time. Future-me will forget why decisions were made and may repeat failed experiments.

What Track B Would Look Like

If I ever revisit this project for full stateless HA:

ComponentTrack A (Current)Track B (Future)
SessionsPod-local, sticky cookieRedis-backed IDistributedCache
User cachePod-local, PostgreSQL fallbackRedis, all pods share
Device capabilitiesPod-localRedis
QuickConnectDisabledRedis-backed
SyncPlayPod-localRedis pub/sub + WebSocket bridge
Task coordinationRedis leader electionRedis distributed lock (unchanged)
TranscodingPod-local (sticky)Pod-local (sticky) — no change possible

Estimated work:

  • Sessions in Redis: ~400 lines. Replace ConcurrentDictionary<string, SessionInfo> with IDistributedCache. Serialize SessionInfo to JSON. Add cache invalidation on logout.
  • User cache in Redis: ~200 lines. Change UserManager to read-through cache backed by Redis + PostgreSQL.
  • QuickConnect in Redis: ~100 lines. Move codes and authorizations to Redis keys with TTL.
  • SyncPlay pub/sub: ~800+ lines. Major redesign. Redis pub/sub for group commands, shared group state, cross-pod WebSocket notifications.

Total: ~1,500+ lines of Jellyfin fork changes (vs ~500 today). The fork rebase burden triples.

The trigger for Track B would be: consistently hitting sticky session problems with >10 concurrent users, or the SyncPlay limitation becoming unacceptable.

Standing on Shoulders

This project wouldn’t exist without two pieces of prior work:

  1. pseudopseudonym’s HA Jellyfin setup — proved that Jellyfin could survive node failures on Kubernetes, identified the Rook PVC release bottleneck, and built the rook-push tool that made Ceph failover practical. Their r/selfhosted post surfaced the key question from the community: “How do you synchronize the database between jellyfin instances?” — the top comment, and the exact question this series answers.

  2. The Hacker News SQLite concurrency discussion — crystallized the architectural argument for PostgreSQL over SQLite, and highlighted that the Jellyfin 10.11 EF Core refactor finally made a pluggable database layer feasible.

The gap between “Jellyfin can fail over in 2.5 minutes” and “Jellyfin runs active-active with 12-second failover” is entirely about solving the database and state problems that these discussions identified.

What I’d Do Differently

Looking back at the project with fresh eyes:

1. Start with the migration tool

I built the SQLite-to-PostgreSQL migration tool as part of Phase 2. In hindsight, I should have built it first and run it on a copy of my data before touching any Jellyfin code. The migration is the highest-risk step — if the tool has bugs, you discover them after you’ve already deployed PostgreSQL and can’t easily revert.

2. Test NFS config performance earlier

NFS config works well in practice, but I didn’t benchmark it until after deployment. If NFS had been too slow for Jellyfin’s config reads (which happen on every startup and settings change), I’d have needed to redesign the shared config approach.

3. Document before implementing

The blog series is being written after the project completed. Writing the architecture docs first (even as rough drafts) would have forced me to articulate assumptions that the AI agents silently resolved — some correctly, some not.

4. Automate the failover tests from day one

Manual failover testing is error-prone and won’t be repeated regularly. A simple shell script that runs the six tests and validates the expected outcomes would pay for itself on the first rebase.

The Numbers

MetricValue
Phases4
GitHub Issues20
PRs merged22
Fork lines added~500
Kubernetes manifest lines~350
AI agent (Copilot) PRs~13 (60% of total)
Manual implementation PRs~9 (40% of total)
Failover test pass rate5/6 (83%)
Total disruption on pod failure~12-20 seconds (down from 2-3 minutes)
New infrastructure componentsPostgreSQL StatefulSet, Redis Deployment, NFS share
Ongoing maintenanceUpstream rebase every 2-3 months (~30-60 min each)

Series Recap

DayPostPhase
1Why Jellyfin Can’t ScaleProblem definition, strategy selection
2Multi-Model PlanningAI planning workflow, gap analysis
3PostgreSQL Database ProviderPhase 1: Fork, EF Core, Npgsql
4Storage MigrationPhase 2: PostgreSQL deploy, volume restructure, data migration
5State and SessionsPhase 3: Redis leader election, sticky sessions, feature flags
6Failover TestingPhase 4: Scale to 2, six tests, Prometheus alerts
7What’s Still Broken (this post)Retrospective, limitations, future work

Final Thoughts

This project proves that even architecturally single-process applications can be made highly available on Kubernetes with a pragmatic approach. You don’t need to solve every state problem — you need to solve enough of them that the user experience during failure is acceptable.

For a homelab media server, “re-authenticate and press Play” is an acceptable failure mode. It’s infinitely better than “wait 3 minutes for the pod to reschedule.”

The fork stays small. The maintenance stays manageable. And the family stops yelling from the living room when I need to patch a node.

Browse the code: The complete Jellyfin fork is at github.com/zolty-mat/jellyfin. The PostgreSQL provider, Redis leader election, and HA Dockerfile are all there. Star it if you’re thinking about doing something similar with your media server.

The infrastructure repo (Kubernetes manifests, Terraform, Ansible) at github.com/zolty-mat/home_k3s_cluster will be published once secrets remediation is complete. Watch the repo for updates.

Don’t have a homelab? This entire stack — Jellyfin, PostgreSQL, Redis, Traefik — can run on any Kubernetes cluster. A DigitalOcean managed cluster with $200 in free credits is enough to run the complete HA setup for several months.