Jellyfin HA on Kubernetes: Redis-Backed Transcode Session Failover

TL;DR

Jellyfin dies mid-stream when a Kubernetes pod restarts because all transcode state is in-memory. I forked it, added a Redis-backed ITranscodeSessionStore, and wired in atomic lease-based pod takeover. The fork is at github.com/ZoltyMat/jellyfin-ha, and I also published a repo-level diff document at docs/FORK-DIFF.md showing exactly what changed versus upstream Jellyfin. Single-instance deployments need zero config changes because it falls back to a no-op store transparently.

The Problem

Jellyfin is great. It’s also built with the assumption that exactly one server instance is running at a time. Transcode state — which pods are running FFmpeg, what segments have been written, who owns a given play session — lives entirely in memory. When the process dies, that state is gone.

In a typical homelab setup, that’s fine. One machine, one Jellyfin process, restart takes 10 seconds, your stream buffers and recovers. But I run Jellyfin on k3s with Longhorn storage and I wanted rolling restarts, node drain, and pod rescheduling to work without dropping active streams.

Turns out, Kubernetes and “state lives entirely in one process’s RAM” are fundamentally at odds.

What I Actually Built

The fork adds three things:

1. ITranscodeSessionStore — a new interface

A thin abstraction for durable transcode session state. It has five methods:

TryGetAsync — retrieve a session by play session ID
TryTakeoverAsync — atomically claim a session whose lease has expired
SetAsync — persist a new or updated session
RenewLeaseAsync — extend a session’s TTL to signal the pod is still alive
DeleteAsync — remove a session when transcoding ends

2. RedisTranscodeSessionStore — the real implementation

Uses Redis for distributed storage with TTL-based leases. Each TranscodeSession record contains:

PlaySessionId — the Jellyfin stream identifier
OwnerPod — which pod holds the current lease
LeaseExpiresUtc — when the lease expires (if the pod stops renewing)
ManifestPath — path to the .m3u8 file on shared storage
SegmentPathPrefix — path prefix for .ts segment files
LastCompletedSegmentIndex — where to resume FFmpeg from after a takeover
LastDurablePlaybackOffset — playback position in ticks for client resume

Takeover is atomic via a Lua script that checks lease expiry and updates ownership in a single Redis operation — no race conditions between concurrent pods trying to claim the same session.

3. NullTranscodeSessionStore — the fallback

If you don’t configure a Redis connection string, this no-op implementation is registered instead. It silently discards all calls and returns null on lookups. Single-instance setups run exactly as they did before — no behavioral change, no performance overhead.

Architecture

┌─────────────────────────────────────────┐
│           Kubernetes Cluster            │
│                                         │
│  ┌──────────────┐  ┌──────────────┐    │
│  │  Jellyfin    │  │  Jellyfin    │    │
│  │   Pod A      │  │   Pod B      │    │
│  │              │  │              │    │
│  │ ┌──────────┐ │  │ ┌──────────┐ │    │
│  │ │Transcode │ │  │ │Transcode │ │    │
│  │ │Manager   │ │  │ │Manager   │ │    │
│  │ └────┬─────┘ │  │ └────┬─────┘ │    │
│  └──────┼───────┘  └──────┼───────┘    │
│         │                 │             │
│         └────────┬────────┘             │
│                  │                      │
│          ┌───────▼───────┐              │
│          │    Redis      │              │
│          │  (sessions +  │              │
│          │   leases)     │              │
│          └───────────────┘              │
│                                         │
│  ┌──────────────────────────────────┐   │
│  │       Shared NFS / Longhorn RWX  │   │
│  │    /transcode/*.m3u8, *.ts       │   │
│  └──────────────────────────────────┘   │
└─────────────────────────────────────────┘

Pod A starts an HLS transcode, writes a session to Redis with a 30-second lease, and renews it every 15 seconds. If Pod A is killed, the lease expires in Redis after 30 seconds. Pod B receives the next client request, calls TryTakeoverAsync, and the Lua script atomically verifies the lease is expired and transfers ownership. Pod B picks up the FFmpeg session from the last completed segment on shared storage.

The client sees a brief reopen of the manifest — typically a few seconds of buffer pause, not an error.

The Lua Takeover Script

The atomic takeover is the interesting part. Redis is single-threaded for script execution, so this is safe even with many pods racing:

local raw = redis.call('GET', KEYS[1])
if not raw then return 0 end
local session = cjson.decode(raw)
local currentTicks = tonumber(ARGV[1])
if session['LeaseExpiresUtc'] > currentTicks then return 0 end
session['OwnerPod'] = ARGV[2]
local leaseDurationMs = tonumber(ARGV[3])
local newTicks = currentTicks + (leaseDurationMs * 10000)
session['LeaseExpiresUtc'] = newTicks
redis.call('SET', KEYS[1], cjson.encode(session), 'PX', leaseDurationMs)
return 1

Returns 1 on successful takeover, 0 if the lease is still valid or the session doesn’t exist. The C# caller gets a bool and acts accordingly.

Lease-Aware Cleanup

The other piece Jellyfin needed fixing was DeleteTranscodeFileTask. In stock Jellyfin, this task sweeps the transcode temp directory and deletes files belonging to sessions that are no longer active. In a multi-pod setup, “no longer active in this pod’s memory” is not the same as “no longer active globally” — and blowing away segments another pod is actively streaming is a very bad time.

The HA version checks the Redis session store before deleting anything. If a session record exists and has a valid (non-expired) lease, cleanup is skipped for that session. Only sessions with expired leases and no Redis entry get their temp files removed.

Configuration

Two config keys, both under Jellyfin:TranscodeStore:

Key	Default	Description
`RedisConnectionString`	(empty)	StackExchange.Redis connection string. Empty = single-instance no-op mode.
`LeaseDurationSeconds`	`30`	How long a pod’s lease is valid before another pod may take over.

Single instance — no config needed:

dotnet run --project Jellyfin.Server/Jellyfin.Server.csproj

HA mode — environment variables:

export Jellyfin__TranscodeStore__RedisConnectionString="redis:6379,abortConnect=false"
export Jellyfin__TranscodeStore__LeaseDurationSeconds="30"

dotnet run --project Jellyfin.Server/Jellyfin.Server.csproj

HA mode — Kubernetes deployment:

env:
  - name: Jellyfin__TranscodeStore__RedisConnectionString
    valueFrom:
      secretKeyRef:
        name: jellyfin-redis
        key: connection-string
  - name: Jellyfin__TranscodeStore__LeaseDurationSeconds
    value: "30"
  - name: JELLYFIN_HA_POD_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name

If RedisConnectionString is non-empty and Redis is unreachable at startup, the server throws and refuses to start. I made this intentional — silently falling back to the no-op store in HA mode would give you the illusion of safety without the actual guarantee.

PostgreSQL

The fork also includes an experimental PostgreSQL database provider under src/Jellyfin.Database/Jellyfin.Database.Providers.PostgreSQL. Jellyfin normally uses SQLite, which doesn’t work well when multiple pods share a single data directory. PostgreSQL lets you point all pods at the same database.

I’m treating this as experimental — the EF Core migration infrastructure is there, but it hasn’t had the same testing time as the Redis session store. SQLite is still the default and the choice I’d recommend unless you specifically need the shared-DB properties.

# Add a new migration for PostgreSQL
dotnet ef migrations add InitialCreate \
  --project "src/Jellyfin.Database/Jellyfin.Database.Providers.PostgreSQL" \
  -- --migration-provider Jellyfin-PostgreSQL

Building from Source

Requirements: .NET 10 SDK.

git clone https://github.com/ZoltyMat/jellyfin-ha.git
cd jellyfin-ha

# Build
dotnet build Jellyfin.Server/Jellyfin.Server.csproj

# Run all tests (excludes Docker-dependent and integration tests)
dotnet test Jellyfin.sln \
  --configuration Release \
  --filter "Category!=RequiresDocker&FullyQualifiedName!~Integration"

# Run HA-specific tests only
dotnet test tests/Jellyfin.Server.Implementations.Tests \
  --configuration Release \
  --filter "FullyQualifiedName~TranscodeSession"

Docker

Build the image yourself — the dotnet publish step runs outside Docker for I/O performance, then the Dockerfile.runtime wraps the published output:

# Publish first
dotnet publish Jellyfin.Server/Jellyfin.Server.csproj \
  --configuration Release \
  --runtime linux-x64 \
  --self-contained false \
  --output ./publish-output

# Build runtime image
docker build -f Dockerfile.runtime \
  --platform linux/amd64 \
  -t jellyfin-ha:latest .

What This Is and What It Isn’t

This is a personal experiment — I built it to solve a specific problem in my homelab. It’s not an officially maintained fork and it doesn’t have a release cadence. If you use it, you’re pulling from main and getting whatever is there.

The changes are deliberately minimal and isolated. No core Jellyfin logic was modified — only extended via existing DI extension points. If the upstream project ever decides to support HA transcoding natively, the hope is the design here is close enough to be useful input.

If you want to push this upstream, that conversation belongs at jellyfin/jellyfin. I’d be happy to see it absorbed.

What’s Next

A few things I haven’t finished:

HlsSessionRenewWorker — a background IHostedService that automatically renews leases for sessions this pod owns, instead of requiring callers to manually call RenewLeaseAsync. The interface is there, the background worker isn’t wired up yet.
Client-side resume after takeover — currently the client sees a manifest reopen and has to rescan from the last segment. Proper resume from LastDurablePlaybackOffset would make this seamless.
Prometheus metrics — lease takeover rate, session count per pod, and cleanup coordination events would make this much more observable.

The repo is at github.com/ZoltyMat/jellyfin-ha if you want to poke around, open issues, or send a PR. If you want the quick “what exactly changed compared to upstream?” version, start with docs/FORK-DIFF.md.

TL;DR#

The Problem#

What I Actually Built#

Architecture#

The Lua Takeover Script#

Lease-Aware Cleanup#

Configuration#

PostgreSQL#

Building from Source#

Docker#

What This Is and What It Isn’t#

What’s Next#