TL;DR
Phase 3 is where the rubber meets the road. We have PostgreSQL for persistent data (Day 4) and NFS for shared config. But Jellyfin still holds critical runtime state — sessions, users, devices, tasks — in 11 ConcurrentDictionary instances scattered across singleton managers. Two pods with independent memory spaces means two independent views of reality.
This post covers the state externalization decision: what got moved to Redis, what got solved by sticky sessions, what got disabled entirely, and why pragmatism beat perfection for a homelab media server.
The State Problem, Revisited
In Day 1, I cataloged every stateful singleton in Jellyfin 10.12.0. Here’s the same table, now with the resolution strategy:
| Manager | State | HA Impact | Resolution |
|---|---|---|---|
SessionManager | Active user sessions | Critical | Sticky sessions + graceful re-auth |
SessionManager | Live stream tracking | Critical | Sticky sessions (stream stays on one pod) |
UserManager | User cache | High | PostgreSQL is source of truth; cache warms on startup |
DeviceManager | Client capabilities | Medium | Sticky sessions (client always hits same pod) |
QuickConnectManager | Pairing requests | High | Disabled in HA mode |
QuickConnectManager | Authorized secrets | High | Disabled in HA mode |
SyncPlayManager | Group state | Deferred | Too complex, low usage — single-pod only |
TaskManager | Scheduled tasks | High | Leader election via Redis |
ProviderManager | Metadata refresh progress | Medium | Sticky sessions (progress per pod) |
TranscodeManager | FFmpeg jobs | Deferred | Sticky sessions (transcode is pod-local) |
ChannelManager | Channel cache | Low | Regenerable — each pod builds its own |
Three categories emerged:
- Solved by sticky sessions — if the client always hits the same pod, the in-memory state stays consistent. No code changes needed.
- Solved by external coordination —
TaskManagerneeds a leader election or distributed lock to prevent duplicate task execution. - Deferred or disabled —
SyncPlayManageris architecturally incompatible with multi-pod.QuickConnectManagerhas a small user base and can be disabled.
Track A vs. Track B
Back in the planning phase (Day 2), the multi-model review identified two escalation paths:
Track A: Sticky Sessions + Minimal Changes
Keep state in memory. Use Traefik’s cookie-based session affinity to pin clients to pods. Accept that pod death means clients re-authenticate and lose transient state (playback position is persisted to PostgreSQL, so “Continue Watching” survives). Coordinate only TaskManager via Redis.
Track B: Full Redis Externalization
Replace every ConcurrentDictionary with a Redis-backed implementation. Sessions survive pod death transparently. Any pod can serve any request. True stateless application.
The plan specified Track A as the implementation target, with Track B as an escalation if real-world testing showed unacceptable user experience.
Why Track A Won
| Factor | Track A | Track B |
|---|---|---|
| Code changes to Jellyfin | ~200 lines (TaskManager, feature flags) | ~2,000+ lines (6 managers rewritten) |
| Fork maintenance burden | Minimal — touches 3 files | Heavy — every upstream release risks merge conflicts |
| User-visible impact of pod death | Re-authenticate, click Play again | Seamless (maybe a brief pause) |
| User-visible frequency of pod death | Rare (node drain, crash) | Same |
| Development time | 1 phase | 3+ phases |
| Redis dependency | Lightweight (leader election only) | Critical path (every API request) |
For a homelab with <10 concurrent users, the difference between “click Play again after a pod crash” and “seamless failover” doesn’t justify tripling the fork’s maintenance surface. Track A it is.
Deploying Redis
Even with Track A, we need Redis for one essential function: distributed locking for TaskManager. Without it, both pods would fire library scans, image extraction, and plugin updates simultaneously.
Redis runs as a single-replica Deployment in the jellyfin namespace:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jellyfin-redis
namespace: jellyfin
labels:
app.kubernetes.io/name: jellyfin
app.kubernetes.io/component: cache
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: jellyfin
app.kubernetes.io/component: cache
template:
metadata:
labels:
app.kubernetes.io/name: jellyfin
app.kubernetes.io/component: cache
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
command: ["redis-server", "--maxmemory", "128mb", "--maxmemory-policy", "allkeys-lru"]
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 192Mi
livenessProbe:
exec:
command: ["redis-cli", "ping"]
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
exec:
command: ["redis-cli", "ping"]
initialDelaySeconds: 3
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: jellyfin-redis
namespace: jellyfin
spec:
selector:
app.kubernetes.io/name: jellyfin
app.kubernetes.io/component: cache
ports:
- port: 6379
Why Not Redis Sentinel or Cluster?
Redis itself isn’t critical path for user requests. It only coordinates background tasks. If Redis goes down:
- Sticky sessions still work (Traefik doesn’t depend on Redis)
- User authentication still works (PostgreSQL)
- Playback still works (FFmpeg is pod-local)
- The only impact: both pods might run a library scan simultaneously until Redis recovers
For this failure mode, single-replica Redis with a restart policy is sufficient. Sentinel adds 3 more pods and operational complexity for a failure scenario that causes duplicate work, not data loss.
The TaskManager Problem
TaskManager is the most dangerous of the 11 stateful managers. It holds a ConcurrentQueue of scheduled tasks and fires them on a configurable interval. With two pods, every scheduled task runs twice:
- Library scan: both pods scan the same media directory, potentially writing duplicate metadata
- Image extraction: both pods extract the same thumbnails, wasting CPU
- Chapter detection: FFmpeg processes run on both pods for the same files
- Plugin updates: both pods try to download and install the same plugin update
Leader Election via Redis
The solution: only one pod runs scheduled tasks at a time. A simple Redis-based leader election:
public class RedisTaskLeaderElection : ITaskLeaderElection
{
private readonly IConnectionMultiplexer _redis;
private readonly string _instanceId;
private readonly TimeSpan _leaseTimeout = TimeSpan.FromSeconds(30);
public RedisTaskLeaderElection(
IConnectionMultiplexer redis,
IServerApplicationHost appHost)
{
_redis = redis;
_instanceId = appHost.SystemId;
}
/// <summary>
/// Attempts to acquire the task leader lease.
/// Returns true if this instance is the current leader.
/// </summary>
public async Task<bool> TryAcquireLeadershipAsync(
CancellationToken cancellationToken = default)
{
var db = _redis.GetDatabase();
var acquired = await db.StringSetAsync(
"jellyfin:task-leader",
_instanceId,
_leaseTimeout,
When.NotExists);
if (acquired)
{
return true;
}
// Check if we already hold the lease
var currentLeader = await db.StringGetAsync("jellyfin:task-leader");
return currentLeader == _instanceId;
}
/// <summary>
/// Renews the leader lease. Called periodically by the leader.
/// </summary>
public async Task RenewLeaseAsync(
CancellationToken cancellationToken = default)
{
var db = _redis.GetDatabase();
var currentLeader = await db.StringGetAsync("jellyfin:task-leader");
if (currentLeader == _instanceId)
{
await db.KeyExpireAsync(
"jellyfin:task-leader",
_leaseTimeout);
}
}
}
The pattern:
- Pod starts up and tries to set
jellyfin:task-leaderto its instance ID with a 30-second TTL StringSetAsyncwithWhen.NotExistsensures only the first pod wins- The leader renews the lease every 15 seconds (half the TTL)
- If the leader pod dies, the lease expires after 30 seconds
- The surviving pod acquires leadership on its next attempt
TaskManagerchecksTryAcquireLeadershipAsync()before executing any scheduled task
Graceful Fallback
What if Redis is down entirely? The leader election uses a circuit breaker:
public class ResilientTaskLeaderElection : ITaskLeaderElection
{
private readonly RedisTaskLeaderElection _inner;
private readonly ILogger<ResilientTaskLeaderElection> _logger;
private int _consecutiveFailures;
private const int CircuitBreakerThreshold = 3;
/// <inheritdoc />
public async Task<bool> TryAcquireLeadershipAsync(
CancellationToken cancellationToken = default)
{
try
{
var result = await _inner.TryAcquireLeadershipAsync(
cancellationToken);
Interlocked.Exchange(ref _consecutiveFailures, 0);
return result;
}
catch (RedisConnectionException ex)
{
var failures = Interlocked.Increment(
ref _consecutiveFailures);
if (failures >= CircuitBreakerThreshold)
{
_logger.LogWarning(
"Redis unavailable ({Failures} consecutive failures). "
+ "Falling back to local task execution.",
failures);
return true; // Allow local execution
}
_logger.LogDebug(ex,
"Redis connection failed ({Failures}/{Threshold})",
failures, CircuitBreakerThreshold);
return false;
}
}
}
After 3 consecutive Redis failures, the circuit breaker opens and both pods run tasks locally — the same behavior as without Redis. This prevents Redis downtime from blocking all scheduled tasks. Duplicate work is preferable to no work.
Disabling QuickConnect
QuickConnect is Jellyfin’s device pairing feature: you generate a 6-digit code on a new device, then authorize it from an already-authenticated session. The code and the authorization live in two ConcurrentDictionary instances that never leave memory.
With sticky sessions, there’s a ~50% chance the authorization request routes to a different pod than the one holding the code. The pairing silently fails.
Options:
- Externalize to Redis — store codes and authorizations in Redis. Track B territory.
- Route to leader — use a shared Redis key to identify which pod generated the code. Complex.
- Disable it — QuickConnect is a convenience feature. Users can authenticate with username/password instead.
We went with option 3. A feature flag in system.xml:
<QuickConnectAvailable>false</QuickConnectAvailable>
The Jellyfin UI hides the QuickConnect option when it’s disabled. No code change needed — just a config setting. If Track B is ever implemented, it can be re-enabled.
SyncPlay: The Feature We Couldn’t Fix
SyncPlay lets multiple users watch the same content in sync — like a virtual movie night. The state for this feature spans three interconnected ConcurrentDictionary instances:
| Dictionary | Purpose |
|---|---|
_groups | Active SyncPlay groups |
_sessionToGroupMap | Which session belongs to which group |
_groupToSessionsMap | Which sessions belong to which group |
SyncPlay requires sub-second coordination between all members of a group. Playback commands (play, pause, seek) must propagate to all clients simultaneously. This is fundamentally incompatible with sticky sessions — if two users in the same group are pinned to different pods, their play/pause commands only affect their local pod’s state.
Externalizing SyncPlay would require:
- Real-time pub/sub between pods (Redis pub/sub or WebSocket bridge)
- Shared group state in Redis
- Cross-pod session lookup
- Sub-100ms roundtrip to avoid visible playback desync
This is Track B territory and then some. SyncPlay works fine on jellyfin-0 (the pod that happens to be the leader). Users who need SyncPlay can use it — they just need to be on the same pod, which sticky sessions handle naturally as long as they log in around the same time.
For a household of 3-4 users, this is an acceptable compromise. For a public server with hundreds of users, it wouldn’t be.
Configuring Traefik Sticky Sessions
The linchpin of Track A: Traefik’s cookie-based session affinity. Every client gets a cookie that pins them to a specific pod. As long as that pod is alive, all requests from that client go to the same place.
The configuration lives in the Jellyfin IngressRoute:
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: jellyfin
namespace: jellyfin
spec:
entryPoints:
- websecure
routes:
- match: Host(`jellyfin.k3s.internal.strommen.systems`)
kind: Rule
services:
- name: jellyfin
port: 8096
sticky:
cookie:
name: jellyfin_server_id
secure: true
httpOnly: true
sameSite: strict
tls:
certResolver: letsencrypt-prod
How It Works
- Client makes first request to
jellyfin.k3s.internal.strommen.systems - Traefik routes to one of the two pods (round-robin on first hit)
- Response includes
Set-Cookie: jellyfin_server_id=<pod-hash>; Secure; HttpOnly; SameSite=Strict - All subsequent requests from that client include the cookie
- Traefik reads the cookie and routes to the same pod
What Happens When a Pod Dies
- Traefik detects the pod is gone (health check fails within ~10 seconds)
- Client’s next request includes the cookie, but the target pod doesn’t exist
- Traefik falls back to round-robin and routes to the surviving pod
- New cookie is set for the surviving pod
- Client’s session is no longer in memory — Jellyfin returns 401
- Client re-authenticates (username/password, or saved credentials on mobile apps)
- New session is created on the surviving pod
- Client clicks Play to resume — playback position is loaded from PostgreSQL
Total user impact: one authentication prompt and one Play button press. For a home server, this is comparable to the existing experience when the single pod restarts during node maintenance.
WebSocket Considerations
Jellyfin uses WebSocket connections for real-time features: playback progress updates, server events, SyncPlay commands. WebSocket connections are inherently sticky — once established, they stay on the pod that accepted the upgrade.
If the WebSocket pod dies, the connection drops and the client reconnects. The Jellyfin client SDKs handle reconnection automatically. The only user-visible effect is a brief “Connecting…” indicator.
The Session Lifecycle
To understand why sticky sessions are sufficient, consider the full session lifecycle:
Client Login
│
▼
POST /Users/AuthenticateByName
│
├── SessionManager creates SessionInfo in ConcurrentDictionary
├── Returns access token + server ID
└── Traefik sets sticky cookie
│
▼
All subsequent requests include:
├── Authorization: MediaBrowser Token="<access_token>"
└── Cookie: jellyfin_server_id=<pod_hash>
│
▼
SessionManager.GetSession() looks up token
├── Found → request proceeds
└── Not found → 401 Unauthorized → client re-auths
The access token is the key. It’s generated on the pod that handled authentication and stored only in that pod’s SessionManager. A different pod has no record of it. Sticky sessions ensure the token always goes back to the pod that created it.
If the pod dies and the client hits the other pod, the token lookup fails. The client re-authenticates, gets a new token on the new pod, and continues. Playback state is in PostgreSQL, so “Continue Watching” and watched markers are preserved across the re-auth.
What This Phase Changed in the Fork
The fork changes for Phase 3 were minimal — the whole point of Track A:
| File | Change | Lines |
|---|---|---|
Jellyfin.Server.Implementations/ITaskLeaderElection.cs | New interface | 18 |
Jellyfin.Server.Implementations/RedisTaskLeaderElection.cs | Redis leader election | 65 |
Jellyfin.Server.Implementations/ResilientTaskLeaderElection.cs | Circuit breaker wrapper | 52 |
Jellyfin.Server/Startup.cs | Register Redis + leader election in DI | 12 |
Directory.Packages.props | StackExchange.Redis version | 1 |
Total: ~148 lines of new code. Compare this to Track B’s estimated ~2,000+ lines. The fork stays small, rebases stay clean.
Coming Up Next
Tomorrow: scaling to two replicas and failover testing — the moment of truth where we set replicas: 2, kill a pod mid-stream, and see if the surviving replica actually keeps serving traffic.
Browse the code: The leader election and circuit breaker implementations are in the Jellyfin fork at github.com/zolty-mat/jellyfin. The Redis and Traefik manifests are in the infrastructure repo (coming soon once secrets are remediated).
Cloud alternative: If you’d rather not run Redis yourself, DigitalOcean Managed Redis provides a drop-in replacement with automatic failover.