State Externalization and the Sticky Session Compromise

TL;DR

Phase 3 is where the rubber meets the road. We have PostgreSQL for persistent data (Day 4) and NFS for shared config. But Jellyfin still holds critical runtime state — sessions, users, devices, tasks — in 11 ConcurrentDictionary instances scattered across singleton managers. Two pods with independent memory spaces means two independent views of reality.

This post covers the state externalization decision: what got moved to Redis, what got solved by sticky sessions, what got disabled entirely, and why pragmatism beat perfection for a homelab media server.

The State Problem, Revisited

In Day 1, I cataloged every stateful singleton in Jellyfin 10.12.0. Here’s the same table, now with the resolution strategy:

Manager	State	HA Impact	Resolution
`SessionManager`	Active user sessions	Critical	Sticky sessions + graceful re-auth
`SessionManager`	Live stream tracking	Critical	Sticky sessions (stream stays on one pod)
`UserManager`	User cache	High	PostgreSQL is source of truth; cache warms on startup
`DeviceManager`	Client capabilities	Medium	Sticky sessions (client always hits same pod)
`QuickConnectManager`	Pairing requests	High	Disabled in HA mode
`QuickConnectManager`	Authorized secrets	High	Disabled in HA mode
`SyncPlayManager`	Group state	Deferred	Too complex, low usage — single-pod only
`TaskManager`	Scheduled tasks	High	Leader election via Redis
`ProviderManager`	Metadata refresh progress	Medium	Sticky sessions (progress per pod)
`TranscodeManager`	FFmpeg jobs	Deferred	Sticky sessions (transcode is pod-local)
`ChannelManager`	Channel cache	Low	Regenerable — each pod builds its own

Three categories emerged:

Solved by sticky sessions — if the client always hits the same pod, the in-memory state stays consistent. No code changes needed.
Solved by external coordination — TaskManager needs a leader election or distributed lock to prevent duplicate task execution.
Deferred or disabled — SyncPlayManager is architecturally incompatible with multi-pod. QuickConnectManager has a small user base and can be disabled.

Track A vs. Track B

Back in the planning phase (Day 2), the multi-model review identified two escalation paths:

Track A: Sticky Sessions + Minimal Changes

Keep state in memory. Use Traefik’s cookie-based session affinity to pin clients to pods. Accept that pod death means clients re-authenticate and lose transient state (playback position is persisted to PostgreSQL, so “Continue Watching” survives). Coordinate only TaskManager via Redis.

Track B: Full Redis Externalization

Replace every ConcurrentDictionary with a Redis-backed implementation. Sessions survive pod death transparently. Any pod can serve any request. True stateless application.

The plan specified Track A as the implementation target, with Track B as an escalation if real-world testing showed unacceptable user experience.

Why Track A Won

Factor	Track A	Track B
Code changes to Jellyfin	~200 lines (TaskManager, feature flags)	~2,000+ lines (6 managers rewritten)
Fork maintenance burden	Minimal — touches 3 files	Heavy — every upstream release risks merge conflicts
User-visible impact of pod death	Re-authenticate, click Play again	Seamless (maybe a brief pause)
User-visible frequency of pod death	Rare (node drain, crash)	Same
Development time	1 phase	3+ phases
Redis dependency	Lightweight (leader election only)	Critical path (every API request)

For a homelab with <10 concurrent users, the difference between “click Play again after a pod crash” and “seamless failover” doesn’t justify tripling the fork’s maintenance surface. Track A it is.

Deploying Redis

Even with Track A, we need Redis for one essential function: distributed locking for TaskManager. Without it, both pods would fire library scans, image extraction, and plugin updates simultaneously.

Redis runs as a single-replica Deployment in the jellyfin namespace:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jellyfin-redis
  namespace: jellyfin
  labels:
    app.kubernetes.io/name: jellyfin
    app.kubernetes.io/component: cache
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: jellyfin
      app.kubernetes.io/component: cache
  template:
    metadata:
      labels:
        app.kubernetes.io/name: jellyfin
        app.kubernetes.io/component: cache
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          ports:
            - containerPort: 6379
          command: ["redis-server", "--maxmemory", "128mb", "--maxmemory-policy", "allkeys-lru"]
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 192Mi
          livenessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 3
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: jellyfin-redis
  namespace: jellyfin
spec:
  selector:
    app.kubernetes.io/name: jellyfin
    app.kubernetes.io/component: cache
  ports:
    - port: 6379

Why Not Redis Sentinel or Cluster?

Redis itself isn’t critical path for user requests. It only coordinates background tasks. If Redis goes down:

Sticky sessions still work (Traefik doesn’t depend on Redis)
User authentication still works (PostgreSQL)
Playback still works (FFmpeg is pod-local)
The only impact: both pods might run a library scan simultaneously until Redis recovers

For this failure mode, single-replica Redis with a restart policy is sufficient. Sentinel adds 3 more pods and operational complexity for a failure scenario that causes duplicate work, not data loss.

The TaskManager Problem

TaskManager is the most dangerous of the 11 stateful managers. It holds a ConcurrentQueue of scheduled tasks and fires them on a configurable interval. With two pods, every scheduled task runs twice:

Library scan: both pods scan the same media directory, potentially writing duplicate metadata
Image extraction: both pods extract the same thumbnails, wasting CPU
Chapter detection: FFmpeg processes run on both pods for the same files
Plugin updates: both pods try to download and install the same plugin update

Leader Election via Redis

The solution: only one pod runs scheduled tasks at a time. A simple Redis-based leader election:

public class RedisTaskLeaderElection : ITaskLeaderElection
{
    private readonly IConnectionMultiplexer _redis;
    private readonly string _instanceId;
    private readonly TimeSpan _leaseTimeout = TimeSpan.FromSeconds(30);

    public RedisTaskLeaderElection(
        IConnectionMultiplexer redis,
        IServerApplicationHost appHost)
    {
        _redis = redis;
        _instanceId = appHost.SystemId;
    }

    /// <summary>
    /// Attempts to acquire the task leader lease.
    /// Returns true if this instance is the current leader.
    /// </summary>
    public async Task<bool> TryAcquireLeadershipAsync(
        CancellationToken cancellationToken = default)
    {
        var db = _redis.GetDatabase();
        var acquired = await db.StringSetAsync(
            "jellyfin:task-leader",
            _instanceId,
            _leaseTimeout,
            When.NotExists);

        if (acquired)
        {
            return true;
        }

        // Check if we already hold the lease
        var currentLeader = await db.StringGetAsync("jellyfin:task-leader");
        return currentLeader == _instanceId;
    }

    /// <summary>
    /// Renews the leader lease. Called periodically by the leader.
    /// </summary>
    public async Task RenewLeaseAsync(
        CancellationToken cancellationToken = default)
    {
        var db = _redis.GetDatabase();
        var currentLeader = await db.StringGetAsync("jellyfin:task-leader");

        if (currentLeader == _instanceId)
        {
            await db.KeyExpireAsync(
                "jellyfin:task-leader",
                _leaseTimeout);
        }
    }
}

The pattern:

Pod starts up and tries to set jellyfin:task-leader to its instance ID with a 30-second TTL
StringSetAsync with When.NotExists ensures only the first pod wins
The leader renews the lease every 15 seconds (half the TTL)
If the leader pod dies, the lease expires after 30 seconds
The surviving pod acquires leadership on its next attempt
TaskManager checks TryAcquireLeadershipAsync() before executing any scheduled task

Graceful Fallback

What if Redis is down entirely? The leader election uses a circuit breaker:

public class ResilientTaskLeaderElection : ITaskLeaderElection
{
    private readonly RedisTaskLeaderElection _inner;
    private readonly ILogger<ResilientTaskLeaderElection> _logger;
    private int _consecutiveFailures;
    private const int CircuitBreakerThreshold = 3;

    /// <inheritdoc />
    public async Task<bool> TryAcquireLeadershipAsync(
        CancellationToken cancellationToken = default)
    {
        try
        {
            var result = await _inner.TryAcquireLeadershipAsync(
                cancellationToken);
            Interlocked.Exchange(ref _consecutiveFailures, 0);
            return result;
        }
        catch (RedisConnectionException ex)
        {
            var failures = Interlocked.Increment(
                ref _consecutiveFailures);

            if (failures >= CircuitBreakerThreshold)
            {
                _logger.LogWarning(
                    "Redis unavailable ({Failures} consecutive failures). "
                    + "Falling back to local task execution.",
                    failures);
                return true; // Allow local execution
            }

            _logger.LogDebug(ex,
                "Redis connection failed ({Failures}/{Threshold})",
                failures, CircuitBreakerThreshold);
            return false;
        }
    }
}

After 3 consecutive Redis failures, the circuit breaker opens and both pods run tasks locally — the same behavior as without Redis. This prevents Redis downtime from blocking all scheduled tasks. Duplicate work is preferable to no work.

Disabling QuickConnect

QuickConnect is Jellyfin’s device pairing feature: you generate a 6-digit code on a new device, then authorize it from an already-authenticated session. The code and the authorization live in two ConcurrentDictionary instances that never leave memory.

With sticky sessions, there’s a ~50% chance the authorization request routes to a different pod than the one holding the code. The pairing silently fails.

Options:

Externalize to Redis — store codes and authorizations in Redis. Track B territory.
Route to leader — use a shared Redis key to identify which pod generated the code. Complex.
Disable it — QuickConnect is a convenience feature. Users can authenticate with username/password instead.

We went with option 3. A feature flag in system.xml:

<QuickConnectAvailable>false</QuickConnectAvailable>

The Jellyfin UI hides the QuickConnect option when it’s disabled. No code change needed — just a config setting. If Track B is ever implemented, it can be re-enabled.

SyncPlay: The Feature We Couldn’t Fix

SyncPlay lets multiple users watch the same content in sync — like a virtual movie night. The state for this feature spans three interconnected ConcurrentDictionary instances:

Dictionary	Purpose
`_groups`	Active SyncPlay groups
`_sessionToGroupMap`	Which session belongs to which group
`_groupToSessionsMap`	Which sessions belong to which group

SyncPlay requires sub-second coordination between all members of a group. Playback commands (play, pause, seek) must propagate to all clients simultaneously. This is fundamentally incompatible with sticky sessions — if two users in the same group are pinned to different pods, their play/pause commands only affect their local pod’s state.

Externalizing SyncPlay would require:

Real-time pub/sub between pods (Redis pub/sub or WebSocket bridge)
Shared group state in Redis
Cross-pod session lookup
Sub-100ms roundtrip to avoid visible playback desync

This is Track B territory and then some. SyncPlay works fine on jellyfin-0 (the pod that happens to be the leader). Users who need SyncPlay can use it — they just need to be on the same pod, which sticky sessions handle naturally as long as they log in around the same time.

For a household of 3-4 users, this is an acceptable compromise. For a public server with hundreds of users, it wouldn’t be.

Configuring Traefik Sticky Sessions

The linchpin of Track A: Traefik’s cookie-based session affinity. Every client gets a cookie that pins them to a specific pod. As long as that pod is alive, all requests from that client go to the same place.

The configuration lives in the Jellyfin IngressRoute:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: jellyfin
  namespace: jellyfin
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`jellyfin.k3s.internal.strommen.systems`)
      kind: Rule
      services:
        - name: jellyfin
          port: 8096
          sticky:
            cookie:
              name: jellyfin_server_id
              secure: true
              httpOnly: true
              sameSite: strict
  tls:
    certResolver: letsencrypt-prod

How It Works

Client makes first request to jellyfin.k3s.internal.strommen.systems
Traefik routes to one of the two pods (round-robin on first hit)
Response includes Set-Cookie: jellyfin_server_id=<pod-hash>; Secure; HttpOnly; SameSite=Strict
All subsequent requests from that client include the cookie
Traefik reads the cookie and routes to the same pod

What Happens When a Pod Dies

Traefik detects the pod is gone (health check fails within ~10 seconds)
Client’s next request includes the cookie, but the target pod doesn’t exist
Traefik falls back to round-robin and routes to the surviving pod
New cookie is set for the surviving pod
Client’s session is no longer in memory — Jellyfin returns 401
Client re-authenticates (username/password, or saved credentials on mobile apps)
New session is created on the surviving pod
Client clicks Play to resume — playback position is loaded from PostgreSQL

Total user impact: one authentication prompt and one Play button press. For a home server, this is comparable to the existing experience when the single pod restarts during node maintenance.

WebSocket Considerations

Jellyfin uses WebSocket connections for real-time features: playback progress updates, server events, SyncPlay commands. WebSocket connections are inherently sticky — once established, they stay on the pod that accepted the upgrade.

If the WebSocket pod dies, the connection drops and the client reconnects. The Jellyfin client SDKs handle reconnection automatically. The only user-visible effect is a brief “Connecting…” indicator.

The Session Lifecycle

To understand why sticky sessions are sufficient, consider the full session lifecycle:

Client Login
    │
    ▼
POST /Users/AuthenticateByName
    │
    ├── SessionManager creates SessionInfo in ConcurrentDictionary
    ├── Returns access token + server ID
    └── Traefik sets sticky cookie
    │
    ▼
All subsequent requests include:
    ├── Authorization: MediaBrowser Token="<access_token>"
    └── Cookie: jellyfin_server_id=<pod_hash>
    │
    ▼
SessionManager.GetSession() looks up token
    ├── Found → request proceeds
    └── Not found → 401 Unauthorized → client re-auths

The access token is the key. It’s generated on the pod that handled authentication and stored only in that pod’s SessionManager. A different pod has no record of it. Sticky sessions ensure the token always goes back to the pod that created it.

If the pod dies and the client hits the other pod, the token lookup fails. The client re-authenticates, gets a new token on the new pod, and continues. Playback state is in PostgreSQL, so “Continue Watching” and watched markers are preserved across the re-auth.

What This Phase Changed in the Fork

The fork changes for Phase 3 were minimal — the whole point of Track A:

File	Change	Lines
`Jellyfin.Server.Implementations/ITaskLeaderElection.cs`	New interface	18
`Jellyfin.Server.Implementations/RedisTaskLeaderElection.cs`	Redis leader election	65
`Jellyfin.Server.Implementations/ResilientTaskLeaderElection.cs`	Circuit breaker wrapper	52
`Jellyfin.Server/Startup.cs`	Register Redis + leader election in DI	12
`Directory.Packages.props`	`StackExchange.Redis` version	1

Total: ~148 lines of new code. Compare this to Track B’s estimated ~2,000+ lines. The fork stays small, rebases stay clean.

Coming Up Next

Tomorrow: scaling to two replicas and failover testing — the moment of truth where we set replicas: 2, kill a pod mid-stream, and see if the surviving replica actually keeps serving traffic.

Browse the code: The leader election and circuit breaker implementations are in the Jellyfin fork at github.com/zolty-mat/jellyfin. The Redis and Traefik manifests are in the infrastructure repo (coming soon once secrets are remediated).

Cloud alternative: If you’d rather not run Redis yourself, DigitalOcean Managed Redis provides a drop-in replacement with automatic failover.

TL;DR#

The State Problem, Revisited#

Track A vs. Track B#

Why Track A Won#

Deploying Redis#

Why Not Redis Sentinel or Cluster?#

The TaskManager Problem#

Leader Election via Redis#

Graceful Fallback#

Disabling QuickConnect#

SyncPlay: The Feature We Couldn’t Fix#

Configuring Traefik Sticky Sessions#

How It Works#

What Happens When a Pod Dies#

WebSocket Considerations#

The Session Lifecycle#

What This Phase Changed in the Fork#

Coming Up Next#