TL;DR

Phase 3 is where the rubber meets the road. We have PostgreSQL for persistent data (Day 4) and NFS for shared config. But Jellyfin still holds critical runtime state — sessions, users, devices, tasks — in 11 ConcurrentDictionary instances scattered across singleton managers. Two pods with independent memory spaces means two independent views of reality.

This post covers the state externalization decision: what got moved to Redis, what got solved by sticky sessions, what got disabled entirely, and why pragmatism beat perfection for a homelab media server.


The State Problem, Revisited

In Day 1, I cataloged every stateful singleton in Jellyfin 10.12.0. Here’s the same table, now with the resolution strategy:

ManagerStateHA ImpactResolution
SessionManagerActive user sessionsCriticalSticky sessions + graceful re-auth
SessionManagerLive stream trackingCriticalSticky sessions (stream stays on one pod)
UserManagerUser cacheHighPostgreSQL is source of truth; cache warms on startup
DeviceManagerClient capabilitiesMediumSticky sessions (client always hits same pod)
QuickConnectManagerPairing requestsHighDisabled in HA mode
QuickConnectManagerAuthorized secretsHighDisabled in HA mode
SyncPlayManagerGroup stateDeferredToo complex, low usage — single-pod only
TaskManagerScheduled tasksHighLeader election via Redis
ProviderManagerMetadata refresh progressMediumSticky sessions (progress per pod)
TranscodeManagerFFmpeg jobsDeferredSticky sessions (transcode is pod-local)
ChannelManagerChannel cacheLowRegenerable — each pod builds its own

Three categories emerged:

  1. Solved by sticky sessions — if the client always hits the same pod, the in-memory state stays consistent. No code changes needed.
  2. Solved by external coordinationTaskManager needs a leader election or distributed lock to prevent duplicate task execution.
  3. Deferred or disabledSyncPlayManager is architecturally incompatible with multi-pod. QuickConnectManager has a small user base and can be disabled.

Track A vs. Track B

Back in the planning phase (Day 2), the multi-model review identified two escalation paths:

Track A: Sticky Sessions + Minimal Changes

Keep state in memory. Use Traefik’s cookie-based session affinity to pin clients to pods. Accept that pod death means clients re-authenticate and lose transient state (playback position is persisted to PostgreSQL, so “Continue Watching” survives). Coordinate only TaskManager via Redis.

Track B: Full Redis Externalization

Replace every ConcurrentDictionary with a Redis-backed implementation. Sessions survive pod death transparently. Any pod can serve any request. True stateless application.

The plan specified Track A as the implementation target, with Track B as an escalation if real-world testing showed unacceptable user experience.

Why Track A Won

FactorTrack ATrack B
Code changes to Jellyfin~200 lines (TaskManager, feature flags)~2,000+ lines (6 managers rewritten)
Fork maintenance burdenMinimal — touches 3 filesHeavy — every upstream release risks merge conflicts
User-visible impact of pod deathRe-authenticate, click Play againSeamless (maybe a brief pause)
User-visible frequency of pod deathRare (node drain, crash)Same
Development time1 phase3+ phases
Redis dependencyLightweight (leader election only)Critical path (every API request)

For a homelab with <10 concurrent users, the difference between “click Play again after a pod crash” and “seamless failover” doesn’t justify tripling the fork’s maintenance surface. Track A it is.

Deploying Redis

Even with Track A, we need Redis for one essential function: distributed locking for TaskManager. Without it, both pods would fire library scans, image extraction, and plugin updates simultaneously.

Redis runs as a single-replica Deployment in the jellyfin namespace:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jellyfin-redis
  namespace: jellyfin
  labels:
    app.kubernetes.io/name: jellyfin
    app.kubernetes.io/component: cache
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: jellyfin
      app.kubernetes.io/component: cache
  template:
    metadata:
      labels:
        app.kubernetes.io/name: jellyfin
        app.kubernetes.io/component: cache
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          ports:
            - containerPort: 6379
          command: ["redis-server", "--maxmemory", "128mb", "--maxmemory-policy", "allkeys-lru"]
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 192Mi
          livenessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 3
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: jellyfin-redis
  namespace: jellyfin
spec:
  selector:
    app.kubernetes.io/name: jellyfin
    app.kubernetes.io/component: cache
  ports:
    - port: 6379

Why Not Redis Sentinel or Cluster?

Redis itself isn’t critical path for user requests. It only coordinates background tasks. If Redis goes down:

  • Sticky sessions still work (Traefik doesn’t depend on Redis)
  • User authentication still works (PostgreSQL)
  • Playback still works (FFmpeg is pod-local)
  • The only impact: both pods might run a library scan simultaneously until Redis recovers

For this failure mode, single-replica Redis with a restart policy is sufficient. Sentinel adds 3 more pods and operational complexity for a failure scenario that causes duplicate work, not data loss.

The TaskManager Problem

TaskManager is the most dangerous of the 11 stateful managers. It holds a ConcurrentQueue of scheduled tasks and fires them on a configurable interval. With two pods, every scheduled task runs twice:

  • Library scan: both pods scan the same media directory, potentially writing duplicate metadata
  • Image extraction: both pods extract the same thumbnails, wasting CPU
  • Chapter detection: FFmpeg processes run on both pods for the same files
  • Plugin updates: both pods try to download and install the same plugin update

Leader Election via Redis

The solution: only one pod runs scheduled tasks at a time. A simple Redis-based leader election:

public class RedisTaskLeaderElection : ITaskLeaderElection
{
    private readonly IConnectionMultiplexer _redis;
    private readonly string _instanceId;
    private readonly TimeSpan _leaseTimeout = TimeSpan.FromSeconds(30);

    public RedisTaskLeaderElection(
        IConnectionMultiplexer redis,
        IServerApplicationHost appHost)
    {
        _redis = redis;
        _instanceId = appHost.SystemId;
    }

    /// <summary>
    /// Attempts to acquire the task leader lease.
    /// Returns true if this instance is the current leader.
    /// </summary>
    public async Task<bool> TryAcquireLeadershipAsync(
        CancellationToken cancellationToken = default)
    {
        var db = _redis.GetDatabase();
        var acquired = await db.StringSetAsync(
            "jellyfin:task-leader",
            _instanceId,
            _leaseTimeout,
            When.NotExists);

        if (acquired)
        {
            return true;
        }

        // Check if we already hold the lease
        var currentLeader = await db.StringGetAsync("jellyfin:task-leader");
        return currentLeader == _instanceId;
    }

    /// <summary>
    /// Renews the leader lease. Called periodically by the leader.
    /// </summary>
    public async Task RenewLeaseAsync(
        CancellationToken cancellationToken = default)
    {
        var db = _redis.GetDatabase();
        var currentLeader = await db.StringGetAsync("jellyfin:task-leader");

        if (currentLeader == _instanceId)
        {
            await db.KeyExpireAsync(
                "jellyfin:task-leader",
                _leaseTimeout);
        }
    }
}

The pattern:

  1. Pod starts up and tries to set jellyfin:task-leader to its instance ID with a 30-second TTL
  2. StringSetAsync with When.NotExists ensures only the first pod wins
  3. The leader renews the lease every 15 seconds (half the TTL)
  4. If the leader pod dies, the lease expires after 30 seconds
  5. The surviving pod acquires leadership on its next attempt
  6. TaskManager checks TryAcquireLeadershipAsync() before executing any scheduled task

Graceful Fallback

What if Redis is down entirely? The leader election uses a circuit breaker:

public class ResilientTaskLeaderElection : ITaskLeaderElection
{
    private readonly RedisTaskLeaderElection _inner;
    private readonly ILogger<ResilientTaskLeaderElection> _logger;
    private int _consecutiveFailures;
    private const int CircuitBreakerThreshold = 3;

    /// <inheritdoc />
    public async Task<bool> TryAcquireLeadershipAsync(
        CancellationToken cancellationToken = default)
    {
        try
        {
            var result = await _inner.TryAcquireLeadershipAsync(
                cancellationToken);
            Interlocked.Exchange(ref _consecutiveFailures, 0);
            return result;
        }
        catch (RedisConnectionException ex)
        {
            var failures = Interlocked.Increment(
                ref _consecutiveFailures);

            if (failures >= CircuitBreakerThreshold)
            {
                _logger.LogWarning(
                    "Redis unavailable ({Failures} consecutive failures). "
                    + "Falling back to local task execution.",
                    failures);
                return true; // Allow local execution
            }

            _logger.LogDebug(ex,
                "Redis connection failed ({Failures}/{Threshold})",
                failures, CircuitBreakerThreshold);
            return false;
        }
    }
}

After 3 consecutive Redis failures, the circuit breaker opens and both pods run tasks locally — the same behavior as without Redis. This prevents Redis downtime from blocking all scheduled tasks. Duplicate work is preferable to no work.

Disabling QuickConnect

QuickConnect is Jellyfin’s device pairing feature: you generate a 6-digit code on a new device, then authorize it from an already-authenticated session. The code and the authorization live in two ConcurrentDictionary instances that never leave memory.

With sticky sessions, there’s a ~50% chance the authorization request routes to a different pod than the one holding the code. The pairing silently fails.

Options:

  1. Externalize to Redis — store codes and authorizations in Redis. Track B territory.
  2. Route to leader — use a shared Redis key to identify which pod generated the code. Complex.
  3. Disable it — QuickConnect is a convenience feature. Users can authenticate with username/password instead.

We went with option 3. A feature flag in system.xml:

<QuickConnectAvailable>false</QuickConnectAvailable>

The Jellyfin UI hides the QuickConnect option when it’s disabled. No code change needed — just a config setting. If Track B is ever implemented, it can be re-enabled.

SyncPlay: The Feature We Couldn’t Fix

SyncPlay lets multiple users watch the same content in sync — like a virtual movie night. The state for this feature spans three interconnected ConcurrentDictionary instances:

DictionaryPurpose
_groupsActive SyncPlay groups
_sessionToGroupMapWhich session belongs to which group
_groupToSessionsMapWhich sessions belong to which group

SyncPlay requires sub-second coordination between all members of a group. Playback commands (play, pause, seek) must propagate to all clients simultaneously. This is fundamentally incompatible with sticky sessions — if two users in the same group are pinned to different pods, their play/pause commands only affect their local pod’s state.

Externalizing SyncPlay would require:

  • Real-time pub/sub between pods (Redis pub/sub or WebSocket bridge)
  • Shared group state in Redis
  • Cross-pod session lookup
  • Sub-100ms roundtrip to avoid visible playback desync

This is Track B territory and then some. SyncPlay works fine on jellyfin-0 (the pod that happens to be the leader). Users who need SyncPlay can use it — they just need to be on the same pod, which sticky sessions handle naturally as long as they log in around the same time.

For a household of 3-4 users, this is an acceptable compromise. For a public server with hundreds of users, it wouldn’t be.

Configuring Traefik Sticky Sessions

The linchpin of Track A: Traefik’s cookie-based session affinity. Every client gets a cookie that pins them to a specific pod. As long as that pod is alive, all requests from that client go to the same place.

The configuration lives in the Jellyfin IngressRoute:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: jellyfin
  namespace: jellyfin
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`jellyfin.k3s.internal.strommen.systems`)
      kind: Rule
      services:
        - name: jellyfin
          port: 8096
          sticky:
            cookie:
              name: jellyfin_server_id
              secure: true
              httpOnly: true
              sameSite: strict
  tls:
    certResolver: letsencrypt-prod

How It Works

  1. Client makes first request to jellyfin.k3s.internal.strommen.systems
  2. Traefik routes to one of the two pods (round-robin on first hit)
  3. Response includes Set-Cookie: jellyfin_server_id=<pod-hash>; Secure; HttpOnly; SameSite=Strict
  4. All subsequent requests from that client include the cookie
  5. Traefik reads the cookie and routes to the same pod

What Happens When a Pod Dies

  1. Traefik detects the pod is gone (health check fails within ~10 seconds)
  2. Client’s next request includes the cookie, but the target pod doesn’t exist
  3. Traefik falls back to round-robin and routes to the surviving pod
  4. New cookie is set for the surviving pod
  5. Client’s session is no longer in memory — Jellyfin returns 401
  6. Client re-authenticates (username/password, or saved credentials on mobile apps)
  7. New session is created on the surviving pod
  8. Client clicks Play to resume — playback position is loaded from PostgreSQL

Total user impact: one authentication prompt and one Play button press. For a home server, this is comparable to the existing experience when the single pod restarts during node maintenance.

WebSocket Considerations

Jellyfin uses WebSocket connections for real-time features: playback progress updates, server events, SyncPlay commands. WebSocket connections are inherently sticky — once established, they stay on the pod that accepted the upgrade.

If the WebSocket pod dies, the connection drops and the client reconnects. The Jellyfin client SDKs handle reconnection automatically. The only user-visible effect is a brief “Connecting…” indicator.

The Session Lifecycle

To understand why sticky sessions are sufficient, consider the full session lifecycle:

Client Login
POST /Users/AuthenticateByName
    ├── SessionManager creates SessionInfo in ConcurrentDictionary
    ├── Returns access token + server ID
    └── Traefik sets sticky cookie
All subsequent requests include:
    ├── Authorization: MediaBrowser Token="<access_token>"
    └── Cookie: jellyfin_server_id=<pod_hash>
SessionManager.GetSession() looks up token
    ├── Found → request proceeds
    └── Not found → 401 Unauthorized → client re-auths

The access token is the key. It’s generated on the pod that handled authentication and stored only in that pod’s SessionManager. A different pod has no record of it. Sticky sessions ensure the token always goes back to the pod that created it.

If the pod dies and the client hits the other pod, the token lookup fails. The client re-authenticates, gets a new token on the new pod, and continues. Playback state is in PostgreSQL, so “Continue Watching” and watched markers are preserved across the re-auth.

What This Phase Changed in the Fork

The fork changes for Phase 3 were minimal — the whole point of Track A:

FileChangeLines
Jellyfin.Server.Implementations/ITaskLeaderElection.csNew interface18
Jellyfin.Server.Implementations/RedisTaskLeaderElection.csRedis leader election65
Jellyfin.Server.Implementations/ResilientTaskLeaderElection.csCircuit breaker wrapper52
Jellyfin.Server/Startup.csRegister Redis + leader election in DI12
Directory.Packages.propsStackExchange.Redis version1

Total: ~148 lines of new code. Compare this to Track B’s estimated ~2,000+ lines. The fork stays small, rebases stay clean.


Coming Up Next

Tomorrow: scaling to two replicas and failover testing — the moment of truth where we set replicas: 2, kill a pod mid-stream, and see if the surviving replica actually keeps serving traffic.

Browse the code: The leader election and circuit breaker implementations are in the Jellyfin fork at github.com/zolty-mat/jellyfin. The Redis and Traefik manifests are in the infrastructure repo (coming soon once secrets are remediated).

Cloud alternative: If you’d rather not run Redis yourself, DigitalOcean Managed Redis provides a drop-in replacement with automatic failover.