TL;DR

This is the moment everything was built for. Three phases of preparation — PostgreSQL provider (Day 3), storage migration (Day 4), state externalization (Day 5) — all leading to a single kubectl scale command.

This post covers Phase 4: scaling the Jellyfin StatefulSet to 2 replicas, configuring anti-affinity to spread pods across nodes, running six structured failover tests, building Prometheus alerts, and one test that only partially passed. The headline result: killing a pod causes zero service downtime — users on the surviving replica experience no interruption at all, and displaced users reconnect within seconds.


The Scale-Up

The moment of truth is anticlimactic. One field change:

spec:
  replicas: 2

Apply it:

kubectl apply -f kubernetes/apps/jellyfin/statefulset.yaml
kubectl get pods -n jellyfin -w
NAME                          READY   STATUS    RESTARTS   AGE
jellyfin-0                    1/1     Running   0          3d
jellyfin-1                    0/1     Pending   0          2s
jellyfin-postgres-0           1/1     Running   0          4d
jellyfin-redis-7f8b6c4d9-x2k  1/1     Running   0          1d

jellyfin-1 goes from Pending to ContainerCreating to Running in about 45 seconds. Longhorn provisions the per-pod PVCs (transcode-jellyfin-1, cache-jellyfin-1), NFS config mounts immediately, and the pod connects to PostgreSQL.

Pod Anti-Affinity

The whole point of two replicas is surviving a node failure. Both pods on the same node defeats the purpose:

spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app.kubernetes.io/name: jellyfin
                  app.kubernetes.io/component: web
              topologyKey: kubernetes.io/hostname

requiredDuringSchedulingIgnoredDuringExecution means the scheduler must place jellyfin-1 on a different node than jellyfin-0. If no node is available, the pod stays Pending rather than co-locating.

After scheduling:

kubectl get pods -n jellyfin -o wide
NAME            READY   NODE       IP
jellyfin-0      1/1     k3s-w01    10.42.1.47
jellyfin-1      1/1     k3s-w03    10.42.3.22

Two pods, two nodes. GPU passthrough gives both pods access to Intel UHD 630 iGPUs for QuickSync hardware transcoding.

Verifying Both Pods Serve Traffic

Before testing failover, verify that Traefik is routing to both pods:

# Hit the endpoint 10 times and check which pod responds
for i in $(seq 1 10); do
  curl -s -o /dev/null -w "%{http_code}" \
    -b "jellyfin_server_id=intentionally_invalid" \
    https://jellyfin.k3s.internal.strommen.systems/health
  echo ""
done

By sending an invalid sticky cookie, Traefik falls back to round-robin. Both pods should respond with 200. If only one responds, check the Service selector — the component label trap is the #1 cause.

Then verify sticky sessions work:

# First request — get the sticky cookie
COOKIE=$(curl -s -c - https://jellyfin.k3s.internal.strommen.systems/health \
  | grep jellyfin_server_id | awk '{print $NF}')

# Next 10 requests with the cookie — should all hit the same pod
for i in $(seq 1 10); do
  curl -s -o /dev/null -w "%{http_code}" \
    -b "jellyfin_server_id=$COOKIE" \
    https://jellyfin.k3s.internal.strommen.systems/health
  echo ""
done

All 10 should return 200. If any return 502, the sticky cookie isn’t being honored — check the IngressRoute configuration from Day 5.

The Six Failover Tests

I designed six tests that cover the realistic failure modes for a homelab Jellyfin deployment:

TestScenarioExpected Behavior
1Kill idle podSurviving pod serves all traffic. No user impact.
2Kill pod during active playbackClient reconnects to surviving pod. User re-authenticates. Playback resumes from last position.
3Cordon + drain a nodePod rescheduled to another node. Brief interruption during reschedule.
4Kill PostgreSQLBoth pods degrade. Reads from cache continue briefly. Pods crash after connection retry exhaustion.
5Kill RedisBoth pods continue serving traffic. Scheduled tasks may run on both pods until Redis recovers.
6Rolling update (new image)Pods restart one at a time. Zero-downtime deploy if client is on the pod that restarts second.

Test 1: Kill Idle Pod

Procedure: Delete jellyfin-1 while no active playback is occurring.

kubectl delete pod jellyfin-1 -n jellyfin

Result: PASS

  • jellyfin-0 continues serving all traffic uninterrupted
  • Clients pinned to jellyfin-1 get routed to jellyfin-0 on next request
  • Those clients see a 401 (session not found), re-authenticate, and continue
  • StatefulSet controller recreates jellyfin-1 within 30 seconds
  • New jellyfin-1 remounts its existing Longhorn PVCs and NFS config
  • Total disruption to users on jellyfin-0: zero
  • Total disruption to users on jellyfin-1: one re-authentication

Test 2: Kill Pod During Active Playback

Procedure: Start playing a movie on the Android TV app (client pinned to jellyfin-0). Kill jellyfin-0 mid-stream.

# Verify which pod the client is using
kubectl logs -n jellyfin jellyfin-0 --tail=5 | grep -i playback

# Kill it
kubectl delete pod jellyfin-0 -n jellyfin

Result: PARTIAL PASS

  • Playback stops immediately on the Android TV client
  • The client shows “Unable to connect to server” for ~12 seconds (Traefik health check interval + client retry)
  • Client reconnects to jellyfin-1
  • Client prompts for re-authentication — saved credentials auto-fill on mobile, manual on Android TV
  • After re-auth, the user returns to the home screen with “Continue Watching” intact (position loaded from PostgreSQL)
  • User clicks Play — playback resumes from the saved position
  • But: the FFmpeg transcode job that was running on jellyfin-0 is gone. The client must start a new transcode on jellyfin-1. For a large 4K file, this means a ~5-10 second buffer before playback starts.

Why partial: the requirement was “playback resumes from last position.” It does — but there’s a ~20-second gap between pod death and playback resuming (12s reconnect + auth + 5-10s transcode start). This is significantly better than the current single-pod behavior (pod death = Jellyfin completely unavailable until reschedule, which can take 2-3 minutes), but it’s not seamless.

Test 3: Cordon and Drain Node

Procedure: Cordon the node running jellyfin-0, then drain it. Simulates planned maintenance.

kubectl cordon k3s-w01
kubectl drain k3s-w01 --ignore-daemonsets --delete-emptydir-data

Result: PASS

  • jellyfin-0 is evicted from k3s-w01 and rescheduled to k3s-w02
  • During reschedule (~30 seconds), clients on jellyfin-0 fail over to jellyfin-1
  • After reschedule, jellyfin-0 is back on a new node with the same PVCs (Longhorn reattaches them)
  • New sticky cookies are issued; traffic rebalances naturally as clients make new requests
  • User experience: same as Test 1 — one re-authentication for affected clients

This is the primary use case. Node maintenance (OS updates, kernel upgrades, Proxmox patches) no longer means Jellyfin downtime.

Test 4: Kill PostgreSQL

Procedure: Delete the PostgreSQL pod.

kubectl delete pod jellyfin-postgres-0 -n jellyfin

Result: PASS (expected degradation)

  • Both Jellyfin pods log NpgsqlException: Connection refused
  • EF Core’s EnableRetryOnFailure(3) retries 3 times over ~15 seconds
  • During retry window: cached data (user list, sessions) serves existing authenticated requests
  • After retry exhaustion: pods return 500 on any database-dependent endpoint
  • StatefulSet controller restarts jellyfin-postgres-0 within 20 seconds
  • PostgreSQL pod reconnects to its Longhorn PVC — all data intact
  • Jellyfin pods auto-reconnect on next retry cycle
  • Total degradation window: ~30-40 seconds
  • No data loss

This test validates the retry logic and the S3 backup isn’t needed (Longhorn PVC survives pod restart). The degradation window matches expectations — PostgreSQL is a single point of failure by design, but it recovers quickly.

Test 5: Kill Redis

Procedure: Delete the Redis pod.

kubectl delete pod -n jellyfin -l app.kubernetes.io/component=cache

Result: PASS

  • Leader election fails — circuit breaker opens after 3 attempts (~10 seconds)
  • Both pods fall back to local task execution
  • User-facing traffic: zero impact — Redis is not in the request path
  • Background tasks: both pods run scheduled tasks (duplicate work, not data corruption)
  • Redis Deployment controller restarts the pod within 15 seconds
  • Leader election resumes — one pod reacquires leadership
  • Duplicate tasks stop

This confirms the circuit breaker from Day 5 works as designed. Redis downtime is invisible to users.

Test 6: Rolling Update

Procedure: Push a new image tag and apply the updated StatefulSet.

# Update the image tag
kubectl set image statefulset/jellyfin \
  jellyfin=855878721457.dkr.ecr.us-east-1.amazonaws.com/k3s-homelab/jellyfin-ha:sha-abc1234 \
  -n jellyfin

Result: PASS

  • StatefulSet controller updates pods in reverse ordinal order: jellyfin-1 first, then jellyfin-0
  • While jellyfin-1 restarts, its clients fail over to jellyfin-0
  • jellyfin-1 comes up with the new image, becomes ready
  • jellyfin-0 restarts — its clients fail over to jellyfin-1
  • jellyfin-0 comes up, all pods running new image
  • Clients on jellyfin-0 (first group): one re-auth when failing to jellyfin-1, another when failing back (if they were re-routed during the window)
  • Clients on jellyfin-1 (second group): one re-auth when jellyfin-1 restarts

The updateStrategy is the default RollingUpdate with partition: 0, which updates all pods. For a canary approach, you could set partition: 1 to update only jellyfin-1 first.

Results Summary

TestScenarioResultUser Impact
1Kill idle podPASSOne re-auth for affected clients
2Kill pod during playbackPARTIAL~20s gap: reconnect + re-auth + transcode restart
3Node cordon + drainPASSOne re-auth during reschedule
4Kill PostgreSQLPASS~30-40s degradation, auto-recovery
5Kill RedisPASSZero user impact
6Rolling updatePASSOne re-auth per client during rollout

Five clean passes and one partial. The partial (Test 2) is a fundamental limitation of FFmpeg transcoding — the process is tied to the pod. Solving this would require distributed transcoding, which is Track B+ territory and out of scope.

Prometheus Monitoring

With two replicas running, we need alerts for scenarios the failover tests validated:

ServiceMonitor

Both Jellyfin pods expose /metrics for Prometheus:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: jellyfin
  namespace: jellyfin
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: jellyfin
      app.kubernetes.io/component: web
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

PrometheusRule Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: jellyfin-alerts
  namespace: jellyfin
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: jellyfin.rules
      rules:
        - alert: JellyfinReplicaCountLow
          expr: |
            kube_statefulset_status_replicas_ready{
              namespace="jellyfin",
              statefulset="jellyfin"
            } < 2
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "Jellyfin has fewer than 2 ready replicas"
            description: >
              Only {{ $value }} Jellyfin replica(s) are ready.
              HA failover capacity is degraded.

        - alert: JellyfinPostgresDown
          expr: |
            up{namespace="jellyfin", pod=~"jellyfin-postgres.*"} == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Jellyfin PostgreSQL is down"
            description: >
              The PostgreSQL pod is not responding.
              Both Jellyfin replicas will degrade within ~15 seconds.

        - alert: JellyfinRedisDown
          expr: |
            up{namespace="jellyfin", pod=~"jellyfin-redis.*"} == 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Jellyfin Redis is down"
            description: >
              Redis is unavailable. Task leader election has
              fallen back to local execution. Both pods may run
              duplicate scheduled tasks.

        - alert: JellyfinHighRestartRate
          expr: |
            increase(kube_pod_container_status_restarts_total{
              namespace="jellyfin",
              container="jellyfin"
            }[1h]) > 3
          labels:
            severity: warning
          annotations:
            summary: "Jellyfin pod restarting frequently"
            description: >
              {{ $labels.pod }} has restarted {{ $value }}
              times in the last hour.

Key Alerts

  • JellyfinReplicaCountLow fires when only 1 replica is ready for 2+ minutes. This catches extended failover scenarios where the StatefulSet controller can’t reschedule (e.g., no node with enough resources, anti-affinity can’t be satisfied).

  • JellyfinPostgresDown fires after 1 minute at critical severity. PostgreSQL downtime is the most impactful failure — both pods degrade.

  • JellyfinRedisDown fires after 5 minutes at warning. Redis downtime causes duplicate task execution but no user impact. The longer for duration avoids alerting on brief restarts.

  • JellyfinHighRestartRate catches CrashLoopBackOff scenarios where a pod is repeatedly restarting due to a bug or misconfiguration.

The Grafana Dashboard

A dedicated Jellyfin HA dashboard shows:

  • Replica status: gauge showing 0/1/2 ready replicas with color coding
  • Pod CPU and memory: per-pod resource usage
  • PostgreSQL connections: active connection count from both pods
  • Redis operations: leader election attempts and circuit breaker state
  • Request rate: per-pod HTTP request rate (verifies traffic is balanced with sticky sessions)
  • Transcode jobs: active FFmpeg processes per pod

The dashboard is provisioned as a ConfigMap with the grafana_dashboard=1 label, auto-loaded by the Grafana sidecar — the same pattern used for every other dashboard in the cluster.


Coming Up Next

Tomorrow: what’s still broken and what comes next — an honest assessment of what this project didn’t solve, the limitations of the Track A approach, and what Track B would look like if we revisited it.

Browse the code: All Kubernetes manifests — StatefulSet, IngressRoute, ServiceMonitor, PrometheusRule — are in the Jellyfin fork at github.com/zolty-mat/jellyfin. Infrastructure manifests will be published to the cluster repo once secrets are remediated.

Don’t have a cluster? You can run this entire test suite on a managed Kubernetes cluster. DigitalOcean Kubernetes with $200 in free credits is enough for a 3-node cluster to replicate every test.