Scaling to Two Replicas and Failover Testing

TL;DR

This is the moment everything was built for. Three phases of preparation — PostgreSQL provider (Day 3), storage migration (Day 4), state externalization (Day 5) — all leading to a single kubectl scale command.

This post covers Phase 4: scaling the Jellyfin StatefulSet to 2 replicas, configuring anti-affinity to spread pods across nodes, running six structured failover tests, building Prometheus alerts, and one test that only partially passed. The headline result: killing a pod causes zero service downtime — users on the surviving replica experience no interruption at all, and displaced users reconnect within seconds.

The Scale-Up

The moment of truth is anticlimactic. One field change:

spec:
  replicas: 2

Apply it:

kubectl apply -f kubernetes/apps/jellyfin/statefulset.yaml
kubectl get pods -n jellyfin -w

NAME                          READY   STATUS    RESTARTS   AGE
jellyfin-0                    1/1     Running   0          3d
jellyfin-1                    0/1     Pending   0          2s
jellyfin-postgres-0           1/1     Running   0          4d
jellyfin-redis-7f8b6c4d9-x2k  1/1     Running   0          1d

jellyfin-1 goes from Pending to ContainerCreating to Running in about 45 seconds. Longhorn provisions the per-pod PVCs (transcode-jellyfin-1, cache-jellyfin-1), NFS config mounts immediately, and the pod connects to PostgreSQL.

Pod Anti-Affinity

The whole point of two replicas is surviving a node failure. Both pods on the same node defeats the purpose:

spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app.kubernetes.io/name: jellyfin
                  app.kubernetes.io/component: web
              topologyKey: kubernetes.io/hostname

requiredDuringSchedulingIgnoredDuringExecution means the scheduler must place jellyfin-1 on a different node than jellyfin-0. If no node is available, the pod stays Pending rather than co-locating.

After scheduling:

kubectl get pods -n jellyfin -o wide

NAME            READY   NODE       IP
jellyfin-0      1/1     k3s-w01    10.42.1.47
jellyfin-1      1/1     k3s-w03    10.42.3.22

Two pods, two nodes. GPU passthrough gives both pods access to Intel UHD 630 iGPUs for QuickSync hardware transcoding.

Verifying Both Pods Serve Traffic

Before testing failover, verify that Traefik is routing to both pods:

# Hit the endpoint 10 times and check which pod responds
for i in $(seq 1 10); do
  curl -s -o /dev/null -w "%{http_code}" \
    -b "jellyfin_server_id=intentionally_invalid" \
    https://jellyfin.k3s.internal.strommen.systems/health
  echo ""
done

By sending an invalid sticky cookie, Traefik falls back to round-robin. Both pods should respond with 200. If only one responds, check the Service selector — the component label trap is the #1 cause.

Then verify sticky sessions work:

# First request — get the sticky cookie
COOKIE=$(curl -s -c - https://jellyfin.k3s.internal.strommen.systems/health \
  | grep jellyfin_server_id | awk '{print $NF}')

# Next 10 requests with the cookie — should all hit the same pod
for i in $(seq 1 10); do
  curl -s -o /dev/null -w "%{http_code}" \
    -b "jellyfin_server_id=$COOKIE" \
    https://jellyfin.k3s.internal.strommen.systems/health
  echo ""
done

All 10 should return 200. If any return 502, the sticky cookie isn’t being honored — check the IngressRoute configuration from Day 5.

The Six Failover Tests

I designed six tests that cover the realistic failure modes for a homelab Jellyfin deployment:

Test	Scenario	Expected Behavior
1	Kill idle pod	Surviving pod serves all traffic. No user impact.
2	Kill pod during active playback	Client reconnects to surviving pod. User re-authenticates. Playback resumes from last position.
3	Cordon + drain a node	Pod rescheduled to another node. Brief interruption during reschedule.
4	Kill PostgreSQL	Both pods degrade. Reads from cache continue briefly. Pods crash after connection retry exhaustion.
5	Kill Redis	Both pods continue serving traffic. Scheduled tasks may run on both pods until Redis recovers.
6	Rolling update (new image)	Pods restart one at a time. Zero-downtime deploy if client is on the pod that restarts second.

Test 1: Kill Idle Pod

Procedure: Delete jellyfin-1 while no active playback is occurring.

kubectl delete pod jellyfin-1 -n jellyfin

Result: PASS

jellyfin-0 continues serving all traffic uninterrupted
Clients pinned to jellyfin-1 get routed to jellyfin-0 on next request
Those clients see a 401 (session not found), re-authenticate, and continue
StatefulSet controller recreates jellyfin-1 within 30 seconds
New jellyfin-1 remounts its existing Longhorn PVCs and NFS config
Total disruption to users on jellyfin-0: zero
Total disruption to users on jellyfin-1: one re-authentication

Test 2: Kill Pod During Active Playback

Procedure: Start playing a movie on the Android TV app (client pinned to jellyfin-0). Kill jellyfin-0 mid-stream.

# Verify which pod the client is using
kubectl logs -n jellyfin jellyfin-0 --tail=5 | grep -i playback

# Kill it
kubectl delete pod jellyfin-0 -n jellyfin

Result: PARTIAL PASS

Playback stops immediately on the Android TV client
The client shows “Unable to connect to server” for ~12 seconds (Traefik health check interval + client retry)
Client reconnects to jellyfin-1
Client prompts for re-authentication — saved credentials auto-fill on mobile, manual on Android TV
After re-auth, the user returns to the home screen with “Continue Watching” intact (position loaded from PostgreSQL)
User clicks Play — playback resumes from the saved position
But: the FFmpeg transcode job that was running on jellyfin-0 is gone. The client must start a new transcode on jellyfin-1. For a large 4K file, this means a ~5-10 second buffer before playback starts.

Why partial: the requirement was “playback resumes from last position.” It does — but there’s a ~20-second gap between pod death and playback resuming (12s reconnect + auth + 5-10s transcode start). This is significantly better than the current single-pod behavior (pod death = Jellyfin completely unavailable until reschedule, which can take 2-3 minutes), but it’s not seamless.

Test 3: Cordon and Drain Node

Procedure: Cordon the node running jellyfin-0, then drain it. Simulates planned maintenance.

kubectl cordon k3s-w01
kubectl drain k3s-w01 --ignore-daemonsets --delete-emptydir-data

Result: PASS

jellyfin-0 is evicted from k3s-w01 and rescheduled to k3s-w02
During reschedule (~30 seconds), clients on jellyfin-0 fail over to jellyfin-1
After reschedule, jellyfin-0 is back on a new node with the same PVCs (Longhorn reattaches them)
New sticky cookies are issued; traffic rebalances naturally as clients make new requests
User experience: same as Test 1 — one re-authentication for affected clients

This is the primary use case. Node maintenance (OS updates, kernel upgrades, Proxmox patches) no longer means Jellyfin downtime.

Test 4: Kill PostgreSQL

Procedure: Delete the PostgreSQL pod.

kubectl delete pod jellyfin-postgres-0 -n jellyfin

Result: PASS (expected degradation)

Both Jellyfin pods log NpgsqlException: Connection refused
EF Core’s EnableRetryOnFailure(3) retries 3 times over ~15 seconds
During retry window: cached data (user list, sessions) serves existing authenticated requests
After retry exhaustion: pods return 500 on any database-dependent endpoint
StatefulSet controller restarts jellyfin-postgres-0 within 20 seconds
PostgreSQL pod reconnects to its Longhorn PVC — all data intact
Jellyfin pods auto-reconnect on next retry cycle
Total degradation window: ~30-40 seconds
No data loss

This test validates the retry logic and the S3 backup isn’t needed (Longhorn PVC survives pod restart). The degradation window matches expectations — PostgreSQL is a single point of failure by design, but it recovers quickly.

Test 5: Kill Redis

Procedure: Delete the Redis pod.

kubectl delete pod -n jellyfin -l app.kubernetes.io/component=cache

Result: PASS

Leader election fails — circuit breaker opens after 3 attempts (~10 seconds)
Both pods fall back to local task execution
User-facing traffic: zero impact — Redis is not in the request path
Background tasks: both pods run scheduled tasks (duplicate work, not data corruption)
Redis Deployment controller restarts the pod within 15 seconds
Leader election resumes — one pod reacquires leadership
Duplicate tasks stop

This confirms the circuit breaker from Day 5 works as designed. Redis downtime is invisible to users.

Test 6: Rolling Update

Procedure: Push a new image tag and apply the updated StatefulSet.

# Update the image tag
kubectl set image statefulset/jellyfin \
  jellyfin=855878721457.dkr.ecr.us-east-1.amazonaws.com/k3s-homelab/jellyfin-ha:sha-abc1234 \
  -n jellyfin

Result: PASS

StatefulSet controller updates pods in reverse ordinal order: jellyfin-1 first, then jellyfin-0
While jellyfin-1 restarts, its clients fail over to jellyfin-0
jellyfin-1 comes up with the new image, becomes ready
jellyfin-0 restarts — its clients fail over to jellyfin-1
jellyfin-0 comes up, all pods running new image
Clients on jellyfin-0 (first group): one re-auth when failing to jellyfin-1, another when failing back (if they were re-routed during the window)
Clients on jellyfin-1 (second group): one re-auth when jellyfin-1 restarts

The updateStrategy is the default RollingUpdate with partition: 0, which updates all pods. For a canary approach, you could set partition: 1 to update only jellyfin-1 first.

Results Summary

Test	Scenario	Result	User Impact
1	Kill idle pod	PASS	One re-auth for affected clients
2	Kill pod during playback	PARTIAL	~20s gap: reconnect + re-auth + transcode restart
3	Node cordon + drain	PASS	One re-auth during reschedule
4	Kill PostgreSQL	PASS	~30-40s degradation, auto-recovery
5	Kill Redis	PASS	Zero user impact
6	Rolling update	PASS	One re-auth per client during rollout

Five clean passes and one partial. The partial (Test 2) is a fundamental limitation of FFmpeg transcoding — the process is tied to the pod. Solving this would require distributed transcoding, which is Track B+ territory and out of scope.

Prometheus Monitoring

With two replicas running, we need alerts for scenarios the failover tests validated:

ServiceMonitor

Both Jellyfin pods expose /metrics for Prometheus:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: jellyfin
  namespace: jellyfin
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: jellyfin
      app.kubernetes.io/component: web
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

PrometheusRule Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: jellyfin-alerts
  namespace: jellyfin
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: jellyfin.rules
      rules:
        - alert: JellyfinReplicaCountLow
          expr: |
            kube_statefulset_status_replicas_ready{
              namespace="jellyfin",
              statefulset="jellyfin"
            } < 2
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "Jellyfin has fewer than 2 ready replicas"
            description: >
              Only {{ $value }} Jellyfin replica(s) are ready.
              HA failover capacity is degraded.

        - alert: JellyfinPostgresDown
          expr: |
            up{namespace="jellyfin", pod=~"jellyfin-postgres.*"} == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Jellyfin PostgreSQL is down"
            description: >
              The PostgreSQL pod is not responding.
              Both Jellyfin replicas will degrade within ~15 seconds.

        - alert: JellyfinRedisDown
          expr: |
            up{namespace="jellyfin", pod=~"jellyfin-redis.*"} == 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Jellyfin Redis is down"
            description: >
              Redis is unavailable. Task leader election has
              fallen back to local execution. Both pods may run
              duplicate scheduled tasks.

        - alert: JellyfinHighRestartRate
          expr: |
            increase(kube_pod_container_status_restarts_total{
              namespace="jellyfin",
              container="jellyfin"
            }[1h]) > 3
          labels:
            severity: warning
          annotations:
            summary: "Jellyfin pod restarting frequently"
            description: >
              {{ $labels.pod }} has restarted {{ $value }}
              times in the last hour.

Key Alerts

JellyfinReplicaCountLow fires when only 1 replica is ready for 2+ minutes. This catches extended failover scenarios where the StatefulSet controller can’t reschedule (e.g., no node with enough resources, anti-affinity can’t be satisfied).
JellyfinPostgresDown fires after 1 minute at critical severity. PostgreSQL downtime is the most impactful failure — both pods degrade.
JellyfinRedisDown fires after 5 minutes at warning. Redis downtime causes duplicate task execution but no user impact. The longer for duration avoids alerting on brief restarts.
JellyfinHighRestartRate catches CrashLoopBackOff scenarios where a pod is repeatedly restarting due to a bug or misconfiguration.

The Grafana Dashboard

A dedicated Jellyfin HA dashboard shows:

Replica status: gauge showing 0/1/2 ready replicas with color coding
Pod CPU and memory: per-pod resource usage
PostgreSQL connections: active connection count from both pods
Redis operations: leader election attempts and circuit breaker state
Request rate: per-pod HTTP request rate (verifies traffic is balanced with sticky sessions)
Transcode jobs: active FFmpeg processes per pod

The dashboard is provisioned as a ConfigMap with the grafana_dashboard=1 label, auto-loaded by the Grafana sidecar — the same pattern used for every other dashboard in the cluster.

Coming Up Next

Tomorrow: what’s still broken and what comes next — an honest assessment of what this project didn’t solve, the limitations of the Track A approach, and what Track B would look like if we revisited it.

Browse the code: All Kubernetes manifests — StatefulSet, IngressRoute, ServiceMonitor, PrometheusRule — are in the Jellyfin fork at github.com/zolty-mat/jellyfin. Infrastructure manifests will be published to the cluster repo once secrets are remediated.

Don’t have a cluster? You can run this entire test suite on a managed Kubernetes cluster. DigitalOcean Kubernetes with $200 in free credits is enough for a 3-node cluster to replicate every test.

TL;DR#

The Scale-Up#

Pod Anti-Affinity#

Verifying Both Pods Serve Traffic#

The Six Failover Tests#

Test 1: Kill Idle Pod#

Test 2: Kill Pod During Active Playback#

Test 3: Cordon and Drain Node#

Test 4: Kill PostgreSQL#

Test 5: Kill Redis#

Test 6: Rolling Update#

Results Summary#

Prometheus Monitoring#

ServiceMonitor#

PrometheusRule Alerts#

Key Alerts#

The Grafana Dashboard#

Coming Up Next#

TL;DR

The Scale-Up

Pod Anti-Affinity

Verifying Both Pods Serve Traffic

The Six Failover Tests

Test 1: Kill Idle Pod

Test 2: Kill Pod During Active Playback

Test 3: Cordon and Drain Node

Test 4: Kill PostgreSQL

Test 5: Kill Redis

Test 6: Rolling Update

Results Summary

Prometheus Monitoring

ServiceMonitor

PrometheusRule Alerts

Key Alerts

The Grafana Dashboard

Coming Up Next