TL;DR
This is the moment everything was built for. Three phases of preparation — PostgreSQL provider (Day 3), storage migration (Day 4), state externalization (Day 5) — all leading to a single kubectl scale command.
This post covers Phase 4: scaling the Jellyfin StatefulSet to 2 replicas, configuring anti-affinity to spread pods across nodes, running six structured failover tests, building Prometheus alerts, and one test that only partially passed. The headline result: killing a pod causes zero service downtime — users on the surviving replica experience no interruption at all, and displaced users reconnect within seconds.
The Scale-Up
The moment of truth is anticlimactic. One field change:
spec:
replicas: 2
Apply it:
kubectl apply -f kubernetes/apps/jellyfin/statefulset.yaml
kubectl get pods -n jellyfin -w
NAME READY STATUS RESTARTS AGE
jellyfin-0 1/1 Running 0 3d
jellyfin-1 0/1 Pending 0 2s
jellyfin-postgres-0 1/1 Running 0 4d
jellyfin-redis-7f8b6c4d9-x2k 1/1 Running 0 1d
jellyfin-1 goes from Pending to ContainerCreating to Running in about 45 seconds. Longhorn provisions the per-pod PVCs (transcode-jellyfin-1, cache-jellyfin-1), NFS config mounts immediately, and the pod connects to PostgreSQL.
Pod Anti-Affinity
The whole point of two replicas is surviving a node failure. Both pods on the same node defeats the purpose:
spec:
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: jellyfin
app.kubernetes.io/component: web
topologyKey: kubernetes.io/hostname
requiredDuringSchedulingIgnoredDuringExecution means the scheduler must place jellyfin-1 on a different node than jellyfin-0. If no node is available, the pod stays Pending rather than co-locating.
After scheduling:
kubectl get pods -n jellyfin -o wide
NAME READY NODE IP
jellyfin-0 1/1 k3s-w01 10.42.1.47
jellyfin-1 1/1 k3s-w03 10.42.3.22
Two pods, two nodes. GPU passthrough gives both pods access to Intel UHD 630 iGPUs for QuickSync hardware transcoding.
Verifying Both Pods Serve Traffic
Before testing failover, verify that Traefik is routing to both pods:
# Hit the endpoint 10 times and check which pod responds
for i in $(seq 1 10); do
curl -s -o /dev/null -w "%{http_code}" \
-b "jellyfin_server_id=intentionally_invalid" \
https://jellyfin.k3s.internal.strommen.systems/health
echo ""
done
By sending an invalid sticky cookie, Traefik falls back to round-robin. Both pods should respond with 200. If only one responds, check the Service selector — the component label trap is the #1 cause.
Then verify sticky sessions work:
# First request — get the sticky cookie
COOKIE=$(curl -s -c - https://jellyfin.k3s.internal.strommen.systems/health \
| grep jellyfin_server_id | awk '{print $NF}')
# Next 10 requests with the cookie — should all hit the same pod
for i in $(seq 1 10); do
curl -s -o /dev/null -w "%{http_code}" \
-b "jellyfin_server_id=$COOKIE" \
https://jellyfin.k3s.internal.strommen.systems/health
echo ""
done
All 10 should return 200. If any return 502, the sticky cookie isn’t being honored — check the IngressRoute configuration from Day 5.
The Six Failover Tests
I designed six tests that cover the realistic failure modes for a homelab Jellyfin deployment:
| Test | Scenario | Expected Behavior |
|---|---|---|
| 1 | Kill idle pod | Surviving pod serves all traffic. No user impact. |
| 2 | Kill pod during active playback | Client reconnects to surviving pod. User re-authenticates. Playback resumes from last position. |
| 3 | Cordon + drain a node | Pod rescheduled to another node. Brief interruption during reschedule. |
| 4 | Kill PostgreSQL | Both pods degrade. Reads from cache continue briefly. Pods crash after connection retry exhaustion. |
| 5 | Kill Redis | Both pods continue serving traffic. Scheduled tasks may run on both pods until Redis recovers. |
| 6 | Rolling update (new image) | Pods restart one at a time. Zero-downtime deploy if client is on the pod that restarts second. |
Test 1: Kill Idle Pod
Procedure: Delete jellyfin-1 while no active playback is occurring.
kubectl delete pod jellyfin-1 -n jellyfin
Result: PASS
jellyfin-0continues serving all traffic uninterrupted- Clients pinned to
jellyfin-1get routed tojellyfin-0on next request - Those clients see a 401 (session not found), re-authenticate, and continue
- StatefulSet controller recreates
jellyfin-1within 30 seconds - New
jellyfin-1remounts its existing Longhorn PVCs and NFS config - Total disruption to users on
jellyfin-0: zero - Total disruption to users on
jellyfin-1: one re-authentication
Test 2: Kill Pod During Active Playback
Procedure: Start playing a movie on the Android TV app (client pinned to jellyfin-0). Kill jellyfin-0 mid-stream.
# Verify which pod the client is using
kubectl logs -n jellyfin jellyfin-0 --tail=5 | grep -i playback
# Kill it
kubectl delete pod jellyfin-0 -n jellyfin
Result: PARTIAL PASS
- Playback stops immediately on the Android TV client
- The client shows “Unable to connect to server” for ~12 seconds (Traefik health check interval + client retry)
- Client reconnects to
jellyfin-1 - Client prompts for re-authentication — saved credentials auto-fill on mobile, manual on Android TV
- After re-auth, the user returns to the home screen with “Continue Watching” intact (position loaded from PostgreSQL)
- User clicks Play — playback resumes from the saved position
- But: the FFmpeg transcode job that was running on
jellyfin-0is gone. The client must start a new transcode onjellyfin-1. For a large 4K file, this means a ~5-10 second buffer before playback starts.
Why partial: the requirement was “playback resumes from last position.” It does — but there’s a ~20-second gap between pod death and playback resuming (12s reconnect + auth + 5-10s transcode start). This is significantly better than the current single-pod behavior (pod death = Jellyfin completely unavailable until reschedule, which can take 2-3 minutes), but it’s not seamless.
Test 3: Cordon and Drain Node
Procedure: Cordon the node running jellyfin-0, then drain it. Simulates planned maintenance.
kubectl cordon k3s-w01
kubectl drain k3s-w01 --ignore-daemonsets --delete-emptydir-data
Result: PASS
jellyfin-0is evicted fromk3s-w01and rescheduled tok3s-w02- During reschedule (~30 seconds), clients on
jellyfin-0fail over tojellyfin-1 - After reschedule,
jellyfin-0is back on a new node with the same PVCs (Longhorn reattaches them) - New sticky cookies are issued; traffic rebalances naturally as clients make new requests
- User experience: same as Test 1 — one re-authentication for affected clients
This is the primary use case. Node maintenance (OS updates, kernel upgrades, Proxmox patches) no longer means Jellyfin downtime.
Test 4: Kill PostgreSQL
Procedure: Delete the PostgreSQL pod.
kubectl delete pod jellyfin-postgres-0 -n jellyfin
Result: PASS (expected degradation)
- Both Jellyfin pods log
NpgsqlException: Connection refused - EF Core’s
EnableRetryOnFailure(3)retries 3 times over ~15 seconds - During retry window: cached data (user list, sessions) serves existing authenticated requests
- After retry exhaustion: pods return 500 on any database-dependent endpoint
- StatefulSet controller restarts
jellyfin-postgres-0within 20 seconds - PostgreSQL pod reconnects to its Longhorn PVC — all data intact
- Jellyfin pods auto-reconnect on next retry cycle
- Total degradation window: ~30-40 seconds
- No data loss
This test validates the retry logic and the S3 backup isn’t needed (Longhorn PVC survives pod restart). The degradation window matches expectations — PostgreSQL is a single point of failure by design, but it recovers quickly.
Test 5: Kill Redis
Procedure: Delete the Redis pod.
kubectl delete pod -n jellyfin -l app.kubernetes.io/component=cache
Result: PASS
- Leader election fails — circuit breaker opens after 3 attempts (~10 seconds)
- Both pods fall back to local task execution
- User-facing traffic: zero impact — Redis is not in the request path
- Background tasks: both pods run scheduled tasks (duplicate work, not data corruption)
- Redis Deployment controller restarts the pod within 15 seconds
- Leader election resumes — one pod reacquires leadership
- Duplicate tasks stop
This confirms the circuit breaker from Day 5 works as designed. Redis downtime is invisible to users.
Test 6: Rolling Update
Procedure: Push a new image tag and apply the updated StatefulSet.
# Update the image tag
kubectl set image statefulset/jellyfin \
jellyfin=855878721457.dkr.ecr.us-east-1.amazonaws.com/k3s-homelab/jellyfin-ha:sha-abc1234 \
-n jellyfin
Result: PASS
- StatefulSet controller updates pods in reverse ordinal order:
jellyfin-1first, thenjellyfin-0 - While
jellyfin-1restarts, its clients fail over tojellyfin-0 jellyfin-1comes up with the new image, becomes readyjellyfin-0restarts — its clients fail over tojellyfin-1jellyfin-0comes up, all pods running new image- Clients on
jellyfin-0(first group): one re-auth when failing tojellyfin-1, another when failing back (if they were re-routed during the window) - Clients on
jellyfin-1(second group): one re-auth whenjellyfin-1restarts
The updateStrategy is the default RollingUpdate with partition: 0, which updates all pods. For a canary approach, you could set partition: 1 to update only jellyfin-1 first.
Results Summary
| Test | Scenario | Result | User Impact |
|---|---|---|---|
| 1 | Kill idle pod | PASS | One re-auth for affected clients |
| 2 | Kill pod during playback | PARTIAL | ~20s gap: reconnect + re-auth + transcode restart |
| 3 | Node cordon + drain | PASS | One re-auth during reschedule |
| 4 | Kill PostgreSQL | PASS | ~30-40s degradation, auto-recovery |
| 5 | Kill Redis | PASS | Zero user impact |
| 6 | Rolling update | PASS | One re-auth per client during rollout |
Five clean passes and one partial. The partial (Test 2) is a fundamental limitation of FFmpeg transcoding — the process is tied to the pod. Solving this would require distributed transcoding, which is Track B+ territory and out of scope.
Prometheus Monitoring
With two replicas running, we need alerts for scenarios the failover tests validated:
ServiceMonitor
Both Jellyfin pods expose /metrics for Prometheus:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: jellyfin
namespace: jellyfin
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app.kubernetes.io/name: jellyfin
app.kubernetes.io/component: web
endpoints:
- port: http
path: /metrics
interval: 30s
PrometheusRule Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: jellyfin-alerts
namespace: jellyfin
labels:
release: kube-prometheus-stack
spec:
groups:
- name: jellyfin.rules
rules:
- alert: JellyfinReplicaCountLow
expr: |
kube_statefulset_status_replicas_ready{
namespace="jellyfin",
statefulset="jellyfin"
} < 2
for: 2m
labels:
severity: warning
annotations:
summary: "Jellyfin has fewer than 2 ready replicas"
description: >
Only {{ $value }} Jellyfin replica(s) are ready.
HA failover capacity is degraded.
- alert: JellyfinPostgresDown
expr: |
up{namespace="jellyfin", pod=~"jellyfin-postgres.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Jellyfin PostgreSQL is down"
description: >
The PostgreSQL pod is not responding.
Both Jellyfin replicas will degrade within ~15 seconds.
- alert: JellyfinRedisDown
expr: |
up{namespace="jellyfin", pod=~"jellyfin-redis.*"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Jellyfin Redis is down"
description: >
Redis is unavailable. Task leader election has
fallen back to local execution. Both pods may run
duplicate scheduled tasks.
- alert: JellyfinHighRestartRate
expr: |
increase(kube_pod_container_status_restarts_total{
namespace="jellyfin",
container="jellyfin"
}[1h]) > 3
labels:
severity: warning
annotations:
summary: "Jellyfin pod restarting frequently"
description: >
{{ $labels.pod }} has restarted {{ $value }}
times in the last hour.
Key Alerts
JellyfinReplicaCountLowfires when only 1 replica is ready for 2+ minutes. This catches extended failover scenarios where the StatefulSet controller can’t reschedule (e.g., no node with enough resources, anti-affinity can’t be satisfied).JellyfinPostgresDownfires after 1 minute atcriticalseverity. PostgreSQL downtime is the most impactful failure — both pods degrade.JellyfinRedisDownfires after 5 minutes atwarning. Redis downtime causes duplicate task execution but no user impact. The longerforduration avoids alerting on brief restarts.JellyfinHighRestartRatecatches CrashLoopBackOff scenarios where a pod is repeatedly restarting due to a bug or misconfiguration.
The Grafana Dashboard
A dedicated Jellyfin HA dashboard shows:
- Replica status: gauge showing 0/1/2 ready replicas with color coding
- Pod CPU and memory: per-pod resource usage
- PostgreSQL connections: active connection count from both pods
- Redis operations: leader election attempts and circuit breaker state
- Request rate: per-pod HTTP request rate (verifies traffic is balanced with sticky sessions)
- Transcode jobs: active FFmpeg processes per pod
The dashboard is provisioned as a ConfigMap with the grafana_dashboard=1 label, auto-loaded by the Grafana sidecar — the same pattern used for every other dashboard in the cluster.
Coming Up Next
Tomorrow: what’s still broken and what comes next — an honest assessment of what this project didn’t solve, the limitations of the Track A approach, and what Track B would look like if we revisited it.
Browse the code: All Kubernetes manifests — StatefulSet, IngressRoute, ServiceMonitor, PrometheusRule — are in the Jellyfin fork at github.com/zolty-mat/jellyfin. Infrastructure manifests will be published to the cluster repo once secrets are remediated.
Don’t have a cluster? You can run this entire test suite on a managed Kubernetes cluster. DigitalOcean Kubernetes with $200 in free credits is enough for a 3-node cluster to replicate every test.