Kubernetes

Ditching AWS ECR for Self-Hosted Harbor: Why and How

TL;DR AWS ECR tokens expire every 12 hours. Every time the cron job that refreshes the pull secret fails, image pulls break cluster-wide. Docker Hub’s anonymous rate limit (100 pulls/6 hours) started hitting during CI builds that pull nginx:alpine and python:3.12-slim. I replaced both with self-hosted Harbor for container images and Gitea for package registries (PyPI, npm), backed by NFS on the NAS, deployed via Ansible and Helm, with Trivy vulnerability scanning on push. Thirteen CI workflows were updated in a single commit. Pull secrets never expire. Images never rate-limit. Monthly ECR cost drops to zero. ...

One Month Retrospective: From Bare Metal to Production Platform

TL;DR One month ago, I had three empty Lenovo ThinkCentre M920q mini PCs and a Proxmox installer USB. Today, the cluster runs 8 Kubernetes nodes, 15+ applications, full observability with Prometheus and Grafana, AI-powered alert analysis, self-hosted CI/CD, 10GbE networking, and a 3D printer fabricating custom hardware. Total hardware cost: under $800. This post traces the entire journey, day by day, including the things that went wrong. ...

Jellyfin HA on Kubernetes: Redis-Backed Transcode Session Failover

TL;DR Jellyfin dies mid-stream when a Kubernetes pod restarts because all transcode state is in-memory. I forked it, added a Redis-backed ITranscodeSessionStore, and wired in atomic lease-based pod takeover. The fork is at github.com/ZoltyMat/jellyfin-ha, and I also published a repo-level diff document at docs/FORK-DIFF.md showing exactly what changed versus upstream Jellyfin. Single-instance deployments need zero config changes because it falls back to a no-op store transparently. The Problem Jellyfin is great. It’s also built with the assumption that exactly one server instance is running at a time. Transcode state — which pods are running FFmpeg, what segments have been written, who owns a given play session — lives entirely in memory. When the process dies, that state is gone. ...

What's Still Broken and What Comes Next

TL;DR Over the last six posts, I’ve documented converting Jellyfin from a single-process media server into a two-replica, PostgreSQL-backed, sticky-session-coordinated deployment on k3s. Five of six failover tests passed cleanly. The key result: zero-downtime failover — killing a pod doesn’t take down the service. Users on the surviving replica see no interruption; displaced users reconnect in seconds. Node maintenance no longer kills Jellyfin for the household. But this project isn’t finished, and some problems can’t be solved with this architecture. This final post is an honest inventory of what’s still broken, what was deferred, and what the path forward looks like. ...

Scaling to Two Replicas and Failover Testing

TL;DR This is the moment everything was built for. Three phases of preparation — PostgreSQL provider (Day 3), storage migration (Day 4), state externalization (Day 5) — all leading to a single kubectl scale command. This post covers Phase 4: scaling the Jellyfin StatefulSet to 2 replicas, configuring anti-affinity to spread pods across nodes, running six structured failover tests, building Prometheus alerts, and one test that only partially passed. The headline result: killing a pod causes zero service downtime — users on the surviving replica experience no interruption at all, and displaced users reconnect within seconds. ...

Jellyfin state externalization architecture

State Externalization and the Sticky Session Compromise

TL;DR Phase 3 is where the rubber meets the road. We have PostgreSQL for persistent data (Day 4) and NFS for shared config. But Jellyfin still holds critical runtime state — sessions, users, devices, tasks — in 11 ConcurrentDictionary instances scattered across singleton managers. Two pods with independent memory spaces means two independent views of reality. This post covers the state externalization decision: what got moved to Redis, what got solved by sticky sessions, what got disabled entirely, and why pragmatism beat perfection for a homelab media server. ...

Storage Refactoring and the SQLite-to-PostgreSQL Migration

TL;DR Phase 2 is the scariest phase. It’s where we take a running Jellyfin instance with years of playback history, user preferences, and media metadata — then swap the database from SQLite to PostgreSQL and restructure every volume. One wrong move and the family discovers their “Continue Watching” list is gone. This post covers deploying PostgreSQL as a k3s StatefulSet, restructuring Jellyfin’s volume layout from a monolithic RWO PVC to NFS shared config + Longhorn per-pod storage, and building a SQLite-to-PostgreSQL migration tool. ...

Jellyfin single-instance architecture diagram

Why Jellyfin Can't Scale (And What We're Going to Do About It)

TL;DR Jellyfin is a fantastic open-source media server. It is also, architecturally, a single-process application that assumes it’s the only instance running. SQLite as the database. Eleven ConcurrentDictionary caches holding sessions, users, devices, and task queues in memory. A file-based config directory that gets written to at runtime. None of this survives a second pod. This is the first post in a seven-part series documenting how I converted Jellyfin into a highly available, multi-replica deployment on my home k3s cluster. The project spans two repositories, four phases, ~20 GitHub Issues executed by AI agents, and a live failover demo where I killed a pod and the service continued with zero downtime — users on the surviving replica never saw an interruption. ...

AI-driven Kubernetes incident response — seven alerts resolved

Seven Alerts, Three Bugs, One AI Debug Session: A Kubernetes Incident Report

TL;DR A routine cluster health check surfaced seven simultaneous issues. Most were transient — Longhorn self-healed its replica fault, Prometheus recovered behind it, a stale manually-created Job was deleted in one command, and a liveness probe blip fixed itself. The real work was dnd-backend, which had been in CrashLoopBackOff and turned out to contain three separate bugs layered on top of each other. The AI identified all three during a single debugging session, authored the fixes across three PRs, and the service came up 1/1 Running with all 18 database tables created on the first boot after the final merge. ...

Self-Hosted AI Chat: Open WebUI, LiteLLM, and AWS Bedrock on k3s

TL;DR I deployed a private, self-hosted ChatGPT alternative on the homelab k3s cluster. Open WebUI provides a polished chat interface. LiteLLM acts as a proxy that translates the OpenAI API into AWS Bedrock’s Converse API. Four models are available: Claude Sonnet 4, Claude Haiku 4.5, Amazon Nova Micro, and Amazon Nova Lite. Authentication is handled by the existing OAuth2 Proxy – no additional SSO configuration needed. The whole stack runs in three pods consuming under 500MB of RAM, and the only ongoing cost is per-request Bedrock pricing. No API keys from OpenAI or Anthropic required. ...