zolty.systems

Scaling to Two Replicas and Failover Testing

TL;DR This is the moment everything was built for. Three phases of preparation — PostgreSQL provider (Day 3), storage migration (Day 4), state externalization (Day 5) — all leading to a single kubectl scale command. This post covers Phase 4: scaling the Jellyfin StatefulSet to 2 replicas, configuring anti-affinity to spread pods across nodes, running six structured failover tests, building Prometheus alerts, and one test that only partially passed. The headline result: killing a pod causes zero service downtime — users on the surviving replica experience no interruption at all, and displaced users reconnect within seconds. ...

Jellyfin state externalization architecture

State Externalization and the Sticky Session Compromise

TL;DR Phase 3 is where the rubber meets the road. We have PostgreSQL for persistent data (Day 4) and NFS for shared config. But Jellyfin still holds critical runtime state — sessions, users, devices, tasks — in 11 ConcurrentDictionary instances scattered across singleton managers. Two pods with independent memory spaces means two independent views of reality. This post covers the state externalization decision: what got moved to Redis, what got solved by sticky sessions, what got disabled entirely, and why pragmatism beat perfection for a homelab media server. ...

Storage Refactoring and the SQLite-to-PostgreSQL Migration

TL;DR Phase 2 is the scariest phase. It’s where we take a running Jellyfin instance with years of playback history, user preferences, and media metadata — then swap the database from SQLite to PostgreSQL and restructure every volume. One wrong move and the family discovers their “Continue Watching” list is gone. This post covers deploying PostgreSQL as a k3s StatefulSet, restructuring Jellyfin’s volume layout from a monolithic RWO PVC to NFS shared config + Longhorn per-pod storage, and building a SQLite-to-PostgreSQL migration tool. ...

Jellyfin PostgreSQL database provider architecture

Forking Jellyfin: A PostgreSQL Database Provider in .NET 10

TL;DR Jellyfin stores everything in SQLite. Metadata, users, activity logs, authentication — all of it lives in .db files that lock under concurrent access. To run multiple replicas, we need a real network-accessible database. This post covers Phase 1 of the HA conversion: forking Jellyfin, designing a pluggable database provider interface, implementing it for PostgreSQL with Npgsql, generating EF Core migrations, writing integration tests with Testcontainers, and building a custom Docker image. ...

Multi-model AI planning workflow diagram

Multi-Model Planning: The Same Pattern That Shipped dnd-multi

TL;DR The Jellyfin HA conversion touches a .NET 10 codebase, Entity Framework Core migrations, Kubernetes manifests, Terraform infrastructure, PostgreSQL operations, and FFmpeg transcoding pipelines. No single AI model understands all of this equally well. So I used four of them — the same multi-model planning pattern that shipped dnd-multi in a single day and that I documented in the LLM GitHub PR workflow. This post covers how I adapted that pattern for infrastructure work, what each model caught, and why planning is where all the human time should go. ...

Jellyfin single-instance architecture diagram

Why Jellyfin Can't Scale (And What We're Going to Do About It)

TL;DR Jellyfin is a fantastic open-source media server. It is also, architecturally, a single-process application that assumes it’s the only instance running. SQLite as the database. Eleven ConcurrentDictionary caches holding sessions, users, devices, and task queues in memory. A file-based config directory that gets written to at runtime. None of this survives a second pod. This is the first post in a seven-part series documenting how I converted Jellyfin into a highly available, multi-replica deployment on my home k3s cluster. The project spans two repositories, four phases, ~20 GitHub Issues executed by AI agents, and a live failover demo where I killed a pod and the service continued with zero downtime — users on the surviving replica never saw an interruption. ...

AI-driven Kubernetes incident response — seven alerts resolved

Seven Alerts, Three Bugs, One AI Debug Session: A Kubernetes Incident Report

TL;DR A routine cluster health check surfaced seven simultaneous issues. Most were transient — Longhorn self-healed its replica fault, Prometheus recovered behind it, a stale manually-created Job was deleted in one command, and a liveness probe blip fixed itself. The real work was dnd-backend, which had been in CrashLoopBackOff and turned out to contain three separate bugs layered on top of each other. The AI identified all three during a single debugging session, authored the fixes across three PRs, and the service came up 1/1 Running with all 18 database tables created on the first boot after the final merge. ...

Self-Hosted AI Chat: Open WebUI, LiteLLM, and AWS Bedrock on k3s

TL;DR I deployed a private, self-hosted ChatGPT alternative on the homelab k3s cluster. Open WebUI provides a polished chat interface. LiteLLM acts as a proxy that translates the OpenAI API into AWS Bedrock’s Converse API. Four models are available: Claude Sonnet 4, Claude Haiku 4.5, Amazon Nova Micro, and Amazon Nova Lite. Authentication is handled by the existing OAuth2 Proxy – no additional SSO configuration needed. The whole stack runs in three pods consuming under 500MB of RAM, and the only ongoing cost is per-request Bedrock pricing. No API keys from OpenAI or Anthropic required. ...

AI coding governance framework for engineering teams

Governing AI Coding Tools Across an Engineering Team

TL;DR AI coding tools are now default behavior for most developers, not an experiment. If you manage a team and you haven’t formalized this, you have ungoverned spend, security exposure, and inconsistent behavior happening right now. The fix isn’t to take the tools away — it’s to pick one, pay for it centrally, encode your policies into the AI itself using instruction files and skills, and govern the control folder rather than individual usage. Here’s the framework I’d implement. ...

When the AI Breaks Production: Failure Patterns, Guardrails, and Measuring What Works

TL;DR AI tools have caused multiple production incidents in this cluster. The AI alert responder agent alone generated 14 documented failure patterns before it became reliable. A security scanner deployed by AI applied restricted PodSecurity labels to every namespace, silently blocking pod creation for half the applications in the cluster. The service selector trap – where AI routes 50% of requests to PostgreSQL instead of the application – appeared in 4 separate incidents before guardrails stopped it. This post catalogs the failure patterns, the five-layer guardrail architecture built to prevent them, and an honest assessment of what still goes wrong. ...