Storage

Tiered model storage with MinIO and rclone: keep the SSD hot, archive the rest

TL;DR Stable Diffusion 3.5 Large is 15 GB. RealVisXL is 6.5 GB. Throw in a few LoRAs and a VAE, and your SSD hits the wall fast. I run a MinIO bucket as the long-tail model store, sync it to a local overflow directory on a 30-minute schedule via rclone, and register both the hot (SSD) and cold (synced overflow) paths in ComfyUI’s extra_model_paths.yaml. Models appear transparently; the loader searches both tiers. A fresh model lands in MinIO, appears locally within 30 minutes, and ComfyUI finds it without any manual shuffling. ...

Storage Refactoring and the SQLite-to-PostgreSQL Migration

TL;DR Phase 2 is the scariest phase. It’s where we take a running Jellyfin instance with years of playback history, user preferences, and media metadata — then swap the database from SQLite to PostgreSQL and restructure every volume. One wrong move and the family discovers their “Continue Watching” list is gone. This post covers deploying PostgreSQL as a k3s StatefulSet, restructuring Jellyfin’s volume layout from a monolithic RWO PVC to NFS shared config + Longhorn per-pod storage, and building a SQLite-to-PostgreSQL migration tool. ...

Monitoring goes blind — Longhorn storage corruption incident report

When Monitoring Goes Blind: A Longhorn Storage Corruption Incident

TL;DR Grafana went completely dark for about 26 hours on my home k3s cluster. Two things broke simultaneously: Loki entered CrashLoopBackOff, and Prometheus silently stopped ingesting metrics — its pods showed as healthy and 2/2 Running the whole time. The actual cause was Longhorn’s auto-balancer migrating replicas onto a freshly-added cluster node (k3s-agent-4) that had unstable storage during its first 48 hours. The replica I/O errors propagated directly into the workloads, corrupting mid-write files: a Prometheus WAL segment and a Loki TSDB index file. Both required offline surgery via a busybox pod to delete the corrupted files before the services could recover. ...