Benchmarking every subsystem on four Lenovo M920q Proxmox hosts — NVMe, CPU, memory, and 10GbE network

Benchmarking Every Subsystem: NVMe, CPU, Memory, and 10GbE on Four Proxmox Hosts

TL;DR Prometheus and Grafana both crashed with I/O errors on the same node. Before assuming software, I ran a full hardware audit across all four Proxmox hosts — SMART health, NVMe disk benchmarks (fio), CPU benchmarks (sysbench), memory bandwidth tests, and 10GbE network throughput (iperf3). The result: all hardware is healthy. The I/O errors were Longhorn CSI virtual block device corruption, not physical disk failure. Along the way, I established baseline performance numbers for every subsystem and discovered that custom cooling makes a dramatic difference in thermal performance. ...

February 22, 2026 · 11 min · zolty
Natural language media requests via Jellyseerr

I Am Zolty: Building a Natural Language Media Request System

TL;DR Jellyseerr already knows what I have. Radarr and Sonarr already know how to find things. The missing piece was a front door that understood intent instead of requiring me to search for specific titles. I wired Jellyseerr’s REST API to Claude and gave it a system prompt that knows my taste profile. Now I can say “download 100GB of family-friendly anime I might like” and get a queue of requests back. A Kubernetes CronJob runs the same prompt on a schedule so the library grows without me thinking about it. ...

February 21, 2026 · 5 min · zolty
10GbE networking upgrade

10GbE Networking on a Budget: Mellanox ConnectX-3 and Bricked NICs

TL;DR I upgraded the inter-node networking from 1GbE to 10GbE using Mellanox ConnectX-3 NICs from eBay (~$15-20 each). Two of the three NICs worked immediately. The third needed a firmware flash from 2.33.5220 to 2.42.5000 using mstflint, which required two cold boots to recover from a partially bricked state. All three Proxmox hosts now have 10GbE connectivity via active-backup bonds. I also deployed Security Scanner infrastructure and fixed ARC runner Pod Security Standards. ...

February 20, 2026 · 6 min · zolty
Monitoring stack

Monitoring Everything: Prometheus, Grafana, and Loki on k3s

TL;DR After running the cluster for nearly two weeks, today I took a step back to document and optimize the monitoring stack. This covers kube-prometheus-stack (Prometheus + Grafana + AlertManager), Loki for log aggregation, custom dashboards for every service, alert tuning to reduce noise, and the cluster-wide performance benchmarks I ran to establish baseline metrics. The Monitoring Architecture ┌──────────────────────────────────────────────────┐ │ Grafana │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Metrics │ │ Logs │ │ Alerts │ │ │ │ Explorer │ │ Explorer │ │ Rules │ │ │ └──────┬───┘ └──────┬───┘ └──────┬───┘ │ └─────────┼──────────────┼─────────────┼───────────┘ │ │ │ ┌─────┴─────┐ ┌─────┴─────┐ │ │Prometheus │ │ Loki │ │ │ (metrics) │ │ (logs) │ │ └─────┬─────┘ └─────┬─────┘ │ │ │ ┌─────┴──────┐ ┌──────┴──────┐ ┌─────┴────┐ │AlertManager│ │ Exporters │ │Promtail │ │ → Slack │ │ node │ │(log │ └────────────┘ │ kube-state │ │ shipper) │ │ cAdvisor │ └──────────┘ │ custom │ └─────────────┘ kube-prometheus-stack The foundation is kube-prometheus-stack, deployed via Helm. This single chart installs: ...

February 19, 2026 · 6 min · zolty
GPU passthrough

GPU Passthrough on Proxmox for Hardware Transcoding

TL;DR Intel iGPU passthrough from Proxmox to a k3s VM enables hardware video transcoding with minimal CPU overhead. This guide covers the complete process: enabling IOMMU, configuring VFIO, blacklisting the i915 driver, rebuilding the VM with q35/OVMF, and verifying VA-API inside the VM. The Intel UHD 630 in the M920q handles H.264 and HEVC encode/decode at real-time speeds. ...

February 18, 2026 · 5 min · zolty
Media stack on Kubernetes

Building a Complete Media Stack with Kubernetes

TL;DR The media stack is now fully automated: content gets sourced, synced from a remote seedbox to the local NAS via an rclone CronJob, organized by Radarr/Sonarr, and served by Jellyfin with Intel iGPU hardware transcoding. I also deployed a Media Controller for lifecycle management and a Media Profiler for content analysis. This post covers the full pipeline from acquisition to playback. The Media Pipeline Jellyseerr (request) │ ▼ Radarr/Sonarr (search + organize) │ ▼ Prowlarr (indexer management) │ ▼ Seedbox (remote acquisition) │ ▼ rclone-sync CronJob (seedbox → NAS) │ ▼ NAS (TrueNAS, VLAN 30) │ ▼ Jellyfin (playback + GPU transcode) Each component runs as a Kubernetes deployment in the media namespace. ...

February 17, 2026 · 5 min · zolty
VLAN migration

VLAN Migration: Moving a Live Kubernetes Cluster Without Downtime

TL;DR Today was the biggest infrastructure day yet. I migrated the entire k3s cluster from a flat network to a proper VLAN architecture: Server VLAN 20 for k3s nodes and services, Storage VLAN 30 for the NAS, and the existing default VLAN 1 for clients. This involved changing IPs on all VMs, updating MetalLB, reconfiguring Traefik, and recovering from an etcd quorum loss when I moved too many nodes at once. I also deployed the media stack (Jellyfin, Radarr, Sonarr, Prowlarr, Jellyseerr) and configured Intel iGPU passthrough infrastructure. ...

February 16, 2026 · 6 min · zolty
Production failures and lessons

Top 10 Production Failures and What I Learned

TL;DR After one week of operating this cluster with real workloads, I have accumulated a healthy list of production failures. Each one taught me something about Kubernetes, infrastructure, or my own assumptions. Here are the top 10, ranked by how much time they cost to investigate and fix. 1. The Longhorn S3 Backup Credential Rotation Impact: All Longhorn backups silently failed for 12 hours. What happened: I rotated the IAM credentials used for S3 backups and updated the Kubernetes secret. But Longhorn caches credentials at startup — it does not re-read the secret dynamically. All backup jobs continued using the stale credentials and failing silently. ...

February 15, 2026 · 6 min · zolty
AI-powered alert analysis

Building an AI-Powered Alert System with AWS Bedrock

TL;DR Today I deployed two significant additions to the cluster: an AI-powered Alert Responder that uses AWS Bedrock (Amazon Nova Micro) to analyze Prometheus alerts and post remediation suggestions to Slack, and a multi-user dev workspace with per-user environments. I also hardened the cluster by constraining all workloads to the correct architecture nodes and fixing arm64 scheduling issues. The Alert Responder Running 13+ applications on a homelab cluster means alerts fire regularly. Most are straightforward — high memory, restart loops, certificate expiry warnings — but analyzing each one, determining root cause, and knowing the right remediation command gets tedious, especially at 2 AM. ...

February 14, 2026 · 5 min · zolty
Microservices architecture

Deploying a Microservices Architecture on k3s

TL;DR Today I deployed the most architecturally complex application on the cluster: a video service platform with a Vue.js frontend, 7 FastAPI backend microservices, NATS for messaging, PostgreSQL for persistence, and Redis for caching. This post covers the deployment patterns for NATS-based microservices on k3s and the RBAC fixes needed for Helm-based deployments. The Application Architecture The video service platform is a full microservices stack: ┌──────────────┐ │ Vue.js │ Frontend SPA │ Frontend │ └──────┬───────┘ │ HTTP/REST ┌──────┴───────────────────────────────────────┐ │ API Gateway │ └──────┬───────────────────────────────────────┘ │ ┌──────┴───────────────────────────────────────┐ │ FastAPI Microservices │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │Auth │ │Video│ │Media│ │Queue│ │ │ └─────┘ └─────┘ └─────┘ └─────┘ │ │ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │Stats│ │User │ │Notif│ │ │ └─────┘ └─────┘ └─────┘ │ └──────────────────────────────────────────────┘ │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ │PostgreSQL│ │ NATS │ │ Redis │ └─────────┘ └─────────┘ └─────────┘ Seven FastAPI services communicate via NATS for asynchronous messaging and Redis for shared state. PostgreSQL handles persistent data. ...

February 13, 2026 · 5 min · zolty

Affiliate Disclosure: Some links on this site are affiliate links (Amazon Associates, DigitalOcean referral). As an Amazon Associate, I earn from qualifying purchases. This does not affect the price you pay or my editorial independence — I only recommend products and services I personally use and trust.