Kubernetes

Monitoring Everything: Prometheus, Grafana, and Loki on k3s

TL;DR After running the cluster for nearly two weeks, today I took a step back to document and optimize the monitoring stack. This covers kube-prometheus-stack (Prometheus + Grafana + AlertManager), Loki for log aggregation, custom dashboards for every service, alert tuning to reduce noise, and the cluster-wide performance benchmarks I ran to establish baseline metrics. The Monitoring Architecture ┌──────────────────────────────────────────────────┐ │ Grafana │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Metrics │ │ Logs │ │ Alerts │ │ │ │ Explorer │ │ Explorer │ │ Rules │ │ │ └──────┬───┘ └──────┬───┘ └──────┬───┘ │ └─────────┼──────────────┼─────────────┼───────────┘ │ │ │ ┌─────┴─────┐ ┌─────┴─────┐ │ │Prometheus │ │ Loki │ │ │ (metrics) │ │ (logs) │ │ └─────┬─────┘ └─────┬─────┘ │ │ │ ┌─────┴──────┐ ┌──────┴──────┐ ┌─────┴────┐ │AlertManager│ │ Exporters │ │Promtail │ │ → Slack │ │ node │ │(log │ └────────────┘ │ kube-state │ │ shipper) │ │ cAdvisor │ └──────────┘ │ custom │ └─────────────┘ kube-prometheus-stack The foundation is kube-prometheus-stack, deployed via Helm. This single chart installs: ...

Building a Complete Media Stack with Kubernetes

TL;DR The media stack is now fully automated: content gets sourced, synced from a remote seedbox to the local NAS via an rclone CronJob, organized by Radarr/Sonarr, and served by Jellyfin with Intel iGPU hardware transcoding. I also deployed a Media Controller for lifecycle management and a Media Profiler for content analysis. This post covers the full pipeline from acquisition to playback. The Media Pipeline Jellyseerr (request) │ ▼ Radarr/Sonarr (search + organize) │ ▼ Prowlarr (indexer management) │ ▼ Seedbox (remote acquisition) │ ▼ rclone-sync CronJob (seedbox → NAS) │ ▼ NAS (TrueNAS, VLAN 30) │ ▼ Jellyfin (playback + GPU transcode) Each component runs as a Kubernetes deployment in the media namespace. ...

VLAN Migration: Moving a Live Kubernetes Cluster Without Downtime

TL;DR Today was the biggest infrastructure day yet. I migrated the entire k3s cluster from a flat network to a proper VLAN architecture: Server VLAN 20 for k3s nodes and services, Storage VLAN 30 for the NAS, and the existing default VLAN 1 for clients. This involved changing IPs on all VMs, updating MetalLB, reconfiguring Traefik, and recovering from an etcd quorum loss when I moved too many nodes at once. I also deployed the media stack (Jellyfin, Radarr, Sonarr, Prowlarr, Jellyseerr) and configured Intel iGPU passthrough infrastructure. ...

Top 10 Production Failures and What I Learned

TL;DR After one week of operating this cluster with real workloads, I have accumulated a healthy list of production failures. Each one taught me something about Kubernetes, infrastructure, or my own assumptions. Here are the top 10, ranked by how much time they cost to investigate and fix. 1. The Longhorn S3 Backup Credential Rotation Impact: All Longhorn backups silently failed for 12 hours. What happened: I rotated the IAM credentials used for S3 backups and updated the Kubernetes secret. But Longhorn caches credentials at startup — it does not re-read the secret dynamically. All backup jobs continued using the stale credentials and failing silently. ...

Deploying a Microservices Architecture on k3s

TL;DR Today I deployed the most architecturally complex application on the cluster: a video service platform with a Vue.js frontend, 7 FastAPI backend microservices, NATS for messaging, PostgreSQL for persistence, and Redis for caching. This post covers the deployment patterns for NATS-based microservices on k3s and the RBAC fixes needed for Helm-based deployments. The Application Architecture The video service platform is a full microservices stack: ┌──────────────┐ │ Vue.js │ Frontend SPA │ Frontend │ └──────┬───────┘ │ HTTP/REST ┌──────┴───────────────────────────────────────┐ │ API Gateway │ └──────┬───────────────────────────────────────┘ │ ┌──────┴───────────────────────────────────────┐ │ FastAPI Microservices │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │Auth │ │Video│ │Media│ │Queue│ │ │ └─────┘ └─────┘ └─────┘ └─────┘ │ │ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │Stats│ │User │ │Notif│ │ │ └─────┘ └─────┘ └─────┘ │ └──────────────────────────────────────────────┘ │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ │PostgreSQL│ │ NATS │ │ Redis │ └─────────┘ └─────────┘ └─────────┘ Seven FastAPI services communicate via NATS for asynchronous messaging and Redis for shared state. PostgreSQL handles persistent data. ...

Self-Hosted CI/CD: Running GitHub Actions Runners on k3s

TL;DR Running self-hosted GitHub Actions runners on the same k3s cluster they deploy to is a powerful pattern. GitHub Actions Runner Controller (ARC) manages runner pods as Kubernetes resources, scaling them based on workflow demand. This post covers the full setup, the RBAC model that makes it work, and every gotcha I encountered. Why Self-Hosted Runners? GitHub-hosted runners are convenient but have limitations: Cost: Free tier gives 2,000 minutes/month. With 5+ repositories doing multiple deploys per day, that burns fast. Speed: GitHub-hosted runners are shared infrastructure. Cold starts take 20-30 seconds, and you are competing with other users. Access: GitHub-hosted runners cannot reach my private cluster network. Every deployment would need a VPN or tunnel. Control: I want to install whatever tools I need (kubectl, helm, terraform, ansible) without Docker layer caching tricks. Self-hosted runners solve all of these: they run inside the cluster, have direct network access to all services, pre-configured tools, and no usage limits. ...

Migrating a Full-Stack App to Kubernetes: Digital Signage on k3s

TL;DR Today I migrated Digital Signage — an Angular SPA backed by 7 Flask microservices, an MQTT broker, and PostgreSQL — from a development environment to the k3s cluster. This is the most complex application on the cluster so far, and deploying it taught me a lot about managing multi-service applications in Kubernetes. The Application Digital Signage started as a side project back in May 2025, designed to drive informational displays on Raspberry Pi kiosk devices. It evolved over the months into a surprisingly complex system: ...

Home Assistant on Kubernetes and Building a Proxmox Watchdog

TL;DR Home Assistant runs on k3s using hostNetwork: true for mDNS/SSDP device discovery. I implemented split DNS routing so it is accessible both externally via Traefik and internally via its host IP. Then I built a Proxmox Watchdog — a custom service that monitors all Proxmox hosts via their API and automatically power-cycles unresponsive nodes using TP-Link Kasa HS300 smart power strips. Home Assistant on Kubernetes Home Assistant is one of those applications that does not play well with Kubernetes out of the box. It needs to discover devices on the local network via mDNS, SSDP, and other broadcast protocols. Put it in a regular Kubernetes pod with cluster networking and it cannot see any of your smart home devices. ...

Deploying First Applications: From Zero to Production in 24 Hours

TL;DR Day two of the cluster was a marathon. I deployed two full-stack applications (Cardboard TCG tracker and Trade Bot), set up PostgreSQL with Longhorn persistent storage, created a cluster dashboard, configured Prometheus service monitors, built a dev workspace for remote SSH, and scaled the ARC runners. By the end, the cluster was running real workloads and I had a proper development workflow. The Deployment Pattern Before diving into the applications, I established a consistent deployment pattern that every service follows: ...

Day One: Bootstrapping a k3s Cluster with Terraform and Ansible

TL;DR Today was cluster genesis. Starting from 3 bare Proxmox hosts, I built the entire infrastructure-as-code pipeline: Terraform to provision VMs from cloud-init templates, Ansible to configure and bootstrap k3s, and a full GitOps deployment model with SOPS-encrypted secrets and S3-backed Terraform state. By end of day: 3 server nodes, 3 agent nodes, cert-manager with Route53 DNS-01 validation, and self-hosted GitHub Actions runners on the cluster itself. The Architecture The design goal was simple: everything as code, nothing manual, everything reproducible. ...