TL;DR

The homelab runs 42 Kubernetes namespaces across 7 nodes (3 control plane, 4 workers) on 4 Lenovo ThinkCentre M920q mini PCs running Proxmox VE. This post is the result of a full infrastructure audit — reconciling what’s actually running against what’s documented, catching version drift, and noting what’s been added, removed, or broken since the last check.

Compute

Four Lenovo ThinkCentre M920q nodes form the physical layer:

HostCPURAMNVMeRole
pve1i5-8500T32GB512GB1 server VM + 1 agent VM
pve2i5-8500T32GB512GB1 server VM + 1 agent VM
pve3i5-8500T32GB512GB1 server VM + 1 agent VM
pve4i7-8700T32GB512GB1 agent VM (GPU passthrough)

The k3s cluster runs v1.34.4+k3s1 with embedded etcd for HA. All 7 nodes report Ready. Server VMs get 2 cores and 6GB each — just enough for etcd and the API server. Agent VMs are beefier: 6 cores and 22GB on pve1-3, 12 cores and 28GB on pve4.

All three pve1-3 agents have Intel UHD 630 GPUs passed through via VFIO for hardware transcoding (h264_qsv, hevc_qsv, hevc_vaapi). pve4’s GPU passthrough was restored on March 16 after a failed LXC experiment — the LXC containers on pve4 were abandoned due to DNS failures and networking instability. Lesson learned: don’t use LXC on pve4 again.

Storage

Two tiers:

Longhorn (v1.8.2) provides distributed block storage across the 3 main worker nodes (~1.2 TiB total). 2 replicas, best-effort auto-balance. Longhorn is disabled on k3s-agent-4 due to the aging NVMe on pve4. All Longhorn volumes use the boot disk at /var/lib/longhorn/ — no dedicated data disks yet, though Terraform supports adding them via additional_disks.

NFS on the Ugreen DXP4800 NAS (192.168.30.10, 19TB) provides shared storage for:

  • Media library (movies, TV, music) — shared by Jellyfin, Plex, Radarr, Sonarr, Bazarr, Tdarr
  • Monitoring data (Prometheus, Loki, Grafana, AlertManager)
  • Harbor container registry storage
  • Gitea package registry storage
  • RAG document staging

The NAS has dual 2.5GbE in 802.3ad LACP LAG for 5Gbps aggregate throughput to the USW Aggregation switch. All three Proxmox hosts have Mellanox ConnectX-3 10GbE NICs in active-backup bonds.

Networking

Three VLANs keep things separated:

  • VLAN 20 (Server/K8s): k3s nodes (192.168.20.20-33), MetalLB pool (192.168.20.200-220)
  • VLAN 30 (Storage): NAS at 192.168.30.10, Proxmox storage interfaces
  • Default VLAN: Proxmox management (192.168.1.105-108), IoT devices, UniFi gear

Traefik (bundled with k3s) handles all ingress at 192.168.20.200. MetalLB v0.14.3 in L2 mode assigns LoadBalancer IPs. cert-manager v1.14.2 handles TLS via Let’s Encrypt DNS-01 against Route53.

Public access goes through a DDNS CronJob that updates Route53 every 5 minutes, with OAuth2 Proxy (Google auth) protecting external endpoints. Port forwarding on the UDM Pro routes 80/443 to Traefik.

Media Stack

The media namespace is the heaviest workload:

  • Jellyfin (HA, 2-replica StatefulSet) with PostgreSQL + Redis — GPU-accelerated transcoding
  • Plex (v1.43.0) — alternative media server, also with GPU preference
  • Jellyseerr — request portal with Jellyfin SSO
  • Radarr + Sonarr — movie and TV management with NFS media mounts
  • Prowlarr — indexer management
  • Bazarr — subtitle automation
  • Tdarr — distributed GPU transcoding (DaemonSet, 1 worker per GPU node = 3 total)
  • FlareSolverr — Cloudflare bypass for indexers

Content pipeline: Jellyseerr request -> Radarr/Sonarr -> legacy seedbox (rclone sync every 4h) -> NAS NFS -> Jellyfin/Plex. The local qBittorrent + gluetun VPN setup was removed since the last audit.

3 GPU workers handle hardware transcoding via Intel QSV/VA-API. Each agent node has the Intel UHD 630 passed through, with gpu=true:NoSchedule taints ensuring only media workloads land there.

Monitoring & Observability

The monitoring stack runs on NFS to avoid Longhorn WAL corruption issues (a painful lesson):

  • Prometheus (kube-prometheus-stack Helm) — scraping 80+ ServiceMonitors
  • Grafana — dashboards as ConfigMaps (auto-loaded by sidecar), persistence disabled after SQLite corruption
  • Loki + Promtail — log aggregation from all 7 nodes
  • AlertManager — alerts route to Slack via the Alert Responder bot (Amazon Nova Micro for AI analysis)
  • Custom exporters: Proxmox (pve-exporter), NAS, seedbox, AWS cost, Anthropic cost, OpenRouter cost, GitHub

AI / LLM Services

This is where most of the recent growth has been:

  • OpenClaw (open-webui) — Personal AI assistant gateway with 2-replica agent pool, Anthropic direct + LiteLLM Bedrock proxy, Telegram + Slack channels, persistent memory, GitHub access
  • OpenClaw Ops (openclaw-ops) — Cluster observer with knowledge graph, event ingest from GitHub/Prometheus/K8s, Jellyfin release watcher with auto-PR
  • OpenClaw Personal (openclaw-personal) — Job search agent with resume audit, interview prep, daily job board crawling
  • RAG Platform (rag) — Qdrant vector DB + dual ingestion pipelines (NFS docs via Ollama embeddings, repo content via Bedrock Titan embeddings into pgvector)
  • Alert Responder — AI-powered alert triage using Bedrock Nova Micro, posts threaded analysis to Slack
  • Auto Brand — AI video factory (7 FastAPI services + NATS + Vue.js) using Bedrock Nova Reel for video generation
  • Trivy Operator — Continuous vulnerability scanning, results consumed by OpenClaw Ops

Other Services

The long tail of apps running in the cluster:

ServiceNamespacePurpose
Habit TrackerhamReact + Fastify habit tracker (Harbor images)
CardboardcardboardTCG price tracker with Chrome scraper CronJobs
Trade Bottrade-botRobinhood trading bot (dry run mode)
Digital Signagedigital-signageAngular SPA + 7 Flask services + MQTT for Raspberry Pi displays
DnD PlatformdndMultiplayer D&D with LiveKit voice, pgvector lore, AI DM
Cat Gamecat-gameBrowser game with nginx exporter
Aja Recipesaja-recipesRecipe manager (2-replica with HPA)
Tshirt Cannontshirt-cannonAI merch factory MVP (storefront + vote API)
Wiki.jswikiInternal knowledge base
Security Scannersecurity-scannerVulnerability scanner (XSS, SQLi, CSRF, etc.)
Media Librarymedia-libraryBlog asset manager with S3 CDN + YouTube integration
Media Profilermedia-profilerMedia preference quiz with psychological profiling
JupyterjupyterJupyterLab with cluster analysis notebooks
GHA Dashboardgha-dashboardGitHub Actions workflow history viewer
Kube Utilskube-utilsSecurity honeypot mimicking node-exporter

Infrastructure Services

ServicePurpose
HarborSelf-hosted container registry (replacing ECR)
GiteaPackage registry (PyPI, npm, Maven, Go)
Home AssistantIoT hub with Google Assistant, Hue, UniFi
Email GatewayPostfix relay to AWS SES
Proxmox WatchdogAuto power-cycle via Kasa smart strip
ARC Runners8 self-hosted GitHub Actions runners
cert-managerTLS via Let’s Encrypt DNS-01
MetalLBL2 LoadBalancer IPs
LonghornDistributed block storage
CoreDNS2 replicas for DNS HA

What Got Removed

  • Dev Workspace — code-server pods replaced by local dev + Claude Code CLI. Namespace stuck in Terminating.
  • NUT DaemonSet — UPS monitoring moved to host-level NUT clients via Ansible
  • qBittorrent — VPN-tunneled torrent client removed from media stack

What Needs Attention

  • dev-workspace namespace stuck in Terminating — needs manual finalizer cleanup
  • cluster-health-monitor and dnd-multi are empty namespaces (candidates for deletion)
  • Radarr pod has been restarting intermittently
  • Several services still pulling from ECR instead of Harbor (migration in progress)
  • Longhorn running on boot disks only — dedicated disks via Terraform additional_disks available but not provisioned

By the Numbers

  • Physical nodes: 4 (128GB RAM total, 2TB NVMe total)
  • Virtual nodes: 7 (3 server + 4 agent)
  • Namespaces: 42 active
  • Deployments: 80+
  • StatefulSets: 29
  • CronJobs: 26
  • DaemonSets: 9
  • Ingress hosts: 26
  • PostgreSQL instances: 15
  • GPU workers: 3 (Intel UHD 630 QSV/VA-API)
  • NAS storage: 19TB
  • Longhorn capacity: ~1.2 TiB (3 nodes)

The cluster has grown from a weekend project into a genuine private cloud. Every service has Prometheus metrics, most have CI/CD pipelines on the self-hosted runners, and the AI layer (OpenClaw + its satellite agents) is starting to operationally manage parts of the cluster itself. The next big milestones are Authentik for proper SSO, Linkerd for service mesh mTLS, and local LLM inference via Ray Serve + vLLM.