TL;DR
The homelab runs 42 Kubernetes namespaces across 7 nodes (3 control plane, 4 workers) on 4 Lenovo ThinkCentre M920q mini PCs running Proxmox VE. This post is the result of a full infrastructure audit — reconciling what’s actually running against what’s documented, catching version drift, and noting what’s been added, removed, or broken since the last check.
Compute
Four Lenovo ThinkCentre M920q nodes form the physical layer:
| Host | CPU | RAM | NVMe | Role |
|---|---|---|---|---|
| pve1 | i5-8500T | 32GB | 512GB | 1 server VM + 1 agent VM |
| pve2 | i5-8500T | 32GB | 512GB | 1 server VM + 1 agent VM |
| pve3 | i5-8500T | 32GB | 512GB | 1 server VM + 1 agent VM |
| pve4 | i7-8700T | 32GB | 512GB | 1 agent VM (GPU passthrough) |
The k3s cluster runs v1.34.4+k3s1 with embedded etcd for HA. All 7 nodes report Ready. Server VMs get 2 cores and 6GB each — just enough for etcd and the API server. Agent VMs are beefier: 6 cores and 22GB on pve1-3, 12 cores and 28GB on pve4.
All three pve1-3 agents have Intel UHD 630 GPUs passed through via VFIO for hardware transcoding (h264_qsv, hevc_qsv, hevc_vaapi). pve4’s GPU passthrough was restored on March 16 after a failed LXC experiment — the LXC containers on pve4 were abandoned due to DNS failures and networking instability. Lesson learned: don’t use LXC on pve4 again.
Storage
Two tiers:
Longhorn (v1.8.2) provides distributed block storage across the 3 main worker nodes (~1.2 TiB total). 2 replicas, best-effort auto-balance. Longhorn is disabled on k3s-agent-4 due to the aging NVMe on pve4. All Longhorn volumes use the boot disk at /var/lib/longhorn/ — no dedicated data disks yet, though Terraform supports adding them via additional_disks.
NFS on the Ugreen DXP4800 NAS (192.168.30.10, 19TB) provides shared storage for:
- Media library (movies, TV, music) — shared by Jellyfin, Plex, Radarr, Sonarr, Bazarr, Tdarr
- Monitoring data (Prometheus, Loki, Grafana, AlertManager)
- Harbor container registry storage
- Gitea package registry storage
- RAG document staging
The NAS has dual 2.5GbE in 802.3ad LACP LAG for 5Gbps aggregate throughput to the USW Aggregation switch. All three Proxmox hosts have Mellanox ConnectX-3 10GbE NICs in active-backup bonds.
Networking
Three VLANs keep things separated:
- VLAN 20 (Server/K8s): k3s nodes (192.168.20.20-33), MetalLB pool (192.168.20.200-220)
- VLAN 30 (Storage): NAS at 192.168.30.10, Proxmox storage interfaces
- Default VLAN: Proxmox management (192.168.1.105-108), IoT devices, UniFi gear
Traefik (bundled with k3s) handles all ingress at 192.168.20.200. MetalLB v0.14.3 in L2 mode assigns LoadBalancer IPs. cert-manager v1.14.2 handles TLS via Let’s Encrypt DNS-01 against Route53.
Public access goes through a DDNS CronJob that updates Route53 every 5 minutes, with OAuth2 Proxy (Google auth) protecting external endpoints. Port forwarding on the UDM Pro routes 80/443 to Traefik.
Media Stack
The media namespace is the heaviest workload:
- Jellyfin (HA, 2-replica StatefulSet) with PostgreSQL + Redis — GPU-accelerated transcoding
- Plex (v1.43.0) — alternative media server, also with GPU preference
- Jellyseerr — request portal with Jellyfin SSO
- Radarr + Sonarr — movie and TV management with NFS media mounts
- Prowlarr — indexer management
- Bazarr — subtitle automation
- Tdarr — distributed GPU transcoding (DaemonSet, 1 worker per GPU node = 3 total)
- FlareSolverr — Cloudflare bypass for indexers
Content pipeline: Jellyseerr request -> Radarr/Sonarr -> legacy seedbox (rclone sync every 4h) -> NAS NFS -> Jellyfin/Plex. The local qBittorrent + gluetun VPN setup was removed since the last audit.
3 GPU workers handle hardware transcoding via Intel QSV/VA-API. Each agent node has the Intel UHD 630 passed through, with gpu=true:NoSchedule taints ensuring only media workloads land there.
Monitoring & Observability
The monitoring stack runs on NFS to avoid Longhorn WAL corruption issues (a painful lesson):
- Prometheus (kube-prometheus-stack Helm) — scraping 80+ ServiceMonitors
- Grafana — dashboards as ConfigMaps (auto-loaded by sidecar), persistence disabled after SQLite corruption
- Loki + Promtail — log aggregation from all 7 nodes
- AlertManager — alerts route to Slack via the Alert Responder bot (Amazon Nova Micro for AI analysis)
- Custom exporters: Proxmox (pve-exporter), NAS, seedbox, AWS cost, Anthropic cost, OpenRouter cost, GitHub
AI / LLM Services
This is where most of the recent growth has been:
- OpenClaw (
open-webui) — Personal AI assistant gateway with 2-replica agent pool, Anthropic direct + LiteLLM Bedrock proxy, Telegram + Slack channels, persistent memory, GitHub access - OpenClaw Ops (
openclaw-ops) — Cluster observer with knowledge graph, event ingest from GitHub/Prometheus/K8s, Jellyfin release watcher with auto-PR - OpenClaw Personal (
openclaw-personal) — Job search agent with resume audit, interview prep, daily job board crawling - RAG Platform (
rag) — Qdrant vector DB + dual ingestion pipelines (NFS docs via Ollama embeddings, repo content via Bedrock Titan embeddings into pgvector) - Alert Responder — AI-powered alert triage using Bedrock Nova Micro, posts threaded analysis to Slack
- Auto Brand — AI video factory (7 FastAPI services + NATS + Vue.js) using Bedrock Nova Reel for video generation
- Trivy Operator — Continuous vulnerability scanning, results consumed by OpenClaw Ops
Other Services
The long tail of apps running in the cluster:
| Service | Namespace | Purpose |
|---|---|---|
| Habit Tracker | ham | React + Fastify habit tracker (Harbor images) |
| Cardboard | cardboard | TCG price tracker with Chrome scraper CronJobs |
| Trade Bot | trade-bot | Robinhood trading bot (dry run mode) |
| Digital Signage | digital-signage | Angular SPA + 7 Flask services + MQTT for Raspberry Pi displays |
| DnD Platform | dnd | Multiplayer D&D with LiveKit voice, pgvector lore, AI DM |
| Cat Game | cat-game | Browser game with nginx exporter |
| Aja Recipes | aja-recipes | Recipe manager (2-replica with HPA) |
| Tshirt Cannon | tshirt-cannon | AI merch factory MVP (storefront + vote API) |
| Wiki.js | wiki | Internal knowledge base |
| Security Scanner | security-scanner | Vulnerability scanner (XSS, SQLi, CSRF, etc.) |
| Media Library | media-library | Blog asset manager with S3 CDN + YouTube integration |
| Media Profiler | media-profiler | Media preference quiz with psychological profiling |
| Jupyter | jupyter | JupyterLab with cluster analysis notebooks |
| GHA Dashboard | gha-dashboard | GitHub Actions workflow history viewer |
| Kube Utils | kube-utils | Security honeypot mimicking node-exporter |
Infrastructure Services
| Service | Purpose |
|---|---|
| Harbor | Self-hosted container registry (replacing ECR) |
| Gitea | Package registry (PyPI, npm, Maven, Go) |
| Home Assistant | IoT hub with Google Assistant, Hue, UniFi |
| Email Gateway | Postfix relay to AWS SES |
| Proxmox Watchdog | Auto power-cycle via Kasa smart strip |
| ARC Runners | 8 self-hosted GitHub Actions runners |
| cert-manager | TLS via Let’s Encrypt DNS-01 |
| MetalLB | L2 LoadBalancer IPs |
| Longhorn | Distributed block storage |
| CoreDNS | 2 replicas for DNS HA |
What Got Removed
- Dev Workspace — code-server pods replaced by local dev + Claude Code CLI. Namespace stuck in Terminating.
- NUT DaemonSet — UPS monitoring moved to host-level NUT clients via Ansible
- qBittorrent — VPN-tunneled torrent client removed from media stack
What Needs Attention
dev-workspacenamespace stuck in Terminating — needs manual finalizer cleanupcluster-health-monitoranddnd-multiare empty namespaces (candidates for deletion)- Radarr pod has been restarting intermittently
- Several services still pulling from ECR instead of Harbor (migration in progress)
- Longhorn running on boot disks only — dedicated disks via Terraform
additional_disksavailable but not provisioned
By the Numbers
- Physical nodes: 4 (128GB RAM total, 2TB NVMe total)
- Virtual nodes: 7 (3 server + 4 agent)
- Namespaces: 42 active
- Deployments: 80+
- StatefulSets: 29
- CronJobs: 26
- DaemonSets: 9
- Ingress hosts: 26
- PostgreSQL instances: 15
- GPU workers: 3 (Intel UHD 630 QSV/VA-API)
- NAS storage: 19TB
- Longhorn capacity: ~1.2 TiB (3 nodes)
The cluster has grown from a weekend project into a genuine private cloud. Every service has Prometheus metrics, most have CI/CD pipelines on the self-hosted runners, and the AI layer (OpenClaw + its satellite agents) is starting to operationally manage parts of the cluster itself. The next big milestones are Authentik for proper SSO, Linkerd for service mesh mTLS, and local LLM inference via Ray Serve + vLLM.