Homelab State of the Union: 42 Namespaces and Counting

TL;DR

The homelab runs 42 Kubernetes namespaces across 7 nodes (3 control plane, 4 workers) on 4 Lenovo ThinkCentre M920q mini PCs running Proxmox VE. This post is the result of a full infrastructure audit — reconciling what’s actually running against what’s documented, catching version drift, and noting what’s been added, removed, or broken since the last check.

Compute

Four Lenovo ThinkCentre M920q nodes form the physical layer:

Host	CPU	RAM	NVMe	Role
pve1	i5-8500T	32GB	512GB	1 server VM + 1 agent VM
pve2	i5-8500T	32GB	512GB	1 server VM + 1 agent VM
pve3	i5-8500T	32GB	512GB	1 server VM + 1 agent VM
pve4	i7-8700T	32GB	512GB	1 agent VM (GPU passthrough)

The k3s cluster runs v1.34.4+k3s1 with embedded etcd for HA. All 7 nodes report Ready. Server VMs get 2 cores and 6GB each — just enough for etcd and the API server. Agent VMs are beefier: 6 cores and 22GB on pve1-3, 12 cores and 28GB on pve4.

All three pve1-3 agents have Intel UHD 630 GPUs passed through via VFIO for hardware transcoding (h264_qsv, hevc_qsv, hevc_vaapi). pve4’s GPU passthrough was restored on March 16 after a failed LXC experiment — the LXC containers on pve4 were abandoned due to DNS failures and networking instability. Lesson learned: don’t use LXC on pve4 again.

Storage

Two tiers:

Longhorn (v1.8.2) provides distributed block storage across the 3 main worker nodes (~1.2 TiB total). 2 replicas, best-effort auto-balance. Longhorn is disabled on k3s-agent-4 due to the aging NVMe on pve4. All Longhorn volumes use the boot disk at /var/lib/longhorn/ — no dedicated data disks yet, though Terraform supports adding them via additional_disks.

NFS on the Ugreen DXP4800 NAS (192.168.30.10, 19TB) provides shared storage for:

Media library (movies, TV, music) — shared by Jellyfin, Plex, Radarr, Sonarr, Bazarr, Tdarr
Monitoring data (Prometheus, Loki, Grafana, AlertManager)
Harbor container registry storage
Gitea package registry storage
RAG document staging

The NAS has dual 2.5GbE in 802.3ad LACP LAG for 5Gbps aggregate throughput to the USW Aggregation switch. All three Proxmox hosts have Mellanox ConnectX-3 10GbE NICs in active-backup bonds.

Networking

Three VLANs keep things separated:

VLAN 20 (Server/K8s): k3s nodes (192.168.20.20-33), MetalLB pool (192.168.20.200-220)
VLAN 30 (Storage): NAS at 192.168.30.10, Proxmox storage interfaces
Default VLAN: Proxmox management (192.168.1.105-108), IoT devices, UniFi gear

Traefik (bundled with k3s) handles all ingress at 192.168.20.200. MetalLB v0.14.3 in L2 mode assigns LoadBalancer IPs. cert-manager v1.14.2 handles TLS via Let’s Encrypt DNS-01 against Route53.

Public access goes through a DDNS CronJob that updates Route53 every 5 minutes, with OAuth2 Proxy (Google auth) protecting external endpoints. Port forwarding on the UDM Pro routes 80/443 to Traefik.

Media Stack

The media namespace is the heaviest workload:

Jellyfin (HA, 2-replica StatefulSet) with PostgreSQL + Redis — GPU-accelerated transcoding
Plex (v1.43.0) — alternative media server, also with GPU preference
Jellyseerr — request portal with Jellyfin SSO
Radarr + Sonarr — movie and TV management with NFS media mounts
Prowlarr — indexer management
Bazarr — subtitle automation
Tdarr — distributed GPU transcoding (DaemonSet, 1 worker per GPU node = 3 total)
FlareSolverr — Cloudflare bypass for indexers

Content pipeline: Jellyseerr request -> Radarr/Sonarr -> legacy seedbox (rclone sync every 4h) -> NAS NFS -> Jellyfin/Plex. The local qBittorrent + gluetun VPN setup was removed since the last audit.

3 GPU workers handle hardware transcoding via Intel QSV/VA-API. Each agent node has the Intel UHD 630 passed through, with gpu=true:NoSchedule taints ensuring only media workloads land there.

Monitoring & Observability

The monitoring stack runs on NFS to avoid Longhorn WAL corruption issues (a painful lesson):

Prometheus (kube-prometheus-stack Helm) — scraping 80+ ServiceMonitors
Grafana — dashboards as ConfigMaps (auto-loaded by sidecar), persistence disabled after SQLite corruption
Loki + Promtail — log aggregation from all 7 nodes
AlertManager — alerts route to Slack via the Alert Responder bot (Amazon Nova Micro for AI analysis)
Custom exporters: Proxmox (pve-exporter), NAS, seedbox, AWS cost, Anthropic cost, OpenRouter cost, GitHub

AI / LLM Services

This is where most of the recent growth has been:

OpenClaw (open-webui) — Personal AI assistant gateway with 2-replica agent pool, Anthropic direct + LiteLLM Bedrock proxy, Telegram + Slack channels, persistent memory, GitHub access
OpenClaw Ops (openclaw-ops) — Cluster observer with knowledge graph, event ingest from GitHub/Prometheus/K8s, Jellyfin release watcher with auto-PR
OpenClaw Personal (openclaw-personal) — Job search agent with resume audit, interview prep, daily job board crawling
RAG Platform (rag) — Qdrant vector DB + dual ingestion pipelines (NFS docs via Ollama embeddings, repo content via Bedrock Titan embeddings into pgvector)
Alert Responder — AI-powered alert triage using Bedrock Nova Micro, posts threaded analysis to Slack
Auto Brand — AI video factory (7 FastAPI services + NATS + Vue.js) using Bedrock Nova Reel for video generation
Trivy Operator — Continuous vulnerability scanning, results consumed by OpenClaw Ops

Other Services

The long tail of apps running in the cluster:

Service	Namespace	Purpose
Habit Tracker	ham	React + Fastify habit tracker (Harbor images)
Cardboard	cardboard	TCG price tracker with Chrome scraper CronJobs
Trade Bot	trade-bot	Robinhood trading bot (dry run mode)
Digital Signage	digital-signage	Angular SPA + 7 Flask services + MQTT for Raspberry Pi displays
DnD Platform	dnd	Multiplayer D&D with LiveKit voice, pgvector lore, AI DM
Cat Game	cat-game	Browser game with nginx exporter
Aja Recipes	aja-recipes	Recipe manager (2-replica with HPA)
Tshirt Cannon	tshirt-cannon	AI merch factory MVP (storefront + vote API)
Wiki.js	wiki	Internal knowledge base
Security Scanner	security-scanner	Vulnerability scanner (XSS, SQLi, CSRF, etc.)
Media Library	media-library	Blog asset manager with S3 CDN + YouTube integration
Media Profiler	media-profiler	Media preference quiz with psychological profiling
Jupyter	jupyter	JupyterLab with cluster analysis notebooks
GHA Dashboard	gha-dashboard	GitHub Actions workflow history viewer
Kube Utils	kube-utils	Security honeypot mimicking node-exporter

Infrastructure Services

Service	Purpose
Harbor	Self-hosted container registry (replacing ECR)
Gitea	Package registry (PyPI, npm, Maven, Go)
Home Assistant	IoT hub with Google Assistant, Hue, UniFi
Email Gateway	Postfix relay to AWS SES
Proxmox Watchdog	Auto power-cycle via Kasa smart strip
ARC Runners	8 self-hosted GitHub Actions runners
cert-manager	TLS via Let’s Encrypt DNS-01
MetalLB	L2 LoadBalancer IPs
Longhorn	Distributed block storage
CoreDNS	2 replicas for DNS HA

What Got Removed

Dev Workspace — code-server pods replaced by local dev + Claude Code CLI. Namespace stuck in Terminating.
NUT DaemonSet — UPS monitoring moved to host-level NUT clients via Ansible
qBittorrent — VPN-tunneled torrent client removed from media stack

What Needs Attention

dev-workspace namespace stuck in Terminating — needs manual finalizer cleanup
cluster-health-monitor and dnd-multi are empty namespaces (candidates for deletion)
Radarr pod has been restarting intermittently
Several services still pulling from ECR instead of Harbor (migration in progress)
Longhorn running on boot disks only — dedicated disks via Terraform additional_disks available but not provisioned

By the Numbers

Physical nodes: 4 (128GB RAM total, 2TB NVMe total)
Virtual nodes: 7 (3 server + 4 agent)
Namespaces: 42 active
Deployments: 80+
StatefulSets: 29
CronJobs: 26
DaemonSets: 9
Ingress hosts: 26
PostgreSQL instances: 15
GPU workers: 3 (Intel UHD 630 QSV/VA-API)
NAS storage: 19TB
Longhorn capacity: ~1.2 TiB (3 nodes)

The cluster has grown from a weekend project into a genuine private cloud. Every service has Prometheus metrics, most have CI/CD pipelines on the self-hosted runners, and the AI layer (OpenClaw + its satellite agents) is starting to operationally manage parts of the cluster itself. The next big milestones are Authentik for proper SSO, Linkerd for service mesh mTLS, and local LLM inference via Ray Serve + vLLM.

TL;DR#

Compute#

Storage#

Networking#

Media Stack#

Monitoring & Observability#

AI / LLM Services#

Other Services#

Infrastructure Services#

What Got Removed#

What Needs Attention#

By the Numbers#