VPN tech collective mesh

Building a VPN Mesh for a Tech Collective

TL;DR I am designing a WireGuard VPN mesh to connect a small tech collective – a group of friends who each run their own infrastructure. The topology is hub-and-spoke with my k3s cluster as the hub, connecting 4+ remote sites over encrypted tunnels. Shared services include Jellyfin media federation, distributed CI/CD runners, LAN gaming, and centralized monitoring. The logging pipeline is privacy-first: all log filtering and anonymization happens at the edge (spoke side) before anything ships to the hub. This post covers the network design, the three-layer firewall architecture, the privacy model, and the phased rollout plan. ...

March 27, 2026 · 8 min · zolty
AI failure patterns and guardrails

When the AI Breaks Production: Failure Patterns, Guardrails, and Measuring What Works

TL;DR AI tools have caused multiple production incidents in this cluster. The AI alert responder agent alone generated 14 documented failure patterns before it became reliable. A security scanner deployed by AI applied restricted PodSecurity labels to every namespace, silently blocking pod creation for half the applications in the cluster. The service selector trap – where AI routes 50% of requests to PostgreSQL instead of the application – appeared in 4 separate incidents before guardrails stopped it. This post catalogs the failure patterns, the five-layer guardrail architecture built to prevent them, and an honest assessment of what still goes wrong. ...

March 2, 2026 · 14 min · zolty
Monitoring stack

Monitoring Everything: Prometheus, Grafana, and Loki on k3s

TL;DR After running the cluster for nearly two weeks, today I took a step back to document and optimize the monitoring stack. This covers kube-prometheus-stack (Prometheus + Grafana + AlertManager), Loki for log aggregation, custom dashboards for every service, alert tuning to reduce noise, and the cluster-wide performance benchmarks I ran to establish baseline metrics. The Monitoring Architecture ┌──────────────────────────────────────────────────┐ │ Grafana │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Metrics │ │ Logs │ │ Alerts │ │ │ │ Explorer │ │ Explorer │ │ Rules │ │ │ └──────┬───┘ └──────┬───┘ └──────┬───┘ │ └─────────┼──────────────┼─────────────┼───────────┘ │ │ │ ┌─────┴─────┐ ┌─────┴─────┐ │ │Prometheus │ │ Loki │ │ │ (metrics) │ │ (logs) │ │ └─────┬─────┘ └─────┬─────┘ │ │ │ ┌─────┴──────┐ ┌──────┴──────┐ ┌─────┴────┐ │AlertManager│ │ Exporters │ │Promtail │ │ → Slack │ │ node │ │(log │ └────────────┘ │ kube-state │ │ shipper) │ │ cAdvisor │ └──────────┘ │ custom │ └─────────────┘ kube-prometheus-stack The foundation is kube-prometheus-stack, deployed via Helm. This single chart installs: ...

February 19, 2026 · 6 min · zolty
AI-powered alert analysis

Building an AI-Powered Alert System with AWS Bedrock

TL;DR Today I deployed two significant additions to the cluster: an AI-powered Alert Responder that uses AWS Bedrock (Amazon Nova Micro) to analyze Prometheus alerts and post remediation suggestions to Slack, and a multi-user dev workspace with per-user environments. I also hardened the cluster by constraining all workloads to the correct architecture nodes and fixing arm64 scheduling issues. The Alert Responder Running 13+ applications on a homelab cluster means alerts fire regularly. Most are straightforward — high memory, restart loops, certificate expiry warnings — but analyzing each one, determining root cause, and knowing the right remediation command gets tedious, especially at 2 AM. ...

February 14, 2026 · 5 min · zolty
Home Assistant and Proxmox monitoring

Home Assistant on Kubernetes and Building a Proxmox Watchdog

TL;DR Home Assistant runs on k3s using hostNetwork: true for mDNS/SSDP device discovery. I implemented split DNS routing so it is accessible both externally via Traefik and internally via its host IP. Then I built a Proxmox Watchdog — a custom service that monitors all Proxmox hosts via their API and automatically power-cycles unresponsive nodes using TP-Link Kasa HS300 smart power strips. ...

February 10, 2026 · 5 min · zolty
First application deployments

Deploying First Applications: From Zero to Production in 24 Hours

TL;DR Day two of the cluster was a marathon. I deployed two full-stack applications (Cardboard TCG tracker and Trade Bot), set up PostgreSQL with Longhorn persistent storage, created a cluster dashboard, configured Prometheus service monitors, built a dev workspace for remote SSH, and scaled the ARC runners. By the end, the cluster was running real workloads and I had a proper development workflow. The Deployment Pattern Before diving into the applications, I established a consistent deployment pattern that every service follows: ...

February 9, 2026 · 6 min · zolty

Affiliate Disclosure: Some links on this site are affiliate links (Amazon Associates, DigitalOcean referral). As an Amazon Associate, I earn from qualifying purchases. This does not affect the price you pay or my editorial independence — I only recommend products and services I personally use and trust.