TL;DR

Over the past two months, I have made a series of stability improvements to my k3s homelab cluster. The biggest wins: migrating from AWS ECR to self-hosted Harbor (eliminating 12-hour token expiry), fixing recurring Grafana crashes caused by SQLite corruption on Longhorn, recovering pve4 after a failed LXC experiment, hardening NetworkPolicies to close gaps in pod-to-host traffic rules, and patching multiple CVEs across the media stack. The cluster now runs 7/7 nodes on k3s v1.34.4, all services monitored, all images pulled from Harbor with static credentials that never expire.

Harbor: Killing ECR Token Expiry

The single biggest reliability improvement was migrating container images from AWS ECR to self-hosted Harbor.

The problem with ECR: tokens expire every 12 hours. A CronJob refreshes them, but if the CronJob fails (node pressure, DNS hiccup, API rate limit), image pulls start failing silently. Pods that restart during the gap get ImagePullBackOff. I have been burned by this at least three times, always at the worst time.

Harbor pull secrets are static. They do not expire. Once configured, image pulls just work indefinitely. The migration touched 26 files across every namespace – openclaw, media services, CI runners, monitoring – all now pointing at harbor.k3s.internal.zolty.systems.

Beyond credential stability, Harbor also provides:

  • Trivy vulnerability scanning on every push
  • Image promotion workflow – CI pushes to staging, manual promotion to production
  • Pull-through cache – proxies upstream images (Docker Hub, GitHub Container Registry) through Harbor, eliminating rate limiting

The ECR repos still exist in AWS (Terraform state gap I have not cleaned up yet), but nothing references them anymore.

Grafana SQLite Corruption Fix

Grafana ships with SQLite as its default database. On a normal server, this is fine. On Kubernetes with Longhorn distributed storage, it causes recurring corruption.

The failure mode: Grafana writes to its SQLite database on a Longhorn PVC. Longhorn replicates the volume across nodes. During replica sync or node failover, SQLite’s write-ahead log (WAL) gets out of sync with the main database file. Grafana starts throwing database is locked or database disk image is malformed errors. The pod crashes, restarts, and either recovers (if Longhorn’s replica is intact) or loses all dashboard and datasource configuration (if the WAL divergence corrupted the database).

This happened three times before I fixed it properly.

The fix: stop using persistent storage for Grafana entirely. All dashboards are now provisioned from ConfigMaps via the sidecar container. All datasources are defined in the Helm values. Grafana runs on emptyDir – if the pod restarts, it rebuilds its state from the provisioned configs in seconds. Nothing is lost because nothing ephemeral is stored.

# Before: Longhorn PVC (prone to SQLite corruption)
persistence:
  enabled: true
  storageClassName: longhorn

# After: emptyDir (stateless, config-driven)
persistence:
  enabled: false

The trade-off is that ad-hoc dashboard changes made through the Grafana UI are lost on restart. But I was already managing dashboards as ConfigMaps for exactly this reason. The PVC was just a legacy configuration from the initial Helm deployment.

Security Patches

Several CVEs hit components in the cluster over the past two months. All patched:

ComponentVersionCVEImpact
Sonarr4.0.17CVE-2026-30976Remote code execution
Jellyfin PostgreSQL16.13-alpineCVE-2026-2005SQL injection in extensions
OAuth2 Proxy7.14.3Auth bypassCookie validation flaw
gh CLI2.88.1 (pinned)CVE-2026-33186gRPC dependency vulnerability

I also removed AWS credentials from a tracked file that should never have been committed. The credentials were rotated immediately, and a .gitignore rule was added to prevent recurrence.

Networking Hardening

Two networking changes that closed real security gaps:

NetworkPolicy ipBlock Rules

Kubernetes NetworkPolicies with namespaceSelector: {} look like they allow traffic from everywhere. They do not. They only match pod CIDR traffic. Host-network IPs – the Kubernetes API server, kubelets, Proxmox hosts, the NAS – are invisible to namespace selectors.

This bit me when Prometheus lost all service discovery. The monitoring namespace had a NetworkPolicy with namespaceSelector: {} for egress, which should have allowed Prometheus to reach everything. But Prometheus talks to the Kubernetes API server (a host-network IP) for service discovery targets. The policy blocked it silently.

The fix: explicit ipBlock CIDR rules for the node subnet:

egress:
  - to:
      - ipBlock:
          cidr: 192.168.20.0/24   # Node network
      - ipBlock:
          cidr: 192.168.1.0/24    # Infrastructure network (NAS, Proxmox)

UFW VXLAN Allow Rules

K3s uses Flannel VXLAN for pod networking. When adding a new node, UFW on existing nodes blocks VXLAN UDP traffic (port 8472) from the new node’s IP. Packets arrive at the network interface (visible in tcpdump) but never reach the Flannel interface for decapsulation.

The symptom: the new node joins the cluster, kubectl get nodes shows it as Ready, but pods on the new node cannot communicate with pods on existing nodes. Extremely confusing if you do not know to check UFW.

The fix is now automated in Ansible. The k3s_external_node_ips variable in group_vars/all.yml lists all node IPs. The hardening role adds ufw allow from <ip> for each one on every node.

Recovering pve4

The pve4 worker node went through an adventure in mid-March. I tried running k3s agents as LXC containers (k3s-agent-5 and k3s-agent-6) instead of the existing VM (k3s-agent-4). The idea was more granular resource allocation and faster boot times.

It did not work. LXC containers on pve4 had persistent DNS resolution failures and networking instability. CoreDNS queries from LXC-based pods would intermittently fail, causing service discovery to break. After three days of debugging, I concluded that the combination of Proxmox LXC networking, VXLAN encapsulation, and k3s Flannel was fundamentally unreliable on this hardware.

The resolution: remove both LXC containers, reinstate k3s-agent-4 as a VM, and add a permanent note to the documentation: do not use LXC on pve4 again.

One constraint remains: Longhorn is disabled on k3s-agent-4 because pve4’s NVMe is aging and I do not want to risk storing replicated data on it. The node handles stateless workloads and GPU-accelerated transcoding (Intel UHD 630 is available for VFIO passthrough).

CI/CD Reliability

Three fixes that reduced flaky CI/CD failures:

  1. Runner RBAC. ARC runners need escalate and bind verbs in their Role to grant permissions to the jobs they run. Without these, any workflow that creates RBAC resources fails with a permissions error. This is not obvious from the error message – it just says “forbidden.”

  2. Test isolation. Some CI tests were running kubectl apply against the live cluster. If the runner’s ServiceAccount lacked the right RBAC (e.g., NetworkPolicy management), the test failed. Moved these to dry-run mode where the runner cannot affect cluster state.

  3. Workflow triggers. A workflow that triggered on push to specific paths was missing its own workflow file in the path filter. Changes to the workflow itself would not trigger a run, so you could merge a broken workflow without catching it until the next unrelated push.

Prometheus Cardinality Reduction

The monitoring stack was generating too many time series. Six cardinality-suppression rules on kubeApiServer metrics dropped approximately 315,000 high-cardinality histogram buckets that were consuming storage and query time without providing actionable insight.

Before the suppression rules, a Prometheus query for API server latency would scan hundreds of thousands of series. After: the same query is an order of magnitude faster, and the metrics that remain are the ones I actually use in dashboards and alerts.

Current State

As of today, the cluster is in good shape:

MetricValue
Nodes7/7 Ready
K3s versionv1.34.4+k3s1
MetalLBv0.14.3 (L2)
Longhornv1.8.2 (all workers except pve4)
Container registryHarbor (self-hosted, no token expiry)
Monitoringkube-prometheus-stack + Loki
TLScert-manager with Let’s Encrypt DNS-01

There are still a couple of open items:

  • Radarr intermittent restarts – under investigation, likely a memory limit issue
  • dev-workspace namespace stuck in Terminating – stale finalizer, needs manual cleanup

But these are minor. The cluster has not had an unplanned outage in the past three weeks, which is the longest streak since I started this project.

Lessons Learned

  1. Self-hosted registries eliminate an entire class of failure. ECR token expiry was a reliability tax I paid every week. Harbor with static credentials reduced image pull failures to zero.

  2. SQLite and distributed storage do not mix. If your application uses SQLite and your storage layer replicates across nodes, you will hit WAL corruption eventually. Make the application stateless and provision config from external sources.

  3. NetworkPolicy namespaceSelector does not mean what you think it means. It only matches pod CIDRs. Host-network IPs require explicit ipBlock rules. This is documented in the Kubernetes docs but easy to miss.

  4. LXC is not a drop-in replacement for VMs in k3s. The networking stack differences (particularly around VXLAN encapsulation) cause subtle, hard-to-debug failures. VMs are heavier but reliably predictable.

  5. CI/CD RBAC errors are deceptive. The error message says “forbidden” but the fix is not always adding permissions to the ServiceAccount. Sometimes the ServiceAccount needs escalate and bind verbs so it can grant permissions to other resources.

  6. Cardinality is a silent performance killer. High-cardinality metrics do not cause errors – they just make everything slower. Regular audits of metric cardinality pay for themselves in query performance.

  7. The longest debugging sessions are networking problems. DNS, VXLAN, UFW, NetworkPolicy – networking failures are always the hardest to diagnose because the symptoms appear far from the cause. tcpdump on the node is usually the fastest path to understanding.