Context: This is the live docs/ai-lessons.md from the home k3s cluster repository, referenced extensively across posts on this blog — starting with AI Memory System and GitHub Copilot Setup Guide. Every entry exists because its absence caused a production incident. Personal identifiers and internal domains have been replaced with generic placeholders.

Updated: 2026-03-03


Rules discovered through production breakage. Each entry prevents recurrence of a specific failure. Update this file whenever a new non-obvious failure pattern is discovered.

Kubernetes

  • Traefik v3 upgrade (k3s v1.34+) breaks port-9000 Ingress resources: Traefik v3 removes port 9000 from the Kubernetes Service spec (only 80/443 remain). Any Kubernetes Ingress resource targeting traefik:9000 will fail to route and cause non-stop ERR Cannot create service: service port not found spam in Traefik logs every few seconds, which can cause intermittent route reloads affecting other services. Fix: delete the Kubernetes Ingress and use an IngressRoute CRD with api@internal via TraefikService kind instead.

  • Traefik IngressRoute TLS secret must be in same namespace: When an IngressRoute in kube-system specifies tls.secretName: k3s-wildcard-tls, Traefik looks for the secret in kube-system. If the cert-manager Certificate was created in default namespace, Traefik logs Error configuring TLS: secret kube-system/k3s-wildcard-tls does not exist continuously. Fix: either remove secretName (use tls: {} for entrypoint-level TLS), or create a Certificate in the same namespace as the IngressRoute, or manually copy the secret (note: manual copies become stale on renewal).

  • ConfigMaps are read-only mounts: Mounting a ConfigMap directly over a path that an init script needs to modify causes “Resource busy” errors. Mount to /tmp/config-template/ and copy to target in a command override.

  • Namespace must exist before RBAC: Do not apply Role/RoleBinding to a namespace before the Namespace resource exists. Create namespace first.

  • Pods don’t reload ConfigMaps: After updating a ConfigMap, you must kubectl rollout restart deployment/<name>. Pods do not auto-detect changes.

  • Longhorn storage-over-provisioning: Setting to 100% blocks replica scheduling when nodes are >75% utilized. Use 200% (Longhorn default).

  • Longhorn 50GB volumes can’t do 3 replicas: Agents have 69GB schedulable. One 50GB replica + others exhausts capacity. Use 2 replicas for large volumes.

  • Longhorn volume I/O corruption (Grafana SQLite) — PERMANENTLY FIXED 2026-03-02: Longhorn volume can report attached/healthy while the underlying data returns I/O errors, corrupting Grafana’s SQLite DB. Every request fails with “unable to open database file: input/output error”. Root fix: Switch Grafana to persistence.enabled: false (emptyDir). All dashboards and datasources are provisioned from ConfigMaps via the sidecar — only sessions/preferences are in SQLite, making it safely ephemeral. No more Longhorn PVC = no Longhorn I/O corruption. Also removed deploymentStrategy: Recreate (was only needed for RWO PVC). Template updated in ansible/roles/cluster_services/templates/prometheus-values.yaml.j2. Users must re-login after a pod restart (expected). Do NOT re-add a Longhorn PVC for Grafana.

  • Longhorn volume I/O corruption recovery (general pattern): When any workload (not just Grafana) shows I/O errors but Longhorn reports volume attached/healthy, the volume data is corrupted at the replica level. General recovery: (1) kubectl scale --replicas=0 the Deployment/StatefulSet, (2) delete the PVC (Longhorn auto-deletes the volume), (3) restore from Longhorn S3 backup: list backups in Longhorn UI → create volume from backup → create PV/PVC from the restored volume, (4) update the workload’s PVC name if it changed, (5) scale back up. This was used successfully to restore Jellyfin after rapid Proxmox reboots caused I/O corruption across multiple volumes.

  • PostgreSQL PGDATA on Longhorn: Must set PGDATA=/var/lib/postgresql/data/pgdata — Longhorn PVCs contain lost+found that breaks default initdb path.

  • code-server emptyDir shadows PVC writes: If an emptyDir volume is mounted at a subpath of the home PVC (e.g., /home/coder/.kube), init container writes to that path go to the PVC but the main container’s emptyDir mount hides them. Fix: mount the emptyDir in the init container too, or copy from a secret mount in the main container’s startup script.

  • code-server apt-get fails on startup (DNS not ready): Container startup apt-get commands fail silently when DNS isn’t ready yet (pod networking starts before CoreDNS is reachable). Fix: retry apt-get update up to 3 times with 10s sleep between attempts. Never suppress stderr entirely (> /dev/null 2>&1) — pipe through | tail -5 to catch errors.

  • code-server init container lacks unzip: The ghcr.io/coder/code-server:4.109.2 base image doesn’t include unzip. Terraform is distributed as a ZIP, so sudo apt-get install -y -qq unzip is needed before downloading terraform in the init container.

  • Longhorn PVCs can’t attach to Lima node (RESOLVED): Previously CSINode lima-k3s-agent does not contain driver driver.longhorn.io — this was a Docker-in-Docker era limitation. With the Lima VM (Debian 13 arm64), Longhorn works fully: DaemonSets run, disk is schedulable, replicas are placed. However, workloads with ReadWriteOnce PVCs that must not land on Lima for architectural reasons should still use node affinity runtime NotIn [lima].

  • Longhorn on root disk wastes ~60% of NVMe: Without dedicated disks, Longhorn shares the OS root filesystem. On a 512GB NVMe with 50GB server + 100GB agent boot disks, only ~98 GiB (agent) and ~49 GiB (server) are visible to Longhorn — the rest is consumed by OS, container images, and k3s. Fix: add dedicated Longhorn disks via Terraform additional_disks + Ansible longhorn_disk role. This separates I/O and reclaims stranded capacity.

  • Longhorn replica-auto-balance: Set to best-effort to redistribute replicas across all schedulable nodes. Without this, newly added nodes (like the Lima 1TB disk) get zero replicas until volumes are recreated. Existing volumes keep their original placement until auto-balance migrates them.

  • Longhorn instance-manager requires privileged PodSecurity: The longhorn-system namespace MUST have pod-security.kubernetes.io/enforce: privileged. Using baseline blocks instance-manager pods from starting (hostPath volumes + privileged container). Symptom: all Longhorn volumes stuck in attaching state forever, error violates PodSecurity "baseline:latest": hostPath volumes. Fix: patch kubernetes/core/pod-security-standards.yaml to set enforced: privileged for longhorn-system and re-apply. This caused a full Prometheus + postgres outage after a pve1 cold boot.

  • Longhorn: lima (arm64) replicas get trimmed when amd64 node recovers — [HISTORICAL — Lima VM removed 2025-06-25]: When k3s-agent-1 (or any node) goes down, Longhorn creates replacement replicas — some may land on lima. When the node recovers and its original replicas re-activate, the volume briefly has 4+ healthy replicas. Longhorn trims the “extras” and preferentially removes the most recently added ones (lima’s). Lima ends up with 0 replicas again even though the scheduler initially chose it. This is expected behavior, not a bug. New PVCs created after the recovery will land on lima fairly.

  • Longhorn server node disk pressure (shared root disk): Control plane nodes (k3s-server-1/2/3) share the root filesystem with Longhorn. When the root disk hits ~75% usage, Longhorn’s storageAvailable drops below storageReserved (30% of disk), causing silent disk pressure — the Longhorn node condition won’t mark it as DiskPressure until scheduling is attempted. Fix: disable allowScheduling on that disk (kubectl patch nodes.longhorn.io k3s-server-X --type=merge -p '{"spec":{"disks":{"..":{"allowScheduling":false,...}}}}'). Long-term: provision dedicated /dev/sdb Longhorn disks on server nodes via Terraform additional_disks.

  • Longhorn disk overcommit causes DiskPressure + Unschedulable: With storage-over-provisioning-percentage: 200, Longhorn can schedule 2× the disk size, which is fine until ACTUAL disk usage catches up. When storageScheduled > storageMaximum AND storageAvailable < storageReserved, the node shows Unschedulable in the Longhorn UI. Fix: enable disk eviction (evictionRequested: true, allowScheduling: false) to drain all replicas off the disk. Only THEN re-enable scheduling. Simply deleting a few replicas is insufficient; use full eviction. After eviction, k3s-server-3 dropped from 61 GiB scheduled to 0.

  • Never allow Longhorn to schedule replicas on server nodes (PERMANENT POLICY — 2026-03-03): Server nodes (k3s-server-1/2/3) have 80GB root disks shared with the OS, k3s binaries, container images, and etcd WAL. Longhorn overcommit (200%) makes this appear spacious, but actual write pressure fills the disk and returns I/O errors to the attached volume — causing WAL corruption in Prometheus and any other write-heavy workload. Agent nodes (k3s-agent-1/2/3/4) have 300GB disks with ample headroom. Resolution: permanently disable allowScheduling on all server node Longhorn disks. The longhorn_no_schedule_nodes variable in group_vars/all.yml lists all three server nodes; the cluster_services Ansible role re-applies this on every run. To evict existing replicas: kubectl patch nodes.longhorn.io k3s-server-X -n longhorn-system --type=merge -p '{"spec":{"disks":{"<disk-id>":{...,"allowScheduling":false,"evictionRequested":true}}}}'. Get the disk ID first: kubectl get nodes.longhorn.io k3s-server-X -n longhorn-system -o json | python3 -c "import sys,json; d=json.load(sys.stdin); [print(k) for k in d['spec']['disks']]". Eviction is complete when replica count on the node drops to 0 (kubectl get replicas.longhorn.io -n longhorn-system -o json | python3 -c "import sys,json; r=json.load(sys.stdin); print(sum(1 for i in r['items'] if i['spec'].get('nodeID')=='k3s-server-X'))").

  • Tdarr (s6-overlay) containers crash with runAsUser: Tdarr uses s6-overlay init system which must start as root to set supplementary groups and drop privileges via PUID/PGID env vars. Setting securityContext.runAsUser causes s6-applyuidgid: fatal: unable to set supplementary group list: Operation not permitted in an infinite restart loop. Fix: remove runAsUser/runAsGroup/fsGroup from the pod/container securityContext. This applies to most LinuxServer.io-style images that use s6-overlay (Radarr, Sonarr, Bazarr, etc.) — they handle user switching internally.

  • Tdarr workers need privileged: true for VA-API: Same as Jellyfin — non-privileged containers can’t perform DRM ioctls needed for Intel QuickSync/VA-API hardware encoding. Workers detect h264_qsv, hevc_qsv, hevc_vaapi as working only with privileged mode.

  • GPU taints cause more harm than good — removed: gpu=true:NoSchedule taints on agent nodes blocked ALL non-GPU pods (CI/CD runners, general workloads) from scheduling. With only 4 non-tainted nodes (3 control-plane + 4 workers), cluster was resource-exhausted. GPU workloads (Jellyfin, Tdarr) already use nodeSelector: gpu: intel-uhd-630 to target GPU nodes, making taints redundant. Fix: removed taints entirely. GPU targeting happens via labels + nodeSelector, not taints. If GPU isolation is ever needed again, use PreferNoSchedule (soft) instead of NoSchedule (hard).

  • Security scanner over-labeling namespaces with restrict PodSecurity: A deployed security scanner service labeled ALL application namespaces with pod-security.kubernetes.io/enforce: restricted (or baseline), blocking pod creation for services that legitimately need elevated permissions. Affected: cardboard, trade-bot (Postgres + Flask → need baseline), home-assistant (hostNetwork=true → needs privileged), dev-workspace (docker-in-docker privileged container → needs privileged), proxmox-watchdog (hostNetwork=true → needs privileged). Pods don’t get evicted, but once they crash or are recreated they hit FailedCreate forever — symptom is 0 pods in namespace despite replica count > 0. Fix: kubectl label namespace <ns> pod-security.kubernetes.io/enforce=<level> --overwrite, then kubectl rollout restart. Always bake the correct PodSecurity labels directly into the namespace YAML in the repo so re-applies don’t clobber them. Mapping: postgres+Flask → baseline; hostNetwork/hostPort/docker-in-docker → privileged; pure stateless apps with secure images → restricted.

  • Longhorn replica failedAt + stale diskID after VM rebuild: When an agent VM is destroyed and rebuilt, its Longhorn disk ID changes. Replicas referencing the old disk ID fail with “cannot find disk name for replica” and the volume goes “faulted/detached”. Clearing failedAt on the replica doesn’t help if the disk ID no longer exists on any node. Fix: delete the volume and restore from S3 backup, or delete and recreate fresh if data is disposable.

  • Longhorn Volume CRD requires frontend: blockdev: Creating a Longhorn Volume via kubectl with fromBackup for restore MUST include spec.frontend: blockdev and spec.accessMode: rwo. Without frontend, the admission webhook rejects with “invalid volume frontend specified”. Copy the full spec from an existing healthy volume when in doubt.

  • Workloads targeting GPU nodes use nodeSelector, not tolerations: GPU workloads (Jellyfin, Tdarr, Home Assistant) use nodeSelector: gpu: intel-uhd-630 or kubernetes.io/hostname to land on agent nodes. Taints were removed — tolerations in manifests are now harmless but unnecessary.

  • NFS mounts fail on Lima node (192.168.1.56) — [HISTORICAL — Lima VM removed 2025-06-25]: The UGreen NAS (192.168.30.10) only allows NFS from the k8s VLAN (192.168.20.0/24). Pods that mount NFS PVs (radarr, sonarr, bazarr, jellyfin, media-controller, etc.) MUST use nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution with runtime NotIn [lima]. Without this, Longhorn volume affinity can schedule pods on the Lima node where NFS mount fails with “access denied by server”.

  • Always use Recreate strategy for single-replica apps: RollingUpdate with maxUnavailable: 0, maxSurge: 1 creates a new pod, waits for it to become ready, THEN kills the old one. With Longhorn RWO PVCs, the new pod can’t attach the volume until the old pod releases it — deadlock. Recreate kills old first, then starts new. All single-replica stateful app Deployments must set strategy: type: Recreate.

  • Bound PVC volumeName is immutable — must match manifest: When a PVC is restored from a backup PV (e.g., via Longhorn restore workflow), the cluster PVC gets spec.volumeName set to the PV name. If the manifest omits volumeName, kubectl apply fails with “spec is immutable after creation: volumeName”. Fix: add volumeName: <pv-name> to the PVC manifest to match cluster state. Run kubectl get pvc <name> -n <ns> -o yaml | grep volumeName to find the bound PV name.

  • Workloads rescheduled to lima after node drain — [HISTORICAL — Lima VM removed 2025-06-25]: When amd64 worker nodes go down (e.g., pve1 shutdown for hardware upgrade), pods without nodeSelector or nodeAffinity may reschedule onto the lima arm64 node. This causes: (1) exec format error for amd64-only images (all digital-signage microservices, tdarr), (2) NFS access denied for media stack workloads, (3) unexpected StatefulSet issues (trade-bot postgres). Fix: add nodeSelector: {kubernetes.io/arch: amd64} to ALL amd64-only workloads. For critical workloads with NFS dependencies, also add runtime NotIn [lima] affinity.

  • K8s env var $(VAR) interpolation requires ordering: In a Deployment’s env list, $(DB_USER) is only interpolated if DB_USER is defined EARLIER in the list. Define referenced vars BEFORE the var that uses them. Otherwise, the literal string $(DB_USER) is passed to the container.

  • K8s optional: true on secretKeyRef for phased deployments: When a Secret won’t exist until after an OAuth flow or manual step, use optional: true on all secretKeyRef entries referencing it. Without this, the pod fails to start with CreateContainerConfigError because K8s can’t find the Secret. This is useful for YouTube/OAuth credentials that require a one-time human authorization step before they can be populated.

ARC (GitHub Actions Runner Controller)

  • ARC labels field replaces defaults: If you set labels, self-hosted disappears. Jobs with runs-on: [self-hosted, ...] won’t match. ALWAYS include self-hosted explicitly.
  • ARC API group is .dev not .net: Use actions.summerwind.dev/v1alpha1. Common copy/paste error with .net.
  • ARC secret key name: Must be github_token (lowercase, underscore). NOT github_pat or GITHUB_TOKEN.
  • ARC metrics behind kube-rbac-proxy: ARC v0.27.x uses kube-rbac-proxy sidecar on port 8443 (HTTPS) for metrics instead of direct port 8080. Prometheus scrape config needs scheme: https, tls_config: insecure_skip_verify: true, bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token, and a ClusterRoleBinding granting the Prometheus SA get on non-resource URL /metrics.
  • Do not use standard HPA with ARC: Use HorizontalRunnerAutoscaler CRD with PercentageRunnersBusy metric.
  • Do not mix ARC v1 and v2 fields: githubConfigUrl is v2 only.
  • ARC runner image lacks pip/node/npm: Bootstrap with ensurepip or apt-get install python3-pip. Use --break-system-packages (PEP 668).
  • Org-level GitHub secrets don’t work with ARC: Use repo-level secrets for CI/CD.
  • ARC runners on arm64 nodes break CI/CD — [HISTORICAL — Lima VM removed 2025-06-25, no more arm64 nodes]: runtime: docker-desktop NotIn affinity didn’t match Lima VM on Mac Mini. Runners scheduled on arm64 failed with exec format error when installing x86_64 AWS CLI or building amd64 Docker images. ALWAYS use kubernetes.io/arch: amd64 nodeAffinity for runners. Same applies to any workload whose Docker image is built for amd64 only (including all app deployments and CronJobs).
  • CI/CD RBAC escalation: When CI creates Roles in target namespaces, the runner’s own Role needs escalate + bind verbs on rbac.authorization.k8s.io/roles,rolebindings, AND every verb it grants must already be held by the runner.
  • Runner RBAC bootstrapping — must apply manually: kubernetes/github-runners/zolty-mat-runners.yaml is NOT applied by the “Deploy K8s Applications” workflow (bootstrapping problem — the runner can’t grant itself new permissions). When adding new namespaces or resources to runner Roles, you must kubectl apply -f kubernetes/github-runners/zolty-mat-runners.yaml from a local kubeconfig with cluster-admin. Always do this before triggering a deploy that needs the new permissions.
  • Runner missing serviceaccounts verb causes apply failure: When the alert-responder (or any) manifest includes a ServiceAccount resource, the runner must have get/list/create/patch/update on serviceaccounts in that namespace. The error is: serviceaccounts "<name>" is forbidden: User "system:serviceaccount:arc-runner-system:github-runner" cannot get resource "serviceaccounts". Add serviceaccounts to the namespace’s deploy Role resources list.
  • Rollout timeouts 120s too short with Longhorn PVCs: After Recreate strategy kills the old pod, a Longhorn ReadWriteOnce PVC must detach from the old node before the new pod can start (30-60s). Combined with image pull time, the total easily exceeds 120s. All CI/CD workflows use --timeout=300s for kubectl rollout status. Never go lower than 300s for any deployment using Longhorn PVCs.

MetalLB

  • MetalLB pool annotation: Do NOT use metallb.universe.tf/address-pool unless targeting a specific pool. Omit to auto-assign from default homelab-pool. Using default as pool name fails with “unknown pool”.
  • MetalLB v0.14.x labels: Native manifest uses app=metallb,component=controller (NOT app.kubernetes.io/component=controller).
  • spec.loadBalancerIP + MetalLB annotation = conflict: Using both spec.loadBalancerIP (deprecated K8s field) AND metallb.universe.tf/loadBalancerIPs annotation on the same Service causes MetalLB to reject or ignore the request. Use ONLY the annotation. When Traefik’s HelmChartConfig sets loadBalancerIP via spec, patch the live Service to remove it and add the annotation instead.
  • HelmChartConfig on-disk manifest reverts kubectl patches: k3s auto-applies manifests from /var/lib/rancher/k3s/server/manifests/custom/ on every restart. If you fix a HelmChartConfig via kubectl apply but don’t update the on-disk YAML, the next k3s restart silently reverts your fix. Always update the on-disk manifest (on server-1) AND the Ansible template simultaneously.

k3s Upgrades

  • k3s servicelb hijacks port 22 — breaks SSH to all nodes: k3s servicelb creates DaemonSets (svclb-*) that bind host ports via iptables for every LoadBalancer Service. If ANY Service uses port: 22 (e.g., code-server SSH), servicelb binds port 22 on EVERY node, intercepting SSH connections and routing them to the Service’s pods instead of the real sshd. Symptoms: SSH connects but fails authentication/key exchange (you’re talking to code-server, not sshd), sshd logs show zero incoming connections. Diagnosis: kubectl get ds -n kube-system | grep svclb to find offending DaemonSets, iptables -t nat -L -n | grep 22 to see the DNAT rules. Fix: disable servicelb entirely when using MetalLB. MetalLB handles LB allocation without binding host ports.
  • k3s --disable flags lost during manual upgrade: When upgrading k3s via the install script (curl -sfL https://get.k3s.io | sh -s - server), flags like --disable=servicelb from the original install are NOT preserved. The systemd ExecStart is overwritten. Fix: use /etc/rancher/k3s/config.yaml with disable: [servicelb] — this persists across upgrades. Create this file on ALL server nodes before upgrading.
  • /etc/rancher/k3s/config.yaml is the upgrade-safe config method: Command-line args in systemd ExecStart or Ansible k3s_extra_server_args can be lost during k3s upgrades. The config.yaml file is always read by k3s on startup regardless of how the binary was upgraded. Prefer config.yaml for critical settings like disable, tls-san, cluster-cidr, etc.
  • k3s upgrade wipes authorized_keys on Debian: Observed during v1.29→v1.34 upgrade — /root/.ssh/authorized_keys was empty on all nodes after upgrade. Root cause unclear (possibly cloud-init re-run). Always verify SSH access immediately after upgrading k3s. Use qm guest exec via Proxmox as out-of-band recovery if SSH breaks.
  • After disabling servicelb, host keys change: When servicelb was intercepting port 22, ssh-keyscan captured the code-server pod’s host key, not the real sshd. After disabling servicelb, the real sshd responds with different keys. Must re-run ssh-keygen -R + ssh-keyscan for all node IPs after the fix.

Networking & DNS

  • LACP order-of-operations: NAS/client bond mode MUST change before switch LAG: When converting from active-backup to 802.3ad LACP, change the client side (NAS bond mode) FIRST, then enable LAG on the switch. If the switch is set to “Aggregating” (LACP) while the client is still in active-backup, the switch enters LAG mode and starts sending LACP PDUs. The active-backup client ignores them, traffic drops to zero, and the client networking stack may crash/hang (observed with UGOS Pro DXP4800 — took full power cycle to recover). Correct order: (1) Change NAS bond to Dynamic link aggregation-bond4 (802.3ad) in UGOS Pro UI, (2) verify NAS is back up, (3) then set switch ports to Aggregating. Also: swctrl port show detail LAG state (D) = detached/failed, (U) = UP/active and working. Verify with cat /proc/switch/mac_table | grep vlan=30 — both NAS MACs should show type=lag.
  • 802.3ad LACP distributes by flow hash, not bytes: A single TCP flow always uses one LAG member (based on src/dst IP/MAC hash). Two simultaneous clients from different IPs will each use a different member, utilizing both links. Don’t expect counter equality on both ports from a single client — asymmetry is correct and expected.
  • UniFi LAG API field is op_mode: "aggregate" + aggregate_members: The working API format for LAG via PUT to /proxy/network/api/s/default/rest/device/<id> uses port_overrides[].op_mode: "aggregate" with aggregate_members: [7,8] and lag_idx: 1. The field aggregate_num_ports does NOT work. The UniFi UI sets this correctly — use the UI for LAG changes and verify via API.
  • Debian 13 cloud images only enable main apt component: intel-media-va-driver-non-free and other non-free packages aren’t available until non-free non-free-firmware components are added to /etc/apt/sources.list.d/debian.sources. The Ansible gpu_worker role handles this automatically with ansible.builtin.replace on the Components: line.
  • kubeconfig default server is k3s-server-1’s static IP, not a floating VIP: The default kubeconfig points to https://192.168.20.20:6443 (k3s-server-1 on pve1) — this is NOT a kube-vip floating address, it’s the static node IP. When pve1 goes down, all kubectl commands fail with dial tcp 192.168.20.20:6443: connect: network is unreachable. Workaround: target another control plane with kubectl --server=https://192.168.20.21:6443 --insecure-skip-tls-verify <cmd>. etcd retains quorum as long as 2/3 control plane nodes are up (pve2 and pve3). Long-term fix: configure kube-vip for a floating VIP that follows the etcd leader.
  • After a Proxmox host outage, pods with Longhorn volumes show I/O errors even after nodes rejoin: When pve1 (or any host) goes down and comes back, previously-mounted Longhorn volumes may return Input/output error to running pods (e.g., failed to load config.xml: Input/output error). The Longhorn volume itself recovers to robustness: healthy within minutes of the node rejoining. The pod’s existing mount point is stale — it was established while the replica was degraded. Fix: kubectl rollout restart deployment/<name>. Do NOT assume I/O errors mean volume data corruption — always check kubectl get volume -n longhorn-system <vol> -o jsonpath='{.status.robustness}' first. If healthy, a rollout restart is sufficient. Observed with Sonarr and Jellyfin config volumes after 2026-03-01 pve1 hard shutdown.
  • Plex transcode EmptyDir eviction — evicted pod stays as Error indefinitely: Plex uses an emptyDir with sizeLimit: 5Gi for transcoding scratch space. When active transcoding fills it, the kubelet evicts the pod (exit code 137, reason: Evicted, message: Usage of EmptyDir volume "transcode" exceeds the limit). The Deployment’s ReplicaSet immediately creates a replacement pod on the same node. The evicted pod remains in Error state indefinitely — it does NOT get cleaned up automatically. The cluster will show two Plex pods: one Running (replacement) and one Error (evicted). Fix: kubectl delete pod -n media <evicted-pod-name>. Consider increasing emptyDir.sizeLimit if transcoding large files, or ensuring Tdarr pre-transcodes to lower bitrates before Plex serves them.
  • Seedbox sync files accumulate in /media/staging when Radarr/Sonarr are down: The seedbox-sync CronJob (runs every 4h) uses rsync to pull from user@<seedbox-ip>:2222:/home/user/Downloads/complete//media/staging/ and then triggers DownloadedMoviesScan/DownloadedEpisodesScan API calls. If Radarr or Sonarr are crashed/restarting when the cron fires, the rsync succeeds (exit 0) but the API call fails (non-fatal WARN). Files accumulate in /media/staging without being imported, causing seedbox disk to fill. After recovering the arr services, manually trigger both scans: wget --header='X-Api-Key: <key>' --post-data='{"name":"DownloadedMoviesScan","path":"/media/staging"}' --header='Content-Type: application/json' http://10.43.82.213:7878/api/v3/command (Radarr) and same pattern for Sonarr at 10.43.85.233:8989. API keys are in the arr-api-keys Secret in the media namespace.
  • pve4 NVMe PCIe AER RxErr — symptom of aging consumer NVMe: The WD PC SN720 512GB NVMe (PCI ID 15b7:5002) on pve4 logs PCIe Physical Layer correctable RxErr errors in dmesg (nvme 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer). SMART reports PASSED with zero NVMe error log entries and zero media/data integrity errors — the PCIe link is correcting these errors transparently. However, recurring correctable errors indicate signal integrity degradation (loose M.2 connector, thermally fatigued PCIe traces, or end-of-life controller). At 42,347 power-on hours (~4.8 years continuous runtime) on a consumer NVMe, this drive is past comfortable service life. Action: (1) Reseat the M.2 drive — loose connector is the most common cause. (2) Plan replacement: SN720 is the pve4 OS boot disk; failure takes the entire host offline. A 512GB+ NVMe is ~$50-70. Longhorn replicates all k3s data to 3 other nodes so workload data is safe, but pve4 OS loss requires Proxmox reinstall. Monitor with dmesg | grep -c RxErr — count was 64 as of 2026-03-01.
  • VMs on untagged 192.168.1.x subnet have no internet: The UDM Pro doesn’t NAT traffic from VMs placed on the untagged/native VLAN (192.168.1.x). VMs can reach other VLANs via inter-VLAN routing but cannot reach the internet. All k3s VMs must be on VLAN 20 (192.168.20.x) for internet access. vlan_id = 0 (or omitted) in Terraform = no VLAN tag = 192.168.1.x = no internet.
  • ansible/.env sets AWS env vars that override AWS_PROFILE: source .env in the ansible directory sets AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (cert-manager-dns01 credentials). These environment variables take precedence over AWS_PROFILE, silently overriding the intended IAM identity. When switching between Ansible and Terraform, always unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY before using AWS_PROFILE=terraform.
  • Proxmox VLAN-aware bridge needs bridge-vids: Setting bridge-vlan-aware yes on vmbr0 and creating sub-interfaces (vmbr0.20, vmbr0.30) is NOT sufficient. The physical bridge port (nic0) defaults to only VLAN 1 (PVID). Tagged VLAN frames from the switch are dropped at the bridge port before reaching the sub-interfaces. Fix: add bridge-vids 2-4094 to the vmbr0 stanza in /etc/network/interfaces. Runtime fix: bridge vlan add dev nic0 vid 20 && bridge vlan add dev nic0 vid 30. Verify with bridge vlan show.
  • UniFi Teleport VPN reserves 192.168.2.0/24: Teleport silently claims this subnet. Creating a VLAN network on the same range fails with api.err.SettingSubnetOverlapped (key: teleport). Check Teleport subnet via API: GET /proxy/network/api/s/default/get/setting/teleport. Either disable Teleport, change its subnet, or use a different /24 for your VLAN.
  • CoreDNS single-replica on Mac Mini: k3s defaults to 1 CoreDNS pod. If it lands on Mac Mini (Docker-in-Docker), VM-based pods can’t resolve DNS → full cluster outage (Longhorn CSI crash → all PVC pods stuck). Ensure 2+ replicas via coredns_replicas Ansible variable.
  • Alpine localhost resolves to ::1 (IPv6): Python HTTPServer bound to 0.0.0.0 won’t respond to wget localhost. Use 127.0.0.1 explicitly. Kubelet probes use pod IP (IPv4) so they work.
  • Traefik port 80 serves HTTPS: Entrypoint forces TLS. Must use https:// URL with SSL context when probing Traefik internally.
  • VS Code Remote-SSH requires TCP port forwarding: sshd_config MUST have AllowTcpForwarding yes or connection fails with “administratively prohibited”.

AWS & Terraform

  • ECR tokens expire after 12 hours: Pull secrets must be refreshed on every deploy.
  • dynamodb_table is deprecated: Use use_lockfile = true in S3 backend (Terraform 1.13+).
  • S3 lock files are .tflock objects: If stuck, delete with aws s3 rm s3://bucket/key.tflock.
  • Backend migration stale locks: Can chain. Use -lock=false to break the cycle.
  • Grafana CloudWatch datasource needs .env file: The GRAFANA_CW_SECRET_KEY env var must be set in ansible/.env (sourced before running playbooks). Without it, the Jinja2 template falls back to CHANGE_ME, causing SignatureDoesNotMatch errors on all CloudWatch/billing dashboards. Retrieve the secret from Terraform: terraform output -raw grafana_cloudwatch_secret_access_key.
  • Terraform cloud-init IP applied on VM reboot — verify tfvars match reality: When terraform apply updates VM parameters (e.g., memory), cloud-init config is also regenerated from tfvars, even if the apply errors with “ide2: hotplug problem”. The rebooted VM picks up the NEW cloud-init config. If ip_address in terraform.tfvars is wrong (e.g., 192.168.1.20 instead of 192.168.20.20), the VM boots with the wrong IP, breaking etcd peering and cluster access. Fix: always verify tfvars IPs match actual deployed IPs BEFORE running terraform apply. If already applied with wrong IP: fix via qm guest exec <vmid> -- sed -i 's/old/new/' /etc/netplan/50-cloud-init.yaml && qm guest exec <vmid> -- netplan apply, then fix tfvars and re-apply.

Helm

  • Stuck Helm release (pending-upgrade/pending-rollback): If a previous helm upgrade or rollback failed mid-operation, subsequent upgrades fail with “another operation is in progress”. Fix: helm history <release> -n <ns> to find stuck revisions, then delete their secrets (kubectl delete secret sh.helm.release.v1.<release>.v<N>). Only delete pending-* revisions, not the last deployed one.
  • StatefulSet strategy is NOT Recreate: Helm charts using StatefulSets (e.g., Open WebUI) reject strategy.type: Recreate — StatefulSets only support RollingUpdate or OnDelete. Use OnDelete for single-replica StatefulSets with Longhorn RWO PVCs. This is different from Deployments which do support Recreate.

AWS Bedrock

  • Bedrock model access is account-level, not IAM-level: Even with correct IAM permissions (bedrock:InvokeModel, bedrock:Converse), models fail with “Model access is denied due to IAM user or service role is not authorized to perform the required AWS Marketplace actions.” Fix: go to AWS Console → Bedrock → Model access → Request access for the specific models. IAM policy also needs aws-marketplace:ViewSubscriptions and aws-marketplace:Subscribe permissions.
  • LiteLLM image tags use -stable suffix: The correct LiteLLM container image tag format is main-v1.81.12-stable, not main-v1.63.14. Tags without -stable may not exist on GHCR. Always check https://github.com/BerriAI/litellm/pkgs/container/litellm for current tags.
  • LiteLLM cross-region inference profiles: Bedrock model IDs prefixed with us. (e.g., us.anthropic.claude-sonnet-4-20250514-v1:0) use cross-region inference profiles for automatic failover. These require inference-profile/* in IAM resource ARNs, not just foundation-model/*.
  • Bedrock Claude 3.5 Haiku model access denied: Claude 3.5 Haiku (anthropic.claude-3-5-haiku-20241022-v1:0) may require explicit marketplace subscription even when other Claude models work. Claude Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0) works without additional marketplace actions. Use Haiku 4.5 as the fast/cheap model.
  • Terraform k3s-homelab-ci lacks IAM create perms: The terraform AWS profile uses k3s-homelab-ci user which cannot iam:CreateUser. Use the default profile (homelab-admin) for IAM provisioning, or add IAM management permissions to the CI user.

Observability & Alerting

  • Prometheus disk full kills ALL metrics ingestion: When Prometheus WAL fills the PVC, no new samples are written. All dashboards go stale. Check df -h /prometheus when dashboards show gaps. Fix: patch PVC to larger size, delete the prometheus pod (StatefulSet recreates it, Longhorn auto-expands filesystem). prometheus_storage_size: "30Gi" and prometheus_retention: "14d" are safe defaults for a homelab.
  • Prometheus high-cardinality apiserver buckets: apiserver_request_duration_seconds_bucket, etcd_request_duration_seconds_bucket, apiserver_request_sli_duration_seconds_bucket, apiserver_request_body_size_bytes_bucket, apiserver_response_sizes_bucket, and apiserver_watch_events_sizes_bucket generate ~315K series (55% of TSDB) but are rarely used in dashboards. Drop them via kubeApiServer.serviceMonitor.metricRelabelings in the Helm values. The _sum and _count aggregates are kept, which is sufficient for latency percentile monitoring via recording rules.
  • Longhorn default-replica-count vs existing volumes: Changing the Longhorn default-replica-count setting only affects NEW volumes. Existing volumes retain their original replica count. To reduce replicas on existing volumes, patch each one: kubectl -n longhorn-system patch volumes.longhorn.io <vol> --type merge -p '{"spec":{"numberOfReplicas":2}}'. Longhorn evicts extra replicas automatically and volumes stay healthy throughout.
  • pve-exporter v3.4.5 metric names changed: Older dashboards use pve_node_cpu_usage, pve_node_memory_*, pve_storage_*, pve_node_uptime_seconds. Actual metric names are pve_cpu_usage_ratio, pve_memory_usage_bytes, pve_disk_usage_bytes, pve_uptime_seconds with id labels like node/pve1, storage/pve1/local-lvm. Filter with {id=~"node/.*"} for nodes.
  • github-exporter /metrics blocks on API calls: The githubexporter/github-exporter container’s /metrics endpoint makes synchronous GitHub API calls. With default 5s probe timeout, kubelet kills the container → CrashLoopBackOff. Fix: set probe timeoutSeconds to 30s, add startupProbe with failureThreshold 10. Also increase memory limit to 256Mi.
  • github-exporter.yaml Secret overwrites real PAT: If the manifest has stringData.github_token: "REPLACE_ME" and you kubectl apply -f the whole file, it overwrites the manually-set secret. Always update the PAT separately: kubectl create secret generic github-exporter-token -n monitoring --from-literal=github_token=<PAT> --dry-run=client -o yaml | kubectl apply -f -
  • Traefik scrape config relabel bug: The additionalScrapeConfigs relabel for Traefik used __meta_kubernetes_pod_annotation_prometheus_io_port as the sole source for __address__, producing 9100:9100 (port:port) instead of ip:port. Fix: use two source labels [__meta_kubernetes_pod_ip, __meta_kubernetes_pod_annotation_prometheus_io_port] with regex: (.+);(.+) and replacement: ${1}:${2}.
  • UniFi poller 429 death spiral → account lockout: UnPoller re-authenticates every poll interval (~15s) even while in its own “retry backoff” state. After CrashLoopBackOff restarts, hundreds of rapid auth requests trigger UDM Pro’s brute-force protection (AUTHENTICATION_FAILED_LIMIT_REACHED). The account lockout is per-account (not per-IP) and persists for 30+ minutes even with zero traffic. Poll interval alone doesn’t prevent it. Fix: (1) increased UP_UNIFI_DEFAULT_POLL_INTERVAL to 120s, (2) added startupProbe with failureThreshold: 10 so startup auth failures don’t trigger immediate restarts, (3) increased livenessProbe.periodSeconds to 120 to reduce restart frequency. To recover from lockout: log into UDM Pro UI as admin → Settings → Security → unlock the service account.
  • ServiceMonitors without valid /metrics cause permanent TargetDown: Before adding a ServiceMonitor, verify the service actually exposes /metrics. oauth2-proxy v7.6.0 requires --metrics-address=:44180 for a separate metrics port. code-server does not expose Prometheus metrics at all. Home Assistant requires Prometheus integration + Long-Lived Access Token configured before its /api/prometheus endpoint returns 200.
  • Grafana community dashboards from 2014-2018 don’t work: gnetId 139 (AWS Billing), 617 (EC2), 575 (S3) use legacy query formats incompatible with modern Grafana CloudWatch plugin. Replace with custom ConfigMap dashboards using metricEditorMode: 0 and metricQueryType: 0 fields. Dashboard ConfigMaps with grafana_dashboard: "1" label are auto-loaded by sidecar.
  • pve_guest_info has no status label in pve-exporter v3.4.5: The PveVmDown alert using pve_guest_info{status!="running"} fires for ALL non-template VMs because the status label doesn’t exist (empty string != “running” is always true). Use pve_up{id=~"qemu/.*"} == 0 joined with pve_guest_info{template="0"} instead.
  • k3s has no kube-proxy: k3s uses built-in iptables/nftables, not kube-proxy. Enabling kubeProxy monitoring in kube-prometheus-stack creates a ServiceMonitor that can never scrape anything, causing permanent TargetDown alerts. Set kubeProxy.enabled: false.
  • Alertmanager email with empty SMTP credentials: If alertmanager_smtp_enabled: true but ALERTMANAGER_SMTP_USER/ALERTMANAGER_SMTP_PASSWORD env vars are not set, Alertmanager generates 100% email send failures, triggering AlertmanagerFailedToSendAlerts and AlertmanagerClusterFailedToSendAlerts. Disable SMTP until credentials are configured.
  • Loki compactor corruption on Longhorn: Loki’s compactor can get corrupted files (input/output error on /var/loki/compactor/deletion/delete_requests), causing infinite CrashLoopBackOff. Fix: scale StatefulSet to 0, run busybox pod with same PVC to rm -rf /var/loki/compactor/deletion, scale back to 1. Data loss is minimal (only pending delete requests).
  • Loki TSDB index corruption on Longhorn: Loki’s TSDB shipper cache index files can become corrupted (input/output error on /var/loki/chunks/index/index_<N>/fake/*.tsdb.gz), causing CrashLoopBackOff on startup (error: “error initialising module: store”). The Longhorn volume is healthy but the individual index file is corrupt. Fix: scale StatefulSet to 0, run kubectl run loki-cleanup --image=busybox --restart=Never -n monitoring --overrides='...' mounting the PVC to rm -rf /var/loki/chunks/index/index_<N> /var/loki/tsdb-shipper-cache/index_<N>, then scale back to 1. Data loss is one day’s TSDB index table (Loki resyncs from object storage). Grafana shows no data while Loki is down because the Loki datasource returns errors.
  • Prometheus WAL corruption on Longhorn (“Grafana shows no data”): When the Prometheus WAL has a corrupt segment on Longhorn (write to WAL: log samples: write /prometheus/wal/000000XX: input/output error), Prometheus continues running (2/2 Running) but silently drops ALL scrape data — TSDB head has data but maxTime is hours/days stale. Grafana shows “No data” on every panel. The file can’t be removed with rm while mounted (also returns I/O error). Fix: (1) scale the Prometheus Operator to 0 first (kubectl scale deployment prometheus-kube-prometheus-operator -n monitoring --replicas=0) to prevent it fighting back, (2) scale the StatefulSet to 0, (3) run a busybox pod mounting the PVC — the PVC root is /data/prometheus-db/wal/ NOT /prometheus/wal/ — and rm -f all WAL segments (keep checkpoint.*), (4) scale StatefulSet back to 1, (5) scale operator back to 1. Data loss is WAL segments only (minutes of data); existing TSDB blocks on the PVC survive intact. Verify recovery with count(up{job!=""} == 1) query returning non-zero.
  • Longhorn replica ERR silently propagates I/O errors to workload mid-write: When a Longhorn volume replica goes ERR (e.g., due to a node losing storage connectivity or the instance-manager being restarted during heavy I/O), the volume remains “attached/healthy” at the Longhorn API level but write calls from the workload return input/output error. Any file mid-write at that moment becomes permanently corrupted — the file exists on disk but is unreadable. This is how both Prometheus WAL and Loki TSDB become corrupted simultaneously: both volumes had replicas on the same node that experienced a brief storage disruption. Root cause for the Feb 24 2026 outage: k3s-agent-4 was added to the cluster Feb 21, and Longhorn auto-balance (best-effort) migrated replicas onto it. During the first 24-48h the node had storage-level disruptions (possibly related to UPS testing/configuration), causing the instance-manager to restart and ERR all replicas on that node. Any volume with a replica on k3s-agent-4 could have received I/O errors during writes. Prevention: Monitor longhorn_volume_robustness (should be 0 = Healthy). Alert on longhorn_volume_robustness > 0. Consider setting replicaSoftAntiAffinity: false to prevent all replicas landing on a single node. After adding a new node, watch Longhorn events for several hours before considering the node stable.
  • Monitoring storage on NAS NFS (migrated Feb 2026): After the Feb 24 2026 Longhorn WAL/TSDB corruption incident, Prometheus TSDB, Loki chunks, Grafana state, and AlertManager state were all migrated from longhorn StorageClass to nfs-monitoring StorageClass backed by the DXP4800 NAS at /volume1/monitoring. This eliminates exposure to Longhorn replica I/O errors for these high-write workloads. The provisioner is nfs-subdir-external-provisioner deployed via Helm into the monitoring namespace — it dynamically creates subdirectories per PVC. Prerequisites before running Ansible: Create the /volume1/monitoring NFS share in UGOS Pro UI (Control Panel → File Services → NFS → Add share → path /volume1/monitoring → client 192.168.20.0/24 → permissions Read/Write, root_squash). This share must exist BEFORE the cluster_services role runs or the provisioner pod will CrashLoopBackOff. Migration is destructive: existing Longhorn PVCs for prometheus-db-prometheus-*, loki-*, grafana-*, alertmanager-* must be manually deleted after scaling down the stack — data is lost (acceptable for metrics/logs). New PVCs auto-provision on NFS. Note: After migration, reclaimPolicy=Retain means deleting a PVC does NOT delete the NFS subdirectory. Clean up /volume1/monitoring/ manually if needed.
  • bpg/proxmox v0.94.0: Does not support timeouts block.
  • bpg/proxmox hostpci device vs id fields are counterintuitive: In the hostpci block, device is the PCI slot name ("hostpci0", "hostpci1", etc.) and id is the actual PCI address ("0000:00:02.0"). These names are the opposite of what you’d expect. Getting them swapped causes "property is not defined in schema" errors because the PCI address fails slot-name validation.
  • bpg/proxmox efi_disk requires file_format and pre_enrolled_keys: Omitting file_format = "raw" and pre_enrolled_keys = false from the efi_disk block causes "efidisk0: invalid format - missing key". Both fields are required even though the provider docs don’t emphasize them.
  • Proxmox cloud-init can’t be hotplugged: Changing cloud-init parameters (ipconfig0, network config) on a running VM fails with "ide2: hotplug problem - unable to change media type". Must stop and start the VM (or destroy/recreate via Terraform) for cloud-init changes to apply.
  • Let’s Encrypt URL typo: Use acme-v02 not acme-v2.
  • Grafana provisioned dashboards from grafana.com — ${DS_PROMETHEUS} unresolved: The Grafana Helm chart downloads dashboards from grafana.com using an init container. When datasource: Prometheus (string format) is used, the chart generates a generic sed (s/"datasource":.*,/"datasource": "Prometheus",/g) that only handles old-style string datasource refs. Modern dashboards use object-style refs ({"type":"prometheus","uid":"${DS_PROMETHEUS}"}) which the sed can’t match or corrupts. Fix: use list format in Helm values: datasource: [{name: DS_PROMETHEUS, value: prometheus}]. This generates proper targeted replacement. Dashboard panels show “No data” when ${DS_PROMETHEUS} is left unresolved since Grafana can’t find a datasource with that UID.
  • Grafana starred dashboards lost on PVC recreation: Starred dashboards are per-user preferences stored in Grafana’s SQLite DB on the PVC. When the PVC is recreated (I/O corruption recovery), all stars are lost. Fix: use a “Home Hub” dashboard (home-hub UID) provisioned via ConfigMap (grafana_dashboard: "1" label) as the default landing page. Set it as home via both grafana.ini (default_home_dashboard_path) in Helm values AND Grafana API (PUT /api/org/preferences). ConfigMap-provisioned dashboards survive PVC recreation. The API call is needed for immediate effect; the grafana.ini path persists it across Helm upgrades.
  • Prometheus PVC size mismatch after recreation: When a Prometheus PVC is deleted and recreated (e.g., I/O corruption recovery), the new PVC may use a smaller default size instead of the configured prometheus_storage_size. Always verify PVC capacity matches after recreation. Longhorn supports online volume expansion: kubectl patch pvc <name> --type=json -p='[{"op":"replace","path":"/spec/resources/requests/storage","value":"30Gi"}]'.
  • NetworkPolicy namespaceSelector: {} does NOT match host-network IPs: Egress rules using namespaceSelector: {} only match pod CIDRs within the cluster. The Kubernetes API server (10.43.0.1:443 → endpoints at 192.168.20.20-22:6443) and kubelets (:10250) run on host-network IPs that are NOT pod IPs. An egress NetworkPolicy with only namespaceSelector: {} blocks Prometheus from reaching the API server, killing ALL Kubernetes service discovery — only static/file-based scrape configs (additionalScrapeConfigs) work. Same applies to any external host IPs (Proxmox node-exporters, NAS, etc.). Fix: add explicit ipBlock rules for node subnets (192.168.20.0/24 for VLAN 20, 192.168.1.0/24 for management VLAN). The monitoring namespace needs broad egress — it scrapes API server, kubelets, node-exporters, and application pods across many ports. Symptom: Prometheus logs show dial tcp 10.43.0.1:443: connect: connection refused on every namespace’s Service/Endpoints/Pod list.

Docker-in-Docker (Mac Mini) — DEPRECATED

The Mac Mini was migrated from Docker-in-Docker to a Lima VM (Debian 13 arm64) to enable Longhorn storage and full DaemonSet compatibility. The Lima VM was subsequently removed from the cluster on 2025-06-25 — the Mac Mini is no longer a k3s node. Do not re-add the Lima VM. All DinD and Lima lessons below are kept for historical reference only.

  • DaemonSets with host path mounts fail: Longhorn, Promtail, node-exporter require host FS. In DinD, exclude via node affinity: runtime NotIn [docker-desktop]. In Lima VMs, all DaemonSets work natively.
  • k3s auto-labels instance-type=k3s: Must --overwrite to set custom values like mac-mini.
  • Lima VM replaces Docker-in-Docker: Docker-in-Docker on macOS cannot provide host-level block devices, iSCSI, or persistent paths that Longhorn requires. Lima VM with Debian 13 arm64 gives a real Linux kernel with full /dev, /proc, /sys access. External disk passed as raw disk image via Lima additionalDisks. Node label changed from runtime=docker-desktop to runtime=lima.
  • Lima bridged networking (NOT shared): Lima’s default SLIRP networking NATs the VM behind the host, making k3s pod-to-pod and MetalLB impossible. Use networks: [{lima: bridged}] for real LAN IP via socket_vmnet. shared mode gives a 192.168.105.x NAT IP which also fails. Requires one-time limactl sudoers setup plus brew install socket_vmnet and copying the binary to /opt/socket_vmnet/bin/socket_vmnet (Lima rejects symlinks). After first boot, set a DHCP reservation in UniFi for a stable IP.
  • Lima additional disk naming: Create a raw disk image on external drive (qemu-img create -f raw /path/datadisk 1000G). The file MUST be named datadisk (not <name>.raw). In Lima v2, additionalDisks uses string form (- "longhorn") not object form. Lima auto-formats and mounts additional disks at /mnt/lima-<name>, so do NOT manually mkfs in provisioning — just symlink /var/lib/longhorn -> /mnt/lima-longhorn.
  • UFW firewall blocks VXLAN for new nodes: The Ansible hardening role enables UFW with deny incoming policy and allows traffic only from IPs in the k3s_cluster inventory group. When adding a node NOT in the Ansible inventory (e.g., Lima VM), VXLAN overlay networking breaks silently: packets leave the new node, arrive at other nodes’ eth0 (visible in tcpdump), but the kernel’s INPUT chain drops them before VXLAN decapsulation. Symptoms: pods on the new node can ping LAN IPs (no overlay) but NOT pod CIDRs on other nodes; DNS via ClusterIP fails; longhorn-csi-plugin CrashLoopBackOff. Fix: add ufw allow from <new-node-ip> on ALL existing nodes. Ansible: add IP to k3s_external_node_ips in group_vars/all.yml.

VLAN Migration & etcd

  • netplan apply does NOT reorder kernel IPs: When a VM boots with 192.168.20.x listed first in netplan, the kernel assigns it as the primary address. Subsequent netplan apply with the IP order swapped (old IP first) does NOT change the kernel’s address order — it only adds/removes addresses. Only a full reboot resets the order. Verify with ip addr show eth0, not ip route get.
  • k3s etcd binds to first interface IP, not routing table source: k3s detects the node IP from the first non-loopback address on the default-route interface (ip addr show). This determines where etcd listens (:2379, :2380). Even if ip route get 8.8.8.8 reports src 192.168.1.x, etcd binds to whichever IP appears first in ip addr show. Use --node-ip to force a specific IP.
  • Never delete TLS certs on multiple k3s servers simultaneously: When TLS cert directories are deleted on all servers, each independently generates a new CA on startup. Since the CAs don’t match, etcd peers reject each other with “tls: bad certificate” and quorum can never form. Recovery requires --cluster-reset on one server (which extracts the original CA from the etcd datastore), then wiping etcd data on other servers and re-joining.
  • k3s --cluster-reset is the nuclear but reliable etcd recovery: Converts a 3-node etcd cluster to single-node on the server that runs it. All workload data is preserved. Afterward: start that server normally, wipe /var/lib/rancher/k3s/server/db/ and /var/lib/rancher/k3s/server/tls/ on other servers, then start them — they rejoin and sync. Agent nodes may need restart to refresh their local load balancer’s TLS state.
  • etcd member URLs are stored in etcd data, not config files: When k3s starts with a wrong IP (e.g., 192.168.20.x instead of 192.168.1.x), it writes that IP into etcd’s member/peer URL table. Even after reverting k3s service files and netplan, etcd data still contains the wrong peer URLs. This causes a deadlock: nodes advertise old IPs in config but etcd expects new IPs from data. Fix: either update member URLs via etcdctl member update, or use --cluster-reset + wipe.
  • VLAN-tagged VMs can’t use old-subnet IPs simultaneously: With Proxmox tag=20, the VM’s eth0 traffic is VLAN-tagged. Old-subnet IPs (192.168.1.x, VLAN 1/untagged) become unreachable because the gateway (UDM) receives them on VLAN 20 but expects them on VLAN 1. Dual-IP migration requires VMs to remain untagged during the transition, with VLAN tags applied only after old IPs are fully removed.
  • k3s “TLS newer than datastore” fatal error: When k3s generates new TLS certs (e.g., after deleting the tls/ directory) but can’t form etcd quorum, the new certs get written to disk but NOT to the etcd datastore. On next restart, k3s detects the disk certs are newer than the datastore copy and refuses to start to prevent cluster-wide cert mismatch. Fix: delete /var/lib/rancher/k3s/server/tls/ AND /var/lib/rancher/k3s/server/db/etcd-tmp/ (the staging area from partial starts), then restart.
  • Proxmox lock files from concurrent qm operations: qm reboot or qm guest exec can leave stale lock files at /var/lock/qemu-server/lock-<vmid>.conf. Subsequent qm commands fail with “VM is locked”. Fix: rm -f /var/lock/qemu-server/lock-<vmid>.conf on the Proxmox host.
  • systemd caches unit files across reboots: After editing k3s service files on VMs, if k3s auto-starts on boot (enabled service), it uses the cached pre-edit unit file. Running systemctl daemon-reload && systemctl restart k3s after boot is required for changes like --node-ip to take effect. The daemon-reload must happen BEFORE the restart.
  • --node-ip is mandatory for VLAN migration: Without --node-ip, kubelet auto-detects from the existing node object in etcd, which still has the old IP. Even though the VM only has the new IP in netplan, the node registers with the stale cached IP until --node-ip forces the correct address. Required on ALL nodes (servers AND agents).
  • Successful full-stop VLAN migration procedure: 1) Stop all k3s on all nodes, 2) Write new-IP-only netplans, 3) Update k3s service files with --node-ip=<new> on ALL nodes + --server=https://<new-server-1>:6443, 4) Clean TLS on server-1, wipe TLS+DB on servers 2-3, 5) Set VLAN tag in Proxmox, 6) Reboot VMs (qm stop+qm start more reliable than qm reboot), 7) Stop auto-started k3s on all nodes, 8) --cluster-reset on server-1, 9) Start server-1, then 2-3, then agents — each with daemon-reload first, 10) Update Lima VM K3S_URL + restart agent. Entire procedure: ~20 min downtime.

Proxmox & Hardware

  • Intel e1000e “Hardware Unit Hang”: ThinkCentre M920q Intel I219-LM NICs hang with TSO/GSO/tx-checksumming enabled → “NIC Link is Down” → 3-5s network outage. Fix: disable all hardware offloading via ethtool. See PROXMOX_E1000E_FIX.md.
  • bridge-vids 2-4094 causes “No space left on device” on bonded interfaces: When creating an active-backup bond with a VLAN-aware bridge on Proxmox, using bridge-vids 2-4094 overflows the bridge VLAN table. ifreload fails with “No space left on device” and can take the network down. Fix: only specify the VLANs you actually need (e.g., bridge-vids 20 30). Never use the full range.
  • Mellanox ConnectX-3 Flash Recovery mode can permanently brick the card: When a ConnectX-3 card enters Flash Recovery mode (PCI ID 15b3:01f6), attempting a firmware flash via mstflint may appear to succeed but doesn’t actually write. After reboots, the card can disappear entirely from the PCI bus — the CPU’s PEG root port (PCIe 00:01.0) is disabled by BIOS because PCIe link training fails with the bricked card. The DEVEN register (PCI 00:00.0 offset 0x54 on Intel Q370) controls root port enable/disable: bit 3 = PEG10 (x16 slot). 0x00008031 = disabled, 0x00008039 = enabled. Write attempts to DEVEN are silently rejected by BIOS lock. Physical reseat + cold boot does not help. Remote recovery is impossible — card must be physically removed and replaced. MCX311A-XCAT replacement cards are ~$15-20 on eBay. Resolution (Feb 20, 2026): Replacement card installed in pve1, configured with active-backup bond matching pve2/pve3. All three hosts now have 10GbE.
  • Mellanox ConnectX-3 firmware flash procedure (mstflint): Debian’s mstflint package (apt, v4.31.0) can flash firmware but CANNOT use /dev/mst/ device paths — it errors with “Cannot open MST device”. Use the PCI BDF address instead: mstflint -d 0000:01:00.0 -i <fw_image.bin> burn. After flashing, two cold boots (full power-off, not just reboot) are required before the new firmware version appears in ethtool -i enp1s0. A single reboot is NOT sufficient — the first cold boot loads the new firmware into the NIC’s flash, the second fully initializes it. Verify with ethtool -i enp1s0 | grep firmware-version. FW 2.42.5000 is the final GA release for ConnectX-3.
  • M920q BIOS updates shared across Lenovo Tiny family: The M920q shares its BIOS with M720t/M720s/M720q/M920t/M920s/M920x/P330 Tiny. Latest: M1UKT78A (Jan 2026). Download from Lenovo DS503907. USB EFI flash is the recommended method for Proxmox hosts (no Windows needed). BIOS M1UKT45A introduced a permanent downgrade lock — cannot flash to any version below M1UKT44A. Contains critical patches: Intel Downfall/GDS (CVE-2022-40982), multiple CPU microcode updates, Ubuntu freeze fix, NVMe SSD detection improvements, PXE boot fixes.
  • Diagnosing PCIe device absence: If lspci doesn’t show a PCIe card, check whether the root port itself is enumerated (e.g., 00:01.0 for PEG10). If the root port is missing, the CPU has disabled it. Read the DEVEN register: setpci -s 00:00.0 0x54.L. Compare against a working node. If the bit for your slot is 0, the BIOS disabled it because link training failed — the card is likely bricked.
  • PVE exporter with cluster=1 — single target only: When cluster=1 is set, any single PVE node returns metrics for the entire cluster (all nodes, VMs, storage). Using multiple static targets creates N× duplicate series with mismatched instance labels (e.g., id=node/pve2 with instance=192.168.20.105). Fix: use ONE static target. Any PVE node works as entry point. The pve_node relabel based on target IP becomes misleading and should be removed.
  • software-properties-common doesn’t exist on Debian 13: Ubuntu-only package.
  • community.general.timezone fails on Debian 13: Use timedatectl command instead.
  • SSH sshd_config.d drop-ins: Cannot be validated standalone with sshd -t -f %s.
  • fail2ban on Debian 13: Needs backend = systemd (no /var/log/auth.log by default).
  • k3s v1.29 /healthz returns 401: API readiness checks must accept [200, 401] for unauthenticated probes.
  • k3s upgrade installer overwrites agent env file: Running the k3s install script (even with INSTALL_K3S_SKIP_START=true) creates a fresh empty /etc/systemd/system/k3s-agent.service.env, wiping K3S_TOKEN and K3S_URL. After every agent install, you MUST restore the env file before starting the service. Server nodes are unaffected because they use --cluster-init from service file args. This applies to all upgrade methods (curl installer, Ansible).
  • k3s upgrade must go through each minor version: Skipping minor versions (e.g., v1.29→v1.31) is unsupported and risks etcd/API incompatibilities. Always step through: v1.29→v1.30→v1.31→v1.32→etc. Take etcd snapshots between steps.
  • k3s upgrade: drain with –disable-eviction for Longhorn: Longhorn PodDisruptionBudgets block normal drains. Use kubectl drain --ignore-daemonsets --delete-emptydir-data --force --timeout=90s --disable-eviction. StatefulSets (loki-0, prometheus) may still time out — force-delete them with kubectl delete pod --force --grace-period=0.
  • k3s upgrade: Traefik still pinned to v2 image: Even with k3s v1.34 bundling Traefik Helm chart v27, k3s intentionally pins tag: "2.11.24" in the default traefik.yaml manifest. Traefik v3 migration requires explicit image override — it does NOT happen automatically with k3s upgrades.
  • Longhorn upgrade path has gaps: Not all Longhorn minor versions have the latest patch. For example, v1.7.4 returns 404 from GitHub — use v1.7.3 instead. Always verify the release exists before applying.

Shell & Secrets

  • Passwords with ! in kubectl commands: Bash interprets ! as history expansion in double quotes. Use single quotes: --from-literal=PASSWORD='MyPass!123'.
  • gcloud CLI from Homebrew not on PATH: brew install --cask google-cloud-sdk installs to /opt/homebrew/share/google-cloud-sdk/bin/gcloud on Apple Silicon but does NOT add it to $PATH. Either use the full path or add source /opt/homebrew/share/google-cloud-sdk/path.zsh.inc to shell profile.

Docker Build & Deployment

  • Docker build platform mismatch (arm64 Mac → amd64 k3s): Building on an Apple Silicon Mac without --platform linux/amd64 produces an arm64 image. k3s nodes (Debian 13 amd64) fail with no match for platform in manifest: not found. Always use docker build --platform linux/amd64 --provenance=false. The --provenance=false flag is critical — Docker Desktop adds attestation manifests that k3s containerd can’t resolve, causing the same not found error even with the correct platform.
  • Docker Desktop on macOS hangs silently: docker ps, docker info, docker buildx ls can all hang indefinitely with no error. When this happens, don’t waste time restarting Docker Desktop — build on a cluster node instead. Install docker.io + awscli on a server node, SCP the project, build natively (amd64), push to ECR from the server.
  • ECR auth token piping via SSH: Generate ECR token locally (aws ecr get-login-password), pipe to the remote server via SSH: echo "$TOKEN" | ssh server 'sudo docker login --username AWS --password-stdin <registry>'. Avoids provisioning AWS credentials on the build node.
  • requirements.txt ranges over exact pins: Exact version pins (==X.Y.Z) cause pip install failures when specific versions aren’t available on the target platform (e.g., amd64 Debian). Use >=X.Y.Z,<(X+1).0 ranges for cross-platform compatibility.

UniFi Dream Machine Pro

  • API key auth returns 401 on all endpoints: UDM Pro (UniFi OS 3.x) API keys appear to be read-only or scoped. For write operations (DHCP reservations, port forwards), use cookie-based auth: POST /api/auth/login with username/password, then capture TOKEN cookie and X-CSRF-Token header. Include both on subsequent requests.
  • CSRF token required on all mutating requests: After cookie auth, every POST/PUT/DELETE to UDM Pro must include the X-CSRF-Token header from the login response. Omitting it returns 403 Forbidden even with a valid session cookie.
  • api.err.InvalidFixedIP on specific IPs: Certain IPs (e.g., .20) are rejected by UniFi firmware even when not in use. May be reserved by UDM Pro for internal use. Workaround: assign a different IP or set static IP at the OS level.
  • api.err.FixedIpAlreadyUsedByClient persists after forget: After forgetting a client via cmd/stamgr, the conflict cache takes time to clear. The forget-sta command returns rc: ok but DHCP reservation still fails. May need to wait for cache expiry or fix via UniFi UI.
  • DHCP reservation API path: POST /proxy/network/api/s/default/rest/fixedip with body {"mac": "aa:bb:cc:dd:ee:ff", "fixed_ip": "192.168.1.X", "network_id": "<network-uuid>"}. Network ID for Default LAN: <network-uuid>.
  • Port forward API path: POST /proxy/network/api/s/default/rest/portforward with body including name, fwd, fwd_port, dst_port, proto, src, enabled.

Public Ingress & OAuth2

  • OAuth2 Proxy forwardAuth mode: When using Traefik’s forwardAuth middleware, OAuth2 Proxy must be configured with upstreams = ["static://200"] and reverse_proxy = true. It returns 202 (authenticated) or 401 (redirect to login). Do NOT set it as a reverse proxy upstream.
  • amazon/aws-cli container has curl but NOT wget: The amazon/aws-cli:2.15.0 image includes curl but not wget. DDNS scripts must use curl -sf instead of wget -qO-.
  • DDNS script multi-provider fallback: Use multiple IP-check providers (ifconfig.me, api.ipify.org, icanhazip.com, checkip.amazonaws.com) with fallback. Any single provider can timeout or return HTML instead of plain text.
  • Specific DNS records override wildcard: A Route53 A record for ha.k3s.internal.zolty.systems → 192.168.20.202 (private IP) silently overrode the wildcard *.k3s.internal.zolty.systems → <public-ip>. External clients resolved to a private LAN IP, causing connection timeouts. TCP/TLS tests with hardcoded IPs masked the issue. Fix: delete specific A records for subdomains that should use the wildcard. Diagnostic: time_connect: 0.000000 with time_namelookup > 0 in curl timing means DNS resolved but TCP couldn’t reach the resolved IP.
  • ECR pull secret expiry breaks CronJobs silently: ECR tokens last 12h. If no app deploys refresh the pull secret, daily CronJobs (etcd-backup, postgres-backup) fail to pull images the next morning. Fix: kubernetes/core/ecr-token-refresh/cronjob.yaml runs every 6h and refreshes ecr-pull-secret across all namespaces using the Kubernetes API directly (no kubectl needed). Uses ecr-pull-k3s IAM credentials stored in ecr-refresh-aws-credentials secret.
  • ServiceMonitors without valid /metrics cause permanent TargetDown: Before adding a ServiceMonitor, verify the service actually exposes /metrics. code-server does not expose Prometheus metrics at all — ServiceMonitors targeting /healthz create permanent TargetDown alerts.
  • Traefik errors middleware preserves original status code: The errors middleware intercepts a matching response status (e.g. 401) and replaces the body with the error handler service’s response — but keeps the original 401 status code, not the error handler’s. This means pointing the errors handler at /oauth2/start (which returns 302) still yields a 401 to the client. Use this middleware to pass through redirect headers+cookies even though the browser sees 401; Chrome does follow Location on top-level-navigation 401s, so this partially works in practice.
  • oauth2-proxy ForwardAuth redirect_uri must be hardcoded: Without redirect_url = "https://auth.k3s.internal.zolty.systems/oauth2/callback" in the oauth2-proxy config, it infers the callback URI from the incoming Host or X-Forwarded-Host header. When the Traefik errors middleware calls oauth2-proxy internally after a ForwardAuth 401 (from a jellyseerr request), the inferred host becomes jellyseerr.k3s.internal.zolty.systems/oauth2/callback — not registered in Google Cloud Console → OAuth fails. Always set redirect_url explicitly.
  • Traefik errors middleware order matters for ForwardAuth: List oauth2-redirect-errors before google-oauth in IngressRoute middlewares so it wraps the entire ForwardAuth chain and can intercept the 401. If listed after, the chain stops at ForwardAuth and the errors middleware never runs.
  • Full reverse-proxy oauth2-proxy vs ForwardAuth: ForwardAuth mode always returns 401 for unauthenticated requests; browsers may or may not redirect from that. Full reverse-proxy mode (oauth2-proxy as the actual upstream — used for media-profiler) handles the redirect itself (302 → Google) with zero ambiguity. For services where seamless browser redirect is critical, use full reverse-proxy mode (separate per-service oauth2-proxy deployment with --upstream=http://service-url). ForwardAuth mode is simpler but requires workarounds for the sign-in flow.

Wiki.js

  • Wiki.js 2.5 pages.update mutation hangs forever: The GraphQL mutation succeeds (content is saved, render completes) but the HTTP response is never returned. This is a Wiki.js 2.5 bug — the render subprocess completes but the main process never sends back the response. Workaround: fire-and-forget — send mutation with a 3-second timeout, ignore the timeout error, wait 8 seconds for render, then verify via a pages.single read query. See scripts/add-wiki-metrics.py for the pattern.
  • Wiki.js mutation requires tags field: Omitting tags: [] from the update mutation causes “Cannot read properties of undefined (reading ‘map’)”. Always include tags: [].
  • Use GraphQL variables for Wiki.js mutations: String interpolation with manual escaping breaks on Unicode characters (em-dashes U+2014, emoji). Use proper GraphQL $variables with json.dumps() which handles escaping correctly.

Alert Responder

  • Longhorn PVC + non-root user = permission denied: When a Dockerfile creates a non-root user (e.g. useradd appuser) and the pod mounts a Longhorn PVC, the volume is owned by root. The non-root user can’t write to it. Fix: add securityContext.fsGroup: <GID> to the pod spec so Kubernetes sets volume group ownership.
  • K8s manifest placeholder secrets overwrite real secrets: If a manifest includes Secret resources with stringData: REPLACE_ME placeholders, running kubectl apply -f manifest.yaml will overwrite any previously-created secrets with the placeholder values. Solution: remove Secret definitions from the manifest entirely and create secrets out-of-band via kubectl create secret.
  • Slack Block Kit 3000-char limit: Slack section blocks have a 3000 character limit on the text field. LLM-generated analysis + code-block remediation prompts can easily exceed this. Always truncate text to ~2800 chars before inserting into blocks.
  • Slack bot not_in_channel error: A Slack bot with chat:write scope cannot post to a channel it hasn’t been invited to. After deploying a new bot, manually /invite @BotName in the target channel. The channels:join scope would allow self-join but isn’t needed if you invite manually.
  • Slack Socket Mode requires outgoing WS pings for event delivery: The slack_sdk SocketModeClient has a threading bug in its Connection class — WebSocket frames corrupt after ~30-50 seconds. Even with a custom websocket-client implementation, Slack only delivers interaction events (button clicks, modals) when the client sends outgoing WebSocket pings. ping_interval=10 (matching the SDK default) is the critical setting. App-level {"type":"ping"} JSON messages alone are NOT sufficient — Slack needs protocol-level WS pings to mark the connection as active. Without them, the WebSocket is bidirectional (can send/receive raw frames) but Slack never routes block_actions or view_submission payloads.
  • Slack action buttons must be top-level blocks, not inside attachments: Slack’s attachments field does not support actions block type. Interactive buttons must be in the top-level blocks array as {"type": "actions", "elements": [...]}. Buttons placed in attachments render but clicks silently fail with no event dispatched.
  • Bedrock Converse API toolResult wrapper required: Tool results in the user role message content array MUST be wrapped in {"toolResult": {"toolUseId": "...", "content": [{"text": "..."}], "status": "success"}}. Putting toolUseId and content directly in the content array item (without the toolResult key) causes ParamValidationError: Invalid number of parameters set for tagged union structure. The content array uses tagged unions — each item needs exactly one discriminator key (text, toolResult, toolUse, image, etc.).
  • Claude models require Marketplace subscription for Bedrock tool-use: Claude models (Sonnet, Haiku) on Bedrock may work for basic converse() calls but return AccessDeniedException when toolConfig is included — even with the model listed as “Access granted” in the Bedrock console. Tool-use requires an explicit AWS Marketplace subscription. Amazon-native models (Nova Micro, Nova Lite, Nova Pro) work with tools without any additional subscription. Use Nova Micro for cost-effective agentic workloads; upgrade to Claude once Marketplace subscription is enabled.
  • ConfigMap env vars override Python code defaults — must update both: os.getenv("VAR", default) reads from the container environment. If a K8s ConfigMap sets that env var, it takes precedence over the code default. Changing only the Python default while the ConfigMap still has the old value has no effect. Always update the ConfigMap AND the code default together. Verify with: kubectl exec <pod> -- python -c "from app.config import Config; print(Config.VAR)".
  • Remediation agent runs inside a minimal container — NOT on nodes: The agent’s run_shell tool executes inside the alert-responder pod (Python 3.12-slim with only kubectl, helm, curl, git). Commands like journalctl, systemctl, service, ps, top, netstat DO NOT EXIST. The agent wasted 5+ steps in a session trying host-level commands that all failed. Fix: added unavailable-command detection, improved tool descriptions to explicitly state container limitations, and added kubectl_exec tool for pod-level diagnostics.
  • Agent guesses pod names instead of looking them up: Job pods have random suffixes (e.g. postgres-backup-29520495-abcde). The agent fabricated postgres-backup-29520495-xxxxxx and got NotFound. Fix: added label_selector parameter to kubectl_get so the agent can query pods -l job-name=<job-name> to discover actual pod names before fetching logs.
  • Agent confuses namespaces across apps: Alert was for cardboard namespace but the agent described a job in trade-bot namespace. Fix: added explicit “Namespace Discipline” rules in the system prompt requiring the agent to always use the namespace from the alert and listing common confusion pairs (trade-bot vs cardboard, longhorn vs longhorn-system).
  • Agent tried wrong Longhorn namespace: Agent queried kubectl get pods -n longhorn (empty) instead of longhorn-system. Fix: added explicit namespace for every infrastructure component in the system prompt (Longhorn → longhorn-system, Traefik → kube-system, monitoring → monitoring, MetalLB → metallb-system, cert-manager → cert-manager).
  • RBAC missing storage.k8s.io and longhorn.io API groups: Agent got Forbidden when listing StorageClasses. Fix: added storage.k8s.io (storageclasses, volumeattachments) and longhorn.io (volumes, replicas, nodes, engines, engineimages) to the ClusterRole.
  • Agent fabricates Job manifests instead of triggering from CronJob: When asked to manually run a backup, the agent created an ad-hoc batch/v1 Job manifest from scratch. The fabricated manifest referenced a PVC (postgres-pvc) that doesn’t exist in the actual CronJob spec, used environment: instead of env: (invalid field), and omitted volumes/service accounts. Each fabrication failure triggered an approval request, wasting 3 approval cycles. Root cause: kubectl_apply allowed kind: Job manifests. Fix: (1) added kubectl_trigger_cronjob tool (LOW-risk) that runs kubectl create job --from=cronjob/<name>, which inherits the full spec automatically, (2) blocked Job creation via kubectl_apply with a clear error redirecting to kubectl_trigger_cronjob, (3) added system prompt rules explicitly forbidding ad-hoc Job manifests.
  • Agent attempted self-RBAC escalation after hitting Forbidden: When kubectl_apply failed with Forbidden, the agent attempted to create a RoleBinding granting itself more permissions. This always fails (ClusterRoleBinding is required and the agent lacks bind verb). The escalation attempt triggered another approval cycle and wasted 2 more steps. Fix: (1) added manifest-level guard in exec_kubectl_apply that detects RBAC manifests targeting alert-responder-agent and returns a BLOCKED error with explanation, (2) added system prompt rule: “If you hit a Forbidden error, STOP and report — never attempt to modify your own RBAC.”
  • Agent never diagnoses the actual root cause when its own actions create noise: In session #15, the original alert was caused by a Longhorn Multi-Attach / volume not ready error on postgres-0 at 3 AM (visible in kubectl_events output). The agent saw the event but didn’t follow it — instead it focused on executing fabricated jobs. The stuck agent-created Job then blocked the CronJob’s concurrencyPolicy: Forbid, preventing the real fix (a clean retrigger). Fix: added a “Diagnosing Backup Failures” protocol in the system prompt that mandates checking postgres StatefulSet health and Longhorn PVC attach status before attempting any Job creation, and requires deleting stuck jobs before retriggering.
  • BackoffLimitExceeded on a CronJob Job leaves a stuck Job blocking concurrencyPolicy: Forbid: When a CronJob Job exhausts its backoff limit, the Job stays in Failed state and is NOT automatically deleted. With concurrencyPolicy: Forbid, the next scheduled CronJob run is skipped entirely (CronJob controller sees an active Job). Diagnosis: kubectl get jobs -n <ns> — Failed jobs still count as “active” for concurrency purposes. Fix: kubectl delete job <name> -n <ns> to clear the stuck job, then kubectl_trigger_cronjob to retrigger.

OpenClaw (AI Assistant Gateway)

  • OpenClaw has no published Docker image: Must build custom image from node:22-bookworm-slim + npm install -g openclaw@latest. The @discordjs/opus native module requires build dependencies (python3 make g++ libopus-dev) for node-gyp compilation. Clean build deps after install to save ~200MB.
  • OpenClaw config gateway.bind only accepts keywords: Valid values are loopback|lan|tailnet|auto|custom. Setting gateway.bind: "0.0.0.0" causes validation error and CrashLoopBackOff with “Invalid input”. Use CLI flag --bind lan in Dockerfile CMD instead of config file.
  • OpenClaw --bind lan requires authentication: Gateway refuses to bind to LAN without auth — “Refusing to bind gateway to lan without auth.” Must set OPENCLAW_GATEWAY_TOKEN env var (or --token flag) and use gateway.auth.mode: "token" in config. Use a random hex token stored in K8s Secret.
  • OpenClaw default bind is loopback: Without --bind lan, gateway listens only on ws://127.0.0.1:<port> — invisible to k8s Service/readiness probes. The pod appears Running but readiness probe fails and Service routes zero traffic. Always use --bind lan for k8s deployments.
  • Longhorn encrypted PVCs have persistent CSI staging path failures: After deleting and recreating encrypted PVCs, new volumes get stale staging paths from previous mounts: “Staging target path … is no longer valid for volume …”. This happened 3 consecutive times. Fix: use regular longhorn StorageClass instead. The encryption benefit is marginal for app-level data that isn’t highly sensitive.
  • OpenClaw device pairing blocks Control UI behind reverse proxy: OpenClaw has a device pairing system separate from token/password auth. New devices (browsers) must be explicitly paired before WebSocket connects. Loopback connections auto-approve silently, but remote connections (via Traefik) require manual approval via kubectl exec deploy/openclaw -- openclaw devices approve <request-id>. Symptom: Control UI loads but shows “disconnected (1008): pairing required” even with valid token. Fix: approve pending requests with openclaw devices listopenclaw devices approve <id>. For multi-user deployments behind OAuth2 Proxy, set gateway.controlUi.dangerouslyDisableDeviceAuth: true to skip per-device pairing when shared token auth is already configured.
  • OpenClaw trustedProxies required for proxy header trust: Without gateway.trustedProxies: ["10.42.0.0/16"], gateway logs “Proxy headers from untrusted address” and treats all connections as direct (non-proxied). This breaks client IP detection and causes auth edge cases. Always configure trustedProxies with the pod CIDR when behind Traefik or any reverse proxy.

Operational Lessons

  • Longhorn S3 backup credentials from Terraform: The longhorn-s3-credentials secret in longhorn-system must match the IAM user created by terraform/modules/s3_backups/. Get real values via cd terraform/environments/aws && terraform output -raw backup_user_access_key / backup_user_secret_key. Placeholder credentials (wrong length) cause InvalidAccessKeyId and backup target shows available: false.
  • Don’t remove live kubectl patches before Helm upgrade: When a kubectl patch (e.g., nodeSelector) is protecting pods, don’t remove it until after helm upgrade deploys the equivalent constraint via chart values. Removing the patch first allows pods to reschedule on incorrect nodes during the gap. Sequence: update chart values → helm upgrade → verify pods → patches are now redundant.
  • Stale k3s nodes after migration: When a node is replaced (e.g., mac-mini-agent → lima-k3s-agent), the old node object persists in NotReady state, generating ~12 alerts (NodeNotReady, TargetDown for DaemonSets, etc.). Fix: kubectl drain <old-node> --ignore-daemonsets --delete-emptydir-data && kubectl delete node <old-node>.
  • amd64-only images on arm64 nodes: Container images built only for amd64 fail with exec format error when scheduled on arm64 nodes. The pod enters CrashLoopBackOff but the error message may not be obvious in kubectl describe. Fix: add nodeSelector: {kubernetes.io/arch: amd64} or nodeAffinity with kubernetes.io/arch In [amd64]. For Helm charts, update values.yaml nodeAffinity section.
  • PVE memory alerts are false positives on over-provisioned hosts: Proxmox hosts running multiple VMs intentionally use most RAM. High memory usage alerts fire constantly. Monitor swap usage instead — (1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) > 0.5 from host node-exporters (job="proxmox") indicates actual memory pressure.
  • Prometheus additionalScrapeConfigs relabel pitfall: The relabel config source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] with target_label: __address__ sets the address to just the port value (e.g., 9100:9100). Must use two source labels with [__meta_kubernetes_pod_ip, __meta_kubernetes_pod_annotation_prometheus_io_port] and regex: (.+);(.+) / replacement: ${1}:${2} to construct ip:port.

Networking

  • Proxmox active-backup bond configuration pattern: For 10GbE + 1GbE failover, use bond-mode active-backup with bond-primary enp1s0 (10GbE). The bond (bond0) replaces the bare NIC as the bridge port for vmbr0. Config: bond-slaves nic0 enp1s0, bond-miimon 100, bond-primary enp1s0. Bridge gets bridge-ports bond0 and bridge-vids 20 30 (only needed VLANs). If 10GbE fails, traffic falls back to 1GbE automatically. Verify with cat /proc/net/bonding/bond0.
  • NAS split-VLAN isolation (both switch ports must match): When a NAS has two NICs on separate switch ports, BOTH ports must be assigned the same VLAN port profile. If Port A is on VLAN 30 (Storage) but Port B is still on VLAN 1 (Default), the NAS ARP-replies from both NICs. The UDM learns the MAC→VLAN mapping from the wrong port, causing return traffic to arrive on the wrong VLAN wire. The NAS receives ICMP/TCP but drops replies because they egress on the VLAN-1 NIC while the source IP is 192.168.30.x. Symptom: NAS ARP-resolves (visible in ip neigh) and tcpdump shows outbound ARP requests, but all inbound ICMP/TCP/NFS is blackholed. Every port shows filtered in nmap. Fix: ensure ALL switch ports connected to the NAS have the same native VLAN (e.g., “Storage Only” profile). Verify via UniFi API: port_overrides must include entries for ALL NAS ports, not just one.
  • DNS records must be updated after VLAN re-IP: After migrating nodes to new VLANs (e.g., flat 192.168.1.0/24 → VLAN 20 192.168.20.0/24), updating Terraform code is not enough — you must also run terraform apply to push the changes to Route53. Stale DNS records (internal wildcard, PVE hosts) will silently resolve to old IPs, causing TLS cert validation failures and unreachable dashboards. Always verify live DNS with nslookup <host> 8.8.8.8 after a network migration. The cert-manager-dns01 IAM user in ansible/.env does NOT have S3 permissions for terraform state — use the default AWS profile (~/.aws/credentials) for terraform operations.
  • MetalLB IPAddressPool must be re-applied after VLAN re-IP: Updating metallb_ip_range in Ansible group_vars/all.yml does NOT update the live cluster — the Ansible playbook must be re-run (or kubectl patch the IPAddressPool directly). Stale pool causes MetalLB to continue assigning old-VLAN IPs. When the pool is updated, auto-assigned services grab IPs in creation order — NOT their previous assignments. Services with loadBalancerIP pointing to the old range go <pending> because the IP is no longer valid. Fix: (1) patch IPAddressPool, (2) annotate Traefik with metallb.universe.tf/loadBalancerIPs and remove stale spec.loadBalancerIP via JSON patch (op: remove, path: /spec/loadBalancerIP), (3) pin all other LB services via metallb.universe.tf/loadBalancerIPs annotation. Never use both spec.loadBalancerIP AND metallb.universe.tf/loadBalancerIPs annotation on the same service — MetalLB rejects the conflict. CRITICAL: the manifest file at /var/lib/rancher/k3s/server/manifests/custom/metallb-config.yaml must be updated on ALL THREE server nodes — k3s Addon reconciler watches this file and will revert kubectl patch changes if the on-disk manifest still contains the old IP range. Updating only one server is insufficient; k3s syncs manifests via etcd and whichever server reconciles the addon next will apply whatever is on its local disk.

Media Stack Architecture

Context for AI sessions implementing the GPU-accelerated media stack. Read this section before modifying anything under kubernetes/apps/media/.

Content Pipeline

Jellyseerr (user requests)
  
Radarr (movies) / Sonarr (TV)
   search indexers
Prowlarr  TorrentLeech
   add to seedbox
RapidSeedbox (<seedbox-ip>:2222)
  rTorrent downloads  /home/user/Downloads/
   AutoTools plugin (move on complete)
  /home/user/Downloads/complete/
   ratio plugin: seed until 1.0 ratio OR 2 weeks, then stop
   rsync --partial over SSH (CronJob every 4h)
   sync to /volume1/media/downloads/{movies,tv}
DXP4800 NAS (192.168.30.10, Storage VLAN 30)
   Radarr/Sonarr import & rename  /volume1/media/{movies,tv}
   NFSv3
Jellyfin (GPU transcode)  Clients
  
Bazarr (subtitle management)

Key Design Decisions

  • Jellyfin is monolithic: Cannot split into “direct play” and “transcode” Deployments. The transcode decision is per-stream internal to the server. Deploy as a single Deployment with preferredDuringSchedulingIgnoredDuringExecution GPU node affinity. If GPU node dies → reschedules to any worker → software transcode fallback.
  • NFS for media, Longhorn for state: NFS is ReadOnlyMany or ReadWriteMany depending on service. Longhorn is for config DBs, PostgreSQL, and any stateful components. Never use iSCSI for media — it’s ReadWriteOnce block storage.
  • NFSv3, NOT NFSv4: UGOS Pro’s NFS server doesn’t expose a pseudo-root filesystem. NFSv4 mounts fail with “No such file or directory” even though the export path is correct. NFSv3 works immediately. All PVs use mountOptions: [nfsvers=3].
  • Seedbox sync needs a K8s CronJob: rclone running in a pod with both SFTP (to seedbox) and NFS (to NAS) access. Credentials in K8s Secret created out-of-band. Schedule: every 4h or on-demand.
  • Arr stack for content automation: Radarr (movies) + Sonarr (TV) + Prowlarr (indexers) manage the full download lifecycle. Jellyseerr sends requests → Radarr/Sonarr search via Prowlarr → TorrentLeech → rTorrent downloads to seedbox ~/Downloads/ → AutoTools moves completed to ~/Downloads/complete/ → ratio plugin seeds to 1.0 or 2 weeks then stops → rsync syncs complete/ to NAS downloads/ dir → Radarr/Sonarr import & rename to library.
  • Per-service NAS accounts: Each arr app runs as its own NAS UID (10010-10015) via PUID/PGID env vars (linuxserver images) or runAsUser (Jellyfin). NFS root_squash preserves UIDs. Group media-services (GID 10000) provides shared read access.
  • IngressRoutes per-service, no OAuth: Each media service has its own IngressRoute + Certificate in media namespace. OAuth is not used — TV apps and arr API keys are incompatible with forwardAuth.
  • GPU passthrough requires q35 + OVMF: Existing VMs use i440fx/SeaBIOS. The GPU worker VM must be destroyed and recreated with machine = "q35" and bios = "ovmf". Drain workloads first, verify Longhorn replicas exist elsewhere.
  • hostpci requires root@pam auth: The bpg/proxmox Terraform provider cannot assign PCI devices with API tokens. Verify provider auth before attempting GPU passthrough.
  • Inter-VLAN routing for NFS: k3s nodes (VLAN 20) must reach NAS (VLAN 30). UDM Pro must allow VLAN 20 → VLAN 30 on TCP 2049 (NFS) + UDP 111 (portmapper).
  • DXP4800 is NOT Synology: It runs UGOS Pro. NFS/SMB configuration is in UGOS Control Panel → File Service, not Synology DSM. Don’t assume DSM-style paths or APIs.
  • UGOS Pro REST API auth is unusable: The API (/webman/login.cgi) requires client-side RSA encryption of credentials using a server-provided public key. This makes API automation impractical. Use SSH instead (enable temporarily via Control Panel → Terminal → Enable SSH, auto-disables after 6h).
  • NAS SSH username is mat, not admin: SSH to the DXP4800 uses ssh mat@192.168.30.10, password Wiggin123. The admin username only works for the UGOS Pro web UI, not SSH. sudo also works: echo "Wiggin123" | sudo -S <command>. sshpass on macOS Homebrew fails due to TTY issues with newer OpenSSH — use expect or interactive SSH instead.
  • SCP to NAS /tmp fails (SFTP server restriction): scp uses the SFTP subsystem which restricts writable paths on UGOS Pro — scp mat@nas:/tmp/file returns “No such file or directory”. Direct shell writes to /tmp DO work via SSH interactive sessions (printf '...' > /tmp/file). Use base64-encoded file content split into 80-char chunks sent via expect + printf >> /tmp/file to transfer scripts to the NAS.
  • UGOS Pro mat user has no home directory: /home/mat does not exist. mat drops to / on login. Do not depend on ~ for script paths. Store persistent scripts in /volume1/scripts/ (survives reboots, same volume as media and data).
  • UGOS Pro cron.d entries likely persist across reboots: Unlike /etc/exports (which is regenerated from the UGOS Pro database on every boot), /etc/cron.d/ uses standard Debian cron and is NOT managed by UGOS Pro. Entries installed there should survive reboots. However, if a cron job disappears after a NAS update, re-run scripts/nas/deploy-nas-alerts.sh to reinstall.
  • NAS VLAN 30 has outbound internet access: The NAS (192.168.30.x) can POST to external URLs (Slack webhooks, AWS SES) — confirmed via curl in 2026-02. No explicit inter-VLAN firewall blocking outbound from Storage VLAN. This means the NAS can use SES SMTP directly without going through the cluster-internal email-gateway.
  • NAS Slack alerts script location: /volume1/scripts/nas-slack-alerts.sh (NAS) + scripts/nas/nas-slack-alerts.sh (repo). Uses journalctl --after-cursor to track position. If alerts stop, check /volume1/scripts/.nas-alerts-cursor exists and /var/log/nas-slack-alerts.log is not filling with errors.
  • /etc/exports on UGOS Pro does NOT persist across reboots: UGOS Pro regenerates NFS config on boot from its internal database. Writing directly to /etc/exports and running exportfs -ra restores access immediately but will be lost on next reboot. The persistent fix is via UGOS Pro UI: Control Panel → File Services → NFS → edit the share and add 192.168.20.0/24 as an allowed client. After a NAS reboot, if NFS gives “access denied”, always re-apply via SSH: echo "Wiggin123" | sudo -S bash -c "printf '/volume1/media 192.168.20.0/24(rw,sync,no_subtree_check,root_squash)\n' > /etc/exports" && echo "Wiggin123" | sudo -S exportfs -ra.
  • UGOS Pro NFS real persistence mechanism is /usr/local/bin/restore-nfs-exports.sh + @reboot cron: Despite the name “UGOS Pro manages NFS from its internal database”, the actual mechanism is simpler — there is a script /usr/local/bin/restore-nfs-exports.sh that is called via /etc/cron.d/nfs-exports at @reboot. Editing this script directly (as root/sudo) is safe and will survive reboots. The UGOS Pro UI writes both the NFS global config to /etc/nfs.json AND updates this restore script. You can bypass the UI entirely by editing /usr/local/bin/restore-nfs-exports.sh. The script lives in the overlay filesystem (/overlay/upper/usr/local/bin/) so it persists across NAS reboots. Use printf '...\n' >> /etc/exports in the script (not echo) to avoid newline issues. Also update /etc/exports directly and run exportfs -ra for immediate effect. SSH access: ssh mat@192.168.30.10 (via jump k3s-server-1, which is VLAN 20 → VLAN 30), password Wiggin123, sudo works.
  • NFS export shell quoting through nested SSH: When writing /etc/exports on the NAS via triple-nested SSH (local → jump → NAS), echo with single-quoted strings often produces empty files due to quoting escaping. Use printf instead: printf '/volume1/media 192.168.20.0/24(rw,sync,no_subtree_check,root_squash)\n' | sudo tee /etc/exports.
  • Seedbox credentials are secrets: Seedbox SFTP creds, VPS control panel creds, and NAS admin creds must NEVER be committed. Store in K8s Secrets created out-of-band.

Seedbox SFTP & rclone Debugging Lessons

  • rclone CronJob activeDeadlineSeconds kills large syncs: A 4-hour deadline (14400s) is fine for incremental syncs (~GBs) but kills the initial bulk sync when hundreds of GBs are waiting on the seedbox. At ~2-5 MiB/s over SFTP, 292 GiB needs 1-2 days. Every scheduled run restarts from zero because rclone sync over SFTP doesn’t resume partial file transfers — so nothing ever completes and no files appear in Jellyfin. Fix: remove activeDeadlineSeconds entirely. concurrencyPolicy: Forbid already prevents overlapping runs. After the initial sync, subsequent 4-hourly runs are incremental and fast.
  • rsync –partial over SSH is better than rclone SFTP for large syncs: rclone sync over SFTP writes to temp files and discards on interruption — no resume. rsync --partial --partial-dir=.rsync-partial keeps partial files in a hidden directory and resumes byte-level on the next run. For 200+ GiB initial syncs at 2-5 MiB/s, this is the difference between 1-2 days total vs infinite retry loop.
  • Alpine containers running as non-root can’t apk add: If a pod has runAsUser: 10012 (non-root), apk add fails silently. Fix: use an init container running as root (securityContext.runAsUser: 0) to install packages and copy binaries + shared libraries to a shared emptyDir volume. Main container then adds the tools dir to PATH and LD_LIBRARY_PATH.
  • SSH in containers fails with “No user exists for uid N”: OpenSSH client requires the current UID to exist in /etc/passwd. Alpine base images only have root. Fix: in the init container, write a custom passwd file to the shared tools volume, then mount it as /etc/passwd (via subPath) into the main container.
  • Seedbox download paths are case-sensitive: RapidSeedbox default download directory is /home/user/Downloads/ (capital D) on the DATA partition (1.2TB). The lowercase /home/user/downloads/complete/ is on the tiny OS disk (53GB, 0 bytes free). Getting this wrong causes all torrents to stop immediately. Both Radarr/Sonarr download client configs AND remote path mappings must use the correct cased path.
  • SFTP subsystem can be broken while SSH auth works: The SSH daemon may accept connections, complete key exchange, and authenticate successfully — but the sftp-server subprocess never sends the SFTP version/init packet. This manifests as an indefinite hang after “Authenticated using password” in every SFTP client (rclone, native sftp, lftp, curl/libssh2). Always test with native sftp -P 2222 user@host FIRST before debugging rclone config.
  • rclone SFTP config for restricted servers: For seedboxes that are SFTP-only (no shell): shell_type = none, md5sum_command = none, sha1sum_command = none, disable_hashcheck = true, key_use_agent = false. Without these, rclone tries to run shell commands that hang or fail.
  • rclone known_hosts format: For non-standard ports, the format is [host]:port keytype base64key. Scan with ssh-keyscan -p 2222 host 2>/dev/null. Mount as a ConfigMap file and reference via known_hosts_file in rclone.conf.
  • Hung SFTP connections exhaust MaxSessions: Each failed rclone/sftp test that hangs consumes an SSH session on the server. After dozens of tests, no new connections succeed (even from different source IPs). Always kill hung processes and wait before retrying. The seedbox may need a service restart via the control panel.
  • rTorrent XMLRPC uses d.multicall.filtered: NOT d.multicall2. The call signature is d.multicall.filtered("","",field1,field2,...) with two empty string params before the field list.
  • rTorrent d.directory_base.set requires stopping first: You cannot change a torrent’s download directory while it’s active. XMLRPC returns a fault. Pattern: d.stopd.directory_base.setd.start. This applies to any migration that moves files and needs to update rTorrent’s internal tracking.
  • ruTorrent plugin settings persist in dat files, not .rtorrent.rc: AutoTools, ratio, and other ruTorrent plugins store their config in /var/www/rutorrent/share/users/<user>/settings/<plugin>.dat as PHP serialized objects. These survive rTorrent restarts (the plugins re-register XMLRPC event handlers when ruTorrent loads). On RapidSeedbox, .rtorrent.rc is auto-generated by the panel — direct edits are overwritten. Use ruTorrent plugins for persistent configuration instead.
  • Configure ruTorrent plugins via POST to action.php: AutoTools: POST /rutorrent/plugins/autotools/action.php with enable_move=1&path_to_finished=/path&fileop_type=Move&add_name=1. Ratio: POST /rutorrent/plugins/ratio/action.php with rat_action0=0&rat_min0=100&rat_max0=200&rat_time0=336&rat_name0=default-seed&default=0. Must use HTTPS to localhost from the seedbox — external DNS for *.seedbox.xip may not resolve.
  • Seedbox SSH rate limiting: Too many rapid SSH connections (>15 in quick succession) trigger rate limiting — connections fail with exit code 255. Use ConnectTimeout=20 and space out connections. This is especially relevant during migration scripts that make many XMLRPC-over-SSH calls.
  • rclone SFTP password must be obscured: Use rclone obscure <plaintext> to generate the value for RCLONE_SFTP_PASS. The K8s Secret stores the obscured form, NOT the plaintext.

Seedbox Details (Architecture Only — No Credentials)

  • Provider: RapidSeedbox
  • IP:
  • Services: Deluge (web UI), ruTorrent (web UI), SFTP (:2222), FTP (:21), OpenVPN, Remote Desktop (:300)
  • Plex: Available but not unlocked (using Jellyfin on-prem instead)
  • Sync method: rsync –partial over SSH from seedbox ~/Downloads/complete/ → NFS mount on DXP4800 (via k3s CronJob seedbox-sync). Only syncs completed downloads — AutoTools plugin moves finished torrents from ~/Downloads/ to ~/Downloads/complete/. Originally rclone SFTP, but rclone can’t resume partial files over SFTP. rsync resumes byte-level from where it left off via --partial-dir=.rsync-partial.
  • Torrent client: rTorrent (engine) + ruTorrent (PHP web UI). Runs in a SCREEN session. AutoTools plugin: move completed to ~/Downloads/complete/. Ratio plugin: seed to 1.0 ratio or 2 weeks, then stop. Throttle: 5 downloads + 5 uploads max (throttle.max_downloads.global.set = 5, throttle.max_uploads.global.set = 5).
  • rTorrent memory: ~63MB RSS (2.4% of 2.5GB) at steady state with ~33 torrents. Initial spike to ~300MB during startup while loading torrent metadata, settles within minutes.
  • VPS Control Panel: master2.rapidseedbox.com:5656 (credentials in password manager)
  • Secret name: seedbox-ssh in media namespace (keys: SSH_HOST, SSH_PORT, SSH_USER, SSH_PASS — plaintext, not rclone-obscured). Legacy seedbox-sftp and seedbox-ftp secrets still exist (unused).
  • Known SFTP issue (2025-07, RESOLVED 2026-02): SFTP subsystem was broken server-side in July 2025. As of Feb 2026, SSH/SFTP are working again — rsync over SSH confirmed operational.
  • Seedbox session-creation hang (2025-07, RESOLVED): Was a data partition mount issue. Resolved — SSH sessions now work normally.
  • Don’t exhaust seedbox with test connections: Each FTP/SSH connection that hangs after auth consumes memory (~20-50MB per hung process). With only 2.5GB RAM, after ~20 hung tests the ENTIRE server becomes unresponsive (even HTTPS dies). Always use tight timeouts (-m 10) and make ONLY ONE test per reboot.
  • VPS panel VM ID changes per session: The vi= parameter for _vm_remote.php API calls changes every login session. Must scrape it from control.php JavaScript (vi:"<id>"), not reuse old values.

Exportarr Instrumentation Patterns

  • Exportarr v2.3.0 is distroless: Image ghcr.io/onedr0p/exportarr:v2.3.0 uses gcr.io/distroless/static:nonroot (UID 65534). No shell available — can’t use wrapper scripts or shell commands.
  • CONFIG option only parses XML: Exportarr’s CONFIG env var reads ApiKey + Port from *arr’s config.xml. Works for Radarr/Sonarr/Prowlarr. Does NOT work for Bazarr (uses YAML config, not XML). Bazarr needs explicit API_KEY from a K8s secret.
  • fsGroup grants sidecar read access: Pod-level fsGroup: 10000 adds supplemental group 10000 to all containers (including exportarr at UID 65534), enabling read access to config.xml files created by linuxserver images (PGID=10000).
  • JellyseerrDown alert was permanently firing: Original alert used absent(up{job="jellyseerr"}) but no ServiceMonitor existed for Jellyseerr (no native Prometheus metrics). The up metric was always absent, so absent() always returned 1, causing the alert to fire permanently. Fixed to use kube_deployment_status_replicas_available instead.
  • Exportarr sidecar port 9707 is safe: Each *arr app runs in its own pod, so all exportarr sidecars can use the same port 9707 without conflicts.

Mermaid Diagrams (Wiki.js 2.5)

Wiki.js 2.5 bundles Mermaid 8.8.2 (hardcoded in package.json, never updated). Many modern Mermaid features silently fail with “Syntax error in graph”. Wiki.js 3.0 does not exist. These rules apply to all diagram content on the wiki.

Syntax NOT supported in Mermaid 8.8.2

  • direction TB / direction LR inside subgraphs: Only the top-level graph TB / graph LR directive controls direction. Subgraph-level direction was added in Mermaid 9.x. Remove any direction keyword inside subgraphs — the parent graph direction applies.
  • <--> bidirectional arrows: Not supported. Use two separate one-way arrows: A --> B and B --> A.
  • & multi-target connections (e.g., A --> B & C & D): Not supported. Expand to individual lines: A --> B, A --> C, A --> D. Same for source-side: A & B --> C becomes A --> C, B --> C.
  • ::: class shorthand: Not supported. Use style commands instead.

Wiki.js GraphQL API patterns

  • Mutations hang forever (Wiki.js 2.5 bug): pages.create and pages.update mutations never return a response. Use fire-and-forget: 3s socket timeout, 8s render wait, then read-back to verify.
  • Tags field required: All page create/update mutations MUST include a tags array (can be empty). Omitting it causes silent failures.
  • Use GraphQL variables for content: Pass page content as a $content: String! variable, not inlined in the query string. Inline content breaks on special characters.
  • Auth tokens expire ~30 min: Re-authenticate before bulk operations.

Diagram authoring rules

  • Always start with graph TB or graph LR at the top level only.
  • Never use direction inside subgraphs.
  • Never use <-->, &, or ::: in connection syntax.
  • Use --> for directional, --- for undirected, -.-> for dashed.
  • Emoji in node labels works fine (e.g., A["🔒 Firewall"]).
  • Test with python3 scripts/upload-wiki-diagrams.py --dry-run before uploading.
  • The upload script supports idempotent create-or-update via get_page_by_path() check.

Mellanox ConnectX-3 Temperature Monitoring

MCX311A-XCAT (PCI ID 0x1003) on pve2 (01:00.0) and pve3. Driver: mlx4. Role: ansible/roles/mellanox_mft/. Playbook: ansible/playbooks/proxmox-mellanox-mft.yml.

Key Facts

  • Debian mstflint ≠ NVIDIA MFT: Debian’s package only provides the firmware flasher. It does NOT include the mst device manager, mget_temp, or mget_temp_ext. Installing mstflint via apt is not enough.
  • Full MFT download URL (v4.29.0 — v4.30/4.31 are 404): https://content.mellanox.com/MFT/mft-4.29.0-131-x86_64-deb.tgz
    • Extracts to /tmp/mft-4.29.0-131-x86_64-deb/
    • Userspace debs: DEBS/mft_4.29.0-131_amd64.deb
    • DKMS deb: SDEBS/kernel-mft-dkms_4.29.0-131_all.deb
  • DKMS build always fails on Proxmox kernels: mst_pci_bc.c can’t find nnt_ioctl.h because the DKMS Makefile uses EXTRA_CFLAGS= -I$(PWD)/$(NNT_DRIVER_LOCATION) where $(PWD) resolves to the kernel source dir in DKMS context, not the module source dir.
    • Workaround: Copy all *.h from /usr/src/kernel-mft-dkms-4.29.0/nnt_driver/ into mst_backward_compatibility/mst_pci/ and mst_backward_compatibility/mst_pciconf/, then make KPVER=$(uname -r) from each directory. Install resulting .ko to /lib/modules/$(uname -r)/extra/ manually. Ansible role handles this automatically.
  • ConnectX-3 not in MFT 4.29’s device list: mst start creates empty /dev/mst/ because ConnectX-3 (0x1003) predates the supported device list (oldest entry is ConnectX-4 0x1013). Fix: mst start --with_unknown. Creates /dev/mst/mt4099_pciconf0 and /dev/mst/mt4099_pci_cr0.
  • Use mget_temp_ext, not mget_temp or mstmget_temp: mget_temp_ext -d mlx4_0 takes the InfiniBand device name from /sys/class/infiniband/, not the /dev/mst/ path. This is the tool the Prometheus collector uses.
  • mlx4 exposes /sys/class/infiniband/mlx4_0 in Ethernet-only mode: No InfiniBand cable needed. The sysfs path exists as long as the mlx4 driver is loaded.
  • Collector already in Debian package: prometheus-node-exporter-collectors ships mellanox_hca_temp script and prometheus-node-exporter-mellanox-hca-temp.{timer,service}. Just enable the timer after MFT + mst-startup are configured.
  • Prometheus metric: node_infiniband_hca_temp_celsius{device="mlx4_0"} collected every 60s via textfile at /var/lib/prometheus/node-exporter/mellanox_hca_temp.prom.
  • Expected operating temps: ConnectX-3 ASIC runs hot — 80–95°C is normal (max ASIC temp ~110°C). Dashboard thresholds: yellow >80, red >95.
  • Grafana panels: id 8 (gauge) + id 9 (timeseries) at y=24 in kubernetes/core/grafana-dashboard-proxmox-hardware.yaml. Query: node_infiniband_hca_temp_celsius{job="proxmox"}.

Wiki.js API Key Invalidation

Symptom: pages.create returns Forbidden at auth.js:47 even though pages.list works and the API key appears valid (correct length, correct grp field).

Root Cause

Wiki.js stores RSA certificate pairs in the settings table (key='certs'). When regenerateCertificates() is called (via admin UI or upgrade), new RSA keys are generated and stored. Any existing API key JWTs — which were signed with the old private key — fail signature verification with the new public key.

The auth middleware falls back to guest user when JWT verification fails. Guest user has read:pages only — enough to pass pages.list (which accepts read:pages) but NOT pages.create (which requires write:pages or manage:system). This creates a misleading partial-auth appearance.

Debugging signal: site.config (requires manage:system exclusively) → Forbidden, while pages.list → OK means JWT verification has failed and user is guest.

Fix

Generate a new JWT using the current private key (stored encrypted in DB) + current session secret (stored in settings.sessionSecret.v):

// Run in: kubectl exec -n wiki <wiki-pod> -- NODE_PATH=/wiki/node_modules node /tmp/gen_key.js
const jwt = require('jsonwebtoken');
const sessionSecret = '<value of settings.sessionSecret.v from DB>';
const encryptedPrivKey = '<value of settings.certs.private from DB>';
const payload = { api: <key_id>, grp: <group_id>, iat: now, exp: future, aud: 'urn:wiki.js', iss: 'urn:wiki.js' };
const newToken = jwt.sign(payload, { key: encryptedPrivKey, passphrase: sessionSecret }, { algorithm: 'RS256', noTimestamp: true });

Then update the apiKeys table: UPDATE "apiKeys" SET key='<newToken>' WHERE id=<key_id>;

And update WIKI_API_KEY in your local environment.

Key DB Queries

-- Get session secret (passphrase for encrypted private key)
SELECT value FROM settings WHERE key='sessionSecret';

-- Get encrypted certs (contains private key)
SELECT value FROM settings WHERE key='certs';

-- Update API key with newly generated JWT
UPDATE "apiKeys" SET key='<new_jwt>' WHERE id=2;

kube-router IS Enforcing NetworkPolicies (Not Just Flannel)

Symptom: pg_isready returns “no response” from backup pods to postgres, even though postgres is 1/1 Running and kubectl exec inside postgres-0 works fine.

Root Cause

This cluster runs kube-router as a NetworkPolicy controller (visible in iptables -L FORWARD -n as KUBE-ROUTER-FORWARD chain). UFW’s default forward policy is deny (routed). kube-router marks compliant packets 0x20000; only marked packets are ACCEPTed. Packets from pods without a matching NetworkPolicy are dropped.

The postgres-backup pods have label app.kubernetes.io/name: postgres-backup. The namespace default-deny NetworkPolicy (podSelector: {}, policyTypes Ingress+Egress) blocks ALL their traffic. The existing *-allow-egress policies only cover app.kubernetes.io/name: cardboard (or trade-bot) pods.

Debugging Path

  1. kubectl exec into postgres pod → connect works (runs inside the pod, no NetworkPolicy check on self)
  2. kubectl run debug-pg --image=postgres:16-alpine -- pg_isready -h postgres ... → “no response” (blocked by kube-router)
  3. nc from k3s-agent-1 HOST to 10.42.0.228:5432 → “Connection refused” (TCP RST, not timeout — packet reaches destination but iptables on ingress path blocks it)
  4. sudo iptables -L FORWARD -n on any node → reveals KUBE-ROUTER-FORWARD chain

Fix

Add a NetworkPolicy allowing egress from backup pods. Added to kubernetes/apps/cardboard/postgres-backup-cronjob.yaml and kubernetes/apps/trade-bot/postgres-backup-cronjob.yaml:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: postgres-backup-allow-egress
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: postgres-backup
  policyTypes: [Egress]
  egress:
    - ports: [{port: 53, protocol: UDP}, {port: 53, protocol: TCP}]
      to: [{namespaceSelector: {matchLabels: {kubernetes.io/metadata.name: kube-system}}, podSelector: {matchLabels: {k8s-app: kube-dns}}}]
    - ports: [{port: 5432, protocol: TCP}]
      to: [{podSelector: {}}]
    - ports: [{port: 443, protocol: TCP}]
      to: [{ipBlock: {cidr: 0.0.0.0/0, except: [10.0.0.0/8, 192.168.0.0/16]}}]

Note: postgres pods already have app.kubernetes.io/name: cardboard (or trade-bot) label, so the existing *-allow-ingress policies cover their ingress — no separate ingress rule needed for postgres.

Key Fact

The postgres pod labels matter: postgres StatefulSets use app.kubernetes.io/name: cardboard (same as the web app), which gives them ingress coverage from the existing cardboard-allow-ingress policy. If postgres were labeled differently, it would need its own ingress NetworkPolicy.

proxmox-watchdog: nodeSelector Required for Kasa Access — [HISTORICAL — Lima requirement resolved 2025-06-25]

Symptom was: Watchdog pod in ImagePullBackOff, then CrashLoop when Kasa unreachable.

Architecture Lessons (historical)

  1. Kasa HS300 is on 192.168.1.x — cluster nodes are on 192.168.20.x VLAN with no inter-VLAN routing. Previously only the lima node (192.168.1.56) could reach Kasa. This is why the watchdog was pinned to Lima.
  2. Resolution (2025-06-25): The proxmox-watchdog was updated to use nodeSelector: kubernetes.io/arch: amd64 along with hostNetwork: true. With hostNetwork, pods on VLAN 20 nodes can reach 192.168.1.x targets directly (UDM Pro routes inter-VLAN). The Lima node requirement was eliminated.
  3. Multi-arch image no longer required — amd64-only build now used (--platform linux/amd64 --provenance=false).

Kasa Degraded Mode

If Kasa is unreachable (offline, DHCP lease changed, physical disconnect), the watchdog retries with backoff [5, 15, 30, 60, 120] seconds, then runs in degraded mode: monitors Proxmox host health but cannot power-cycle. The power_cycle_outlet() and get_outlet_power_metrics() methods guard with if not self.strip: return.

To find Kasa’s new IP if it moves: nmap -sn 192.168.1.0/24 | grep -i tp-link\|kasa or check UniFi client list. Then: kubectl patch configmap -n proxmox-watchdog watchdog-config --type=merge -p '{"data":{"KASA_IP":"NEW_IP"}}' + pod restart.

Jellyfin Has No Native /metrics Endpoint

Symptom: JellyfinDown Prometheus alert always fires even when Jellyfin is running.

Jellyfin does not expose a /metrics endpoint natively. A ServiceMonitor pointing to it always gets “no data” which triggers absent() alerts.

Fix: Delete the ServiceMonitor. Change the alert to use kube_deployment_status_replicas_available{namespace="media", deployment="jellyfin"} == 0.

Note: Jellyfin’s IngressRoute uses traefik.io/v1alpha1 (not traefik.containo.us) and the hostname is jellyfin.k3s.internal.zolty.systems (not jellyfin.k3s.internal.zolty.systems).

LiteLLM / Open WebUI

  • LiteLLM disables end_user in Prometheus by default: In litellm/utils.py function get_end_user_id_for_cost_tracking, LiteLLM returns None for the end_user label when service_type == "prometheus" unless enable_end_user_cost_tracking_prometheus_only is true. This is intentional to avoid high-cardinality metrics. Fix: set litellm_settings.enable_end_user_cost_tracking_prometheus_only: true in config.yaml.
  • Open WebUI doesn’t pass user in request body for standard models: Only pipeline-type models get user injected in the chat completions request body. Standard OpenAI-compatible backends receive no user identification in the request body at all.
  • Header-based user tracking pipeline: Open WebUI’s ENABLE_FORWARD_USER_INFO_HEADERS=true sends X-OpenWebUI-User-Email (and X-OpenWebUI-User-Name, X-OpenWebUI-User-Id, X-OpenWebUI-User-Role) headers on every LLM request. LiteLLM’s general_settings.user_header_name: "X-OpenWebUI-User-Email" reads this header and maps it to the end_user field used in Prometheus metrics. Both settings must be set for per-user cost tracking to work.
  • LiteLLM master_key required for Prometheus callback: The success_callback: ["prometheus"] silently does nothing without a master_key set. The master key also becomes the API key that Open WebUI must use (openaiApiKey in Helm values).
  • LiteLLM OOM at 512Mi with master_key: Enabling master_key increases LiteLLM’s memory footprint significantly (auth middleware, user tracking state). Minimum viable memory limit is 1Gi with 512Mi request.
  • Open WebUI OOM at 1Gi on first boot: On first start, Open WebUI downloads a ~657MB sentence transformer model (all-MiniLM-L6-v2). This exceeds 1Gi memory limit, causing OOMKill. Fix: increase memory limit to 2Gi, OR disable RAG embedding (RAG_EMBEDDING_ENGINE=openai, RAG_EMBEDDING_MODEL="").
  • LiteLLM /metrics vs /metrics/: The LiteLLM metrics endpoint redirects /metrics/metrics/ with HTTP 307. Use /metrics/ (trailing slash) or curl -L. ServiceMonitor should use /metrics (k8s follows redirects).
  • Longhorn encrypted StorageClass requires cryptsetup: longhorn-encrypted StorageClass with LUKS2 needs cryptsetup installed on all worker nodes that might schedule the volume. Without it, PVC stays Pending forever with no clear error. Install via sudo apt-get install -y cryptsetup on all agents.

Hugo / Blog (zolty-blog)

  • Hugo shortcodes don’t render inside code blocks: {{< amzn >}}, {{< youtube >}}, and all other shortcodes are NOT processed when placed inside triple-backtick fenced code blocks. Hugo treats everything between ``` as literal text. If you need a product link near a code example, place the shortcode BEFORE or AFTER the code block, never inside it.
  • HEIC images must be converted to JPEG for Hugo: Media library stores originals as HEIC. Hugo and web browsers don’t render HEIC. Convert with macOS sips: sips -s format jpeg -s formatOptions 85 input.heic --out output.jpg. Place converted images in the page bundle directory alongside index.md.
  • Page bundle image convention: Each blog post is a page bundle (hugo/content/posts/<slug>/index.md). Images go directly in the same directory (not in a subdirectory). Reference with ![alt text](filename.jpg).
  • Amazon Associates shortcode: Use {{< amzn search="Product Name" >}}link text{{< /amzn >}} for inline affiliate links. Amazon tag zoltyblog07-20 is configured in hugo.toml params. The shortcode generates Amazon search URLs, not direct product links (more resilient to URL changes).
  • YouTube Data API v3 requires human OAuth consent: Service accounts cannot upload to YouTube — only OAuth 2.0 with user consent works for the youtube.upload scope. The OAuth flow requires a one-time browser-based authorization. Until the GCP project passes YouTube’s compliance audit, uploads are restricted to private visibility.

GCP / YouTube Integration

  • GCP project for YouTube: Project youtube-k3s in Google Cloud. YouTube Data API v3 enabled. OAuth 2.0 credentials (Web application type) with redirect URI https://media-library.k3s.internal.zolty.systems/api/youtube/callback.
  • YouTube upload quota: YouTube Data API v3 has a 10,000 unit daily quota. Each video upload costs ~1,600 units. That’s roughly 6 uploads per day maximum.

AI Skills / Context Window Management

  • Generic skills duplicated across repos waste context: Claude skills in .claude/skills/ are loaded into context when relevant. In a multi-repo workspace (5 repos), generic skills (gh-cli, refactor, systematic-debugging, test-driven-development, git-commit) were identically duplicated across all repos — 25 files totaling ~401KB (~100K tokens). The gh-cli skill alone was 40KB per copy (42% of all skill content). Fix: removed all 5 generic skills from all repos (2026-02-23). Only project-specific skills are retained. If generic skills are needed again, keep them in ONE repo only, never duplicate across a multi-repo workspace.
  • Multi-repo workspaces multiply context baseline: Each repo’s copilot-instructions.md is injected into every message. With 5 repos open, that’s ~28KB (~7K tokens) of baseline context before any question is asked. Plus skill descriptions (~3KB) and instruction file references (~2KB). Consider splitting into per-project workspaces for long sessions.

NUT / UPS Shutdown

  • NUT on NAS: upsd.users only has internal nut master user by default — UGOS Pro ships NUT with one user: [nut]/password = nut/upsmon master. The upsmon slave user (used by k3s nodes) does NOT exist until you add it. Fix: appended [upsmon]\n password = Wiggin123\n upsmon slave to /etc/nut/upsd.users on the NAS and restarted nut-server. Then re-ran the Ansible nut-client.yml playbook with NUT_MONITOR_PASSWORD=Wiggin123 set to deploy correct upsmon.conf to all nodes.
  • NAS SSH is not reachable from Mac (no VLAN 30 route) but works from k3s nodes — My Mac is on VLAN 1 with no route to VLAN 30. Directly SSHing to 192.168.30.10 returns “Network is unreachable”. Use a k3s node as a jump: ssh -F ssh_config k3s-server-1 "sshpass -p 'Wiggin123' ssh -o StrictHostKeyChecking=no mat@192.168.30.10 'command'". sshpass must be installed on k3s nodes (it is — verified).
  • nc -z timeout to a port does NOT mean it’s blocked — nc to the NAS port 3493 timed out, but upsc connected fine. upsd uses its own NUT protocol; nc -z (zero I/O) appears to time out waiting for a NUT banner that never comes. Always test NUT connectivity with upsc ups0@host:3493 ups.status, not nc.
  • UDM Pro inter-VLAN has zero custom firewall rules — All 5 networks (Default, Server VLAN20, Storage VLAN30, 2x WAN) have purpose=corporate with no firewall rules. Cross-VLAN routing is fully open. Port blocking from VLAN20→30 was entirely the NAS-side UG_INPUT chain (UGOS Pro iptables). That chain only ACCEPTs established/related + port 5443 (management). But new TCP connections from VLAN20 are silently dropped UNLESS they hit the established-state rule. NFS works because the connection is pre-established. upsd connections need to be NEW, so they were timing out.
  • NUT_MONITOR_PASSWORD must be in ansible/.env before running nut-client.yml — The role defaults to changeme via lookup('env', 'NUT_MONITOR_PASSWORD') | default('changeme', true). If the env var is not exported before the playbook run, all 7 nodes get changeme as the upsmon password which won’t match the NAS config.

Docker Build & Deployment

  • Service Dockerfiles need .dockerignore: The alert-responder and media-controller services in services/ had no .dockerignore, shipping __pycache__/, .git/, test files, and IDE configs into the build context and image layers. Added .dockerignore to both services (2026-02-24) excluding dev files, tests, and CI artefacts.

Media Stack / Prowlarr / TorrentLeech

  • Prowlarr database is not auto-restored after PVC rebind — If the Prowlarr pod restarts and its Longhorn PVC gets a new binding (e.g. after node failure + PVC recreation), the SQLite database (prowlarr.db) starts fresh: no indexers, no app connections. Radarr/Sonarr will still show stale Prowlarr-synced indexers pointing to prowlarr:9696/<id>/ that 404. Signs: prowlarr-config PVC is 0–1 day old while other PVCs are weeks old. Fix: re-run the Prowlarr setup script (/tmp/prowlarr-setup.py) and FlareSolverr config script (/tmp/prowlarr-flaresolverr.py). TorrentLeech credentials are in kubectl get secret torrentleech-credentials -n media.
  • TorrentLeech requires FlareSolverr to authenticate — TorrentLeech uses Cloudflare DDoS protection. Prowlarr’s Cardigann scraper for TorrentLeech cannot pass the Cloudflare challenge, so all searches return the HTML login page (<title>Login :: TorrentLeech.org</title>) instead of results. Fix: deploy FlareSolverr (kubernetes/apps/media/flaresolverr.yaml), add it as an indexerproxy in Prowlarr (POST /api/v1/indexerproxy with host: http://flaresolverr:8191/), create a flaresolverr tag, and apply that tag to both the proxy and the TorrentLeech indexer. FlareSolverr runs a headless Chrome and relays requests through it.
  • Prowlarr app sync uses per-app endpoint, not bulk — The bulk sync endpoint POST /api/v1/applications/sync returns 405. Trigger per-app sync with POST /api/v1/applications/{id}/sync… but this also returns 405 in v1.28.2. Sync happens automatically when an app is added or an indexer changes. Don’t rely on a manual sync endpoint — just wait ~30s or check Radarr/Sonarr indexer list.
  • Prowlarr API key is not in arr-api-keys secret — Only RADARR_API_KEY and SONARR_API_KEY are in that secret. Get Prowlarr’s key from kubectl exec -n media deployment/prowlarr -c prowlarr -- cat /config/config.xml | grep ApiKey.
  • Radarr’s 429 during Prowlarr sync is a self-test artifact — When Prowlarr syncs a new indexer to Radarr, Radarr validates it by querying prowlarr:9696/<id>/api. If Prowlarr returns 429 (rate-limiting during auth) the sync validation fails with “Unable to connect to indexer”. This is transient — wait for TL authentication to complete and re-trigger the sync (or just make a new Jellyseerr request which forces a search).