Context: This is the live
docs/ai-lessons.mdfrom the home k3s cluster repository, referenced extensively across posts on this blog — starting with AI Memory System and GitHub Copilot Setup Guide. Every entry exists because its absence caused a production incident. Personal identifiers and internal domains have been replaced with generic placeholders.Updated: 2026-03-03
Rules discovered through production breakage. Each entry prevents recurrence of a specific failure. Update this file whenever a new non-obvious failure pattern is discovered.
Kubernetes
Traefik v3 upgrade (k3s v1.34+) breaks port-9000 Ingress resources: Traefik v3 removes port 9000 from the Kubernetes Service spec (only 80/443 remain). Any Kubernetes
Ingressresource targetingtraefik:9000will fail to route and cause non-stopERR Cannot create service: service port not foundspam in Traefik logs every few seconds, which can cause intermittent route reloads affecting other services. Fix: delete the KubernetesIngressand use anIngressRouteCRD withapi@internalviaTraefikServicekind instead.Traefik IngressRoute TLS secret must be in same namespace: When an
IngressRouteinkube-systemspecifiestls.secretName: k3s-wildcard-tls, Traefik looks for the secret inkube-system. If the cert-manager Certificate was created indefaultnamespace, Traefik logsError configuring TLS: secret kube-system/k3s-wildcard-tls does not existcontinuously. Fix: either removesecretName(usetls: {}for entrypoint-level TLS), or create a Certificate in the same namespace as the IngressRoute, or manually copy the secret (note: manual copies become stale on renewal).ConfigMaps are read-only mounts: Mounting a ConfigMap directly over a path that an init script needs to modify causes “Resource busy” errors. Mount to
/tmp/config-template/and copy to target in a command override.Namespace must exist before RBAC: Do not apply Role/RoleBinding to a namespace before the Namespace resource exists. Create namespace first.
Pods don’t reload ConfigMaps: After updating a ConfigMap, you must
kubectl rollout restart deployment/<name>. Pods do not auto-detect changes.Longhorn storage-over-provisioning: Setting to 100% blocks replica scheduling when nodes are >75% utilized. Use 200% (Longhorn default).
Longhorn 50GB volumes can’t do 3 replicas: Agents have 69GB schedulable. One 50GB replica + others exhausts capacity. Use 2 replicas for large volumes.
Longhorn volume I/O corruption (Grafana SQLite) — PERMANENTLY FIXED 2026-03-02: Longhorn volume can report
attached/healthywhile the underlying data returns I/O errors, corrupting Grafana’s SQLite DB. Every request fails with “unable to open database file: input/output error”. Root fix: Switch Grafana topersistence.enabled: false(emptyDir). All dashboards and datasources are provisioned from ConfigMaps via the sidecar — only sessions/preferences are in SQLite, making it safely ephemeral. No more Longhorn PVC = no Longhorn I/O corruption. Also removeddeploymentStrategy: Recreate(was only needed for RWO PVC). Template updated inansible/roles/cluster_services/templates/prometheus-values.yaml.j2. Users must re-login after a pod restart (expected). Do NOT re-add a Longhorn PVC for Grafana.Longhorn volume I/O corruption recovery (general pattern): When any workload (not just Grafana) shows I/O errors but Longhorn reports volume
attached/healthy, the volume data is corrupted at the replica level. General recovery: (1)kubectl scale --replicas=0the Deployment/StatefulSet, (2) delete the PVC (Longhorn auto-deletes the volume), (3) restore from Longhorn S3 backup: list backups in Longhorn UI → create volume from backup → create PV/PVC from the restored volume, (4) update the workload’s PVC name if it changed, (5) scale back up. This was used successfully to restore Jellyfin after rapid Proxmox reboots caused I/O corruption across multiple volumes.PostgreSQL PGDATA on Longhorn: Must set
PGDATA=/var/lib/postgresql/data/pgdata— Longhorn PVCs containlost+foundthat breaks default initdb path.code-server emptyDir shadows PVC writes: If an
emptyDirvolume is mounted at a subpath of thehomePVC (e.g.,/home/coder/.kube), init container writes to that path go to the PVC but the main container’s emptyDir mount hides them. Fix: mount the emptyDir in the init container too, or copy from a secret mount in the main container’s startup script.code-server apt-get fails on startup (DNS not ready): Container startup apt-get commands fail silently when DNS isn’t ready yet (pod networking starts before CoreDNS is reachable). Fix: retry
apt-get updateup to 3 times with 10s sleep between attempts. Never suppress stderr entirely (> /dev/null 2>&1) — pipe through| tail -5to catch errors.code-server init container lacks unzip: The
ghcr.io/coder/code-server:4.109.2base image doesn’t includeunzip. Terraform is distributed as a ZIP, sosudo apt-get install -y -qq unzipis needed before downloading terraform in the init container.Longhorn PVCs can’t attach to Lima node (RESOLVED): Previously
CSINode lima-k3s-agent does not contain driver driver.longhorn.io— this was a Docker-in-Docker era limitation. With the Lima VM (Debian 13 arm64), Longhorn works fully: DaemonSets run, disk is schedulable, replicas are placed. However, workloads withReadWriteOncePVCs that must not land on Lima for architectural reasons should still use node affinityruntime NotIn [lima].Longhorn on root disk wastes ~60% of NVMe: Without dedicated disks, Longhorn shares the OS root filesystem. On a 512GB NVMe with 50GB server + 100GB agent boot disks, only ~98 GiB (agent) and ~49 GiB (server) are visible to Longhorn — the rest is consumed by OS, container images, and k3s. Fix: add dedicated Longhorn disks via Terraform
additional_disks+ Ansiblelonghorn_diskrole. This separates I/O and reclaims stranded capacity.Longhorn replica-auto-balance: Set to
best-effortto redistribute replicas across all schedulable nodes. Without this, newly added nodes (like the Lima 1TB disk) get zero replicas until volumes are recreated. Existing volumes keep their original placement until auto-balance migrates them.Longhorn instance-manager requires
privilegedPodSecurity: Thelonghorn-systemnamespace MUST havepod-security.kubernetes.io/enforce: privileged. Usingbaselineblocks instance-manager pods from starting (hostPath volumes + privileged container). Symptom: all Longhorn volumes stuck inattachingstate forever, errorviolates PodSecurity "baseline:latest": hostPath volumes. Fix: patchkubernetes/core/pod-security-standards.yamlto setenforced: privilegedforlonghorn-systemand re-apply. This caused a full Prometheus + postgres outage after a pve1 cold boot.Longhorn: lima (arm64) replicas get trimmed when amd64 node recovers — [HISTORICAL — Lima VM removed 2025-06-25]: When k3s-agent-1 (or any node) goes down, Longhorn creates replacement replicas — some may land on lima. When the node recovers and its original replicas re-activate, the volume briefly has 4+ healthy replicas. Longhorn trims the “extras” and preferentially removes the most recently added ones (lima’s). Lima ends up with 0 replicas again even though the scheduler initially chose it. This is expected behavior, not a bug. New PVCs created after the recovery will land on lima fairly.
Longhorn server node disk pressure (shared root disk): Control plane nodes (k3s-server-1/2/3) share the root filesystem with Longhorn. When the root disk hits ~75% usage, Longhorn’s
storageAvailabledrops belowstorageReserved(30% of disk), causing silent disk pressure — the Longhorn node condition won’t mark it asDiskPressureuntil scheduling is attempted. Fix: disableallowSchedulingon that disk (kubectl patch nodes.longhorn.io k3s-server-X --type=merge -p '{"spec":{"disks":{"..":{"allowScheduling":false,...}}}}'). Long-term: provision dedicated/dev/sdbLonghorn disks on server nodes via Terraformadditional_disks.Longhorn disk overcommit causes DiskPressure + Unschedulable: With
storage-over-provisioning-percentage: 200, Longhorn can schedule 2× the disk size, which is fine until ACTUAL disk usage catches up. WhenstorageScheduled > storageMaximumANDstorageAvailable < storageReserved, the node showsUnschedulablein the Longhorn UI. Fix: enable disk eviction (evictionRequested: true,allowScheduling: false) to drain all replicas off the disk. Only THEN re-enable scheduling. Simply deleting a few replicas is insufficient; use full eviction. After eviction, k3s-server-3 dropped from 61 GiB scheduled to 0.Never allow Longhorn to schedule replicas on server nodes (PERMANENT POLICY — 2026-03-03): Server nodes (k3s-server-1/2/3) have 80GB root disks shared with the OS, k3s binaries, container images, and etcd WAL. Longhorn overcommit (200%) makes this appear spacious, but actual write pressure fills the disk and returns I/O errors to the attached volume — causing WAL corruption in Prometheus and any other write-heavy workload. Agent nodes (k3s-agent-1/2/3/4) have 300GB disks with ample headroom. Resolution: permanently disable
allowSchedulingon all server node Longhorn disks. Thelonghorn_no_schedule_nodesvariable ingroup_vars/all.ymllists all three server nodes; thecluster_servicesAnsible role re-applies this on every run. To evict existing replicas:kubectl patch nodes.longhorn.io k3s-server-X -n longhorn-system --type=merge -p '{"spec":{"disks":{"<disk-id>":{...,"allowScheduling":false,"evictionRequested":true}}}}'. Get the disk ID first:kubectl get nodes.longhorn.io k3s-server-X -n longhorn-system -o json | python3 -c "import sys,json; d=json.load(sys.stdin); [print(k) for k in d['spec']['disks']]". Eviction is complete when replica count on the node drops to 0 (kubectl get replicas.longhorn.io -n longhorn-system -o json | python3 -c "import sys,json; r=json.load(sys.stdin); print(sum(1 for i in r['items'] if i['spec'].get('nodeID')=='k3s-server-X'))").Tdarr (s6-overlay) containers crash with
runAsUser: Tdarr uses s6-overlay init system which must start as root to set supplementary groups and drop privileges viaPUID/PGIDenv vars. SettingsecurityContext.runAsUsercausess6-applyuidgid: fatal: unable to set supplementary group list: Operation not permittedin an infinite restart loop. Fix: removerunAsUser/runAsGroup/fsGroupfrom the pod/container securityContext. This applies to most LinuxServer.io-style images that use s6-overlay (Radarr, Sonarr, Bazarr, etc.) — they handle user switching internally.Tdarr workers need
privileged: truefor VA-API: Same as Jellyfin — non-privileged containers can’t perform DRM ioctls needed for Intel QuickSync/VA-API hardware encoding. Workers detecth264_qsv,hevc_qsv,hevc_vaapias working only with privileged mode.GPU taints cause more harm than good — removed:
gpu=true:NoScheduletaints on agent nodes blocked ALL non-GPU pods (CI/CD runners, general workloads) from scheduling. With only 4 non-tainted nodes (3 control-plane + 4 workers), cluster was resource-exhausted. GPU workloads (Jellyfin, Tdarr) already usenodeSelector: gpu: intel-uhd-630to target GPU nodes, making taints redundant. Fix: removed taints entirely. GPU targeting happens via labels + nodeSelector, not taints. If GPU isolation is ever needed again, usePreferNoSchedule(soft) instead ofNoSchedule(hard).Security scanner over-labeling namespaces with restrict PodSecurity: A deployed security scanner service labeled ALL application namespaces with
pod-security.kubernetes.io/enforce: restricted(orbaseline), blocking pod creation for services that legitimately need elevated permissions. Affected:cardboard,trade-bot(Postgres + Flask → needbaseline),home-assistant(hostNetwork=true → needsprivileged),dev-workspace(docker-in-dockerprivilegedcontainer → needsprivileged),proxmox-watchdog(hostNetwork=true → needsprivileged). Pods don’t get evicted, but once they crash or are recreated they hitFailedCreateforever — symptom is 0 pods in namespace despite replica count > 0. Fix:kubectl label namespace <ns> pod-security.kubernetes.io/enforce=<level> --overwrite, thenkubectl rollout restart. Always bake the correct PodSecurity labels directly into the namespace YAML in the repo so re-applies don’t clobber them. Mapping: postgres+Flask →baseline;hostNetwork/hostPort/docker-in-docker →privileged; pure stateless apps with secure images →restricted.Longhorn replica
failedAt+ stale diskID after VM rebuild: When an agent VM is destroyed and rebuilt, its Longhorn disk ID changes. Replicas referencing the old disk ID fail with “cannot find disk name for replica” and the volume goes “faulted/detached”. ClearingfailedAton the replica doesn’t help if the disk ID no longer exists on any node. Fix: delete the volume and restore from S3 backup, or delete and recreate fresh if data is disposable.Longhorn Volume CRD requires
frontend: blockdev: Creating a Longhorn Volume via kubectl withfromBackupfor restore MUST includespec.frontend: blockdevandspec.accessMode: rwo. Withoutfrontend, the admission webhook rejects with “invalid volume frontend specified”. Copy the full spec from an existing healthy volume when in doubt.Workloads targeting GPU nodes use nodeSelector, not tolerations: GPU workloads (Jellyfin, Tdarr, Home Assistant) use
nodeSelector: gpu: intel-uhd-630orkubernetes.io/hostnameto land on agent nodes. Taints were removed — tolerations in manifests are now harmless but unnecessary.NFS mounts fail on Lima node (192.168.1.56) — [HISTORICAL — Lima VM removed 2025-06-25]: The UGreen NAS (192.168.30.10) only allows NFS from the k8s VLAN (192.168.20.0/24). Pods that mount NFS PVs (radarr, sonarr, bazarr, jellyfin, media-controller, etc.) MUST use
nodeAffinity: requiredDuringSchedulingIgnoredDuringExecutionwithruntime NotIn [lima]. Without this, Longhorn volume affinity can schedule pods on the Lima node where NFS mount fails with “access denied by server”.Always use
Recreatestrategy for single-replica apps:RollingUpdatewithmaxUnavailable: 0, maxSurge: 1creates a new pod, waits for it to become ready, THEN kills the old one. With Longhorn RWO PVCs, the new pod can’t attach the volume until the old pod releases it — deadlock.Recreatekills old first, then starts new. All single-replica stateful app Deployments must setstrategy: type: Recreate.Bound PVC
volumeNameis immutable — must match manifest: When a PVC is restored from a backup PV (e.g., via Longhorn restore workflow), the cluster PVC getsspec.volumeNameset to the PV name. If the manifest omitsvolumeName,kubectl applyfails with “spec is immutable after creation: volumeName”. Fix: addvolumeName: <pv-name>to the PVC manifest to match cluster state. Runkubectl get pvc <name> -n <ns> -o yaml | grep volumeNameto find the bound PV name.Workloads rescheduled to lima after node drain — [HISTORICAL — Lima VM removed 2025-06-25]: When amd64 worker nodes go down (e.g., pve1 shutdown for hardware upgrade), pods without
nodeSelectorornodeAffinitymay reschedule onto the lima arm64 node. This causes: (1)exec format errorfor amd64-only images (all digital-signage microservices, tdarr), (2) NFSaccess deniedfor media stack workloads, (3) unexpected StatefulSet issues (trade-bot postgres). Fix: addnodeSelector: {kubernetes.io/arch: amd64}to ALL amd64-only workloads. For critical workloads with NFS dependencies, also addruntime NotIn [lima]affinity.K8s env var $(VAR) interpolation requires ordering: In a Deployment’s env list,
$(DB_USER)is only interpolated ifDB_USERis defined EARLIER in the list. Define referenced vars BEFORE the var that uses them. Otherwise, the literal string$(DB_USER)is passed to the container.K8s
optional: trueon secretKeyRef for phased deployments: When a Secret won’t exist until after an OAuth flow or manual step, useoptional: trueon allsecretKeyRefentries referencing it. Without this, the pod fails to start withCreateContainerConfigErrorbecause K8s can’t find the Secret. This is useful for YouTube/OAuth credentials that require a one-time human authorization step before they can be populated.
ARC (GitHub Actions Runner Controller)
- ARC
labelsfield replaces defaults: If you set labels,self-hosteddisappears. Jobs withruns-on: [self-hosted, ...]won’t match. ALWAYS includeself-hostedexplicitly. - ARC API group is
.devnot.net: Useactions.summerwind.dev/v1alpha1. Common copy/paste error with.net. - ARC secret key name: Must be
github_token(lowercase, underscore). NOTgithub_patorGITHUB_TOKEN. - ARC metrics behind kube-rbac-proxy: ARC v0.27.x uses
kube-rbac-proxysidecar on port 8443 (HTTPS) for metrics instead of direct port 8080. Prometheus scrape config needsscheme: https,tls_config: insecure_skip_verify: true,bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token, and a ClusterRoleBinding granting the Prometheus SAgeton non-resource URL/metrics. - Do not use standard HPA with ARC: Use
HorizontalRunnerAutoscalerCRD withPercentageRunnersBusymetric. - Do not mix ARC v1 and v2 fields:
githubConfigUrlis v2 only. - ARC runner image lacks pip/node/npm: Bootstrap with
ensurepiporapt-get install python3-pip. Use--break-system-packages(PEP 668). - Org-level GitHub secrets don’t work with ARC: Use repo-level secrets for CI/CD.
- ARC runners on arm64 nodes break CI/CD — [HISTORICAL — Lima VM removed 2025-06-25, no more arm64 nodes]:
runtime: docker-desktopNotIn affinity didn’t match Lima VM on Mac Mini. Runners scheduled on arm64 failed withexec format errorwhen installing x86_64 AWS CLI or building amd64 Docker images. ALWAYS usekubernetes.io/arch: amd64nodeAffinity for runners. Same applies to any workload whose Docker image is built for amd64 only (including all app deployments and CronJobs). - CI/CD RBAC escalation: When CI creates Roles in target namespaces, the runner’s own Role needs
escalate+bindverbs onrbac.authorization.k8s.io/roles,rolebindings, AND every verb it grants must already be held by the runner. - Runner RBAC bootstrapping — must apply manually:
kubernetes/github-runners/zolty-mat-runners.yamlis NOT applied by the “Deploy K8s Applications” workflow (bootstrapping problem — the runner can’t grant itself new permissions). When adding new namespaces or resources to runner Roles, you mustkubectl apply -f kubernetes/github-runners/zolty-mat-runners.yamlfrom a local kubeconfig with cluster-admin. Always do this before triggering a deploy that needs the new permissions. - Runner missing
serviceaccountsverb causes apply failure: When the alert-responder (or any) manifest includes aServiceAccountresource, the runner must haveget/list/create/patch/updateonserviceaccountsin that namespace. The error is:serviceaccounts "<name>" is forbidden: User "system:serviceaccount:arc-runner-system:github-runner" cannot get resource "serviceaccounts". Addserviceaccountsto the namespace’s deploy Role resources list. - Rollout timeouts 120s too short with Longhorn PVCs: After Recreate strategy kills the old pod, a Longhorn
ReadWriteOncePVC must detach from the old node before the new pod can start (30-60s). Combined with image pull time, the total easily exceeds 120s. All CI/CD workflows use--timeout=300sforkubectl rollout status. Never go lower than 300s for any deployment using Longhorn PVCs.
MetalLB
- MetalLB pool annotation: Do NOT use
metallb.universe.tf/address-poolunless targeting a specific pool. Omit to auto-assign from defaulthomelab-pool. Usingdefaultas pool name fails with “unknown pool”. - MetalLB v0.14.x labels: Native manifest uses
app=metallb,component=controller(NOTapp.kubernetes.io/component=controller). spec.loadBalancerIP+ MetalLB annotation = conflict: Using bothspec.loadBalancerIP(deprecated K8s field) ANDmetallb.universe.tf/loadBalancerIPsannotation on the same Service causes MetalLB to reject or ignore the request. Use ONLY the annotation. When Traefik’s HelmChartConfig setsloadBalancerIPvia spec, patch the live Service to remove it and add the annotation instead.- HelmChartConfig on-disk manifest reverts kubectl patches: k3s auto-applies manifests from
/var/lib/rancher/k3s/server/manifests/custom/on every restart. If you fix a HelmChartConfig viakubectl applybut don’t update the on-disk YAML, the next k3s restart silently reverts your fix. Always update the on-disk manifest (on server-1) AND the Ansible template simultaneously.
k3s Upgrades
- k3s servicelb hijacks port 22 — breaks SSH to all nodes: k3s servicelb creates DaemonSets (
svclb-*) that bind host ports via iptables for every LoadBalancer Service. If ANY Service usesport: 22(e.g., code-server SSH), servicelb binds port 22 on EVERY node, intercepting SSH connections and routing them to the Service’s pods instead of the real sshd. Symptoms: SSH connects but fails authentication/key exchange (you’re talking to code-server, not sshd), sshd logs show zero incoming connections. Diagnosis:kubectl get ds -n kube-system | grep svclbto find offending DaemonSets,iptables -t nat -L -n | grep 22to see the DNAT rules. Fix: disable servicelb entirely when using MetalLB. MetalLB handles LB allocation without binding host ports. - k3s
--disableflags lost during manual upgrade: When upgrading k3s via the install script (curl -sfL https://get.k3s.io | sh -s - server), flags like--disable=servicelbfrom the original install are NOT preserved. The systemd ExecStart is overwritten. Fix: use/etc/rancher/k3s/config.yamlwithdisable: [servicelb]— this persists across upgrades. Create this file on ALL server nodes before upgrading. /etc/rancher/k3s/config.yamlis the upgrade-safe config method: Command-line args in systemd ExecStart or Ansiblek3s_extra_server_argscan be lost during k3s upgrades. The config.yaml file is always read by k3s on startup regardless of how the binary was upgraded. Prefer config.yaml for critical settings likedisable,tls-san,cluster-cidr, etc.- k3s upgrade wipes
authorized_keyson Debian: Observed during v1.29→v1.34 upgrade —/root/.ssh/authorized_keyswas empty on all nodes after upgrade. Root cause unclear (possibly cloud-init re-run). Always verify SSH access immediately after upgrading k3s. Useqm guest execvia Proxmox as out-of-band recovery if SSH breaks. - After disabling servicelb, host keys change: When servicelb was intercepting port 22,
ssh-keyscancaptured the code-server pod’s host key, not the real sshd. After disabling servicelb, the real sshd responds with different keys. Must re-runssh-keygen -R+ssh-keyscanfor all node IPs after the fix.
Networking & DNS
- LACP order-of-operations: NAS/client bond mode MUST change before switch LAG: When converting from active-backup to 802.3ad LACP, change the client side (NAS bond mode) FIRST, then enable LAG on the switch. If the switch is set to “Aggregating” (LACP) while the client is still in active-backup, the switch enters LAG mode and starts sending LACP PDUs. The active-backup client ignores them, traffic drops to zero, and the client networking stack may crash/hang (observed with UGOS Pro DXP4800 — took full power cycle to recover). Correct order: (1) Change NAS bond to
Dynamic link aggregation-bond4(802.3ad) in UGOS Pro UI, (2) verify NAS is back up, (3) then set switch ports to Aggregating. Also:swctrl port show detailLAG state(D)= detached/failed,(U)= UP/active and working. Verify withcat /proc/switch/mac_table | grep vlan=30— both NAS MACs should showtype=lag. - 802.3ad LACP distributes by flow hash, not bytes: A single TCP flow always uses one LAG member (based on src/dst IP/MAC hash). Two simultaneous clients from different IPs will each use a different member, utilizing both links. Don’t expect counter equality on both ports from a single client — asymmetry is correct and expected.
- UniFi LAG API field is
op_mode: "aggregate"+aggregate_members: The working API format for LAG via PUT to/proxy/network/api/s/default/rest/device/<id>usesport_overrides[].op_mode: "aggregate"withaggregate_members: [7,8]andlag_idx: 1. The fieldaggregate_num_portsdoes NOT work. The UniFi UI sets this correctly — use the UI for LAG changes and verify via API. - Debian 13 cloud images only enable
mainapt component:intel-media-va-driver-non-freeand other non-free packages aren’t available untilnon-free non-free-firmwarecomponents are added to/etc/apt/sources.list.d/debian.sources. The Ansible gpu_worker role handles this automatically withansible.builtin.replaceon theComponents:line. - kubeconfig default server is k3s-server-1’s static IP, not a floating VIP: The default kubeconfig points to
https://192.168.20.20:6443(k3s-server-1 on pve1) — this is NOT a kube-vip floating address, it’s the static node IP. When pve1 goes down, allkubectlcommands fail withdial tcp 192.168.20.20:6443: connect: network is unreachable. Workaround: target another control plane withkubectl --server=https://192.168.20.21:6443 --insecure-skip-tls-verify <cmd>. etcd retains quorum as long as 2/3 control plane nodes are up (pve2 and pve3). Long-term fix: configure kube-vip for a floating VIP that follows the etcd leader. - After a Proxmox host outage, pods with Longhorn volumes show I/O errors even after nodes rejoin: When pve1 (or any host) goes down and comes back, previously-mounted Longhorn volumes may return
Input/output errorto running pods (e.g.,failed to load config.xml: Input/output error). The Longhorn volume itself recovers torobustness: healthywithin minutes of the node rejoining. The pod’s existing mount point is stale — it was established while the replica was degraded. Fix:kubectl rollout restart deployment/<name>. Do NOT assume I/O errors mean volume data corruption — always checkkubectl get volume -n longhorn-system <vol> -o jsonpath='{.status.robustness}'first. Ifhealthy, a rollout restart is sufficient. Observed with Sonarr and Jellyfin config volumes after 2026-03-01 pve1 hard shutdown. - Plex transcode EmptyDir eviction — evicted pod stays as Error indefinitely: Plex uses an
emptyDirwithsizeLimit: 5Gifor transcoding scratch space. When active transcoding fills it, the kubelet evicts the pod (exit code 137, reason:Evicted, message:Usage of EmptyDir volume "transcode" exceeds the limit). The Deployment’s ReplicaSet immediately creates a replacement pod on the same node. The evicted pod remains inErrorstate indefinitely — it does NOT get cleaned up automatically. The cluster will show two Plex pods: oneRunning(replacement) and oneError(evicted). Fix:kubectl delete pod -n media <evicted-pod-name>. Consider increasingemptyDir.sizeLimitif transcoding large files, or ensuring Tdarr pre-transcodes to lower bitrates before Plex serves them. - Seedbox sync files accumulate in /media/staging when Radarr/Sonarr are down: The
seedbox-syncCronJob (runs every 4h) uses rsync to pull fromuser@<seedbox-ip>:2222:/home/user/Downloads/complete/→/media/staging/and then triggersDownloadedMoviesScan/DownloadedEpisodesScanAPI calls. If Radarr or Sonarr are crashed/restarting when the cron fires, the rsync succeeds (exit 0) but the API call fails (non-fatal WARN). Files accumulate in/media/stagingwithout being imported, causing seedbox disk to fill. After recovering the arr services, manually trigger both scans:wget --header='X-Api-Key: <key>' --post-data='{"name":"DownloadedMoviesScan","path":"/media/staging"}' --header='Content-Type: application/json' http://10.43.82.213:7878/api/v3/command(Radarr) and same pattern for Sonarr at10.43.85.233:8989. API keys are in thearr-api-keysSecret in themedianamespace. - pve4 NVMe PCIe AER RxErr — symptom of aging consumer NVMe: The WD PC SN720 512GB NVMe (PCI ID
15b7:5002) on pve4 logs PCIe Physical Layer correctableRxErrerrors in dmesg (nvme 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer). SMART reportsPASSEDwith zero NVMe error log entries and zero media/data integrity errors — the PCIe link is correcting these errors transparently. However, recurring correctable errors indicate signal integrity degradation (loose M.2 connector, thermally fatigued PCIe traces, or end-of-life controller). At 42,347 power-on hours (~4.8 years continuous runtime) on a consumer NVMe, this drive is past comfortable service life. Action: (1) Reseat the M.2 drive — loose connector is the most common cause. (2) Plan replacement: SN720 is the pve4 OS boot disk; failure takes the entire host offline. A 512GB+ NVMe is ~$50-70. Longhorn replicates all k3s data to 3 other nodes so workload data is safe, but pve4 OS loss requires Proxmox reinstall. Monitor withdmesg | grep -c RxErr— count was 64 as of 2026-03-01. - VMs on untagged 192.168.1.x subnet have no internet: The UDM Pro doesn’t NAT traffic from VMs placed on the untagged/native VLAN (192.168.1.x). VMs can reach other VLANs via inter-VLAN routing but cannot reach the internet. All k3s VMs must be on VLAN 20 (192.168.20.x) for internet access.
vlan_id = 0(or omitted) in Terraform = no VLAN tag = 192.168.1.x = no internet. ansible/.envsets AWS env vars that overrideAWS_PROFILE:source .envin the ansible directory setsAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY(cert-manager-dns01 credentials). These environment variables take precedence overAWS_PROFILE, silently overriding the intended IAM identity. When switching between Ansible and Terraform, alwaysunset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEYbefore usingAWS_PROFILE=terraform.- Proxmox VLAN-aware bridge needs
bridge-vids: Settingbridge-vlan-aware yeson vmbr0 and creating sub-interfaces (vmbr0.20, vmbr0.30) is NOT sufficient. The physical bridge port (nic0) defaults to only VLAN 1 (PVID). Tagged VLAN frames from the switch are dropped at the bridge port before reaching the sub-interfaces. Fix: addbridge-vids 2-4094to the vmbr0 stanza in/etc/network/interfaces. Runtime fix:bridge vlan add dev nic0 vid 20 && bridge vlan add dev nic0 vid 30. Verify withbridge vlan show. - UniFi Teleport VPN reserves 192.168.2.0/24: Teleport silently claims this subnet. Creating a VLAN network on the same range fails with
api.err.SettingSubnetOverlapped (key: teleport). Check Teleport subnet via API:GET /proxy/network/api/s/default/get/setting/teleport. Either disable Teleport, change its subnet, or use a different /24 for your VLAN. - CoreDNS single-replica on Mac Mini: k3s defaults to 1 CoreDNS pod. If it lands on Mac Mini (Docker-in-Docker), VM-based pods can’t resolve DNS → full cluster outage (Longhorn CSI crash → all PVC pods stuck). Ensure 2+ replicas via
coredns_replicasAnsible variable. - Alpine
localhostresolves to::1(IPv6): Python HTTPServer bound to0.0.0.0won’t respond towget localhost. Use127.0.0.1explicitly. Kubelet probes use pod IP (IPv4) so they work. - Traefik port 80 serves HTTPS: Entrypoint forces TLS. Must use
https://URL with SSL context when probing Traefik internally. - VS Code Remote-SSH requires TCP port forwarding: sshd_config MUST have
AllowTcpForwarding yesor connection fails with “administratively prohibited”.
AWS & Terraform
- ECR tokens expire after 12 hours: Pull secrets must be refreshed on every deploy.
dynamodb_tableis deprecated: Useuse_lockfile = truein S3 backend (Terraform 1.13+).- S3 lock files are
.tflockobjects: If stuck, delete withaws s3 rm s3://bucket/key.tflock. - Backend migration stale locks: Can chain. Use
-lock=falseto break the cycle. - Grafana CloudWatch datasource needs
.envfile: TheGRAFANA_CW_SECRET_KEYenv var must be set inansible/.env(sourced before running playbooks). Without it, the Jinja2 template falls back toCHANGE_ME, causingSignatureDoesNotMatcherrors on all CloudWatch/billing dashboards. Retrieve the secret from Terraform:terraform output -raw grafana_cloudwatch_secret_access_key. - Terraform cloud-init IP applied on VM reboot — verify tfvars match reality: When
terraform applyupdates VM parameters (e.g., memory), cloud-init config is also regenerated from tfvars, even if the apply errors with “ide2: hotplug problem”. The rebooted VM picks up the NEW cloud-init config. Ifip_addressinterraform.tfvarsis wrong (e.g.,192.168.1.20instead of192.168.20.20), the VM boots with the wrong IP, breaking etcd peering and cluster access. Fix: always verify tfvars IPs match actual deployed IPs BEFORE runningterraform apply. If already applied with wrong IP: fix viaqm guest exec <vmid> -- sed -i 's/old/new/' /etc/netplan/50-cloud-init.yaml && qm guest exec <vmid> -- netplan apply, then fix tfvars and re-apply.
Helm
- Stuck Helm release (pending-upgrade/pending-rollback): If a previous
helm upgradeor rollback failed mid-operation, subsequent upgrades fail with “another operation is in progress”. Fix:helm history <release> -n <ns>to find stuck revisions, then delete their secrets (kubectl delete secret sh.helm.release.v1.<release>.v<N>). Only deletepending-*revisions, not the lastdeployedone. - StatefulSet strategy is NOT
Recreate: Helm charts using StatefulSets (e.g., Open WebUI) rejectstrategy.type: Recreate— StatefulSets only supportRollingUpdateorOnDelete. UseOnDeletefor single-replica StatefulSets with Longhorn RWO PVCs. This is different from Deployments which do supportRecreate.
AWS Bedrock
- Bedrock model access is account-level, not IAM-level: Even with correct IAM permissions (
bedrock:InvokeModel,bedrock:Converse), models fail with “Model access is denied due to IAM user or service role is not authorized to perform the required AWS Marketplace actions.” Fix: go to AWS Console → Bedrock → Model access → Request access for the specific models. IAM policy also needsaws-marketplace:ViewSubscriptionsandaws-marketplace:Subscribepermissions. - LiteLLM image tags use
-stablesuffix: The correct LiteLLM container image tag format ismain-v1.81.12-stable, notmain-v1.63.14. Tags without-stablemay not exist on GHCR. Always checkhttps://github.com/BerriAI/litellm/pkgs/container/litellmfor current tags. - LiteLLM cross-region inference profiles: Bedrock model IDs prefixed with
us.(e.g.,us.anthropic.claude-sonnet-4-20250514-v1:0) use cross-region inference profiles for automatic failover. These requireinference-profile/*in IAM resource ARNs, not justfoundation-model/*. - Bedrock Claude 3.5 Haiku model access denied: Claude 3.5 Haiku (
anthropic.claude-3-5-haiku-20241022-v1:0) may require explicit marketplace subscription even when other Claude models work. Claude Haiku 4.5 (anthropic.claude-haiku-4-5-20251001-v1:0) works without additional marketplace actions. Use Haiku 4.5 as the fast/cheap model. - Terraform
k3s-homelab-cilacks IAM create perms: TheterraformAWS profile usesk3s-homelab-ciuser which cannotiam:CreateUser. Use thedefaultprofile (homelab-admin) for IAM provisioning, or add IAM management permissions to the CI user.
Observability & Alerting
- Prometheus disk full kills ALL metrics ingestion: When Prometheus WAL fills the PVC, no new samples are written. All dashboards go stale. Check
df -h /prometheuswhen dashboards show gaps. Fix: patch PVC to larger size, delete the prometheus pod (StatefulSet recreates it, Longhorn auto-expands filesystem).prometheus_storage_size: "30Gi"andprometheus_retention: "14d"are safe defaults for a homelab. - Prometheus high-cardinality apiserver buckets:
apiserver_request_duration_seconds_bucket,etcd_request_duration_seconds_bucket,apiserver_request_sli_duration_seconds_bucket,apiserver_request_body_size_bytes_bucket,apiserver_response_sizes_bucket, andapiserver_watch_events_sizes_bucketgenerate ~315K series (55% of TSDB) but are rarely used in dashboards. Drop them viakubeApiServer.serviceMonitor.metricRelabelingsin the Helm values. The_sumand_countaggregates are kept, which is sufficient for latency percentile monitoring via recording rules. - Longhorn default-replica-count vs existing volumes: Changing the Longhorn
default-replica-countsetting only affects NEW volumes. Existing volumes retain their original replica count. To reduce replicas on existing volumes, patch each one:kubectl -n longhorn-system patch volumes.longhorn.io <vol> --type merge -p '{"spec":{"numberOfReplicas":2}}'. Longhorn evicts extra replicas automatically and volumes stay healthy throughout. - pve-exporter v3.4.5 metric names changed: Older dashboards use
pve_node_cpu_usage,pve_node_memory_*,pve_storage_*,pve_node_uptime_seconds. Actual metric names arepve_cpu_usage_ratio,pve_memory_usage_bytes,pve_disk_usage_bytes,pve_uptime_secondswithidlabels likenode/pve1,storage/pve1/local-lvm. Filter with{id=~"node/.*"}for nodes. - github-exporter /metrics blocks on API calls: The
githubexporter/github-exportercontainer’s/metricsendpoint makes synchronous GitHub API calls. With default 5s probe timeout, kubelet kills the container → CrashLoopBackOff. Fix: set probe timeoutSeconds to 30s, add startupProbe with failureThreshold 10. Also increase memory limit to 256Mi. - github-exporter.yaml Secret overwrites real PAT: If the manifest has
stringData.github_token: "REPLACE_ME"and youkubectl apply -fthe whole file, it overwrites the manually-set secret. Always update the PAT separately:kubectl create secret generic github-exporter-token -n monitoring --from-literal=github_token=<PAT> --dry-run=client -o yaml | kubectl apply -f - - Traefik scrape config relabel bug: The additionalScrapeConfigs relabel for Traefik used
__meta_kubernetes_pod_annotation_prometheus_io_portas the sole source for__address__, producing9100:9100(port:port) instead ofip:port. Fix: use two source labels[__meta_kubernetes_pod_ip, __meta_kubernetes_pod_annotation_prometheus_io_port]withregex: (.+);(.+)andreplacement: ${1}:${2}. - UniFi poller 429 death spiral → account lockout: UnPoller re-authenticates every poll interval (~15s) even while in its own “retry backoff” state. After CrashLoopBackOff restarts, hundreds of rapid auth requests trigger UDM Pro’s brute-force protection (
AUTHENTICATION_FAILED_LIMIT_REACHED). The account lockout is per-account (not per-IP) and persists for 30+ minutes even with zero traffic. Poll interval alone doesn’t prevent it. Fix: (1) increasedUP_UNIFI_DEFAULT_POLL_INTERVALto120s, (2) addedstartupProbewithfailureThreshold: 10so startup auth failures don’t trigger immediate restarts, (3) increasedlivenessProbe.periodSecondsto120to reduce restart frequency. To recover from lockout: log into UDM Pro UI as admin → Settings → Security → unlock the service account. - ServiceMonitors without valid /metrics cause permanent TargetDown: Before adding a ServiceMonitor, verify the service actually exposes
/metrics. oauth2-proxy v7.6.0 requires--metrics-address=:44180for a separate metrics port. code-server does not expose Prometheus metrics at all. Home Assistant requires Prometheus integration + Long-Lived Access Token configured before its/api/prometheusendpoint returns 200. - Grafana community dashboards from 2014-2018 don’t work: gnetId 139 (AWS Billing), 617 (EC2), 575 (S3) use legacy query formats incompatible with modern Grafana CloudWatch plugin. Replace with custom ConfigMap dashboards using
metricEditorMode: 0andmetricQueryType: 0fields. Dashboard ConfigMaps withgrafana_dashboard: "1"label are auto-loaded by sidecar. - pve_guest_info has no
statuslabel in pve-exporter v3.4.5: ThePveVmDownalert usingpve_guest_info{status!="running"}fires for ALL non-template VMs because thestatuslabel doesn’t exist (empty string != “running” is always true). Usepve_up{id=~"qemu/.*"} == 0joined withpve_guest_info{template="0"}instead. - k3s has no kube-proxy: k3s uses built-in iptables/nftables, not kube-proxy. Enabling
kubeProxymonitoring in kube-prometheus-stack creates a ServiceMonitor that can never scrape anything, causing permanentTargetDownalerts. SetkubeProxy.enabled: false. - Alertmanager email with empty SMTP credentials: If
alertmanager_smtp_enabled: truebutALERTMANAGER_SMTP_USER/ALERTMANAGER_SMTP_PASSWORDenv vars are not set, Alertmanager generates 100% email send failures, triggeringAlertmanagerFailedToSendAlertsandAlertmanagerClusterFailedToSendAlerts. Disable SMTP until credentials are configured. - Loki compactor corruption on Longhorn: Loki’s compactor can get corrupted files (
input/output erroron/var/loki/compactor/deletion/delete_requests), causing infinite CrashLoopBackOff. Fix: scale StatefulSet to 0, run busybox pod with same PVC torm -rf /var/loki/compactor/deletion, scale back to 1. Data loss is minimal (only pending delete requests). - Loki TSDB index corruption on Longhorn: Loki’s TSDB shipper cache index files can become corrupted (
input/output erroron/var/loki/chunks/index/index_<N>/fake/*.tsdb.gz), causing CrashLoopBackOff on startup (error: “error initialising module: store”). The Longhorn volume is healthy but the individual index file is corrupt. Fix: scale StatefulSet to 0, runkubectl run loki-cleanup --image=busybox --restart=Never -n monitoring --overrides='...'mounting the PVC torm -rf /var/loki/chunks/index/index_<N> /var/loki/tsdb-shipper-cache/index_<N>, then scale back to 1. Data loss is one day’s TSDB index table (Loki resyncs from object storage). Grafana shows no data while Loki is down because the Loki datasource returns errors. - Prometheus WAL corruption on Longhorn (“Grafana shows no data”): When the Prometheus WAL has a corrupt segment on Longhorn (
write to WAL: log samples: write /prometheus/wal/000000XX: input/output error), Prometheus continues running (2/2 Running) but silently drops ALL scrape data — TSDB head has data but maxTime is hours/days stale. Grafana shows “No data” on every panel. The file can’t be removed withrmwhile mounted (also returns I/O error). Fix: (1) scale the Prometheus Operator to 0 first (kubectl scale deployment prometheus-kube-prometheus-operator -n monitoring --replicas=0) to prevent it fighting back, (2) scale the StatefulSet to 0, (3) run a busybox pod mounting the PVC — the PVC root is/data/prometheus-db/wal/NOT/prometheus/wal/— andrm -fall WAL segments (keepcheckpoint.*), (4) scale StatefulSet back to 1, (5) scale operator back to 1. Data loss is WAL segments only (minutes of data); existing TSDB blocks on the PVC survive intact. Verify recovery withcount(up{job!=""} == 1)query returning non-zero. - Longhorn replica ERR silently propagates I/O errors to workload mid-write: When a Longhorn volume replica goes ERR (e.g., due to a node losing storage connectivity or the instance-manager being restarted during heavy I/O), the volume remains “attached/healthy” at the Longhorn API level but write calls from the workload return
input/output error. Any file mid-write at that moment becomes permanently corrupted — the file exists on disk but is unreadable. This is how both Prometheus WAL and Loki TSDB become corrupted simultaneously: both volumes had replicas on the same node that experienced a brief storage disruption. Root cause for the Feb 24 2026 outage: k3s-agent-4 was added to the cluster Feb 21, and Longhorn auto-balance (best-effort) migrated replicas onto it. During the first 24-48h the node had storage-level disruptions (possibly related to UPS testing/configuration), causing the instance-manager to restart and ERR all replicas on that node. Any volume with a replica on k3s-agent-4 could have received I/O errors during writes. Prevention: Monitorlonghorn_volume_robustness(should be0= Healthy). Alert onlonghorn_volume_robustness > 0. Consider settingreplicaSoftAntiAffinity: falseto prevent all replicas landing on a single node. After adding a new node, watch Longhorn events for several hours before considering the node stable. - Monitoring storage on NAS NFS (migrated Feb 2026): After the Feb 24 2026 Longhorn WAL/TSDB corruption incident, Prometheus TSDB, Loki chunks, Grafana state, and AlertManager state were all migrated from
longhornStorageClass tonfs-monitoringStorageClass backed by the DXP4800 NAS at/volume1/monitoring. This eliminates exposure to Longhorn replica I/O errors for these high-write workloads. The provisioner isnfs-subdir-external-provisionerdeployed via Helm into themonitoringnamespace — it dynamically creates subdirectories per PVC. Prerequisites before running Ansible: Create the/volume1/monitoringNFS share in UGOS Pro UI (Control Panel → File Services → NFS → Add share → path/volume1/monitoring→ client192.168.20.0/24→ permissionsRead/Write,root_squash). This share must exist BEFORE thecluster_servicesrole runs or the provisioner pod will CrashLoopBackOff. Migration is destructive: existing Longhorn PVCs forprometheus-db-prometheus-*,loki-*,grafana-*,alertmanager-*must be manually deleted after scaling down the stack — data is lost (acceptable for metrics/logs). New PVCs auto-provision on NFS. Note: After migration,reclaimPolicy=Retainmeans deleting a PVC does NOT delete the NFS subdirectory. Clean up/volume1/monitoring/manually if needed. - bpg/proxmox v0.94.0: Does not support
timeoutsblock. - bpg/proxmox hostpci
devicevsidfields are counterintuitive: In thehostpciblock,deviceis the PCI slot name ("hostpci0","hostpci1", etc.) andidis the actual PCI address ("0000:00:02.0"). These names are the opposite of what you’d expect. Getting them swapped causes"property is not defined in schema"errors because the PCI address fails slot-name validation. - bpg/proxmox efi_disk requires
file_formatandpre_enrolled_keys: Omittingfile_format = "raw"andpre_enrolled_keys = falsefrom theefi_diskblock causes"efidisk0: invalid format - missing key". Both fields are required even though the provider docs don’t emphasize them. - Proxmox cloud-init can’t be hotplugged: Changing cloud-init parameters (
ipconfig0, network config) on a running VM fails with"ide2: hotplug problem - unable to change media type". Must stop and start the VM (or destroy/recreate via Terraform) for cloud-init changes to apply. - Let’s Encrypt URL typo: Use
acme-v02notacme-v2. - Grafana provisioned dashboards from grafana.com —
${DS_PROMETHEUS}unresolved: The Grafana Helm chart downloads dashboards from grafana.com using an init container. Whendatasource: Prometheus(string format) is used, the chart generates a genericsed(s/"datasource":.*,/"datasource": "Prometheus",/g) that only handles old-style string datasource refs. Modern dashboards use object-style refs ({"type":"prometheus","uid":"${DS_PROMETHEUS}"}) which the sed can’t match or corrupts. Fix: use list format in Helm values:datasource: [{name: DS_PROMETHEUS, value: prometheus}]. This generates proper targeted replacement. Dashboard panels show “No data” when${DS_PROMETHEUS}is left unresolved since Grafana can’t find a datasource with that UID. - Grafana starred dashboards lost on PVC recreation: Starred dashboards are per-user preferences stored in Grafana’s SQLite DB on the PVC. When the PVC is recreated (I/O corruption recovery), all stars are lost. Fix: use a “Home Hub” dashboard (
home-hubUID) provisioned via ConfigMap (grafana_dashboard: "1"label) as the default landing page. Set it as home via bothgrafana.ini(default_home_dashboard_path) in Helm values AND Grafana API (PUT /api/org/preferences). ConfigMap-provisioned dashboards survive PVC recreation. The API call is needed for immediate effect; thegrafana.inipath persists it across Helm upgrades. - Prometheus PVC size mismatch after recreation: When a Prometheus PVC is deleted and recreated (e.g., I/O corruption recovery), the new PVC may use a smaller default size instead of the configured
prometheus_storage_size. Always verify PVC capacity matches after recreation. Longhorn supports online volume expansion:kubectl patch pvc <name> --type=json -p='[{"op":"replace","path":"/spec/resources/requests/storage","value":"30Gi"}]'. - NetworkPolicy
namespaceSelector: {}does NOT match host-network IPs: Egress rules usingnamespaceSelector: {}only match pod CIDRs within the cluster. The Kubernetes API server (10.43.0.1:443 → endpoints at 192.168.20.20-22:6443) and kubelets (:10250) run on host-network IPs that are NOT pod IPs. An egress NetworkPolicy with onlynamespaceSelector: {}blocks Prometheus from reaching the API server, killing ALL Kubernetes service discovery — only static/file-based scrape configs (additionalScrapeConfigs) work. Same applies to any external host IPs (Proxmox node-exporters, NAS, etc.). Fix: add explicitipBlockrules for node subnets (192.168.20.0/24for VLAN 20,192.168.1.0/24for management VLAN). The monitoring namespace needs broad egress — it scrapes API server, kubelets, node-exporters, and application pods across many ports. Symptom: Prometheus logs showdial tcp 10.43.0.1:443: connect: connection refusedon every namespace’s Service/Endpoints/Pod list.
Docker-in-Docker (Mac Mini) — DEPRECATED
The Mac Mini was migrated from Docker-in-Docker to a Lima VM (Debian 13 arm64) to enable Longhorn storage and full DaemonSet compatibility. The Lima VM was subsequently removed from the cluster on 2025-06-25 — the Mac Mini is no longer a k3s node. Do not re-add the Lima VM. All DinD and Lima lessons below are kept for historical reference only.
- DaemonSets with host path mounts fail: Longhorn, Promtail, node-exporter require host FS. In DinD, exclude via node affinity:
runtime NotIn [docker-desktop]. In Lima VMs, all DaemonSets work natively. - k3s auto-labels
instance-type=k3s: Must--overwriteto set custom values likemac-mini. - Lima VM replaces Docker-in-Docker: Docker-in-Docker on macOS cannot provide host-level block devices, iSCSI, or persistent paths that Longhorn requires. Lima VM with Debian 13 arm64 gives a real Linux kernel with full
/dev,/proc,/sysaccess. External disk passed as raw disk image via LimaadditionalDisks. Node label changed fromruntime=docker-desktoptoruntime=lima. - Lima bridged networking (NOT shared): Lima’s default SLIRP networking NATs the VM behind the host, making k3s pod-to-pod and MetalLB impossible. Use
networks: [{lima: bridged}]for real LAN IP via socket_vmnet.sharedmode gives a 192.168.105.x NAT IP which also fails. Requires one-timelimactl sudoerssetup plusbrew install socket_vmnetand copying the binary to/opt/socket_vmnet/bin/socket_vmnet(Lima rejects symlinks). After first boot, set a DHCP reservation in UniFi for a stable IP. - Lima additional disk naming: Create a raw disk image on external drive (
qemu-img create -f raw /path/datadisk 1000G). The file MUST be nameddatadisk(not<name>.raw). In Lima v2,additionalDisksuses string form (- "longhorn") not object form. Lima auto-formats and mounts additional disks at/mnt/lima-<name>, so do NOT manuallymkfsin provisioning — just symlink/var/lib/longhorn -> /mnt/lima-longhorn. - UFW firewall blocks VXLAN for new nodes: The Ansible hardening role enables UFW with
deny incomingpolicy and allows traffic only from IPs in thek3s_clusterinventory group. When adding a node NOT in the Ansible inventory (e.g., Lima VM), VXLAN overlay networking breaks silently: packets leave the new node, arrive at other nodes’ eth0 (visible in tcpdump), but the kernel’s INPUT chain drops them before VXLAN decapsulation. Symptoms: pods on the new node can ping LAN IPs (no overlay) but NOT pod CIDRs on other nodes; DNS via ClusterIP fails; longhorn-csi-plugin CrashLoopBackOff. Fix: addufw allow from <new-node-ip>on ALL existing nodes. Ansible: add IP tok3s_external_node_ipsingroup_vars/all.yml.
VLAN Migration & etcd
- netplan apply does NOT reorder kernel IPs: When a VM boots with
192.168.20.xlisted first in netplan, the kernel assigns it as the primary address. Subsequentnetplan applywith the IP order swapped (old IP first) does NOT change the kernel’s address order — it only adds/removes addresses. Only a full reboot resets the order. Verify withip addr show eth0, notip route get. - k3s etcd binds to first interface IP, not routing table source: k3s detects the node IP from the first non-loopback address on the default-route interface (
ip addr show). This determines where etcd listens (:2379,:2380). Even ifip route get 8.8.8.8reportssrc 192.168.1.x, etcd binds to whichever IP appears first inip addr show. Use--node-ipto force a specific IP. - Never delete TLS certs on multiple k3s servers simultaneously: When TLS cert directories are deleted on all servers, each independently generates a new CA on startup. Since the CAs don’t match, etcd peers reject each other with “tls: bad certificate” and quorum can never form. Recovery requires
--cluster-reseton one server (which extracts the original CA from the etcd datastore), then wiping etcd data on other servers and re-joining. - k3s
--cluster-resetis the nuclear but reliable etcd recovery: Converts a 3-node etcd cluster to single-node on the server that runs it. All workload data is preserved. Afterward: start that server normally, wipe/var/lib/rancher/k3s/server/db/and/var/lib/rancher/k3s/server/tls/on other servers, then start them — they rejoin and sync. Agent nodes may need restart to refresh their local load balancer’s TLS state. - etcd member URLs are stored in etcd data, not config files: When k3s starts with a wrong IP (e.g., 192.168.20.x instead of 192.168.1.x), it writes that IP into etcd’s member/peer URL table. Even after reverting k3s service files and netplan, etcd data still contains the wrong peer URLs. This causes a deadlock: nodes advertise old IPs in config but etcd expects new IPs from data. Fix: either update member URLs via
etcdctl member update, or use--cluster-reset+ wipe. - VLAN-tagged VMs can’t use old-subnet IPs simultaneously: With Proxmox
tag=20, the VM’s eth0 traffic is VLAN-tagged. Old-subnet IPs (192.168.1.x, VLAN 1/untagged) become unreachable because the gateway (UDM) receives them on VLAN 20 but expects them on VLAN 1. Dual-IP migration requires VMs to remain untagged during the transition, with VLAN tags applied only after old IPs are fully removed. - k3s “TLS newer than datastore” fatal error: When k3s generates new TLS certs (e.g., after deleting the tls/ directory) but can’t form etcd quorum, the new certs get written to disk but NOT to the etcd datastore. On next restart, k3s detects the disk certs are newer than the datastore copy and refuses to start to prevent cluster-wide cert mismatch. Fix: delete
/var/lib/rancher/k3s/server/tls/AND/var/lib/rancher/k3s/server/db/etcd-tmp/(the staging area from partial starts), then restart. - Proxmox lock files from concurrent qm operations:
qm rebootorqm guest execcan leave stale lock files at/var/lock/qemu-server/lock-<vmid>.conf. Subsequentqmcommands fail with “VM is locked”. Fix:rm -f /var/lock/qemu-server/lock-<vmid>.confon the Proxmox host. - systemd caches unit files across reboots: After editing k3s service files on VMs, if k3s auto-starts on boot (enabled service), it uses the cached pre-edit unit file. Running
systemctl daemon-reload && systemctl restart k3safter boot is required for changes like--node-ipto take effect. Thedaemon-reloadmust happen BEFORE the restart. --node-ipis mandatory for VLAN migration: Without--node-ip, kubelet auto-detects from the existing node object in etcd, which still has the old IP. Even though the VM only has the new IP in netplan, the node registers with the stale cached IP until--node-ipforces the correct address. Required on ALL nodes (servers AND agents).- Successful full-stop VLAN migration procedure: 1) Stop all k3s on all nodes, 2) Write new-IP-only netplans, 3) Update k3s service files with
--node-ip=<new>on ALL nodes +--server=https://<new-server-1>:6443, 4) Clean TLS on server-1, wipe TLS+DB on servers 2-3, 5) Set VLAN tag in Proxmox, 6) Reboot VMs (qm stop+qm startmore reliable thanqm reboot), 7) Stop auto-started k3s on all nodes, 8)--cluster-reseton server-1, 9) Start server-1, then 2-3, then agents — each withdaemon-reloadfirst, 10) Update Lima VM K3S_URL + restart agent. Entire procedure: ~20 min downtime.
Proxmox & Hardware
- Intel e1000e “Hardware Unit Hang”: ThinkCentre M920q Intel I219-LM NICs hang with TSO/GSO/tx-checksumming enabled → “NIC Link is Down” → 3-5s network outage. Fix: disable all hardware offloading via ethtool. See
PROXMOX_E1000E_FIX.md. bridge-vids 2-4094causes “No space left on device” on bonded interfaces: When creating an active-backup bond with a VLAN-aware bridge on Proxmox, usingbridge-vids 2-4094overflows the bridge VLAN table.ifreloadfails with “No space left on device” and can take the network down. Fix: only specify the VLANs you actually need (e.g.,bridge-vids 20 30). Never use the full range.- Mellanox ConnectX-3 Flash Recovery mode can permanently brick the card: When a ConnectX-3 card enters Flash Recovery mode (PCI ID
15b3:01f6), attempting a firmware flash viamstflintmay appear to succeed but doesn’t actually write. After reboots, the card can disappear entirely from the PCI bus — the CPU’s PEG root port (PCIe 00:01.0) is disabled by BIOS because PCIe link training fails with the bricked card. The DEVEN register (PCI 00:00.0 offset 0x54 on Intel Q370) controls root port enable/disable: bit 3 = PEG10 (x16 slot).0x00008031= disabled,0x00008039= enabled. Write attempts to DEVEN are silently rejected by BIOS lock. Physical reseat + cold boot does not help. Remote recovery is impossible — card must be physically removed and replaced. MCX311A-XCAT replacement cards are ~$15-20 on eBay. Resolution (Feb 20, 2026): Replacement card installed in pve1, configured with active-backup bond matching pve2/pve3. All three hosts now have 10GbE. - Mellanox ConnectX-3 firmware flash procedure (mstflint): Debian’s
mstflintpackage (apt, v4.31.0) can flash firmware but CANNOT use/dev/mst/device paths — it errors with “Cannot open MST device”. Use the PCI BDF address instead:mstflint -d 0000:01:00.0 -i <fw_image.bin> burn. After flashing, two cold boots (full power-off, not just reboot) are required before the new firmware version appears inethtool -i enp1s0. A single reboot is NOT sufficient — the first cold boot loads the new firmware into the NIC’s flash, the second fully initializes it. Verify withethtool -i enp1s0 | grep firmware-version. FW 2.42.5000 is the final GA release for ConnectX-3. - M920q BIOS updates shared across Lenovo Tiny family: The M920q shares its BIOS with M720t/M720s/M720q/M920t/M920s/M920x/P330 Tiny. Latest: M1UKT78A (Jan 2026). Download from Lenovo DS503907. USB EFI flash is the recommended method for Proxmox hosts (no Windows needed). BIOS M1UKT45A introduced a permanent downgrade lock — cannot flash to any version below M1UKT44A. Contains critical patches: Intel Downfall/GDS (CVE-2022-40982), multiple CPU microcode updates, Ubuntu freeze fix, NVMe SSD detection improvements, PXE boot fixes.
- Diagnosing PCIe device absence: If
lspcidoesn’t show a PCIe card, check whether the root port itself is enumerated (e.g.,00:01.0for PEG10). If the root port is missing, the CPU has disabled it. Read the DEVEN register:setpci -s 00:00.0 0x54.L. Compare against a working node. If the bit for your slot is 0, the BIOS disabled it because link training failed — the card is likely bricked. - PVE exporter with
cluster=1— single target only: Whencluster=1is set, any single PVE node returns metrics for the entire cluster (all nodes, VMs, storage). Using multiple static targets creates N× duplicate series with mismatchedinstancelabels (e.g.,id=node/pve2withinstance=192.168.20.105). Fix: use ONE static target. Any PVE node works as entry point. Thepve_noderelabel based on target IP becomes misleading and should be removed. software-properties-commondoesn’t exist on Debian 13: Ubuntu-only package.community.general.timezonefails on Debian 13: Usetimedatectlcommand instead.- SSH
sshd_config.ddrop-ins: Cannot be validated standalone withsshd -t -f %s. - fail2ban on Debian 13: Needs
backend = systemd(no/var/log/auth.logby default). - k3s v1.29 /healthz returns 401: API readiness checks must accept [200, 401] for unauthenticated probes.
- k3s upgrade installer overwrites agent env file: Running the k3s install script (even with
INSTALL_K3S_SKIP_START=true) creates a fresh empty/etc/systemd/system/k3s-agent.service.env, wipingK3S_TOKENandK3S_URL. After every agent install, you MUST restore the env file before starting the service. Server nodes are unaffected because they use--cluster-initfrom service file args. This applies to all upgrade methods (curl installer, Ansible). - k3s upgrade must go through each minor version: Skipping minor versions (e.g., v1.29→v1.31) is unsupported and risks etcd/API incompatibilities. Always step through: v1.29→v1.30→v1.31→v1.32→etc. Take etcd snapshots between steps.
- k3s upgrade: drain with –disable-eviction for Longhorn: Longhorn PodDisruptionBudgets block normal drains. Use
kubectl drain --ignore-daemonsets --delete-emptydir-data --force --timeout=90s --disable-eviction. StatefulSets (loki-0, prometheus) may still time out — force-delete them withkubectl delete pod --force --grace-period=0. - k3s upgrade: Traefik still pinned to v2 image: Even with k3s v1.34 bundling Traefik Helm chart v27, k3s intentionally pins
tag: "2.11.24"in the default traefik.yaml manifest. Traefik v3 migration requires explicit image override — it does NOT happen automatically with k3s upgrades. - Longhorn upgrade path has gaps: Not all Longhorn minor versions have the latest patch. For example, v1.7.4 returns 404 from GitHub — use v1.7.3 instead. Always verify the release exists before applying.
Shell & Secrets
- Passwords with
!in kubectl commands: Bash interprets!as history expansion in double quotes. Use single quotes:--from-literal=PASSWORD='MyPass!123'. - gcloud CLI from Homebrew not on PATH:
brew install --cask google-cloud-sdkinstalls to/opt/homebrew/share/google-cloud-sdk/bin/gcloudon Apple Silicon but does NOT add it to$PATH. Either use the full path or addsource /opt/homebrew/share/google-cloud-sdk/path.zsh.incto shell profile.
Docker Build & Deployment
- Docker build platform mismatch (arm64 Mac → amd64 k3s): Building on an Apple Silicon Mac without
--platform linux/amd64produces an arm64 image. k3s nodes (Debian 13 amd64) fail withno match for platform in manifest: not found. Always usedocker build --platform linux/amd64 --provenance=false. The--provenance=falseflag is critical — Docker Desktop adds attestation manifests that k3s containerd can’t resolve, causing the samenot founderror even with the correct platform. - Docker Desktop on macOS hangs silently:
docker ps,docker info,docker buildx lscan all hang indefinitely with no error. When this happens, don’t waste time restarting Docker Desktop — build on a cluster node instead. Installdocker.io+awsclion a server node, SCP the project, build natively (amd64), push to ECR from the server. - ECR auth token piping via SSH: Generate ECR token locally (
aws ecr get-login-password), pipe to the remote server via SSH:echo "$TOKEN" | ssh server 'sudo docker login --username AWS --password-stdin <registry>'. Avoids provisioning AWS credentials on the build node. - requirements.txt ranges over exact pins: Exact version pins (
==X.Y.Z) causepip installfailures when specific versions aren’t available on the target platform (e.g., amd64 Debian). Use>=X.Y.Z,<(X+1).0ranges for cross-platform compatibility.
UniFi Dream Machine Pro
- API key auth returns 401 on all endpoints: UDM Pro (UniFi OS 3.x) API keys appear to be read-only or scoped. For write operations (DHCP reservations, port forwards), use cookie-based auth: POST
/api/auth/loginwith username/password, then captureTOKENcookie andX-CSRF-Tokenheader. Include both on subsequent requests. - CSRF token required on all mutating requests: After cookie auth, every POST/PUT/DELETE to UDM Pro must include the
X-CSRF-Tokenheader from the login response. Omitting it returns 403 Forbidden even with a valid session cookie. api.err.InvalidFixedIPon specific IPs: Certain IPs (e.g.,.20) are rejected by UniFi firmware even when not in use. May be reserved by UDM Pro for internal use. Workaround: assign a different IP or set static IP at the OS level.api.err.FixedIpAlreadyUsedByClientpersists after forget: After forgetting a client viacmd/stamgr, the conflict cache takes time to clear. Theforget-stacommand returnsrc: okbut DHCP reservation still fails. May need to wait for cache expiry or fix via UniFi UI.- DHCP reservation API path:
POST /proxy/network/api/s/default/rest/fixedipwith body{"mac": "aa:bb:cc:dd:ee:ff", "fixed_ip": "192.168.1.X", "network_id": "<network-uuid>"}. Network ID for Default LAN:<network-uuid>. - Port forward API path:
POST /proxy/network/api/s/default/rest/portforwardwith body includingname,fwd,fwd_port,dst_port,proto,src,enabled.
Public Ingress & OAuth2
- OAuth2 Proxy forwardAuth mode: When using Traefik’s forwardAuth middleware, OAuth2 Proxy must be configured with
upstreams = ["static://200"]andreverse_proxy = true. It returns 202 (authenticated) or 401 (redirect to login). Do NOT set it as a reverse proxy upstream. - amazon/aws-cli container has curl but NOT wget: The
amazon/aws-cli:2.15.0image includes curl but not wget. DDNS scripts must usecurl -sfinstead ofwget -qO-. - DDNS script multi-provider fallback: Use multiple IP-check providers (ifconfig.me, api.ipify.org, icanhazip.com, checkip.amazonaws.com) with fallback. Any single provider can timeout or return HTML instead of plain text.
- Specific DNS records override wildcard: A Route53 A record for
ha.k3s.internal.zolty.systems → 192.168.20.202(private IP) silently overrode the wildcard*.k3s.internal.zolty.systems → <public-ip>. External clients resolved to a private LAN IP, causing connection timeouts. TCP/TLS tests with hardcoded IPs masked the issue. Fix: delete specific A records for subdomains that should use the wildcard. Diagnostic:time_connect: 0.000000withtime_namelookup > 0in curl timing means DNS resolved but TCP couldn’t reach the resolved IP. - ECR pull secret expiry breaks CronJobs silently: ECR tokens last 12h. If no app deploys refresh the pull secret, daily CronJobs (etcd-backup, postgres-backup) fail to pull images the next morning. Fix:
kubernetes/core/ecr-token-refresh/cronjob.yamlruns every 6h and refreshesecr-pull-secretacross all namespaces using the Kubernetes API directly (no kubectl needed). Usesecr-pull-k3sIAM credentials stored inecr-refresh-aws-credentialssecret. - ServiceMonitors without valid /metrics cause permanent TargetDown: Before adding a ServiceMonitor, verify the service actually exposes
/metrics. code-server does not expose Prometheus metrics at all — ServiceMonitors targeting/healthzcreate permanent TargetDown alerts. - Traefik
errorsmiddleware preserves original status code: Theerrorsmiddleware intercepts a matching response status (e.g. 401) and replaces the body with the error handler service’s response — but keeps the original 401 status code, not the error handler’s. This means pointing the errors handler at/oauth2/start(which returns 302) still yields a 401 to the client. Use this middleware to pass through redirect headers+cookies even though the browser sees 401; Chrome does follow Location on top-level-navigation 401s, so this partially works in practice. - oauth2-proxy ForwardAuth redirect_uri must be hardcoded: Without
redirect_url = "https://auth.k3s.internal.zolty.systems/oauth2/callback"in the oauth2-proxy config, it infers the callback URI from the incoming Host or X-Forwarded-Host header. When the Traefikerrorsmiddleware calls oauth2-proxy internally after a ForwardAuth 401 (from a jellyseerr request), the inferred host becomesjellyseerr.k3s.internal.zolty.systems/oauth2/callback— not registered in Google Cloud Console → OAuth fails. Always setredirect_urlexplicitly. - Traefik
errorsmiddleware order matters for ForwardAuth: Listoauth2-redirect-errorsbeforegoogle-oauthin IngressRoute middlewares so it wraps the entire ForwardAuth chain and can intercept the 401. If listed after, the chain stops at ForwardAuth and the errors middleware never runs. - Full reverse-proxy oauth2-proxy vs ForwardAuth: ForwardAuth mode always returns 401 for unauthenticated requests; browsers may or may not redirect from that. Full reverse-proxy mode (oauth2-proxy as the actual upstream — used for media-profiler) handles the redirect itself (302 → Google) with zero ambiguity. For services where seamless browser redirect is critical, use full reverse-proxy mode (separate per-service oauth2-proxy deployment with
--upstream=http://service-url). ForwardAuth mode is simpler but requires workarounds for the sign-in flow.
Wiki.js
- Wiki.js 2.5
pages.updatemutation hangs forever: The GraphQL mutation succeeds (content is saved, render completes) but the HTTP response is never returned. This is a Wiki.js 2.5 bug — the render subprocess completes but the main process never sends back the response. Workaround: fire-and-forget — send mutation with a 3-second timeout, ignore the timeout error, wait 8 seconds for render, then verify via apages.singleread query. Seescripts/add-wiki-metrics.pyfor the pattern. - Wiki.js mutation requires
tagsfield: Omittingtags: []from the update mutation causes “Cannot read properties of undefined (reading ‘map’)”. Always includetags: []. - Use GraphQL variables for Wiki.js mutations: String interpolation with manual escaping breaks on Unicode characters (em-dashes U+2014, emoji). Use proper GraphQL
$variableswithjson.dumps()which handles escaping correctly.
Alert Responder
- Longhorn PVC + non-root user = permission denied: When a Dockerfile creates a non-root user (e.g.
useradd appuser) and the pod mounts a Longhorn PVC, the volume is owned by root. The non-root user can’t write to it. Fix: addsecurityContext.fsGroup: <GID>to the pod spec so Kubernetes sets volume group ownership. - K8s manifest placeholder secrets overwrite real secrets: If a manifest includes Secret resources with
stringData: REPLACE_MEplaceholders, runningkubectl apply -f manifest.yamlwill overwrite any previously-created secrets with the placeholder values. Solution: remove Secret definitions from the manifest entirely and create secrets out-of-band viakubectl create secret. - Slack Block Kit 3000-char limit: Slack section blocks have a 3000 character limit on the
textfield. LLM-generated analysis + code-block remediation prompts can easily exceed this. Always truncate text to ~2800 chars before inserting into blocks. - Slack bot
not_in_channelerror: A Slack bot withchat:writescope cannot post to a channel it hasn’t been invited to. After deploying a new bot, manually/invite @BotNamein the target channel. Thechannels:joinscope would allow self-join but isn’t needed if you invite manually. - Slack Socket Mode requires outgoing WS pings for event delivery: The
slack_sdkSocketModeClienthas a threading bug in itsConnectionclass — WebSocket frames corrupt after ~30-50 seconds. Even with a customwebsocket-clientimplementation, Slack only delivers interaction events (button clicks, modals) when the client sends outgoing WebSocket pings.ping_interval=10(matching the SDK default) is the critical setting. App-level{"type":"ping"}JSON messages alone are NOT sufficient — Slack needs protocol-level WS pings to mark the connection as active. Without them, the WebSocket is bidirectional (can send/receive raw frames) but Slack never routesblock_actionsorview_submissionpayloads. - Slack action buttons must be top-level blocks, not inside attachments: Slack’s
attachmentsfield does not supportactionsblock type. Interactive buttons must be in the top-levelblocksarray as{"type": "actions", "elements": [...]}. Buttons placed in attachments render but clicks silently fail with no event dispatched. - Bedrock Converse API toolResult wrapper required: Tool results in the
userrole message content array MUST be wrapped in{"toolResult": {"toolUseId": "...", "content": [{"text": "..."}], "status": "success"}}. PuttingtoolUseIdandcontentdirectly in the content array item (without thetoolResultkey) causesParamValidationError: Invalid number of parameters set for tagged union structure. The content array uses tagged unions — each item needs exactly one discriminator key (text,toolResult,toolUse,image, etc.). - Claude models require Marketplace subscription for Bedrock tool-use: Claude models (Sonnet, Haiku) on Bedrock may work for basic
converse()calls but returnAccessDeniedExceptionwhentoolConfigis included — even with the model listed as “Access granted” in the Bedrock console. Tool-use requires an explicit AWS Marketplace subscription. Amazon-native models (Nova Micro, Nova Lite, Nova Pro) work with tools without any additional subscription. Use Nova Micro for cost-effective agentic workloads; upgrade to Claude once Marketplace subscription is enabled. - ConfigMap env vars override Python code defaults — must update both:
os.getenv("VAR", default)reads from the container environment. If a K8s ConfigMap sets that env var, it takes precedence over the code default. Changing only the Python default while the ConfigMap still has the old value has no effect. Always update the ConfigMap AND the code default together. Verify with:kubectl exec <pod> -- python -c "from app.config import Config; print(Config.VAR)". - Remediation agent runs inside a minimal container — NOT on nodes: The agent’s
run_shelltool executes inside the alert-responder pod (Python 3.12-slim with only kubectl, helm, curl, git). Commands likejournalctl,systemctl,service,ps,top,netstatDO NOT EXIST. The agent wasted 5+ steps in a session trying host-level commands that all failed. Fix: added unavailable-command detection, improved tool descriptions to explicitly state container limitations, and addedkubectl_exectool for pod-level diagnostics. - Agent guesses pod names instead of looking them up: Job pods have random suffixes (e.g.
postgres-backup-29520495-abcde). The agent fabricatedpostgres-backup-29520495-xxxxxxand got NotFound. Fix: addedlabel_selectorparameter tokubectl_getso the agent can querypods -l job-name=<job-name>to discover actual pod names before fetching logs. - Agent confuses namespaces across apps: Alert was for
cardboardnamespace but the agent described a job intrade-botnamespace. Fix: added explicit “Namespace Discipline” rules in the system prompt requiring the agent to always use the namespace from the alert and listing common confusion pairs (trade-bot vs cardboard, longhorn vs longhorn-system). - Agent tried wrong Longhorn namespace: Agent queried
kubectl get pods -n longhorn(empty) instead oflonghorn-system. Fix: added explicit namespace for every infrastructure component in the system prompt (Longhorn →longhorn-system, Traefik →kube-system, monitoring →monitoring, MetalLB →metallb-system, cert-manager →cert-manager). - RBAC missing
storage.k8s.ioandlonghorn.ioAPI groups: Agent got Forbidden when listing StorageClasses. Fix: addedstorage.k8s.io(storageclasses, volumeattachments) andlonghorn.io(volumes, replicas, nodes, engines, engineimages) to the ClusterRole. - Agent fabricates Job manifests instead of triggering from CronJob: When asked to manually run a backup, the agent created an ad-hoc
batch/v1 Jobmanifest from scratch. The fabricated manifest referenced a PVC (postgres-pvc) that doesn’t exist in the actual CronJob spec, usedenvironment:instead ofenv:(invalid field), and omitted volumes/service accounts. Each fabrication failure triggered an approval request, wasting 3 approval cycles. Root cause:kubectl_applyallowedkind: Jobmanifests. Fix: (1) addedkubectl_trigger_cronjobtool (LOW-risk) that runskubectl create job --from=cronjob/<name>, which inherits the full spec automatically, (2) blocked Job creation viakubectl_applywith a clear error redirecting tokubectl_trigger_cronjob, (3) added system prompt rules explicitly forbidding ad-hoc Job manifests. - Agent attempted self-RBAC escalation after hitting Forbidden: When
kubectl_applyfailed withForbidden, the agent attempted to create aRoleBindinggranting itself more permissions. This always fails (ClusterRoleBinding is required and the agent lacksbindverb). The escalation attempt triggered another approval cycle and wasted 2 more steps. Fix: (1) added manifest-level guard inexec_kubectl_applythat detects RBAC manifests targetingalert-responder-agentand returns aBLOCKEDerror with explanation, (2) added system prompt rule: “If you hit a Forbidden error, STOP and report — never attempt to modify your own RBAC.” - Agent never diagnoses the actual root cause when its own actions create noise: In session #15, the original alert was caused by a Longhorn
Multi-Attach / volume not readyerror onpostgres-0at 3 AM (visible inkubectl_eventsoutput). The agent saw the event but didn’t follow it — instead it focused on executing fabricated jobs. The stuck agent-created Job then blocked the CronJob’sconcurrencyPolicy: Forbid, preventing the real fix (a clean retrigger). Fix: added a “Diagnosing Backup Failures” protocol in the system prompt that mandates checking postgres StatefulSet health and Longhorn PVC attach status before attempting any Job creation, and requires deleting stuck jobs before retriggering. BackoffLimitExceededon a CronJob Job leaves a stuck Job blockingconcurrencyPolicy: Forbid: When a CronJob Job exhausts its backoff limit, the Job stays inFailedstate and is NOT automatically deleted. WithconcurrencyPolicy: Forbid, the next scheduled CronJob run is skipped entirely (CronJob controller sees an active Job). Diagnosis:kubectl get jobs -n <ns>— Failed jobs still count as “active” for concurrency purposes. Fix:kubectl delete job <name> -n <ns>to clear the stuck job, thenkubectl_trigger_cronjobto retrigger.
OpenClaw (AI Assistant Gateway)
- OpenClaw has no published Docker image: Must build custom image from
node:22-bookworm-slim+npm install -g openclaw@latest. The@discordjs/opusnative module requires build dependencies (python3 make g++ libopus-dev) for node-gyp compilation. Clean build deps after install to save ~200MB. - OpenClaw config
gateway.bindonly accepts keywords: Valid values areloopback|lan|tailnet|auto|custom. Settinggateway.bind: "0.0.0.0"causes validation error and CrashLoopBackOff with “Invalid input”. Use CLI flag--bind lanin Dockerfile CMD instead of config file. - OpenClaw
--bind lanrequires authentication: Gateway refuses to bind to LAN without auth — “Refusing to bind gateway to lan without auth.” Must setOPENCLAW_GATEWAY_TOKENenv var (or--tokenflag) and usegateway.auth.mode: "token"in config. Use a random hex token stored in K8s Secret. - OpenClaw default bind is loopback: Without
--bind lan, gateway listens only onws://127.0.0.1:<port>— invisible to k8s Service/readiness probes. The pod appears Running but readiness probe fails and Service routes zero traffic. Always use--bind lanfor k8s deployments. - Longhorn encrypted PVCs have persistent CSI staging path failures: After deleting and recreating encrypted PVCs, new volumes get stale staging paths from previous mounts: “Staging target path … is no longer valid for volume …”. This happened 3 consecutive times. Fix: use regular
longhornStorageClass instead. The encryption benefit is marginal for app-level data that isn’t highly sensitive. - OpenClaw device pairing blocks Control UI behind reverse proxy: OpenClaw has a device pairing system separate from token/password auth. New devices (browsers) must be explicitly paired before WebSocket connects. Loopback connections auto-approve silently, but remote connections (via Traefik) require manual approval via
kubectl exec deploy/openclaw -- openclaw devices approve <request-id>. Symptom: Control UI loads but shows “disconnected (1008): pairing required” even with valid token. Fix: approve pending requests withopenclaw devices list→openclaw devices approve <id>. For multi-user deployments behind OAuth2 Proxy, setgateway.controlUi.dangerouslyDisableDeviceAuth: trueto skip per-device pairing when shared token auth is already configured. - OpenClaw
trustedProxiesrequired for proxy header trust: Withoutgateway.trustedProxies: ["10.42.0.0/16"], gateway logs “Proxy headers from untrusted address” and treats all connections as direct (non-proxied). This breaks client IP detection and causes auth edge cases. Always configuretrustedProxieswith the pod CIDR when behind Traefik or any reverse proxy.
Operational Lessons
- Longhorn S3 backup credentials from Terraform: The
longhorn-s3-credentialssecret inlonghorn-systemmust match the IAM user created byterraform/modules/s3_backups/. Get real values viacd terraform/environments/aws && terraform output -raw backup_user_access_key/backup_user_secret_key. Placeholder credentials (wrong length) causeInvalidAccessKeyIdand backup target showsavailable: false. - Don’t remove live kubectl patches before Helm upgrade: When a
kubectl patch(e.g., nodeSelector) is protecting pods, don’t remove it until afterhelm upgradedeploys the equivalent constraint via chart values. Removing the patch first allows pods to reschedule on incorrect nodes during the gap. Sequence: update chart values → helm upgrade → verify pods → patches are now redundant. - Stale k3s nodes after migration: When a node is replaced (e.g., mac-mini-agent → lima-k3s-agent), the old node object persists in
NotReadystate, generating ~12 alerts (NodeNotReady, TargetDown for DaemonSets, etc.). Fix:kubectl drain <old-node> --ignore-daemonsets --delete-emptydir-data && kubectl delete node <old-node>. - amd64-only images on arm64 nodes: Container images built only for amd64 fail with
exec format errorwhen scheduled on arm64 nodes. The pod enters CrashLoopBackOff but the error message may not be obvious inkubectl describe. Fix: addnodeSelector: {kubernetes.io/arch: amd64}or nodeAffinity withkubernetes.io/arch In [amd64]. For Helm charts, updatevalues.yamlnodeAffinity section. - PVE memory alerts are false positives on over-provisioned hosts: Proxmox hosts running multiple VMs intentionally use most RAM. High memory usage alerts fire constantly. Monitor swap usage instead —
(1 - node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) > 0.5from host node-exporters (job="proxmox") indicates actual memory pressure. - Prometheus additionalScrapeConfigs relabel pitfall: The relabel config
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]withtarget_label: __address__sets the address to just the port value (e.g.,9100:9100). Must use two source labels with[__meta_kubernetes_pod_ip, __meta_kubernetes_pod_annotation_prometheus_io_port]andregex: (.+);(.+)/replacement: ${1}:${2}to constructip:port.
Networking
- Proxmox active-backup bond configuration pattern: For 10GbE + 1GbE failover, use
bond-mode active-backupwithbond-primary enp1s0(10GbE). The bond (bond0) replaces the bare NIC as the bridge port forvmbr0. Config:bond-slaves nic0 enp1s0,bond-miimon 100,bond-primary enp1s0. Bridge getsbridge-ports bond0andbridge-vids 20 30(only needed VLANs). If 10GbE fails, traffic falls back to 1GbE automatically. Verify withcat /proc/net/bonding/bond0. - NAS split-VLAN isolation (both switch ports must match): When a NAS has two NICs on separate switch ports, BOTH ports must be assigned the same VLAN port profile. If Port A is on VLAN 30 (Storage) but Port B is still on VLAN 1 (Default), the NAS ARP-replies from both NICs. The UDM learns the MAC→VLAN mapping from the wrong port, causing return traffic to arrive on the wrong VLAN wire. The NAS receives ICMP/TCP but drops replies because they egress on the VLAN-1 NIC while the source IP is 192.168.30.x. Symptom: NAS ARP-resolves (visible in
ip neigh) and tcpdump shows outbound ARP requests, but all inbound ICMP/TCP/NFS is blackholed. Every port showsfilteredin nmap. Fix: ensure ALL switch ports connected to the NAS have the same native VLAN (e.g., “Storage Only” profile). Verify via UniFi API:port_overridesmust include entries for ALL NAS ports, not just one. - DNS records must be updated after VLAN re-IP: After migrating nodes to new VLANs (e.g., flat 192.168.1.0/24 → VLAN 20 192.168.20.0/24), updating Terraform code is not enough — you must also run
terraform applyto push the changes to Route53. Stale DNS records (internal wildcard, PVE hosts) will silently resolve to old IPs, causing TLS cert validation failures and unreachable dashboards. Always verify live DNS withnslookup <host> 8.8.8.8after a network migration. Thecert-manager-dns01IAM user inansible/.envdoes NOT have S3 permissions for terraform state — use the default AWS profile (~/.aws/credentials) for terraform operations. - MetalLB IPAddressPool must be re-applied after VLAN re-IP: Updating
metallb_ip_rangein Ansiblegroup_vars/all.ymldoes NOT update the live cluster — the Ansible playbook must be re-run (orkubectl patchthe IPAddressPool directly). Stale pool causes MetalLB to continue assigning old-VLAN IPs. When the pool is updated, auto-assigned services grab IPs in creation order — NOT their previous assignments. Services withloadBalancerIPpointing to the old range go<pending>because the IP is no longer valid. Fix: (1) patch IPAddressPool, (2) annotate Traefik withmetallb.universe.tf/loadBalancerIPsand remove stalespec.loadBalancerIPvia JSON patch (op: remove, path: /spec/loadBalancerIP), (3) pin all other LB services viametallb.universe.tf/loadBalancerIPsannotation. Never use bothspec.loadBalancerIPANDmetallb.universe.tf/loadBalancerIPsannotation on the same service — MetalLB rejects the conflict. CRITICAL: the manifest file at/var/lib/rancher/k3s/server/manifests/custom/metallb-config.yamlmust be updated on ALL THREE server nodes — k3s Addon reconciler watches this file and will revertkubectl patchchanges if the on-disk manifest still contains the old IP range. Updating only one server is insufficient; k3s syncs manifests via etcd and whichever server reconciles the addon next will apply whatever is on its local disk.
Media Stack Architecture
Context for AI sessions implementing the GPU-accelerated media stack. Read this section before modifying anything under
kubernetes/apps/media/.
Content Pipeline
Jellyseerr (user requests)
↓
Radarr (movies) / Sonarr (TV)
↓ search indexers
Prowlarr → TorrentLeech
↓ add to seedbox
RapidSeedbox (<seedbox-ip>:2222)
rTorrent downloads → /home/user/Downloads/
↓ AutoTools plugin (move on complete)
/home/user/Downloads/complete/
↓ ratio plugin: seed until 1.0 ratio OR 2 weeks, then stop
↓ rsync --partial over SSH (CronJob every 4h)
↓ sync to /volume1/media/downloads/{movies,tv}
DXP4800 NAS (192.168.30.10, Storage VLAN 30)
↓ Radarr/Sonarr import & rename → /volume1/media/{movies,tv}
↓ NFSv3
Jellyfin (GPU transcode) → Clients
↑
Bazarr (subtitle management)
Key Design Decisions
- Jellyfin is monolithic: Cannot split into “direct play” and “transcode” Deployments. The transcode decision is per-stream internal to the server. Deploy as a single Deployment with
preferredDuringSchedulingIgnoredDuringExecutionGPU node affinity. If GPU node dies → reschedules to any worker → software transcode fallback. - NFS for media, Longhorn for state: NFS is
ReadOnlyManyorReadWriteManydepending on service. Longhorn is for config DBs, PostgreSQL, and any stateful components. Never use iSCSI for media — it’sReadWriteOnceblock storage. - NFSv3, NOT NFSv4: UGOS Pro’s NFS server doesn’t expose a pseudo-root filesystem. NFSv4 mounts fail with “No such file or directory” even though the export path is correct. NFSv3 works immediately. All PVs use
mountOptions: [nfsvers=3]. - Seedbox sync needs a K8s CronJob:
rclonerunning in a pod with both SFTP (to seedbox) and NFS (to NAS) access. Credentials in K8s Secret created out-of-band. Schedule: every 4h or on-demand. - Arr stack for content automation: Radarr (movies) + Sonarr (TV) + Prowlarr (indexers) manage the full download lifecycle. Jellyseerr sends requests → Radarr/Sonarr search via Prowlarr → TorrentLeech → rTorrent downloads to seedbox
~/Downloads/→ AutoTools moves completed to~/Downloads/complete/→ ratio plugin seeds to 1.0 or 2 weeks then stops → rsync syncscomplete/to NAS downloads/ dir → Radarr/Sonarr import & rename to library. - Per-service NAS accounts: Each arr app runs as its own NAS UID (10010-10015) via PUID/PGID env vars (linuxserver images) or
runAsUser(Jellyfin). NFSroot_squashpreserves UIDs. Groupmedia-services(GID 10000) provides shared read access. - IngressRoutes per-service, no OAuth: Each media service has its own IngressRoute + Certificate in media namespace. OAuth is not used — TV apps and arr API keys are incompatible with forwardAuth.
- GPU passthrough requires q35 + OVMF: Existing VMs use i440fx/SeaBIOS. The GPU worker VM must be destroyed and recreated with
machine = "q35"andbios = "ovmf". Drain workloads first, verify Longhorn replicas exist elsewhere. hostpcirequiresroot@pamauth: The bpg/proxmox Terraform provider cannot assign PCI devices with API tokens. Verify provider auth before attempting GPU passthrough.- Inter-VLAN routing for NFS: k3s nodes (VLAN 20) must reach NAS (VLAN 30). UDM Pro must allow VLAN 20 → VLAN 30 on TCP 2049 (NFS) + UDP 111 (portmapper).
- DXP4800 is NOT Synology: It runs UGOS Pro. NFS/SMB configuration is in UGOS Control Panel → File Service, not Synology DSM. Don’t assume DSM-style paths or APIs.
- UGOS Pro REST API auth is unusable: The API (
/webman/login.cgi) requires client-side RSA encryption of credentials using a server-provided public key. This makes API automation impractical. Use SSH instead (enable temporarily via Control Panel → Terminal → Enable SSH, auto-disables after 6h). - NAS SSH username is
mat, notadmin: SSH to the DXP4800 usesssh mat@192.168.30.10, passwordWiggin123. Theadminusername only works for the UGOS Pro web UI, not SSH. sudo also works:echo "Wiggin123" | sudo -S <command>. sshpass on macOS Homebrew fails due to TTY issues with newer OpenSSH — useexpector interactive SSH instead. - SCP to NAS /tmp fails (SFTP server restriction):
scpuses the SFTP subsystem which restricts writable paths on UGOS Pro —scp mat@nas:/tmp/filereturns “No such file or directory”. Direct shell writes to /tmp DO work via SSH interactive sessions (printf '...' > /tmp/file). Use base64-encoded file content split into 80-char chunks sent viaexpect+printf >> /tmp/fileto transfer scripts to the NAS. - UGOS Pro
matuser has no home directory:/home/matdoes not exist. mat drops to/on login. Do not depend on~for script paths. Store persistent scripts in/volume1/scripts/(survives reboots, same volume as media and data). - UGOS Pro cron.d entries likely persist across reboots: Unlike
/etc/exports(which is regenerated from the UGOS Pro database on every boot),/etc/cron.d/uses standard Debian cron and is NOT managed by UGOS Pro. Entries installed there should survive reboots. However, if a cron job disappears after a NAS update, re-runscripts/nas/deploy-nas-alerts.shto reinstall. - NAS VLAN 30 has outbound internet access: The NAS (192.168.30.x) can POST to external URLs (Slack webhooks, AWS SES) — confirmed via
curlin 2026-02. No explicit inter-VLAN firewall blocking outbound from Storage VLAN. This means the NAS can use SES SMTP directly without going through the cluster-internal email-gateway. - NAS Slack alerts script location:
/volume1/scripts/nas-slack-alerts.sh(NAS) +scripts/nas/nas-slack-alerts.sh(repo). Usesjournalctl --after-cursorto track position. If alerts stop, check/volume1/scripts/.nas-alerts-cursorexists and/var/log/nas-slack-alerts.logis not filling with errors. /etc/exportson UGOS Pro does NOT persist across reboots: UGOS Pro regenerates NFS config on boot from its internal database. Writing directly to/etc/exportsand runningexportfs -rarestores access immediately but will be lost on next reboot. The persistent fix is via UGOS Pro UI: Control Panel → File Services → NFS → edit the share and add192.168.20.0/24as an allowed client. After a NAS reboot, if NFS gives “access denied”, always re-apply via SSH:echo "Wiggin123" | sudo -S bash -c "printf '/volume1/media 192.168.20.0/24(rw,sync,no_subtree_check,root_squash)\n' > /etc/exports" && echo "Wiggin123" | sudo -S exportfs -ra.- UGOS Pro NFS real persistence mechanism is
/usr/local/bin/restore-nfs-exports.sh+@rebootcron: Despite the name “UGOS Pro manages NFS from its internal database”, the actual mechanism is simpler — there is a script/usr/local/bin/restore-nfs-exports.shthat is called via/etc/cron.d/nfs-exportsat@reboot. Editing this script directly (as root/sudo) is safe and will survive reboots. The UGOS Pro UI writes both the NFS global config to/etc/nfs.jsonAND updates this restore script. You can bypass the UI entirely by editing/usr/local/bin/restore-nfs-exports.sh. The script lives in the overlay filesystem (/overlay/upper/usr/local/bin/) so it persists across NAS reboots. Useprintf '...\n' >> /etc/exportsin the script (not echo) to avoid newline issues. Also update/etc/exportsdirectly and runexportfs -rafor immediate effect. SSH access:ssh mat@192.168.30.10(via jump k3s-server-1, which is VLAN 20 → VLAN 30), passwordWiggin123, sudo works. - NFS export shell quoting through nested SSH: When writing
/etc/exportson the NAS via triple-nested SSH (local → jump → NAS),echowith single-quoted strings often produces empty files due to quoting escaping. Useprintfinstead:printf '/volume1/media 192.168.20.0/24(rw,sync,no_subtree_check,root_squash)\n' | sudo tee /etc/exports. - Seedbox credentials are secrets: Seedbox SFTP creds, VPS control panel creds, and NAS admin creds must NEVER be committed. Store in K8s Secrets created out-of-band.
Seedbox SFTP & rclone Debugging Lessons
- rclone CronJob
activeDeadlineSecondskills large syncs: A 4-hour deadline (14400s) is fine for incremental syncs (~GBs) but kills the initial bulk sync when hundreds of GBs are waiting on the seedbox. At ~2-5 MiB/s over SFTP, 292 GiB needs 1-2 days. Every scheduled run restarts from zero because rclonesyncover SFTP doesn’t resume partial file transfers — so nothing ever completes and no files appear in Jellyfin. Fix: removeactiveDeadlineSecondsentirely.concurrencyPolicy: Forbidalready prevents overlapping runs. After the initial sync, subsequent 4-hourly runs are incremental and fast. - rsync –partial over SSH is better than rclone SFTP for large syncs: rclone
syncover SFTP writes to temp files and discards on interruption — no resume. rsync--partial --partial-dir=.rsync-partialkeeps partial files in a hidden directory and resumes byte-level on the next run. For 200+ GiB initial syncs at 2-5 MiB/s, this is the difference between 1-2 days total vs infinite retry loop. - Alpine containers running as non-root can’t
apk add: If a pod hasrunAsUser: 10012(non-root),apk addfails silently. Fix: use an init container running as root (securityContext.runAsUser: 0) to install packages and copy binaries + shared libraries to a sharedemptyDirvolume. Main container then adds the tools dir toPATHandLD_LIBRARY_PATH. - SSH in containers fails with “No user exists for uid N”: OpenSSH client requires the current UID to exist in
/etc/passwd. Alpine base images only have root. Fix: in the init container, write a custom passwd file to the shared tools volume, then mount it as/etc/passwd(viasubPath) into the main container. - Seedbox download paths are case-sensitive: RapidSeedbox default download directory is
/home/user/Downloads/(capital D) on the DATA partition (1.2TB). The lowercase/home/user/downloads/complete/is on the tiny OS disk (53GB, 0 bytes free). Getting this wrong causes all torrents to stop immediately. Both Radarr/Sonarr download client configs AND remote path mappings must use the correct cased path. - SFTP subsystem can be broken while SSH auth works: The SSH daemon may accept connections, complete key exchange, and authenticate successfully — but the sftp-server subprocess never sends the SFTP version/init packet. This manifests as an indefinite hang after “Authenticated using password” in every SFTP client (rclone, native sftp, lftp, curl/libssh2). Always test with native
sftp -P 2222 user@hostFIRST before debugging rclone config. - rclone SFTP config for restricted servers: For seedboxes that are SFTP-only (no shell):
shell_type = none,md5sum_command = none,sha1sum_command = none,disable_hashcheck = true,key_use_agent = false. Without these, rclone tries to run shell commands that hang or fail. - rclone known_hosts format: For non-standard ports, the format is
[host]:port keytype base64key. Scan withssh-keyscan -p 2222 host 2>/dev/null. Mount as a ConfigMap file and reference viaknown_hosts_filein rclone.conf. - Hung SFTP connections exhaust MaxSessions: Each failed rclone/sftp test that hangs consumes an SSH session on the server. After dozens of tests, no new connections succeed (even from different source IPs). Always
killhung processes and wait before retrying. The seedbox may need a service restart via the control panel. - rTorrent XMLRPC uses d.multicall.filtered: NOT
d.multicall2. The call signature isd.multicall.filtered("","",field1,field2,...)with two empty string params before the field list. - rTorrent
d.directory_base.setrequires stopping first: You cannot change a torrent’s download directory while it’s active. XMLRPC returns a fault. Pattern:d.stop→d.directory_base.set→d.start. This applies to any migration that moves files and needs to update rTorrent’s internal tracking. - ruTorrent plugin settings persist in dat files, not .rtorrent.rc: AutoTools, ratio, and other ruTorrent plugins store their config in
/var/www/rutorrent/share/users/<user>/settings/<plugin>.datas PHP serialized objects. These survive rTorrent restarts (the plugins re-register XMLRPC event handlers when ruTorrent loads). On RapidSeedbox,.rtorrent.rcis auto-generated by the panel — direct edits are overwritten. Use ruTorrent plugins for persistent configuration instead. - Configure ruTorrent plugins via POST to action.php: AutoTools:
POST /rutorrent/plugins/autotools/action.phpwithenable_move=1&path_to_finished=/path&fileop_type=Move&add_name=1. Ratio:POST /rutorrent/plugins/ratio/action.phpwithrat_action0=0&rat_min0=100&rat_max0=200&rat_time0=336&rat_name0=default-seed&default=0. Must use HTTPS to localhost from the seedbox — external DNS for*.seedbox.xipmay not resolve. - Seedbox SSH rate limiting: Too many rapid SSH connections (>15 in quick succession) trigger rate limiting — connections fail with exit code 255. Use
ConnectTimeout=20and space out connections. This is especially relevant during migration scripts that make many XMLRPC-over-SSH calls. - rclone SFTP password must be obscured: Use
rclone obscure <plaintext>to generate the value forRCLONE_SFTP_PASS. The K8s Secret stores the obscured form, NOT the plaintext.
Seedbox Details (Architecture Only — No Credentials)
- Provider: RapidSeedbox
- IP:
- Services: Deluge (web UI), ruTorrent (web UI), SFTP (:2222), FTP (:21), OpenVPN, Remote Desktop (:300)
- Plex: Available but not unlocked (using Jellyfin on-prem instead)
- Sync method: rsync –partial over SSH from seedbox
~/Downloads/complete/→ NFS mount on DXP4800 (via k3s CronJobseedbox-sync). Only syncs completed downloads — AutoTools plugin moves finished torrents from~/Downloads/to~/Downloads/complete/. Originally rclone SFTP, but rclone can’t resume partial files over SFTP. rsync resumes byte-level from where it left off via--partial-dir=.rsync-partial. - Torrent client: rTorrent (engine) + ruTorrent (PHP web UI). Runs in a SCREEN session. AutoTools plugin: move completed to
~/Downloads/complete/. Ratio plugin: seed to 1.0 ratio or 2 weeks, then stop. Throttle: 5 downloads + 5 uploads max (throttle.max_downloads.global.set = 5,throttle.max_uploads.global.set = 5). - rTorrent memory: ~63MB RSS (2.4% of 2.5GB) at steady state with ~33 torrents. Initial spike to ~300MB during startup while loading torrent metadata, settles within minutes.
- VPS Control Panel: master2.rapidseedbox.com:5656 (credentials in password manager)
- Secret name:
seedbox-sshinmedianamespace (keys:SSH_HOST,SSH_PORT,SSH_USER,SSH_PASS— plaintext, not rclone-obscured). Legacyseedbox-sftpandseedbox-ftpsecrets still exist (unused). - Known SFTP issue (2025-07, RESOLVED 2026-02): SFTP subsystem was broken server-side in July 2025. As of Feb 2026, SSH/SFTP are working again — rsync over SSH confirmed operational.
- Seedbox session-creation hang (2025-07, RESOLVED): Was a data partition mount issue. Resolved — SSH sessions now work normally.
- Don’t exhaust seedbox with test connections: Each FTP/SSH connection that hangs after auth consumes memory (~20-50MB per hung process). With only 2.5GB RAM, after ~20 hung tests the ENTIRE server becomes unresponsive (even HTTPS dies). Always use tight timeouts (-m 10) and make ONLY ONE test per reboot.
- VPS panel VM ID changes per session: The
vi=parameter for_vm_remote.phpAPI calls changes every login session. Must scrape it fromcontrol.phpJavaScript (vi:"<id>"), not reuse old values.
Exportarr Instrumentation Patterns
- Exportarr v2.3.0 is distroless: Image
ghcr.io/onedr0p/exportarr:v2.3.0usesgcr.io/distroless/static:nonroot(UID 65534). No shell available — can’t use wrapper scripts or shell commands. - CONFIG option only parses XML: Exportarr’s
CONFIGenv var readsApiKey+Portfrom *arr’sconfig.xml. Works for Radarr/Sonarr/Prowlarr. Does NOT work for Bazarr (uses YAML config, not XML). Bazarr needs explicitAPI_KEYfrom a K8s secret. - fsGroup grants sidecar read access: Pod-level
fsGroup: 10000adds supplemental group 10000 to all containers (including exportarr at UID 65534), enabling read access to config.xml files created by linuxserver images (PGID=10000). - JellyseerrDown alert was permanently firing: Original alert used
absent(up{job="jellyseerr"})but no ServiceMonitor existed for Jellyseerr (no native Prometheus metrics). Theupmetric was always absent, soabsent()always returned 1, causing the alert to fire permanently. Fixed to usekube_deployment_status_replicas_availableinstead. - Exportarr sidecar port 9707 is safe: Each *arr app runs in its own pod, so all exportarr sidecars can use the same port 9707 without conflicts.
Mermaid Diagrams (Wiki.js 2.5)
Wiki.js 2.5 bundles Mermaid 8.8.2 (hardcoded in
package.json, never updated). Many modern Mermaid features silently fail with “Syntax error in graph”. Wiki.js 3.0 does not exist. These rules apply to all diagram content on the wiki.
Syntax NOT supported in Mermaid 8.8.2
direction TB/direction LRinside subgraphs: Only the top-levelgraph TB/graph LRdirective controls direction. Subgraph-leveldirectionwas added in Mermaid 9.x. Remove anydirectionkeyword inside subgraphs — the parent graph direction applies.<-->bidirectional arrows: Not supported. Use two separate one-way arrows:A --> BandB --> A.&multi-target connections (e.g.,A --> B & C & D): Not supported. Expand to individual lines:A --> B,A --> C,A --> D. Same for source-side:A & B --> CbecomesA --> C,B --> C.:::class shorthand: Not supported. Usestylecommands instead.
Wiki.js GraphQL API patterns
- Mutations hang forever (Wiki.js 2.5 bug):
pages.createandpages.updatemutations never return a response. Use fire-and-forget: 3s socket timeout, 8s render wait, then read-back to verify. - Tags field required: All page create/update mutations MUST include a
tagsarray (can be empty). Omitting it causes silent failures. - Use GraphQL variables for content: Pass page content as a
$content: String!variable, not inlined in the query string. Inline content breaks on special characters. - Auth tokens expire ~30 min: Re-authenticate before bulk operations.
Diagram authoring rules
- Always start with
graph TBorgraph LRat the top level only. - Never use
directioninside subgraphs. - Never use
<-->,&, or:::in connection syntax. - Use
-->for directional,---for undirected,-.->for dashed. - Emoji in node labels works fine (e.g.,
A["🔒 Firewall"]). - Test with
python3 scripts/upload-wiki-diagrams.py --dry-runbefore uploading. - The upload script supports idempotent create-or-update via
get_page_by_path()check.
Mellanox ConnectX-3 Temperature Monitoring
MCX311A-XCAT (PCI ID
0x1003) on pve2 (01:00.0) and pve3. Driver: mlx4. Role:ansible/roles/mellanox_mft/. Playbook:ansible/playbooks/proxmox-mellanox-mft.yml.
Key Facts
- Debian
mstflint≠ NVIDIA MFT: Debian’s package only provides the firmware flasher. It does NOT include themstdevice manager,mget_temp, ormget_temp_ext. Installingmstflintvia apt is not enough. - Full MFT download URL (v4.29.0 — v4.30/4.31 are 404):
https://content.mellanox.com/MFT/mft-4.29.0-131-x86_64-deb.tgz- Extracts to
/tmp/mft-4.29.0-131-x86_64-deb/ - Userspace debs:
DEBS/mft_4.29.0-131_amd64.deb - DKMS deb:
SDEBS/kernel-mft-dkms_4.29.0-131_all.deb
- Extracts to
- DKMS build always fails on Proxmox kernels:
mst_pci_bc.ccan’t findnnt_ioctl.hbecause the DKMS Makefile usesEXTRA_CFLAGS= -I$(PWD)/$(NNT_DRIVER_LOCATION)where$(PWD)resolves to the kernel source dir in DKMS context, not the module source dir.- Workaround: Copy all
*.hfrom/usr/src/kernel-mft-dkms-4.29.0/nnt_driver/intomst_backward_compatibility/mst_pci/andmst_backward_compatibility/mst_pciconf/, thenmake KPVER=$(uname -r)from each directory. Install resulting.koto/lib/modules/$(uname -r)/extra/manually. Ansible role handles this automatically.
- Workaround: Copy all
- ConnectX-3 not in MFT 4.29’s device list:
mst startcreates empty/dev/mst/because ConnectX-3 (0x1003) predates the supported device list (oldest entry is ConnectX-40x1013). Fix:mst start --with_unknown. Creates/dev/mst/mt4099_pciconf0and/dev/mst/mt4099_pci_cr0. - Use
mget_temp_ext, notmget_tempormstmget_temp:mget_temp_ext -d mlx4_0takes the InfiniBand device name from/sys/class/infiniband/, not the/dev/mst/path. This is the tool the Prometheus collector uses. - mlx4 exposes
/sys/class/infiniband/mlx4_0in Ethernet-only mode: No InfiniBand cable needed. The sysfs path exists as long as the mlx4 driver is loaded. - Collector already in Debian package:
prometheus-node-exporter-collectorsshipsmellanox_hca_tempscript andprometheus-node-exporter-mellanox-hca-temp.{timer,service}. Just enable the timer after MFT + mst-startup are configured. - Prometheus metric:
node_infiniband_hca_temp_celsius{device="mlx4_0"}collected every 60s via textfile at/var/lib/prometheus/node-exporter/mellanox_hca_temp.prom. - Expected operating temps: ConnectX-3 ASIC runs hot — 80–95°C is normal (max ASIC temp ~110°C). Dashboard thresholds: yellow >80, red >95.
- Grafana panels: id 8 (gauge) + id 9 (timeseries) at
y=24inkubernetes/core/grafana-dashboard-proxmox-hardware.yaml. Query:node_infiniband_hca_temp_celsius{job="proxmox"}.
Wiki.js API Key Invalidation
Symptom:
pages.createreturnsForbiddenatauth.js:47even thoughpages.listworks and the API key appears valid (correct length, correctgrpfield).
Root Cause
Wiki.js stores RSA certificate pairs in the settings table (key='certs'). When regenerateCertificates() is called (via admin UI or upgrade), new RSA keys are generated and stored. Any existing API key JWTs — which were signed with the old private key — fail signature verification with the new public key.
The auth middleware falls back to guest user when JWT verification fails. Guest user has read:pages only — enough to pass pages.list (which accepts read:pages) but NOT pages.create (which requires write:pages or manage:system). This creates a misleading partial-auth appearance.
Debugging signal: site.config (requires manage:system exclusively) → Forbidden, while pages.list → OK means JWT verification has failed and user is guest.
Fix
Generate a new JWT using the current private key (stored encrypted in DB) + current session secret (stored in settings.sessionSecret.v):
// Run in: kubectl exec -n wiki <wiki-pod> -- NODE_PATH=/wiki/node_modules node /tmp/gen_key.js
const jwt = require('jsonwebtoken');
const sessionSecret = '<value of settings.sessionSecret.v from DB>';
const encryptedPrivKey = '<value of settings.certs.private from DB>';
const payload = { api: <key_id>, grp: <group_id>, iat: now, exp: future, aud: 'urn:wiki.js', iss: 'urn:wiki.js' };
const newToken = jwt.sign(payload, { key: encryptedPrivKey, passphrase: sessionSecret }, { algorithm: 'RS256', noTimestamp: true });
Then update the apiKeys table: UPDATE "apiKeys" SET key='<newToken>' WHERE id=<key_id>;
And update WIKI_API_KEY in your local environment.
Key DB Queries
-- Get session secret (passphrase for encrypted private key)
SELECT value FROM settings WHERE key='sessionSecret';
-- Get encrypted certs (contains private key)
SELECT value FROM settings WHERE key='certs';
-- Update API key with newly generated JWT
UPDATE "apiKeys" SET key='<new_jwt>' WHERE id=2;
kube-router IS Enforcing NetworkPolicies (Not Just Flannel)
Symptom:
pg_isreadyreturns “no response” from backup pods to postgres, even though postgres is 1/1 Running andkubectl execinside postgres-0 works fine.
Root Cause
This cluster runs kube-router as a NetworkPolicy controller (visible in iptables -L FORWARD -n as KUBE-ROUTER-FORWARD chain). UFW’s default forward policy is deny (routed). kube-router marks compliant packets 0x20000; only marked packets are ACCEPTed. Packets from pods without a matching NetworkPolicy are dropped.
The postgres-backup pods have label app.kubernetes.io/name: postgres-backup. The namespace default-deny NetworkPolicy (podSelector: {}, policyTypes Ingress+Egress) blocks ALL their traffic. The existing *-allow-egress policies only cover app.kubernetes.io/name: cardboard (or trade-bot) pods.
Debugging Path
kubectl execinto postgres pod → connect works (runs inside the pod, no NetworkPolicy check on self)kubectl run debug-pg --image=postgres:16-alpine -- pg_isready -h postgres ...→ “no response” (blocked by kube-router)ncfrom k3s-agent-1 HOST to 10.42.0.228:5432 → “Connection refused” (TCP RST, not timeout — packet reaches destination but iptables on ingress path blocks it)sudo iptables -L FORWARD -non any node → revealsKUBE-ROUTER-FORWARDchain
Fix
Add a NetworkPolicy allowing egress from backup pods. Added to kubernetes/apps/cardboard/postgres-backup-cronjob.yaml and kubernetes/apps/trade-bot/postgres-backup-cronjob.yaml:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: postgres-backup-allow-egress
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: postgres-backup
policyTypes: [Egress]
egress:
- ports: [{port: 53, protocol: UDP}, {port: 53, protocol: TCP}]
to: [{namespaceSelector: {matchLabels: {kubernetes.io/metadata.name: kube-system}}, podSelector: {matchLabels: {k8s-app: kube-dns}}}]
- ports: [{port: 5432, protocol: TCP}]
to: [{podSelector: {}}]
- ports: [{port: 443, protocol: TCP}]
to: [{ipBlock: {cidr: 0.0.0.0/0, except: [10.0.0.0/8, 192.168.0.0/16]}}]
Note: postgres pods already have app.kubernetes.io/name: cardboard (or trade-bot) label, so the existing *-allow-ingress policies cover their ingress — no separate ingress rule needed for postgres.
Key Fact
The postgres pod labels matter: postgres StatefulSets use app.kubernetes.io/name: cardboard (same as the web app), which gives them ingress coverage from the existing cardboard-allow-ingress policy. If postgres were labeled differently, it would need its own ingress NetworkPolicy.
proxmox-watchdog: nodeSelector Required for Kasa Access — [HISTORICAL — Lima requirement resolved 2025-06-25]
Symptom was: Watchdog pod in ImagePullBackOff, then CrashLoop when Kasa unreachable.
Architecture Lessons (historical)
- Kasa HS300 is on 192.168.1.x — cluster nodes are on 192.168.20.x VLAN with no inter-VLAN routing. Previously only the lima node (192.168.1.56) could reach Kasa. This is why the watchdog was pinned to Lima.
- Resolution (2025-06-25): The proxmox-watchdog was updated to use
nodeSelector: kubernetes.io/arch: amd64along withhostNetwork: true. With hostNetwork, pods on VLAN 20 nodes can reach 192.168.1.x targets directly (UDM Pro routes inter-VLAN). The Lima node requirement was eliminated. - Multi-arch image no longer required — amd64-only build now used (
--platform linux/amd64 --provenance=false).
Kasa Degraded Mode
If Kasa is unreachable (offline, DHCP lease changed, physical disconnect), the watchdog retries with backoff [5, 15, 30, 60, 120] seconds, then runs in degraded mode: monitors Proxmox host health but cannot power-cycle. The power_cycle_outlet() and get_outlet_power_metrics() methods guard with if not self.strip: return.
To find Kasa’s new IP if it moves: nmap -sn 192.168.1.0/24 | grep -i tp-link\|kasa or check UniFi client list. Then: kubectl patch configmap -n proxmox-watchdog watchdog-config --type=merge -p '{"data":{"KASA_IP":"NEW_IP"}}' + pod restart.
Jellyfin Has No Native /metrics Endpoint
Symptom:
JellyfinDownPrometheus alert always fires even when Jellyfin is running.
Jellyfin does not expose a /metrics endpoint natively. A ServiceMonitor pointing to it always gets “no data” which triggers absent() alerts.
Fix: Delete the ServiceMonitor. Change the alert to use kube_deployment_status_replicas_available{namespace="media", deployment="jellyfin"} == 0.
Note: Jellyfin’s IngressRoute uses traefik.io/v1alpha1 (not traefik.containo.us) and the hostname is jellyfin.k3s.internal.zolty.systems (not jellyfin.k3s.internal.zolty.systems).
LiteLLM / Open WebUI
- LiteLLM disables end_user in Prometheus by default: In
litellm/utils.pyfunctionget_end_user_id_for_cost_tracking, LiteLLM returnsNonefor theend_userlabel whenservice_type == "prometheus"unlessenable_end_user_cost_tracking_prometheus_onlyistrue. This is intentional to avoid high-cardinality metrics. Fix: setlitellm_settings.enable_end_user_cost_tracking_prometheus_only: truein config.yaml. - Open WebUI doesn’t pass
userin request body for standard models: Only pipeline-type models getuserinjected in the chat completions request body. Standard OpenAI-compatible backends receive no user identification in the request body at all. - Header-based user tracking pipeline: Open WebUI’s
ENABLE_FORWARD_USER_INFO_HEADERS=truesendsX-OpenWebUI-User-Email(andX-OpenWebUI-User-Name,X-OpenWebUI-User-Id,X-OpenWebUI-User-Role) headers on every LLM request. LiteLLM’sgeneral_settings.user_header_name: "X-OpenWebUI-User-Email"reads this header and maps it to theend_userfield used in Prometheus metrics. Both settings must be set for per-user cost tracking to work. - LiteLLM master_key required for Prometheus callback: The
success_callback: ["prometheus"]silently does nothing without amaster_keyset. The master key also becomes the API key that Open WebUI must use (openaiApiKeyin Helm values). - LiteLLM OOM at 512Mi with master_key: Enabling
master_keyincreases LiteLLM’s memory footprint significantly (auth middleware, user tracking state). Minimum viable memory limit is 1Gi with 512Mi request. - Open WebUI OOM at 1Gi on first boot: On first start, Open WebUI downloads a ~657MB sentence transformer model (
all-MiniLM-L6-v2). This exceeds 1Gi memory limit, causing OOMKill. Fix: increase memory limit to 2Gi, OR disable RAG embedding (RAG_EMBEDDING_ENGINE=openai,RAG_EMBEDDING_MODEL=""). - LiteLLM /metrics vs /metrics/: The LiteLLM metrics endpoint redirects
/metrics→/metrics/with HTTP 307. Use/metrics/(trailing slash) orcurl -L. ServiceMonitor should use/metrics(k8s follows redirects). - Longhorn encrypted StorageClass requires cryptsetup:
longhorn-encryptedStorageClass with LUKS2 needscryptsetupinstalled on all worker nodes that might schedule the volume. Without it, PVC staysPendingforever with no clear error. Install viasudo apt-get install -y cryptsetupon all agents.
Hugo / Blog (zolty-blog)
- Hugo shortcodes don’t render inside code blocks:
{{< amzn >}},{{< youtube >}}, and all other shortcodes are NOT processed when placed inside triple-backtick fenced code blocks. Hugo treats everything between ``` as literal text. If you need a product link near a code example, place the shortcode BEFORE or AFTER the code block, never inside it. - HEIC images must be converted to JPEG for Hugo: Media library stores originals as HEIC. Hugo and web browsers don’t render HEIC. Convert with macOS
sips:sips -s format jpeg -s formatOptions 85 input.heic --out output.jpg. Place converted images in the page bundle directory alongsideindex.md. - Page bundle image convention: Each blog post is a page bundle (
hugo/content/posts/<slug>/index.md). Images go directly in the same directory (not in a subdirectory). Reference with. - Amazon Associates shortcode: Use
{{< amzn search="Product Name" >}}link text{{< /amzn >}}for inline affiliate links. Amazon tagzoltyblog07-20is configured inhugo.tomlparams. The shortcode generates Amazon search URLs, not direct product links (more resilient to URL changes). - YouTube Data API v3 requires human OAuth consent: Service accounts cannot upload to YouTube — only OAuth 2.0 with user consent works for the
youtube.uploadscope. The OAuth flow requires a one-time browser-based authorization. Until the GCP project passes YouTube’s compliance audit, uploads are restricted toprivatevisibility.
GCP / YouTube Integration
- GCP project for YouTube: Project
youtube-k3sin Google Cloud. YouTube Data API v3 enabled. OAuth 2.0 credentials (Web application type) with redirect URIhttps://media-library.k3s.internal.zolty.systems/api/youtube/callback. - YouTube upload quota: YouTube Data API v3 has a 10,000 unit daily quota. Each video upload costs ~1,600 units. That’s roughly 6 uploads per day maximum.
AI Skills / Context Window Management
- Generic skills duplicated across repos waste context: Claude skills in
.claude/skills/are loaded into context when relevant. In a multi-repo workspace (5 repos), generic skills (gh-cli, refactor, systematic-debugging, test-driven-development, git-commit) were identically duplicated across all repos — 25 files totaling ~401KB (~100K tokens). Thegh-cliskill alone was 40KB per copy (42% of all skill content). Fix: removed all 5 generic skills from all repos (2026-02-23). Only project-specific skills are retained. If generic skills are needed again, keep them in ONE repo only, never duplicate across a multi-repo workspace. - Multi-repo workspaces multiply context baseline: Each repo’s
copilot-instructions.mdis injected into every message. With 5 repos open, that’s ~28KB (~7K tokens) of baseline context before any question is asked. Plus skill descriptions (~3KB) and instruction file references (~2KB). Consider splitting into per-project workspaces for long sessions.
NUT / UPS Shutdown
- NUT on NAS: upsd.users only has internal
nutmaster user by default — UGOS Pro ships NUT with one user:[nut]/password = nut/upsmon master. Theupsmonslave user (used by k3s nodes) does NOT exist until you add it. Fix: appended[upsmon]\n password = Wiggin123\n upsmon slaveto/etc/nut/upsd.userson the NAS and restartednut-server. Then re-ran the Ansiblenut-client.ymlplaybook withNUT_MONITOR_PASSWORD=Wiggin123set to deploy correctupsmon.confto all nodes. - NAS SSH is not reachable from Mac (no VLAN 30 route) but works from k3s nodes — My Mac is on VLAN 1 with no route to VLAN 30. Directly SSHing to 192.168.30.10 returns “Network is unreachable”. Use a k3s node as a jump:
ssh -F ssh_config k3s-server-1 "sshpass -p 'Wiggin123' ssh -o StrictHostKeyChecking=no mat@192.168.30.10 'command'". sshpass must be installed on k3s nodes (it is — verified). nc -ztimeout to a port does NOT mean it’s blocked — nc to the NAS port 3493 timed out, but upsc connected fine. upsd uses its own NUT protocol; nc -z (zero I/O) appears to time out waiting for a NUT banner that never comes. Always test NUT connectivity withupsc ups0@host:3493 ups.status, not nc.- UDM Pro inter-VLAN has zero custom firewall rules — All 5 networks (Default, Server VLAN20, Storage VLAN30, 2x WAN) have purpose=corporate with no firewall rules. Cross-VLAN routing is fully open. Port blocking from VLAN20→30 was entirely the NAS-side
UG_INPUTchain (UGOS Pro iptables). That chain only ACCEPTs established/related + port 5443 (management). But new TCP connections from VLAN20 are silently dropped UNLESS they hit the established-state rule. NFS works because the connection is pre-established.upsdconnections need to be NEW, so they were timing out. NUT_MONITOR_PASSWORDmust be inansible/.envbefore runningnut-client.yml— The role defaults tochangemevialookup('env', 'NUT_MONITOR_PASSWORD') | default('changeme', true). If the env var is not exported before the playbook run, all 7 nodes getchangemeas the upsmon password which won’t match the NAS config.
Docker Build & Deployment
- Service Dockerfiles need .dockerignore: The
alert-responderandmedia-controllerservices inservices/had no.dockerignore, shipping__pycache__/,.git/, test files, and IDE configs into the build context and image layers. Added.dockerignoreto both services (2026-02-24) excluding dev files, tests, and CI artefacts.
Media Stack / Prowlarr / TorrentLeech
- Prowlarr database is not auto-restored after PVC rebind — If the Prowlarr pod restarts and its Longhorn PVC gets a new binding (e.g. after node failure + PVC recreation), the SQLite database (
prowlarr.db) starts fresh: no indexers, no app connections. Radarr/Sonarr will still show stale Prowlarr-synced indexers pointing toprowlarr:9696/<id>/that 404. Signs:prowlarr-configPVC is 0–1 day old while other PVCs are weeks old. Fix: re-run the Prowlarr setup script (/tmp/prowlarr-setup.py) and FlareSolverr config script (/tmp/prowlarr-flaresolverr.py). TorrentLeech credentials are inkubectl get secret torrentleech-credentials -n media. - TorrentLeech requires FlareSolverr to authenticate — TorrentLeech uses Cloudflare DDoS protection. Prowlarr’s Cardigann scraper for TorrentLeech cannot pass the Cloudflare challenge, so all searches return the HTML login page (
<title>Login :: TorrentLeech.org</title>) instead of results. Fix: deploy FlareSolverr (kubernetes/apps/media/flaresolverr.yaml), add it as an indexerproxy in Prowlarr (POST /api/v1/indexerproxywithhost: http://flaresolverr:8191/), create aflaresolverrtag, and apply that tag to both the proxy and the TorrentLeech indexer. FlareSolverr runs a headless Chrome and relays requests through it. - Prowlarr app sync uses per-app endpoint, not bulk — The bulk sync endpoint
POST /api/v1/applications/syncreturns 405. Trigger per-app sync withPOST /api/v1/applications/{id}/sync… but this also returns 405 in v1.28.2. Sync happens automatically when an app is added or an indexer changes. Don’t rely on a manual sync endpoint — just wait ~30s or check Radarr/Sonarr indexer list. - Prowlarr API key is not in
arr-api-keyssecret — OnlyRADARR_API_KEYandSONARR_API_KEYare in that secret. Get Prowlarr’s key fromkubectl exec -n media deployment/prowlarr -c prowlarr -- cat /config/config.xml | grep ApiKey. - Radarr’s 429 during Prowlarr sync is a self-test artifact — When Prowlarr syncs a new indexer to Radarr, Radarr validates it by querying
prowlarr:9696/<id>/api. If Prowlarr returns 429 (rate-limiting during auth) the sync validation fails with “Unable to connect to indexer”. This is transient — wait for TL authentication to complete and re-trigger the sync (or just make a new Jellyseerr request which forces a search).