Hardening a Self-Hosted AI Agent: Multi-Stage Builds, NetworkPolicies, and Automated CVE Triage

TL;DR

OpenClaw, my self-hosted AI trading agent, was running in a fat container with 46 Critical CVEs, no network restrictions, and no automated vulnerability scanning. I fixed all three: multi-stage Dockerfile dropped the CVE count to single digits, default-deny NetworkPolicies locked down traffic, and a daily CronJob triages Trivy scan results via local LLM and posts a digest to Slack. Total cost of the automated triage: $0/day.

The problem with AI agent containers

AI agent containers are uniquely bad from a security perspective. They need:

Build tools (g++, make, python3-dev) for native module compilation
CLI tools (git, gh, curl, jq) for interacting with external services
Multiple runtimes (Node.js + Python) for the agent framework and trading libraries
Large dependency trees (npm + pip) that pull in hundreds of transitive packages

If you install all of this in a single Docker stage, every build tool and dev header ships in the final image. The OpenClaw image had g++, make, python3-pip, libopus-dev, and gnupg sitting in production. Each one carries its own CVE surface area.

Multi-stage Dockerfile

The fix is textbook but the details matter for AI agents. Two stages:

Stage 1 (builder): Install everything needed to compile native modules and set up the agent.

FROM node:22-bookworm AS builder

RUN apt-get update && apt-get install -y \
    g++ make python3-dev libopus-dev \
    git gpg curl jq ca-certificates

# Install GitHub CLI
RUN curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg \
    | gpg --dearmor -o /usr/share/keyrings/githubcli-archive-keyring.gpg \
    && echo "deb [arch=amd64 signed-by=...] ..." > /etc/apt/sources.list.d/github-cli.list \
    && apt-get update && apt-get install gh=2.88.1-1

# Install npm packages (compile native addons here)
RUN npm install -g qmd@2.1.0 openclaw@2026.4.9 clawhub@0.9.0

# Install Python packages to isolated directory
RUN pip install --target /opt/python-libs \
    robin_stocks fredapi finnhub-python pyotp duckdb

Stage 2 (runtime): Copy only the built artifacts.

FROM node:22-bookworm-slim

# Runtime deps only — no compilers, no dev headers
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 libopus0 curl jq ca-certificates sqlite3 \
    ripgrep ffmpeg imagemagick poppler-utils

# Copy built artifacts from builder
COPY --from=builder /usr/local/lib/node_modules /usr/local/lib/node_modules
COPY --from=builder /usr/local/bin/kubectl /usr/local/bin/kubectl
COPY --from=builder /usr/local/bin/helm /usr/local/bin/helm
COPY --from=builder /usr/bin/gh /usr/bin/gh
COPY --from=builder /opt/python-libs /opt/python-libs

ENV PYTHONPATH=/opt/python-libs

# Recreate npm bin stubs (can't just copy symlinks across stages)
RUN npm link openclaw clawhub @tobilu/qmd

The key detail is the Python isolation. pip install --target /opt/python-libs installs packages to a specific directory instead of the system site-packages. In the runtime stage, we just copy that directory and set PYTHONPATH. No pip needed at runtime.

The npm link step at the end is necessary because npm global packages use symlinks in /usr/local/bin/ that point to paths in /usr/local/lib/node_modules/. Those symlinks don’t survive a COPY --from=builder since the source paths are in a different filesystem layer. npm link recreates them.

CVE impact

Before (single stage): 46 Critical, 120+ High CVEs. Most from g++, make, python3-pip, and their dependency chains.

After (multi-stage): Single-digit Critical CVEs. The remaining ones are in runtime dependencies (Node.js itself, base OS packages) that can’t be eliminated without switching base images.

Default-deny NetworkPolicy

The second layer is network isolation. By default, Kubernetes pods can talk to anything — other pods, external services, the internet. For an AI agent that has kubectl access and GitHub credentials, that’s a large blast radius.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: default
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
  egress:
    - ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    - ports:
        - protocol: TCP
          port: 443

This says:

Ingress: Only pods from kube-system (Traefik) can reach pods in this namespace. No other namespace, no external traffic.
Egress: Pods can only do DNS lookups (port 53) and HTTPS (port 443, for the Kubernetes API). No arbitrary outbound connections.

In practice, this means the OpenClaw agent can talk to the Kubernetes API and resolve DNS, but it can’t reach other namespaces directly or phone home to unexpected external services. If a dependency gets compromised and tries to exfiltrate data over HTTP (port 80) or a non-standard port, the NetworkPolicy blocks it.

The gotcha with namespaceSelector

One thing that trips people up: namespaceSelector: {} (empty selector) matches all pod CIDRs, not all IPs. Host-network services like the Kubernetes API server, kubelets, and external hosts (Proxmox, NAS) use IPs outside the pod CIDR. If your monitoring stack needs to scrape nodes, you need ipBlock rules for the node subnet — namespace selectors won’t cover it.

Automated CVE triage

Trivy Operator runs in the cluster and creates VulnerabilityReport CRDs for every running container. The reports exist — but nobody reads them unless something breaks. I wanted a daily digest that surfaces Critical and High CVEs without me having to remember to check.

The CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: openclaw-security-triage
  namespace: openclaw
spec:
  schedule: "0 9 * * *"
  timeZone: "America/New_York"
  jobTemplate:
    spec:
      activeDeadlineSeconds: 300
      template:
        spec:
          serviceAccountName: openclaw-patrol
          containers:
          - name: triage
            image: harbor.k3s.internal.zolty.systems/production/openclaw-v2:latest
            command: ["uv", "run", "python", "-m", "openclaw.agents.security"]
            resources:
              requests:
                cpu: 100m
                memory: 256Mi
              limits:
                cpu: 500m
                memory: 512Mi

The triage agent reads VulnerabilityReport CRDs, summarizes the Critical and High findings via a local LLM (Gemma 4 26B on the Mac Studio), and posts a digest to the #cat-ops Slack channel at 9 AM ET.

RBAC for the patrol agent

The ServiceAccount needs read access to Trivy’s CRDs plus limited write access for remediation:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: openclaw-cluster-reader
rules:
  # Read cluster state
  - apiGroups: [""]
    resources: [pods, nodes, events, persistentvolumeclaims, namespaces, services]
    verbs: [get, list, watch]
  - apiGroups: [apps]
    resources: [deployments, statefulsets, daemonsets]
    verbs: [get, list]
  # Read Trivy vulnerability reports
  - apiGroups: [aquasecurity.github.io]
    resources: [vulnerabilityreports]
    verbs: [get, list]
  # Limited remediation (delete stuck pods/jobs)
  - apiGroups: [""]
    resources: [pods]
    verbs: [delete]
  - apiGroups: [batch]
    resources: [jobs, cronjobs]
    verbs: [delete]

The delete verbs on pods and jobs allow the patrol agent to clean up stuck pods and orphaned batch jobs automatically. It can’t create or modify deployments — that requires human approval via PR.

Managing accepted CVEs

Not every CVE needs immediate action. Some are in transitive dependencies you can’t control, with low real-world risk. I track these in .trivyignore with explicit risk assessments:

# CVE-2025-62718 — axios 1.13.6 SSRF via NO_PROXY bypass
# Risk: Low — outbound API calls within k8s pod only
# Fix: Requires axios >= 1.15.0 (upstream openclaw dependency)
CVE-2025-62718

# CVE-2026-40175 — axios cloud metadata exfiltration
# Risk: Low — pod-internal only, no user-controlled headers
# Fix: Same axios >= 1.15.0 upgrade
CVE-2026-40175

Each entry documents the CVE number, what it affects, why the risk is low in this context, and what upgrade path removes it. When the upstream dependency bumps axios, I remove the entries and the Trivy scan goes clean.

This is better than ignoring CVEs silently. The .trivyignore file is version-controlled, reviewed in PRs, and serves as a risk register. If I get hit by a CVE I’ve accepted, the risk assessment is already documented.

What changed operationally

Before these changes, security was something I thought about during deployments and forgot about between them. Now:

Build time: Multi-stage build adds ~30 seconds but the image is 40% smaller
Network: Default-deny catches any unexpected egress immediately (pod fails to connect instead of silently exfiltrating)
Daily triage: 9 AM Slack message either says “all clear” or links to specific CVEs with affected images

The daily triage is the highest-value change. It turns “I should check for CVEs sometime” into “the CVE report is in my Slack every morning.” Most days it’s clean. When it’s not, I know immediately.

Lessons

Multi-stage builds are table stakes for AI containers. The dependency trees are huge and full of build-time tools that have no business in production. Separate build from runtime.
Default-deny NetworkPolicies should be the starting point, not an afterthought. Add them when you create the namespace, then poke holes as needed. It’s much harder to add restrictions to a running system.
Automated triage beats manual scanning. The data exists in Trivy CRDs — you just need something to read it and surface the important bits. A local LLM does this for free.
Document your accepted CVEs. .trivyignore without comments is a liability. .trivyignore with risk assessments is a risk register.
namespaceSelector doesn’t cover host-network IPs. This bit me with Prometheus scraping. Use ipBlock for node subnets.

Don’t have a homelab? Trivy Operator and NetworkPolicies work on any Kubernetes cluster. The CVE triage CronJob can use a cloud LLM instead of local inference — the pattern is the same, you just pay per-token. A DigitalOcean managed Kubernetes cluster supports NetworkPolicies out of the box with Cilium.

TL;DR#

The problem with AI agent containers#

Multi-stage Dockerfile#

CVE impact#

Default-deny NetworkPolicy#

The gotcha with namespaceSelector#

Automated CVE triage#

The CronJob#

RBAC for the patrol agent#

Managing accepted CVEs#

What changed operationally#

Lessons#