TL;DR

I built an “Ops Dream Worker” — a Kubernetes CronJob that runs at 3 AM, inspects the cluster, identifies improvements, and files GitHub issues with specific fixes. It runs entirely on local models (Mac Studio M3 Ultra), costs $0 per run, and went through 240 A/B test iterations to optimize the prompts. The anti-hallucination patterns were harder to get right than the analysis itself.

The idea

I have a k3s cluster with ~40 deployed services. I maintain it solo. There’s always something that could be better — a deployment missing resource limits, a CronJob that’s been failing silently, an ingress without SSO protection, a container image with known CVEs. These improvements pile up because I’m usually focused on building features, not auditing infrastructure.

What if an AI agent could do the audit overnight and hand me a prioritized list of issues in the morning?

Architecture

The Dream Worker is a 5-phase pipeline inspired by how the brain consolidates memories during sleep — observe, diagnose, prescribe, validate, act. Each phase uses a different local model selected for its strengths:

PhaseModelWhy this model
Observekubectl (no LLM)Raw data collection — pods, events, PVCs, images, ingress
DiagnoseQwen3 (109B MoE, 41 tok/s)Fast pattern recognition across observations
PrescribeGemma 4 (31B dense, 26 tok/s)Detailed reasoning for specific fixes
ValidateDeepSeek-R1 (32B, 27 tok/s)Chain-of-thought safety checking
ActGitHub API (no LLM)Files issues and PRs

All models run locally on the Mac Studio M3 Ultra via Ollama, proxied through LiteLLM. The total cost per dream run is $0.00 — about 50K tokens of local inference.

The observation phase

Phase 1 doesn’t use an LLM at all. It runs kubectl commands to collect cluster state:

kubectl get pods --all-namespaces -o json    # pod health, restart counts
kubectl get events --all-namespaces           # warnings, errors
kubectl get pvc --all-namespaces              # storage health
kubectl get deployments -A -o json            # image versions, replicas
kubectl get ingressroute -A -o json           # security audit

The raw output feeds into a structured observation document with sections for unhealthy pods, warning events, error logs, image versions, and ingress security posture.

This is important: the LLM never runs kubectl itself. The observation phase is pure scripting. The LLM only sees the output. This eliminates an entire class of hallucination — the model can’t invent cluster state that doesn’t exist.

The diagnosis phase

The fast model (Qwen3) receives the observation document and produces exactly 4 findings, each categorized:

FINDING: [error|waste|reliability|performance|security|debt]
Title: ...
Evidence: [specific pod names, error messages, restart counts from observation]
Impact: ...
Already filed? YES #123 / NO

The category constraint matters. Without it, models tend to produce vague “you should add monitoring” findings. Forcing a category forces specificity.

The “Already filed?” check is the dedup mechanism — the observation phase fetches all open and closed issues labeled ops-dream from GitHub and passes them to the diagnosis. More on this in the anti-hallucination section.

The prescription phase

The deep model (Gemma 4) takes each finding and generates actionable fixes:

  • Specific file paths in the repository
  • YAML diffs showing exactly what to change
  • Verification commands to confirm the fix works

This is where model quality matters most. A fast model produces fixes that look plausible but miss edge cases. The dense 31B model catches things like “if you add resource limits to this deployment, you also need to update the PodDisruptionBudget.”

The validation phase

The reasoning model (DeepSeek-R1) acts as an adversarial reviewer. For each fix, it evaluates:

  • Blast radius: Does this change affect other services?
  • Confidence: Is the evidence strong enough to act on?
  • Verdict: PR (auto-file), ISSUE (human review), or SKIP (not worth it)

The mandatory SKIP threshold is critical. Without it, the model rubber-stamps everything. With it, about 30% of findings get killed — which is the right ratio for an automated system.

The action phase

Validated findings become GitHub issues with a structured template:

## Problem
[One sentence describing the issue]

## Evidence
[Raw kubectl output — pod names, error messages, restart counts]

## Impact
[What breaks if this isn't fixed]

## Suggested Fix
[File path + YAML diff]

## Verification
[kubectl command to confirm the fix worked]

Issues are labeled ops-dream and automated. High/critical findings include PR-ready diffs. Low/medium findings are issues for human review.

The 240-run prompt optimization

The first version of the Dream Worker produced garbage. Vague findings, hallucinated CVE numbers, duplicate issues for things already fixed. Getting from “technically works” to “actually useful” took 240 test runs.

I built a test harness that ran synthetic cluster observations through the pipeline and graded outputs on five dimensions:

DimensionWhat it measures
OriginalityNon-obvious connections (not just “pod is crashing”)
CoherenceLogical flow across phases
RigorFalsifiable claims with specific evidence
ActionabilityClear fixes with file paths and diffs
CalibrationHonest about confidence, avoids overstatement

Four prompt variants were tested:

VariantScore (/25)Truncation rateTime/run
v1 baseline18.065%60s
v2 sharper19.585%46s
v3 structured19.547%48s
v4 hybrid19.420%41s

v4 won despite a marginally lower score because it had the lowest truncation rate (20% vs 65% baseline) and fastest runtime. The truncation issue was killing v1 and v2 — the models were generating responses that exceeded the token budget, so the validation and synthesis phases were getting cut off mid-sentence.

The fix was prosaic: bump max_tokens from 2000 to 4000 across all phases. The models weren’t being verbose — the structured output format just takes more tokens than free-form prose.

Total cost of the 240-run optimization: $0.00 — 685K tokens on local Ollama. This is the advantage of local inference for prompt engineering. You can iterate without watching a billing dashboard.

Anti-hallucination patterns

This was the hardest part. The first deployments filed issues for problems that didn’t exist. Three patterns fixed it:

1. TLS/SSO confusion

The model kept filing “missing TLS” issues for services that had TLS but lacked SSO. The fix was renaming the audit output from [UNPROTECTED] to [NO_AUTHENTIK_SSO] and adding an explicit instruction:

ALL ingresses already have TLS via cert-manager. Do NOT file TLS issues. The issue is AUTHENTICATION (SSO login).

Seems obvious in hindsight. The model was pattern-matching “unprotected” to “no encryption” instead of “no authentication.”

2. Fabricated CVE numbers

The model occasionally invented CVE numbers that didn’t exist. The fix:

NEVER fabricate CVE numbers, error messages, or data not in the observation. You can ONLY cite data that appears in the input below. If a CVE scan wasn’t run, you CANNOT claim a CVE exists.

This explicit constraint eliminated fabricated CVEs completely. The model needs to be told it can’t make things up — implicit expectations don’t work.

3. Issue deduplication

The model would re-file issues that were already open or had been closed as resolved. The fix was a two-part dedup:

  1. Fetch all open AND closed ops-dream issues from GitHub
  2. For closed issues, also fetch the resolution comment (last comment on the issue)
  3. Pass the full list to the diagnosis phase with: “If an issue was already filed and CLOSED (resolved or false positive), do NOT re-file it.”

The resolution context matters — without it, the model sees a closed issue titled “Add resource limits to service X” and re-files it because it can’t tell if it was fixed or just closed as wontfix.

After 6 test runs with these patterns, false positives went from 3 per run to 0.

Scheduling

The Ops Dream Worker runs as a Kubernetes CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: openclaw-ops-dream-worker
  namespace: openclaw
spec:
  schedule: "0 3 * * 1-5"
  timeZone: "America/New_York"
  concurrencyPolicy: Forbid
  activeDeadlineSeconds: 900

3 AM ET on weeknights — when the cluster is quietest and I’m asleep. The 15-minute hard deadline prevents runaway inference from blocking the next scheduled run. Forbid concurrency means if a run is still going when the next one triggers, it skips rather than stacking.

Typical runtime is 5-8 minutes. kubectl operations and GitHub API calls are slower than the LLM inference.

What the morning looks like

I wake up to a Slack message in #cat-ops with a summary of what the Dream Worker found. Most mornings it’s “0 new findings” — which is the goal. When it does find something, I get a link to the GitHub issue with the full analysis and suggested fix.

Over the first two weeks of operation, it found:

  • 3 deployments missing memory limits
  • 1 CronJob that had been failing silently for 5 days
  • 2 ingresses without Authentik SSO protection
  • 1 Longhorn volume with degraded replica count

All real issues. All things I would have found eventually — but “eventually” is the problem when you’re a solo operator.

Lessons

  • Separate observation from reasoning. The LLM should never collect its own data. Script the data collection, feed the output to the model. This eliminates hallucinated cluster state.
  • Local models make prompt engineering free. 240 test runs would have cost hundreds of dollars on Claude or GPT-4. On local hardware, it cost electricity.
  • Explicit constraints beat implicit expectations. “Don’t hallucinate” doesn’t work. “You can ONLY cite data that appears in the input below” works.
  • Dedup needs resolution context. Knowing an issue was closed isn’t enough — the model needs to know WHY it was closed to decide if the finding is still relevant.
  • The validation phase is load-bearing. Without adversarial review, the system files too many low-value issues. The mandatory SKIP threshold keeps signal-to-noise high.

Don’t have a homelab? You can run the same pattern with cloud LLMs against any Kubernetes cluster. The prompt patterns and anti-hallucination techniques apply regardless of where the inference runs — you’ll just pay per token instead of per kilowatt-hour. A DigitalOcean Kubernetes cluster is a good starting point.