TL;DR

A new Opus model shipped, so I sat down to re-tune the agent harness I drive it with — the CLAUDE.md files, skills, hooks, and settings that shape every session. The surprising part: the most valuable changes weren’t trimming prompts for the smarter model. They were wiring the agent into infrastructure I already run — offloading bulk work to a local LLM (≈$0), a live homelab statusline, session tracing for an “action audit,” and a goal-drift monitor that uses the local model as judge. I also learned not to trust the new model’s own suggestions about what to cut. It wanted to delete load-bearing guardrails.

Why bother re-tuning at all?

Here’s the thing nobody tells you about running an AI coding agent as your daily driver: your config is tuned for a specific model, and you don’t notice until the model changes.

I’ve accumulated a pile of agent context over months — a global instructions file, a stack of per-repo instruction files, a couple dozen skills, a handful of hooks, and a settings file with permissions and environment tweaks. All of it was written against the previous model’s quirks. Some of it exists purely to compensate for things the old model got wrong: extra hand-holding, repeated reminders, verbose examples.

A new, more capable model is an opportunity to delete the scaffolding. It’s also a trap, which I’ll get to.

So when the latest Opus dropped, I gave the agent a deceptively simple instruction: review all of my agent config and improve it for the new model — “better” means completing tasks more completely, faster, with less token use.

Then I watched what it did.

The trap: the smarter model over-trims its own config

I fanned the review out across a few sub-agents, each auditing a slice: skills, instruction files, settings. They came back enthusiastic. One claimed I could cut ~500 lines of “bloat.” Great, right?

No. I verified every suggested cut, and most of them didn’t survive contact with reality:

  • It flagged a section of my secrets-helper skill as “resolved historical context, safe to trim.” That section was actually a set of load-bearing constraints — “don’t upgrade this CLI past version X, it breaks headless auth.” Deleting it would have reintroduced a silent failure I’d already paid for once.
  • It claimed one repo’s instructions duplicated a global rule and should be deleted. The file’s first line was an @import of the canonical doc. It wasn’t a duplicate — it was an include. Deleting it would have orphaned real context.
  • It “found” a redundant rule in another file that, when I actually opened the file, did not exist. The line it cited was about something else entirely. A clean hallucination, delivered with total confidence.

This is the counterintuitive lesson: a stronger model is more persuasive when it’s wrong, not less. It writes a tidy justification for cutting something it doesn’t fully understand. If you let it edit aggressively, you get a leaner config that quietly does less.

Rule I now follow: treat the model’s suggestions about its own guardrails as a hypothesis, not a verdict. Open the file. Read the thing it wants to delete. Ask “what breaks if this is gone?” Removing a guardrail makes tasks less reliable — which is the opposite of the goal.

The actual cleanup from the review was modest: one genuinely stale doc section, one skill with broken frontmatter that meant it never triggered, a few byte-identical duplicate skills. Useful, but not the headline.

The headline was everything I bolted on next.

The real wins: wire the agent into the infrastructure you already run

If you run a homelab, you almost certainly have idle capacity the agent isn’t using: a local LLM box, an observability stack, a vector database, a chat server. The highest-leverage change wasn’t editing prompts — it was connecting the agent to all of that.

1. Offload bulk work to a local model

The agent spawns sub-agents constantly — for searching, summarizing, scanning files. By default those run on a paid cloud model. But most of that work is low-judgment: classify this, summarize that, extract the fields. I have a local inference box (Apple Silicon, enough unified memory to run 30B–120B models via Ollama) sitting right there.

So I wrote a tiny wrapper that routes bulk work to it:

#!/usr/bin/env bash
# llm-local — route bulk/low-judgment work to the local Ollama box ($0).
# Reserve the expensive cloud model for synthesis; let local handle the grunt work.
set -euo pipefail
HOST="${OLLAMA_HOST:-http://LOCAL_LLM_BOX:11434}"
MODEL="${LLM_LOCAL_MODEL:-qwen3-coder:30b}"
SYS=""
while [ $# -gt 0 ]; do
  case "$1" in
    -m) MODEL="$2"; shift 2;;
    -s) SYS="$2"; shift 2;;
    *) break;;
  esac
done
PROMPT="${1:-$(cat)}"   # arg or stdin
python3 - "$HOST" "$MODEL" "$SYS" "$PROMPT" <<'PY'
import sys, json, urllib.request
host, model, sysmsg, prompt = sys.argv[1:5]
msgs = ([{"role":"system","content":sysmsg}] if sysmsg else []) + [{"role":"user","content":prompt}]
req = urllib.request.Request(host.rstrip("/")+"/api/chat",
    data=json.dumps({"model":model,"messages":msgs,"stream":False}).encode(),
    headers={"Content-Type":"application/json"})
print(json.load(urllib.request.urlopen(req, timeout=600))["message"]["content"].strip())
PY

Now a skill that needs to summarize 40 files pipes them through llm-local instead of burning cloud tokens. Bonus: some of the local models carry a far larger context window than I’d want to pay for in the cloud, so whole-repo scans that would otherwise need chunking just… fit.

The principle is local compute first: the cloud model is for the parts that actually need its judgment.

2. A statusline that shows the homelab, not just the model

The agent’s statusline was wasted real estate. I replaced it with a live glance at cluster health:

Opus 4.8 · my-repo · ⎇ feature/thing · k3s 7/7 · pods ✓ · llm ●

The catch with statuslines is they render constantly — you cannot put a network call in the hot path or every keystroke lags. So the architecture is two pieces:

  • A background refresher that hits the cluster (read-only kubectl, a ping to the local LLM) every couple of minutes and writes a small JSON cache.
  • The statusline script itself only reads that cache. If the cache is stale, it kicks the refresher in the background and shows last-known state. Render time stays in single-digit milliseconds.

The first time it rendered, it told me I had three unhealthy pods I didn’t know about. Observability you see for free, on every prompt, beats a dashboard you have to remember to open.

3. Session tracing — the “action audit”

I already send all my CI agent activity to an observability backend (Langfuse). My interactive sessions weren’t traced at all. That’s a blind spot: I had no record of what the agent actually did across sessions — what it touched, how many tool calls, whether it stayed on task.

A SessionEnd hook fixes that. It parses the session transcript and posts one trace per session with the objective, the model used, a per-tool call count, and the list of files touched:

# SessionEnd hook (sketch): parse transcript → one observability trace
info = parse_transcript(transcript_path)   # objective, model, tools, files
trace = {
    "name": "agent-session",
    "input": info["objective"],
    "metadata": {
        "tool_calls": info["tool_calls"],
        "tools": info["tools"],            # {"Bash": 568, "Edit": 43, ...}
        "files_touched": info["files"],
    },
}
post_to_observability(trace)   # no-op silently if no API key configured

Two design rules that matter here: the hook never blocks or errors out a session (a tracing failure must be invisible), and it no-ops cleanly when no credentials are present so it’s safe to ship before the key exists.

4. A goal-drift monitor that judges with the local model

Once sessions are traced, you can ask a question that’s hard to answer manually: did the agent actually stay on the objective, or did it wander? Long autonomous sessions drift. They start fixing a bug and end up refactoring three unrelated files.

So I built a monitor. It scans recent session transcripts, pulls out the stated objective and the actions taken, and asks a model to score the divergence. The judge is the local LLM — this is exactly the kind of bounded, repetitive call that shouldn’t cost a cent:

prompt = (f"OBJECTIVE: {objective}\n"
          f"ACTIONS: tools={tools}; files_edited={files}\n\n"
          "Did the actions stay on the stated objective? Consider scope creep "
          'and abandoned goals. Reply compact JSON: {"drift":0.0-1.0,"why":"<=15 words"}')
verdict = json.loads(run_local_llm(prompt))   # {"drift":0.2,"why":"stayed on target"}

Anything over a threshold gets posted to my chat server. The funniest part of building this: I ran it against my own recent sessions and it flagged a couple of real ones — a “restart my computer” task that never finished the restart, and a session that “wandered into unrelated tool development.” Both fair cops.

5. Schedule it the boring way

The monitor is a deterministic script — it does its own judging and its own notifying. It does not need to spin up a full agent session to run. So I scheduled it with plain launchd, not the agent’s own routine system:

<!-- ~/Library/LaunchAgents/com.zolty.agent.goal-drift-check.plist -->
<key>ProgramArguments</key>
<array>
  <string>/bin/bash</string>
  <string>-lc</string>
  <string>$HOME/bin/goal-drift-check --days 1 --notify</string>
</array>
<key>StartCalendarInterval</key>
<dict><key>Hour</key><integer>8</integer><key>Minute</key><integer>0</integer></dict>

Daily at 08:00, zero cloud tokens, survives reboot, runs whether or not I have a session open. Reaching for the fancy agent-scheduling system here would have meant paying for an LLM session to run a script that already contains its own LLM call. Simplicity wins.

Results

  • Bulk work moved to local compute. The grunt-work sub-agents now cost roughly nothing and run against models with bigger context windows than I’d pay for in the cloud.
  • Observability on every prompt. The statusline surfaced unhealthy pods I hadn’t noticed, and every session now lands as a trace I can query later.
  • Drift monitoring is live and earning its keep. It correctly scored a focused session at 0.2 (“stayed on target”) and flagged two genuinely-drifted ones at 0.8.
  • The config got more careful, not just smaller. I cut the few things that were actually stale and fixed a skill that was silently never firing — but I kept every guardrail the review wanted to delete.

Lessons learned

  • A smarter model over-trims its own guardrails. It will confidently recommend deleting the exact constraints that keep your tasks reliable. Verify every cut against the actual file. “Less context” is only a win if the context was dead weight.
  • The biggest gains are integration, not prompt-golfing. Re-tuning text in instruction files is fiddly and marginal. Wiring the agent into the local LLM, the observability stack, and a scheduler changed how it works.
  • Local compute first. Anything bounded and low-judgment — classify, summarize, extract, score — should run on hardware you already own. Save the cloud model for synthesis.
  • Make hooks fail safe. Anything that runs on every edit or every session-end must no-op silently without credentials and must never block the session. Build the dry-run mode first.
  • Don’t schedule a deterministic script with an LLM. If a job can judge and notify itself, a plain cron/launchd entry beats spinning a whole agent run.

What’s next

The session traces are the foundation for a longer-running idea: feed them into semantic recall, so I can ask “what did I decide about X three months ago” across every session, with the embeddings computed locally. That’s a post for another day.