TL;DR
I have $X in monthly Claude tokens I don’t always use. Instead of letting the unused credit evaporate, I built a parallel agent sweep that fans out autonomous scouts to scan for dependency upgrades, CVEs, CI waste, and quick wins across my repos. Each discovery agent returns a scored candidate list. The orchestrator triages and ranks them, then spins up isolated worktree agents to implement the safe ones — all under a hard token cap and with human gates between phases. The output is a pile of merge requests, not silent commits. Noise is real and review burden is the limiting factor, but when it lands right, an hour of agent work + human review beats a weekend of manual maintenance.
The problem: idle tokens after reset
Every month I get a Claude token allotment through the Claude Code subscription. Some months I use it all; other months I’m half through a task when my renewal hits and I’ve got leftover credit. Letting that burn feels wasteful. And I have a growing list of maintenance chores: dependency upgrades that aren’t urgent, CI jobs that could be 10× faster on a different runner, test coverage gaps, redundant patterns that the templating system already solved. I’ve written about the case for agent autonomy and the trust ladder to earn it, and this loop puts that theory into practice.
The tension: I could do a pass through my repos manually, but it’s tedious enough that I don’t. I could write automation to do it, but then I’m building the automation instead of shipping features. The middle ground is to farm out the tedious work to agents under a hard budget, let them build merge requests, and then I decide what’s worth merging. The agents become a force multiplier on maintenance work I’ve been deferring.
Architecture: seven phases under a token ceiling
The system, which I call improve-loop, fans out in two dimensions: domains (where to look) and intensity (how many agents to spawn). It’s gated by hard safety checks and capped by a tunable token budget.
Preflight (single agent, sequential):
- Check cluster health (all nodes Ready, no degraded Longhorn volumes, no active patching session)
- Read my project memory to identify exclusion zones (active work that agents shouldn’t touch)
- Compute fan-out from intensity level:
light=3 agents,medium=6,heavy=10-12 - Write a lockfile to prevent concurrent runs
Discovery (parallel fan-out, one agent per domain): Each discovery agent gets a domain (releases, Reddit, HN, CVEs, CI inefficiencies, HuggingFace models, AWS changes) and returns a ranked JSON list of candidates with structure like:
{
"id": "traefik-3.6-grpc-retry",
"source": "github-releases",
"summary": "Traefik 3.6 adds gRPC retry middleware",
"value_for_us": "Reduces gRPC 502s in openclaw worker pools",
"effort_estimate_hours": 2,
"risk": "low",
"reversibility": "instant",
"target_repo": "home_k3s_cluster",
"target_paths": ["kubernetes/apps/traefik/", "helm-values/"]
}
Triage (single Claude pass):
Aggregate, dedupe, and rank by (risk_score × reversibility_score × value_weight) / effort_hours. Apply hard exclusions:
- Touches cluster-wide blast radius (Traefik, MetalLB, cert-manager) without an explicit safety story → discard
- Contradicts an active project memory finding → discard
- Already shipped in the last 14 days → discard
- Touches files flagged as “do not auto-touch” in feedback docs → discard
Pick the top K by score (light=3, medium=6, heavy=8). Print and wait for human triage (unless --auto flag is set).
Implementation (parallel fan-out, worktree-isolated): For each accepted candidate, spawn an agent in an isolated git worktree. Per-agent budget: 30-minute wall-clock limit and $5 token cap (the agent self-reports at 80%). The agent:
- Reads the repo and candidate metadata
- Makes the change, tests it locally if applicable
- Opens a merge request with an explanation in the description
- Exits, leaving the worktree on disk for easy inspection
Worktree isolation is critical — when multiple agents edit the same repo in parallel, their working directories don’t collide, and changes are staged per-branch. If an agent runs away or hits the token cap, the damage is contained to that branch.
Validation (automated CI): My GitLab CI already runs on every MR: linting, image builds (with Trivy security scans), manifest dry-runs, and an LLM-powered code review that leaves inline comments. I don’t re-validate; I just poll the pipeline status.
Shipping (orchestrator decision):
- Green pipeline + clean LLM review → tag
improve-loop/auto-mergeable(I merge manually later) - Green pipeline + reviewer flagged something → tag
improve-loop/needs-review, keep it open with a summary - Pipeline failed → close with an explanation comment, mark the candidate as “needs human plan”
Cleanup:
Remove the lockfile, update docs/runtime/improve-loop-state.yaml with the run summary, and post a summary message to Slack.
Three invocation modes: full Claude vs. local workers vs. cron
Default (full Claude): Orchestrator and all agents use Claude. Highest quality, highest cost. Budget: medium = $15 for the full run.
Local mode (--local): Orchestrator is Claude; agents are Bash loops calling scripts/ci/llm_call.py, which dispatches to LiteLLM (free Qwen 36B coder + Claude escalation for hard thinking). $0 LLM cost for workers, ~$2-5 Claude cost for orchestration.
Cron mode (future): The full pipeline runs via CI scheduled job, no Claude at all — just the scripts/ci/llm_call.py machinery with LiteLLM. Already wired up in the CI; not yet hooked into the skill.
Invocation is a slash command with flags:
/improve-loop # Full sweep, medium intensity, home_k3s_cluster
/improve-loop --light # Discovery only, no implementation
/improve-loop --target zolty-blog # Focus on one repo
/improve-loop --domain ci # One source domain (releases, reddit, ci, aws, etc.)
/improve-loop --local # Use LiteLLM workers, not Claude
/improve-loop --intensity heavy # Up to 12 parallel agents, 6-hour budget
Sources: where agents look
Each discovery agent targets a concrete source:
| Domain | Sources |
|---|---|
releases | GitHub release feeds for Traefik, Longhorn, cert-manager, k3s, Terraform providers, etc. |
reddit | Top posts from r/kubernetes, r/selfhosted, r/homelab, r/devops, r/LocalLLaMA, etc. (7-day window) |
hn | HN Algolia API searching for keywords (kubernetes, k3s, terraform, ollama, harbor, etc.) |
patches | CVE tracking via /patching status — urgent security drifts |
hf | HuggingFace model releases that fit my inference constraints |
aws | Bedrock model deprecations, ECR/CloudFront pricing/feature changes |
ci | Scan .gitlab-ci.yml for slow jobs, redundant logic, uncached steps, direct registry pulls |
cves | Critical CVEs from Dependabot alerts + Trivy operator (when deployed) |
internal-issues | My own issues labeled improve-loop — explicit “do this next” requests |
The honest caveats
Noise is real. A discovery agent might surface “Kubernetes 1.32 added a new admission controller,” which is technically new but totally irrelevant to me. I trade false positives for recall — I’d rather filter 20 candidates down to 5 useful ones than miss one that matters.
Review burden exceeds savings sometimes. If the loop opens 8 MRs and I spend an hour triaging them, but only 2 are mergeable, I’ve lost the win. The loop pays off when the hit rate is high — 60%+ of candidates are genuinely useful. When candidates are noisy, the next run gets tuned with tighter exclusion rules.
Implementation agents aren’t magic. If the change needs architectural thinking or cross-system refactoring, a 30-minute agent budget won’t cut it. The loop works best for small, reversible wins: dependency bumps with green tests, config tweaks, CI speedups, shell-script polishes.
Deterministic safety is mandatory. I don’t let the loop auto-merge. I also don’t trust LLM safety checks alone — the hard gates (no degraded volumes, no cluster-wide changes without a story, no test deletion, no secrets) are enforced via Bash regex on the CI runner, not prompts. An agent can’t bypass them.
Token spillage happens. I set budgets, but if a discovery agent gets stuck pulling changelogs or parsing API responses, it might burn $3 instead of the planned $2. I monitor Langfuse traces and kill runaway agents if they hit 80% of their cap. Not perfect, but acceptable.
Anti-patterns that bit me
- Two levels of parallelism. A discovery agent that itself spawns more agents = exponential token burn + trace loss. I keep the fan-out flat.
- Merging to main without human gates. The loop opens MRs, period. I always merge manually, even if the pipeline is green. Human judgment is final. (This ties to the broader CI/CD automation and code review framework I use.)
- Running on a degraded cluster. The loop is to improve, not hide instability. If Longhorn is degraded or a node is NotReady, the whole thing aborts.
- Expanding scope mid-implementation. If an agent realizes the work spills outside the candidate’s target paths, it stops and files the spillover as a separate candidate. Scope creep is the loop’s enemy.
- Infra bumps through the loop. Traefik 3.6, Longhorn upgrades, Terraform provider bumps — these go through
/patching, not the improve loop. The loop is for apps, CI, docs, and config.
What’s actually landed
The loop has been running for a few months in light and medium modes. Some wins:
- CI speedup: Migrated
lint:ansibleandlint:pythonjobs from cluster runners to Mac runners, cut feedback time from 60s to 8s. The discovery agent found it by scanning job runtimes; the implementer changed.gitlab-ci.ymlto use a different runner tag. - Harbor inventory: Removed 13 stale
.build_harbordirectives as part of Harbor retirement. The loop discovered the stale jobs, the implementer updated them to use the new.build_registrytemplate and verified green builds. - Kubectl latest unpinned: Caught a
.kubectltemplate usinglatesttag (exact violation of the repo’s own linting rules). Implementer pinned it to a digest.
But also noise:
- False positive upgrades: The releases agent surfaced “new Terraform AWS provider version,” but the candidate was a pre-release and skipped by design.
- Already-done work: Candidates for changes that shipped during the run itself (the lockfile exists to prevent concurrent runs, but not runs against a fast-moving main).
The hit rate was ~50% after triage — workable, not great. The next sweep tuned the triage exclusions and the “already shipped” check got more aggressive.
The math
Token cost scales with intensity and agent depth:
| Intensity | Discovery cost | Impl cost | Total budget |
|---|---|---|---|
| light | ~$2 | $0 | $1 |
| medium | ~$4 | ~$8 | $15 |
| heavy | ~$8 | ~$30 | $40 |
Local mode cuts implementer cost to ~$0 (LiteLLM) and discovery to ~$1-2 (cheaper extract phase).
Against a monthly allotment, medium once a week is sustainable and catches enough signal to make merging 2-3 PRs per run feel worthwhile. heavy weekly would burn the whole month; I reserve it for “backlog cleanup” sprints.
Lessons
- Agents as force multipliers on deferred work, not autonomous decision-makers. They find candidates and implement, I decide what lands.
- Worktree isolation is non-negotiable. Running multiple agents in
main= chaos. Worktrees cost a few extra seconds and prevent every agent from stepping on every other. - Hard gates before soft reasoning. Bash regex enforcement of destructive-command blocks, secret-detection rules, and test-deletion guards beats any LLM safety prompt.
- State files are your friend.
improve-loop-state.yamlis the memory of what shipped, what failed, and what to try next. Without it, the loop drifts. - Human gates between phases prevent surprises. The default is
--no-auto, so I review the candidate list before any implementation starts. The loop is a tool, not a runaway process.
No homelab, no git server? The same pattern scales down to a single repo. A single-agent discovery sweep of GitHub releases and Dependabot alerts, implementation in a branch, and MRs to your default branch gives you the same “maintenance automation with human review” model. DigitalOcean’s App Platform runs the whole CI/CD pipeline in one place — no need for self-hosted runners.