TL;DR

Most AI tooling still treats an LLM like a search bar — you prompt, it answers, the loop ends. Useful, but not what I wanted. For my homelab’s ops + trading intelligence platform (OpenClaw), I needed agents that could run for hours, do real work against a real cluster, and then tap me on the shoulder when they found something I should see. Claude turned out to be the model I kept coming back to for the “thinking” layer — it’s both comfortable with long tool-use chains and happy to write structured output a human won’t need to decode. This is a tour of how I’ve actually wired that up: k3s CronJobs doing the heavy lifting, LiteLLM as the routing layer, Slack as the interrupt bus, and named cat-bot personas so I can tell at a glance who’s knocking.

The shape of the problem

I have a homelab with ~30 long-running workers. Some ingest market data every 8 minutes. Some scrape YouTube creators every 6 hours. A cluster patrol sweeps for sick pods every 15 minutes. A “dream worker” wakes up nightly to generate trading hypotheses. A security triage agent reads Trivy reports at 9am daily.

Each of these jobs has three distinct phases:

  1. Gather — cheap, deterministic data fetching. Yahoo Finance, Reddit JSON, Longhorn CRDs, CVE reports.
  2. Reason — expensive, probabilistic LLM work. Score sentiment, summarize earnings calls, suggest remediation, generate falsifiable hypotheses.
  3. Decide & report — cheap, deterministic. Write to Postgres. Post to Slack. Maybe fix the broken thing itself.

Phase 2 is where Claude earned its seat on the platform. I started with smaller local models (Gemma 4 via Ollama on a Mac Studio) and they’re good — they handle the high-volume, bounded tasks. But for anything that needs actual judgment — “given these 14 unhealthy volumes, what’s the narrative, what’s the blast radius, and what should a groggy homelab operator do about it at 2am?” — I wanted the strongest model I could afford to call. Claude was it.

The question was how to let a long-running agent call Claude without burning money on chat-style ping-pong.

Shape of an agent: the dream worker

The easiest way to explain the pattern is to walk through one agent end to end. The dream worker is my favorite, partly because it’s the most creative and partly because it has the most personality.

It runs nightly at 10pm ET, takes about 20 minutes, and its whole job is:

Read recent market data, news headlines, Reddit sentiment, and YouTube transcripts. Generate 10-30 falsifiable trading hypotheses. Queue them into dreams.hypothesis_queue. Pick the most interesting five and post them to Slack for my morning review.

That’s a lot of surface area for one process, so it’s structured as a tool-use loop:

┌─ Gather (Python, 2 min) ─────────────────────────────────────┐
│  pull last 24h from Postgres                                 │
│  pull fresh headlines from FRED + Finnhub                    │
│  pull hot Reddit posts from the poller's DuckDB              │
└──────────────────────────────────────────────────────────────┘
┌─ Reason (Claude, 10-15 min) ─────────────────────────────────┐
│  system prompt: "You are a skeptical quant. Generate         │
│                  falsifiable hypotheses with p_true 10-25%." │
│  tools:  lookup_ticker_history, sample_recent_news,          │
│          score_correlation, query_prior_hypothesis           │
│  budget: 40 tool calls, 30k output tokens                    │
└──────────────────────────────────────────────────────────────┘
┌─ Decide & report (Python, 2 min) ────────────────────────────┐
│  INSERT into dreams.hypothesis_queue (verdict=WATCH)         │
│  pick top 5 by novelty + conviction                          │
│  slack.chat.postMessage to #snek-daily as "Fluffie"          │
│  attach interactive Accept / Reject / Test Now buttons       │
└──────────────────────────────────────────────────────────────┘

The long-running part isn’t the LLM call — it’s the orchestration around it. Claude only sees a turn at a time, but the Python harness is the thing that runs for 20 minutes, deciding which tools to expose, batching context into each prompt, and writing progress back into the database after every reasoning step so a mid-run pod eviction doesn’t waste the entire job.

That durability is the part most “agent” frameworks quietly skip over. My homelab gets flaky — Longhorn volumes reattach, etcd blips, a Proxmox host decides to swap — and the only thing that keeps these agents viable is assuming the pod will die mid-thought and designing for resumption from the last committed step.

How Claude actually gets called

I don’t let Claude into OpenClaw directly. Everything routes through LiteLLM, which I run as a gateway inside the openclaw namespace. That gives me one place to:

  • map a short logical name (smart, fast, reasoner, finance) to whichever underlying model is currently best for that role
  • aggregate spend per agent via Prometheus labels
  • fall back from Claude to a local Gemma 4 if the API is down, without the agent knowing
  • rate-limit each agent individually so one runaway dream session can’t blow through the month’s budget in an hour

A typical Python call inside a worker looks like this:

from openclaw.config import MODEL_SMART  # -> "smart" -> routed to Claude

resp = litellm.completion(
    model=MODEL_SMART,
    messages=[
        {"role": "system", "content": SKEPTICAL_QUANT_PROMPT},
        {"role": "user", "content": context_block},
    ],
    tools=HYPOTHESIS_TOOLS,
    tool_choice="auto",
    max_tokens=30000,
    metadata={"agent": "dream-worker", "run_id": run_id},
)

The agent doesn’t know or care that smart is Claude today. If next month I decide Gemini is better for this job, or I want to A/B test a new model, I change one line in the LiteLLM ConfigMap and every OpenClaw worker starts using the replacement on the next scheduled run. No rebuild, no deploy.

That abstraction has mattered more than I expected. Model prices move quarterly. New capabilities show up in point releases. Having a cheap place to rewire it all has been the difference between “experiments I ship” and “experiments I give up on because the upgrade path is too painful.”

The interrupt bus: Slack + cat-bot personas

Long-running agents are useless if they can’t reach me. But they’re also useless if they reach me too much — I don’t want every 15-minute cluster-patrol run to ping my phone. The constraint I landed on: agents must be actionable or silent. There’s no third category.

To enforce that in practice, every agent has its own Slack persona. Actual names, actual personalities, actual app IDs in Slack. Fluffie runs dream sessions. Vincat runs cluster-patrol. Kat Highwind runs the morning market briefing. Purrith runs the security triage. Sephi-furr-oth is the incident alert-responder. Red Purrteen is option roulette.

This sounds frivolous until you’re standing in line at the grocery store at 6pm, your phone buzzes, you glance at the notification, and you already know what kind of problem it is based on who sent it:

  • Fluffie posting means the dream worker wants me to review new hypotheses. Not urgent, tonight after dinner is fine.
  • Vincat posting means a pod in the cluster is actually sick and not self-healing. I should look.
  • Sephi-furr-oth posting means a Prometheus alert fired that it couldn’t suppress or auto-remediate. Drop everything.

None of that is Claude’s job, exactly. Claude writes the message body. But the identity layer in front is the thing that lets an agentic platform stay ambient without becoming noise. It’s the homelab version of “who’s on call?” — except the cats are on call, and they only text me when it’s actually my turn to do something.

Under the hood it’s just a per-agent Slack Bot Token, a per-agent channel list, and a shared helper module (openclaw.slack) that enforces house style: every message leads with who’s talking, what happened, and — critically — what the human is supposed to do about it.

Where the agent can actually act

The other part most “can Claude talk back?” stories miss is that talking back is barely useful unless the agent can also do something. My cluster-patrol worker is the clearest example.

It runs every 15 minutes, sweeps for unhealthy pods, stale CronJob runs, NotReady nodes, expiring certs, and angry Longhorn volumes. When it finds something, it asks Claude — via the smart alias in LiteLLM — to:

  1. decide whether the issue is self-healing or genuine
  2. if genuine, classify into auto-remediate or human-required
  3. for auto-remediate, return a structured plan: which pods to delete, which stale Jobs to clean up, which remediation runbook to link

The Python harness takes that structured output and actually executes it. Deletes the CrashLoopBackOff pod that’s been stuck for 8 restarts. Cleans up the stale GitLab runner pods that never finished. Then posts a single Slack message as Vincat, summarizing what was found and what it already did. The human doesn’t have to do anything unless there’s something in the human-required bucket.

Most weeks, I only hear from Vincat a handful of times. Most of what goes wrong in the cluster is tedious and fixable, and the pattern of “LLM as the judgment layer, Python as the hands” is what finally made it boring enough to leave alone. Claude’s job isn’t to SSH into the node. Claude’s job is to read the state, make a call, and describe the call in a format the harness can execute.

The rule I’ve landed on: the LLM never touches the cluster directly. It returns a plan. The harness executes the plan. The plan is either boring enough to auto-run, or it’s a message to the human. That separation is what lets me trust the system to operate unattended — the blast radius is bounded by what the harness is willing to do, not by what the LLM dreamed up.

What Claude is actually good at here

A few things I learned by running this pattern for months:

Claude is excellent at structured-output-plus-narrative. I can ask for “a JSON array of remediation actions, followed by a plain-language summary for Slack”, and it does both in one pass without drift. Smaller local models tend to mangle one or the other when asked to do both. That single-pass efficiency matters when you’re paying per token on a job that runs 96 times a day.

It’s unusually good at knowing when to say nothing. When cluster-patrol asks “what should a groggy operator do about this?” and the answer is “actually, nothing, this will self-heal on the next reconcile” — Claude says that. A lot of models feel compelled to fill the silence with a remediation even when the correct answer is wait and see. For an ambient ops agent, that’s the whole ballgame.

It makes good use of long context. The dream worker prompts can get to 40-60k input tokens — market snapshots, 20 YouTube transcript summaries, a few hundred Reddit posts, prior hypotheses to avoid duplicating. Claude handles that comfortably and the output quality doesn’t visibly degrade at the top of the window the way some models do.

It’s willing to disagree. When I ask an agent to debate itself — one persona arguing bull, another arguing bear, both contributing testable hypotheses into the queue — Claude writes genuinely adversarial bear cases instead of reflexively steelmanning whatever I primed it with. That’s rarer than it should be.

What I’d do differently if I were starting today

A few things that bit me on the way here:

  • Start with LiteLLM on day one. I tried to call Anthropic’s API directly from the first few workers. Ripping that out later once I had 8 workers all hardcoded to one SDK was a miserable afternoon. The abstraction layer costs nothing and buys you everything.
  • Every agent gets a persona before it gets a schedule. I built three workers before I added Slack personas, and retrofitting the cat-bots was annoying because the posting code was scattered. Now the persona is a required field on the agent config and the helper module enforces it.
  • Budgets at the agent level, not the model level. LiteLLM supports both. Per-agent caps — “dream-worker is allowed $2/day, cluster-patrol is allowed $0.20/day” — stops a bug in one agent from emptying the shared budget for the rest.
  • Structured output is not optional for auto-remediation. If the plan comes back as prose, the harness has to parse it, and every parser eventually meets a model output it can’t read. Require JSON. Validate the schema. If validation fails, fall back to human-required and post the raw output to Slack for review.

Try it yourself

If you want to play with Claude for your own long-running agent work, I’ve got a referral link. It gets us both a little credit — no pressure, but if this post was useful, that’s the easiest way to say thanks:

claude.ai via my referral →

The pattern in this post isn’t specific to OpenClaw. If you’ve got any long-running process that needs a judgment layer — a CI pipeline, a data pipeline, a batch job, a scheduled scraper — the same shape applies. Python does the gathering and the acting. The LLM does the judgment. A gateway like LiteLLM lets you swap models without rewriting agents. A notification channel with identity (Slack, Discord, email, doesn’t matter) lets the agent reach you when it actually needs to. And structured output is the contract that lets the harness trust what comes back.

The agents that feel most useful aren’t the chatty ones. They’re the quiet ones that disappear for six hours, do real work, and come back with a three-line summary and a button that says “yes, do that.”

More on the broader OpenClaw platform in future posts. If you want to see the skeleton today, the overall architecture lives in my OpenClaw on k3s write-up.