Tracing and budgeting LLM agents with Langfuse

TL;DR

I run unattended LLM agents on my homelab — they write code, open MRs, generate content, rotate secrets. The problem: they fail silently and bill silently. Langfuse (a tracing platform) logs every LLM call with input/output tokens, latency, and cost. On top of those traces, I built three background monitors that run weekly: a goal-drift detector that compares an agent’s stated objective to what its commits actually did (via embedding similarity), a cost-spike alert that fires at 80% and 100% of a daily budget cap, and an action audit that exports traces and flags sessions where the tool-call sequence diverged from the plan. Together, these let me sleep while autonomous agents handle repetitive work.

Why agents need eyes (and a budget)

The appeal of autonomous agents is obvious: declare an intent (“fix this failing test”), and the agent submits a diff. No back-and-forth, no context-switching. Reality, though, is messier. I’ve written about the trust ladder and what it takes to earn agent autonomy, and a core requirement is observability.

Agents fail silently. A hallucinated completion that passes the wrong test and gets committed anyway. A misunderstood spec that generates a thousand dollars’ worth of incorrect cloud infrastructure. A loop that issues tokens for an hour and then says “I’m done” because the LLM got confused about the exit condition.

And agents spend silently. Every LLM call is a line item. A single agent that runs daily across your repos burns ~$2–20/day depending on model, context size, and token churn. Times five repos times three agents, and you’re at $300/month before you’ve written a single production line. The math is worse if you auto-run these things — there’s no human gate, no monthly invoice that lands in email. The card just gets charged.

I run six agents in CI/CD: mr-review (code review on every MR), pipeline-medic (diagnoses test failures), llm-fix (generates patches), issue-executor (closes issues), evolve-claude-md (tunes system prompts), and a weekly improve-loop that scans for library upgrades. (The improve-loop is described in detail here.) Each one is cheap — I use a local Qwen 32B model (qwen36-coder) for the first pass and escalate to Claude Sonnet on OpenRouter for anything that touches infrastructure — but they run constantly. Without observability, I’d be flying blind.

Langfuse is the observability layer. Every call gets logged with tokens, latency, user ID, custom tags, and structured traces. The interface is a dashboards: costs per feature, costs per day, trace latency p95. And the real power is that traces are queryable — I can filter by tag, reconstruct the call order, and export the raw data to build monitors on top.

How traces work — and why they matter

A trace is a tree of LLM calls and tool uses from a single user session. Here’s a real one from a pipeline-medic run:

Session ID: abc123def456
Start: 2026-06-12 10:14:23 UTC
Initial objective: "diagnose why test:unit failed"

  Trace [1] LLM Call (extraction stage)
    Input: "Here are the test logs..."
    Tokens in: 48,192 | out: 1,247 | cost: $0.087
    Latency: 3.2s
    Model: qwen36-coder
    Metadata: {"feature": "pipeline-medic", "stage": "extraction"}

    Trace [1.1] Tool Call → logs parsed
      Tool: parse_failure_logs
      Result: "AssertionError in test_redis_connection"

  Trace [2] LLM Call (diagnosis stage)
    Input: "The test failed with: AssertionError in test_redis_connection..."
    Tokens in: 31,045 | out: 2,118 | cost: $0.063
    Latency: 2.8s
    Model: qwen36-coder

Total cost: $0.150
Total latency: 6.0s

Each level — session, trace, span — is a unit of observability. Sessions aggregate cost and latency per agent per day. Traces show the call tree for a single request. And spans (the individual LLM calls + tool invocations) are where you see exactly what happened: which model, which prompt, which context size, how many tokens the LLM burned.

The cost tracking is the key. Langfuse multiplies input/output token counts by the model’s per-token price. A qwen36-coder call at 50K input tokens costs ~$0.15; a Claude Sonnet call at the same size is ~$3. Run 30 of those a day, and you’re at $90. Scale to 10 repos, and it’s $900. You need to know this before you hit the bill.

Building the monitors — cost, drift, and action audits

Knowing the cost per call is table stakes. The real win is treating Langfuse traces as a data source for operational sanity checks. I built three cron jobs that fire weekly and pull from Langfuse:

1. Cost-spike detector

This one is the simplest: a Mattermost notification at 80% and 100% of a daily budget.

# Pseudocode: weekly cost monitor
def check_daily_cost():
    for date in (today, yesterday, last_week):
        cost = langfuse.query(date=date).sum()
        daily_cap = 10.0  # dollars
        if cost > daily_cap * 0.80:
            slack.notify(f"⚠️ Cost on {date}: ${cost:.2f} (80% of cap)")
        if cost > daily_cap:
            slack.notify(f"🚨 COST SPIKE on {date}: ${cost:.2f} (over cap!)")

The homelab LiteLLM gateway has a server-side hard limit of $10/day for the CI virtual key. If we hit it, all LLM calls fail gracefully (they return an error, scripts exit cleanly, no tokens wasted). But I want to know we’re approaching the limit before the wall hits. Eighty percent is the warning; one hundred percent is the page.

2. Goal-drift detector

This is the more sophisticated one. The idea: an agent has an objective (stated at the beginning of the session), and its actions (the commits, MRs, code it actually wrote). Those should correlate. If I asked the agent to “add a new test,” and instead it rewrote half the module and deleted three existing tests, I want a flag.

Langfuse traces include the session’s initial objective (I log it when the agent starts). The drift detector pulls it, then fetches the MR’s final commit message and runs a cosine similarity check between the embeddings:

def detect_goal_drift():
    for session in langfuse.query(created_at > "7 days ago"):
        objective = session.metadata.get("objective")
        mr = gitlab.get_mr(session.metadata.get("mr_iid"))
        commit_msg = mr.commits[-1].message  # final commit
        
        obj_embedding = embed(objective)
        commit_embedding = embed(commit_msg)
        
        similarity = cosine_similarity(obj_embedding, commit_embedding)
        
        if similarity < 0.6:  # drift threshold
            mattermost.alert(
                f"Goal drift detected:\n"
                f"Stated: {objective}\n"
                f"Did: {commit_msg}\n"
                f"Similarity: {similarity:.2f}"
            )

Cosine similarity ranges from 0 (completely unrelated) to 1 (identical). A threshold of 0.6 is loose enough to handle rephrasings and tight enough to catch true drift. I’ve seen this catch two failure modes:

Scope creep: Agent was asked to “bump the dependency version” and ended up refactoring the module. Similarity: 0.45 → alert.
Hallucinated completion: Agent thought it was done when it actually wasn’t, then wrote a misleading commit message. Similarity: 0.38 → alert.

Once the alert fires, I review the session in Langfuse (click the trace, see the entire call tree and tool uses) and decide whether the agent overstepped or the initial objective was just poorly worded.

3. Action audit — weekly trace export

The third monitor is a weekly cron that exports all Langfuse traces and flags sessions where the plan diverged from the execution. This is more nuanced than drift. A session might be on-goal but wasteful — it issued 15 tool calls when 3 would have sufficed, or it re-queried the same data three times.

def audit_action_sequences():
    for session in langfuse.query(created_at > "7 days ago", min_tool_calls=5):
        traces = session.traces
        
        # Reconstruct the agent's call sequence
        plan = extract_plan_from_first_trace(traces[0])
        actions = [t.tool_name for t in traces if t.is_tool_call]
        
        # Check for divergence: repeated tool calls, backtracking, etc.
        if has_repeated_calls(actions):
            log(f"⚠️ Session {session.id}: repeated tool calls detected: {actions}")
        if has_backtracking(plan, actions):
            log(f"⚠️ Session {session.id}: plan→action mismatch: plan={plan}, actions={actions}")

“Repeated tool calls” looks for calls to the same tool within a short time window — a sign the agent got confused or didn’t parse the result. “Backtracking” is harder to define, but it’s usually a tool call that contradicts the prior result (e.g., querying for the status of a file that was just deleted, or reading a key that was just written).

These get logged to a weekly report that I review Thursday mornings. Usually they’re noise — legitimate retries after transient API failures. But once a month, there’s a real bug: an agent got stuck in a loop, or misunderstood a tool’s semantics, or hit an edge case in its logic. The audit catches those before they become bigger problems.

Hooking up the traces — the agent side

All of this hinges on actually logging the traces. Langfuse SDKs exist for Python and JavaScript. For my CI agents, I use the Python SDK:

from langfuse import Langfuse
from openai import AsyncOpenAI

# Initialize at script start
langfuse = Langfuse(
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    host="https://langfuse.k3s.internal.zolty.systems"  # internal self-hosted
)

# Wrap the LLM client
client = AsyncOpenAI(api_key=os.environ["LITELLM_API_KEY"], base_url="https://litellm.k3s.internal.zolty.systems/v1")

# Create a trace for this session
trace = langfuse.trace(
    name="pipeline-medic",
    input={"logs": test_output_snippet},
    metadata={
        "feature": "pipeline-medic",
        "mr_iid": os.environ.get("CI_MERGE_REQUEST_IID"),
        "objective": "diagnose test failure",
        "repo": os.environ["CI_PROJECT_PATH"]
    }
)

# Make the LLM call — Langfuse SDK auto-logs it
response = await client.chat.completions.create(
    model="qwen36-coder",
    messages=messages,
    langfuse_trace_id=trace.id  # tie this call to the session
)

trace.output = response.choices[0].message.content
trace.end()  # flush to Langfuse

The key bits:

Langfuse() client pointed at the self-hosted instance (internal k3s ingress).
trace() with metadata — tags this session with feature name, MR ID, objective, repo.
langfuse_trace_id on the LLM call — ties the call to the parent trace.
.end() — flushes the trace to the server (async; doesn’t block the script).

That’s it. Once that’s in place, every LLM call in the script shows up in the Langfuse dashboard, and the cron jobs can query them.

Self-hosting Langfuse is optional. Langfuse Cloud exists and handles the same data — you’d just point host at https://cloud.langfuse.com and authenticate with your cloud API key. For a homelab, self-hosting makes sense: one more service on k3s, data stays internal, and the cost is just the compute (negligible).

Caveats — not magic, just better

I’m not going to pretend this is a silver bullet.

Instrumentation has overhead. Every trace write is a network call (even if async). A script that’s already running 30 API calls will add a few more. Langfuse SDKs are designed to be fast and non-blocking — they batch writes, retry on failure, and fail open (if Langfuse is down, the agent still runs) — but there’s still a cost. For CI agents running every few minutes, it’s fine. For high-frequency inference, you’d want to sample.

Cosine similarity is a heuristic, not ground truth. If the agent completely misunderstood the spec and wrote perfect code for the wrong thing, goal-drift might not catch it. The detector is looking for linguistic correlation between the objective and the commit message — if the commit message is a lie, the detector is fooled. It’s a guardrail, not a guarantee.

Cost attribution is only as good as your tagging. If every trace is tagged with {"feature": "unnamed"}, the cost report is useless. I have to remember to include feature, repo, mr_iid, objective in every trace’s metadata. Lapses in discipline → blind spots in cost.

Don’t confuse “traced” with “safe”. Observability shows you what happened. It doesn’t prevent bad things. Langfuse finds the drift after the MR is written. To actually stop bad commits, you need gates: tests, linters, regex checks on diffs for destructive commands (my settings.json has a pre-tool-use hook that blocks kubectl delete unless the command is in an allowlist). Traces are the lens; gates are the lock.

The honest caveats

None of this is free from operational burden.

One more service to run. Langfuse is a Django app + PostgreSQL. Self-hosted adds ~2GB RAM and a few CPU cores. If you’re running Cloud, it’s $10–50/mo depending on trace volume. Manageable, but real.
Alerting is manual setup. Mattermost webhooks, weekly cron jobs, embedding API calls (I use local embeddings via a small model, but you could use OpenAI’s API). The infrastructure is simple, but it’s on me to write and maintain it.
Drift detection has false positives. An agent that does exactly what you asked but phrases the commit message differently will trigger the detector. I get ~1–2 false positives per week (out of ~50 traces), which is acceptable — better to over-alert than under-alert on autonomy concerns.
Traces only matter if you read them. A dashboard full of data is useless if no one looks. I’ve built the habit of reviewing the weekly audit report and spot-checking any drift alerts within 24 hours. If I fell out of that habit, I’d be back to flying blind.

Lessons

Observe before trusting. You can’t tune what you don’t measure. Langfuse is the measurement. Cost spikes and goal drift are the tuning levers.
Cost caps are non-negotiable for unattended agents. A $10/day hard limit prevents runaway spend. It’s loose enough that normal operations don’t hit it; tight enough that bugs get expensive before they get catastrophic.
Goal-drift is the proxy for “did the agent do what I asked.” Embedding similarity is imperfect, but it’s better than nothing. And it costs ~$0.001 to run (local model) and takes 100ms.
Action audits catch patterns, not individual failures. You won’t spot every bug in the traces, but repeated-tool-call patterns, backtracking, and long tool-call chains are real signals of confusion. Log them and review weekly.
Self-hosted Langfuse fits the homelab. You already have a k3s cluster; running another service is free in marginal terms. And the data stays on your network, which is worth the operational lift if you care about privacy.

Running agents on a budget? Start with cloud Langfuse and the OpenAI SDK — no self-hosted infra needed, and the traces still give you per-call cost and drift detection. DigitalOcean’s Kubernetes service plus a managed database can run Langfuse self-hosted with the same pattern — the key is getting any observability in place before autonomy becomes a liability.

TL;DR#

Why agents need eyes (and a budget)#

How traces work — and why they matter#

Building the monitors — cost, drift, and action audits#

1. Cost-spike detector#

2. Goal-drift detector#

3. Action audit — weekly trace export#

Hooking up the traces — the agent side#

Caveats — not magic, just better#

The honest caveats#

Lessons#