AI Agents Work Better When They Actually Know How You Operate

TL;DR

AI agents fail when they don’t know what you know. I built a Slack bot that conducts structured 5-layer interviews to extract tacit knowledge — operating rhythms, decision criteria, dependencies, friction points, leverage opportunities — and generates soul.md, user.md, and heartbeat.md config files for provisioning agents. The interview surfaces ~30% more actionable context than documentation alone. Full source code below.

The Problem Nobody’s Talking About

Nate B. Jones has a video that nails the core issue with AI agents: they fail because they lack tacit knowledge. Not the stuff in your docs — the stuff in your head. The 20-year veteran who just knows that the staging deploy takes longer on Thursdays because the batch job runs. The designer who can feel when a color palette is wrong without being able to articulate why.

Michael Polanyi formalized this in the 1960s: “We know more than we can tell.”

I’ve been building AI agents for my homelab for months — trading bots, alert responders, security triagers, dream workers. Every one of them needed the same thing: my operating context. Not the architecture docs (those exist), but the operational knowledge that lives in my head:

When do I actually check things? (Not what the calendar says — what really happens.)
What signals tell me something is “good enough” vs. needs more work?
What are my unwritten rules? (Never interrupt between 11pm-1am. Don’t make it easier after failure. Match the energy, don’t lead it.)
What would I delegate if I could trust the delegation?

You can write this down manually. I have. My CLAUDE.md files are extensive. But there’s a category of knowledge that only surfaces when someone asks the right follow-up question — “You mentioned you adjust the Elo rating manually. What specifically triggers that?” That’s the stuff that makes the difference between a useful agent and one that produces plausible-sounding but contextually wrong output.

The 5-Layer Interview Framework

The interviewer conducts a structured conversation across five layers, each targeting a different type of tacit knowledge:

Layer	Name	What It Extracts
1	Operating Rhythms	Daily/weekly/monthly patterns — the real schedule, not the aspirational one
2	Recurring Decisions	Judgment calls, heuristics, “I just know” criteria
3	Dependencies	Who provides what, blockers, handoff points
4	Friction Points	Time sinks, context-switch costs, recurring fires
5	Leverage Points	Highest-ROI delegation opportunities

Each layer has 5 primary questions. After each answer, the bot generates a contextual follow-up that digs deeper into something specific the interviewee mentioned. That’s 25 primary questions + up to 25 follow-ups = ~50 exchanges per interview.

The follow-ups are the key innovation. A static questionnaire gives you surface-level answers. A conversational follow-up that says “You mentioned X — can you walk me through exactly what happens when X?” extracts the operational detail that docs never capture.

How It Works

The bot is a single Python file (~870 lines) that runs as a Kubernetes Deployment polling a Slack channel every 15 seconds. The architecture:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Slack API   │────▶│  Interviewer │────▶│  PostgreSQL  │
│  (polling)   │◀────│  (Python)    │────▶│  (knowledge) │
└──────────────┘     └──────┬───────┘     └──────────────┘
                            │
                     ┌──────▼───────┐
                     │   LiteLLM    │
                     │  (gemma4/    │
                     │   scout)     │
                     └──────────────┘

Trigger: User posts “interview me” in the designated Slack channel.

State machine: The bot tracks interview state in PostgreSQL — current layer, current question, pending answers, follow-up status. Each exchange (question + answer + optional follow-up) is stored with timestamps for later analysis.

Two-model approach:

scout (small/fast) handles the interview conversation — asking questions and generating follow-ups. It’s good at this because the questions are pre-defined and follow-ups just need to be contextually relevant.
gemma4 (26B MoE) handles synthesis — distilling 50 exchanges into structured knowledge nuggets and generating the config files. This is where model quality matters.

Output: Three markdown config files (soul.md, user.md, heartbeat.md) plus a database of individual knowledge nuggets tagged by layer and category.

Key Code: The Interview State Machine

The core loop polls Slack, detects triggers, and advances the interview state:

while not _shutdown:
    try:
        messages = slack_get_channel_messages(SLACK_CHANNEL, oldest=last_channel_check)
        for msg in messages:
            raw_text = msg.get("text", "").lower().strip()
            text = re.sub(r"<@[a-z0-9]+>\s*", "", raw_text).strip()
            text = re.sub(r"\s*\*sent using\*.*$", "", text).strip()
            if text in ("interview me", "start interview", "begin interview"):
                existing = get_active_session(conn, SLACK_CHANNEL)
                if existing:
                    slack_post(SLACK_CHANNEL, "⚠️ Active interview exists. Finish or pause first.")
                else:
                    start_interview(conn, SLACK_CHANNEL)
        last_channel_check = str(time.time())

        active = get_active_session(conn, SLACK_CHANNEL)
        if active:
            process_session(conn, active)
    except Exception:
        log.exception("Error in main loop")
    time.sleep(POLL_INTERVAL)

Key Code: Follow-Up Generation

The follow-up questions are what make this more than a questionnaire. After each answer, the bot calls the LLM to generate a probing follow-up:

def generate_follow_up(question, answer, layer_name):
    system = (
        "You are conducting a domain knowledge interview. The person just answered "
        "a question. Generate ONE concise follow-up question that digs deeper into "
        "something specific they mentioned. Focus on extracting operational details, "
        "decision criteria, or unwritten rules. Do NOT repeat what they said — probe "
        "the part they probably haven't fully articulated yet."
    )
    user = f"Layer: {layer_name}\nQuestion: {question}\nTheir answer: {answer}\n\nFollow-up:"
    return llm_call(system, user, model=MODEL, max_tokens=200, temperature=0.4)

Key Code: Config File Synthesis

After all 5 layers complete, the bot feeds the accumulated knowledge to the synthesis model:

def generate_config_files(all_knowledge):
    system = (
        "You are an expert at configuring AI agents. Given extracted knowledge, "
        "generate three markdown configuration files.\n\n"
        "Return ONLY a valid JSON object with three keys:\n"
        '- "soul_md": Agent role, tone, boundaries, decision framework\n'
        '- "user_md": Human profile — preferences, schedule, communication style\n'
        '- "heartbeat_md": Periodic checklist for the agent\n\n'
        "Be specific and actionable — every line should reflect the interview data."
    )
    raw = llm_call(system, f"Extracted knowledge:\n{json.dumps(all_knowledge, indent=2)}",
                    model=SYNTHESIS_MODEL, max_tokens=6000, temperature=0.3)
    return parse_json_response(raw)

What I Actually Learned: Baseline vs. Interview

To test the interviewer, I ran a controlled experiment. I had Claude generate config files for a vet-academy tutoring agent (a project I’m building for my kid) two ways:

Baseline: Generated solely from documentation — the 1,200-line planning doc and two memory files
Interview: I answered all 25 questions + follow-ups, then had gemma4 synthesize configs from the raw answers

What the docs captured well

The baseline files were comprehensive on architecture: Elo rating system parameters (K=32, 65% target), scaffolding levels, session structure, curriculum mapping, tech stack, cost model. Documentation is great for what the system does.

What only the interview surfaced

Insight	Why Docs Missed It
Tone editing is 30% of content authoring time — the specific patterns that drift (“Great job!” → should be “The fox settled down”)	Docs describe the desired tone; interview revealed the failure modes
The “Ghibli immersion trigger” — watching 30 seconds of Mononoke to re-enter the creative headspace	Tacit creative process, never documented
Co-parenting as a scheduling constraint — session continuity across two households	Personal logistics, not a design doc topic
“Adjust the system, not the content” — max 1 change per day, never mid-session	An operational rule that emerged from experience, not planning
Content authoring template: 7 steps, 1-2 hours per case file, with specific time breakdown	Process knowledge that’s muscle memory
“The platform’s biggest risk is me, not the technology” — perfectionism and scope creep patterns	Self-awareness that wouldn’t appear in a planning doc

The interview captured the operating context — how the builder actually works, what breaks, what they’d delegate, what they protect. The docs captured the system design. Both are needed; neither is sufficient.

Example: Blog Agent Config Files

To show this isn’t just a one-off, I ran the same exercise for a different domain: a blog writing agent for this site. Generated baseline configs from my blog documentation, then simulated the interview with gemma4 answering as the blog operator.

Baseline soul.md (from docs)

The documentation-derived config nailed the mechanics: the zolty voice rules, structural skeleton (TL;DR → Motivation → Implementation → Results), anti-patterns (“just”, “simply”, “In today’s fast-paced world”), affiliate integration, privacy rules. Everything a blog agent needs to follow the rules.

What the interview added

When gemma4 answered as the blog operator, it produced answers with genuine operational texture. On the daily schedule:

“My calendar is a lie. It’s a collection of aspirational blocks designed to make me feel like I have my life together.”

The synthesis captured patterns the docs never articulated:

Writing process insights:

“The Trigger” — writing begins when a failure is too significant to forget (“the ‘I’ll forget this by Tuesday’ signal”)
Three named energy windows: “The Janitor Shift” (7-10am audit), “Dad Mode” (daytime), “The Creative Peak” (11pm-1am)
Publishing is reactive, not scheduled — “there is no content calendar”

Named concepts the baseline missed:

“The Family Override” — a hard-coded, non-negotiable interrupt. Family needs terminate the session instantly.
“The Copy-Paste Test” — every command must be functional. A command not found error ruins credibility.
“The Pain-to-Utility Ratio” — content must be driven by actual technical struggle, not synthetic tutorials

Decision criteria from the interview:

The “Post or Commit” triage — whenever fixing something, an automatic mental filter runs: “Is this just a typo, or is there a story here?”
Topics get killed by “The Shiny Object” filter — don’t write about trending tools unless they impact the actual stack
The quality test: “would I be embarrassed if a senior engineer read this at 9 AM on a Monday?”

The combined config merges both: documentation provides the editorial rules and quality gates, while the interview provides the operational rhythm, named heuristics, and tacit judgment criteria. The baseline tells the agent what to do. The interview tells it how to think.

Deploying the Interviewer

The full stack:

# Kubernetes Deployment + ConfigMap (abbreviated)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: openclaw-interviewer
  namespace: openclaw
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: interviewer
          image: python:3.12-slim
          env:
            - name: LITELLM_URL
              value: "http://litellm.openclaw.svc.cluster.local:4000"
            - name: INTERVIEW_MODEL
              value: "scout"
            - name: SYNTHESIS_MODEL
              value: "gemma4"
            - name: PYTHONUNBUFFERED
              value: "1"
          volumeMounts:
            - name: script
              mountPath: /script
      volumes:
        - name: script
          configMap:
            name: openclaw-interviewer-script

Requirements:

A Slack bot token with channels:history, channels:read, chat:write scopes
PostgreSQL for interview state and knowledge storage
An LLM endpoint (I use LiteLLM proxying to local Ollama models)
A Kubernetes cluster (or just run the Python script directly)

Pro tip: use Apple dictation

Typing 400-word answers to 50 questions is tedious enough to kill the interview before it finishes. Apple’s built-in dictation (System Settings → Keyboard → Dictation, or double-tap Fn) is surprisingly good for this. Enable “Enhanced Dictation” for offline processing, then just talk through your answer naturally in Slack. The bot doesn’t care how you typed it — and spoken answers tend to be more honest and less polished than typed ones, which is exactly what you want.

Bugs I Fixed During Testing

Three bugs surfaced when I first deployed:

Slack MCP suffix: Claude Code appends *Sent using* Claude to messages. The bot’s exact-match on "interview me" failed. Fix: strip the suffix with regex.
Case-sensitive regex after .lower(): The mention-stripping regex used [A-Z0-9]+ but the text was already lowercased. <@u0akdtfntny> didn’t match [A-Z0-9]+. Fix: [a-z0-9]+.
Python stdout buffering: In a container, sys.stdout is fully buffered (not line-buffered). The bot appeared to produce no logs after initialization. Fix: PYTHONUNBUFFERED=1 env var.

All three are the kind of thing that makes you feel stupid in hindsight. All three were invisible until real traffic hit the bot.

Lessons Learned

The interview questions matter more than the synthesis model. The bot’s questions are pre-defined and battle-tested — they consistently extract useful knowledge regardless of who’s being interviewed. The synthesis step (turning answers into config files) is where model quality matters, but a bad model with good questions still produces useful raw data. Good model with bad questions produces nothing.

Follow-ups are 70% of the value. The primary questions get surface-level answers. The follow-ups (“You mentioned X — walk me through what happens when…”) extract the specific examples, numbers, and heuristics that make config files actionable.

Two-stage synthesis is lossy. My initial pipeline extracted “knowledge nuggets” per-layer (small model), then generated configs from the nuggets (also small model). This double-compression threw away most of the detail. The fix: feed raw interview exchanges directly to the synthesis model. Skip the nugget extraction step.

Documentation + interview > either alone. Docs capture architecture, specs, and rules. Interviews capture process, judgment, and context. The ideal agent config is a merge: docs for the “what,” interview for the “how” and “why.”

What’s Next

The interviewer currently produces static config files. The next step is making it a continuous loop:

Re-interview quarterly to capture how operating patterns have evolved
Diff the new configs against the old ones to surface drift
Auto-update agent configurations when the delta exceeds a threshold

The knowledge nuggets are stored in PostgreSQL. Any agent in the cluster can query them:

SELECT insight, category, confidence
FROM interviews.knowledge
WHERE session_id = '<latest>'
AND category = 'Operating Rhythms'
ORDER BY confidence DESC;

The full source code is in the home_k3s_cluster repo — ConfigMap with inline Python, Deployment, and init SQL.

Don’t have a homelab? You can run the same pattern on any machine with Python and a Slack bot token. Swap LiteLLM for direct API calls to Claude or OpenAI. The interview framework is model-agnostic — the questions are the hard part, not the infrastructure. A DigitalOcean Kubernetes cluster is a good starting point if you want the full k8s setup.

TL;DR#

The Problem Nobody’s Talking About#

The 5-Layer Interview Framework#

How It Works#

Key Code: The Interview State Machine#

Key Code: Follow-Up Generation#

Key Code: Config File Synthesis#

What I Actually Learned: Baseline vs. Interview#

What the docs captured well#

What only the interview surfaced#

Example: Blog Agent Config Files#

Baseline soul.md (from docs)#

What the interview added#

Deploying the Interviewer#

Pro tip: use Apple dictation#

Bugs I Fixed During Testing#

Lessons Learned#

What’s Next#