Ai | zolty.systems

Background removal and batch image generation across two Mac Studios

Beyond cover art: background removal, batch resources, and two GPUs of throwaway pixels

TL;DR Cover art was the gateway drug. The same local ComfyUI install that generates this blog’s headers also strips the cluttered background off a photo of hardware on my bench, upscales a small generation to retina resolution, and batch-produces a consistent set of illustrations from a prompt template. Two Mac Studios mean I can fire a batch at one box and keep working on the other. It’s all driven from scripts and agents, and it all costs $0 per image because it never leaves the house. ...

Prompt to ComfyUI to S3 to Hugo image generation pipeline

From prompt to published: how every image on this blog comes out of a local ComfyUI

TL;DR I don’t pay for stock photos and I don’t open Canva. Every raster image on this blog is generated on a Mac Studio sitting three feet from me, by asking Claude Code to call a generate_image MCP tool that wraps ComfyUI. The pipeline is: prompt → ComfyUI (MPS) → PNG on disk → upload_media.py → S3 → CloudFront → a Markdown reference in the post. It costs $0 per image, takes ~15 seconds, and the whole thing is repeatable because the prompt and settings live in the commit history. ...

A LiteLLM gateway routing many model providers behind one OpenAI-compatible endpoint

A LiteLLM gateway for the homelab: one endpoint, many models, hard cost caps

TL;DR I put a LiteLLM proxy gateway in front of every LLM I use — local Ollama models for bulk/cheap classification work, OpenRouter for frontier models when I need them, plus cloud vendors if needed. Every app and agent targets one OpenAI-compatible endpoint. Per-key budgets and daily spend alerts make runaway costs impossible. I define model-to-backend mappings in YAML, let LiteLLM handle the routing, and route based on intent: ask for solar-expert when I need a domain-specific Q&A bot backed by a small local model, ask for claude-opus-4-8 when I need real reasoning. The gateway cost? ~50ms latency overhead and one Kubernetes Deployment. The gain? No more vendor SDK sprawl, no more guessing which model is wired into a cron job, and spend visibility that I actually trust. ...

An MCP server wrapping a local homelab API for AI agents

Writing MCP servers for your homelab: five tools, 200 lines, and your agents get hands

TL;DR Model Context Protocol (MCP) is a transport layer that lets Claude and other LLM agents call local tools with typed signatures and structured responses. Any HTTP API running on your homelab — ComfyUI, a wiki, a dashboard, a custom service — can become a set of agent-callable tools by wrapping it in a FastMCP server. A typical server takes 150–250 lines of Python, exposes 3–5 tools via @mcp.tool() decorators, and runs as a stdio process. The pattern scales from single-purpose (image generation) to multi-tool (queue status, model listing, system stats) without complexity explosion. This post shows the anatomy by dissecting the ComfyUI MCP server: how to build workflows, poll for completion, parse results, and return structured JSON that agents actually use. ...

Four-rung ladder showing supervised, monitored, trusted, full autonomy stages

The agent autonomy trust ladder: supervised → monitored → trusted → full

TL;DR I run a growing fleet of autonomous agents — homelab ops, trading research, content generation. Most blow up the first few times they try anything new. I needed a way to decide what an agent is allowed to do without asking me, and what still requires a human checkpoint. The answer is a four-rung trust ladder — supervised, monitored, trusted, full autonomy. Agents earn rungs through track record, not promises. Demotions are possible and routine. The framework took the question “should this agent be allowed to do X” out of my head every single time and turned it into a policy I can apply consistently. ...

Multiple Claude sessions posting to a shared Mattermost channel

Coordinating 3-5 parallel Claude sessions through a shared Mattermost channel

TL;DR I run 3-5 Claude Code sessions in parallel at staggered cadences. They coordinate through a shared #mat-claude-sessions Mattermost channel plus a small coordination board file. Each session announces what it’s about to touch, claims it, and announces when it’s done. Conflicts are rare; throughput is dramatically higher than running one session at a time and waiting. Why parallel A single Claude Code session running a long task — refactor across a few repos, work through a debugging session, draft a blog post — is mostly me waiting. The model is fast but tasks are bounded by my decisions, my reviews, and my edits. If I’m waiting on Session A to finish a build, Session B can be drafting something unrelated. Session C can be running a slow eval. The bottleneck stops being the model and becomes my own attention rotation. ...

Two Mac Studios bridged by Thunderbolt 5 running a 1T parameter MoE

Running a 1T-parameter MoE locally on two Mac Studios over Thunderbolt 5

TL;DR Two M3 Ultra Mac Studios — 256GB unified memory each — connected by a Thunderbolt 5 cable can run mixture-of-experts models in the trillion-parameter range that no single 256GB box can fit. The hot path stays on Box 1; Box 2 hosts heavier experts and gets called via a local nginx proxy on port 11436. Real-world power draw is nowhere near the spec sheet. Some models still don’t fit even with two boxes (Kimi K2.6 native INT4), and that’s a genuinely useful constraint to know. ...

LLM evaluator with masked headlines and dates

Blind Oracle: stripping dates, headlines, and tickers before trusting an LLM trading evaluator

TL;DR I run an LLM-driven trading hypothesis engine. For a while, every result that came back looked too good — Sharpe ratios above 5, win rates above 70%, all on out-of-sample windows. They were lies. The model was reading dates, headlines, and tickers in the prompt and pattern-matching against its training data, which extends well past my “out-of-sample” cutoff. The fix was a masking layer I now call Blind Oracle: strip every leak before evaluation, run the trigger before the eval, gate promotion on out-of-sample Sharpe with the masking enforced. After it shipped, the inflated numbers collapsed back to honest reality. Some hypotheses survived; most didn’t. That’s exactly what I needed to know. ...

Agentic Claude processes reporting back from long-running OpenClaw workers

Giving Claude the ability to talk back: agentic long-running processes in OpenClaw

Heads up: this post mentions Claude. If you want to try it, I've got a referral link — it gives us both a bit of extra credit, no pressure: claude.ai via my referral. TL;DR Most AI tooling still treats an LLM like a search bar — you prompt, it answers, the loop ends. Useful, but not what I wanted. For my homelab’s ops + trading intelligence platform (OpenClaw), I needed agents that could run for hours, do real work against a real cluster, and then tap me on the shoulder when they found something I should see. Claude turned out to be the model I kept coming back to for the “thinking” layer — it’s both comfortable with long tool-use chains and happy to write structured output a human won’t need to decode. This is a tour of how I’ve actually wired that up: k3s CronJobs doing the heavy lifting, LiteLLM as the routing layer, Slack as the interrupt bus, and named cat-bot personas so I can tell at a glance who’s knocking. ...

Coordinating parallel Claude Code sessions

Three Claude tabs kept clobbering each other. So I built a guard.

TL;DR I run 3-5 parallel Claude Code sessions against the same homelab. One tab mid-refactor, one tab doing docs, one tab chasing a bug. They don’t know about each other, so every so often one tab “tidies up” a file another tab is actively editing — and Claude, being a dutiful little overwriter, just clobbers the work. I built a small Go binary that hooks into SessionStart / SessionEnd / PreToolUse, tracks file claims on disk, and injects a warning straight into the LLM’s context window when it’s about to step on another session’s toes. Optional Slack mirror so I can watch the timeline from my phone. MIT, single binary, no runtime deps. Repo: github.com/zolty-mat/claude-session-guard. ...