TL;DR
I put a LiteLLM proxy gateway in front of every LLM I use — local Ollama models for bulk/cheap classification work, OpenRouter for frontier models when I need them, plus cloud vendors if needed. Every app and agent targets one OpenAI-compatible endpoint. Per-key budgets and daily spend alerts make runaway costs impossible. I define model-to-backend mappings in YAML, let LiteLLM handle the routing, and route based on intent: ask for solar-expert when I need a domain-specific Q&A bot backed by a small local model, ask for claude-opus-4-8 when I need real reasoning. The gateway cost? ~50ms latency overhead and one Kubernetes Deployment. The gain? No more vendor SDK sprawl, no more guessing which model is wired into a cron job, and spend visibility that I actually trust.
The old problem — and why it’s worse than it looks
Before the gateway, I had at least three model providers in active use:
- Local Ollama on my Mac Studio (Gemma 4, Qwen3 for cheap inference) — described in more depth in my post on scaling two Mac Studios for reasoning and image generation
- Anthropic API for actual reasoning (Claude Opus/Sonnet)
- OpenRouter as a backup for cloud models when I need something other than Claude
Each lived behind a different SDK. Each had different auth schemes — API keys in environment variables, some in Vault, one in Bitwarden. Every agent and automation script I wrote had to know about all three and pick one at startup.
The real problem wasn’t complexity; it was invisible blast radius. A cron job using the Anthropic SDK would accidentally hit Claude Sonnet at $3/1M input tokens. A test script would hammer a local model and spike memory. An old agent would be wired to the wrong provider. I’d look at my OpenRouter invoice 30 days later and shrug.
But worse: cost visibility was terrible. I could see what I spent per vendor, but not per agent or per job. Was my stock trading bot expensive? My image generation pipeline? The vet academy tutoring system? I didn’t know. I had three separate dashboards and no model of where money was actually going. That turns every optimization attempt into guesswork.
And the most annoying part: fallback was a pain. If OpenRouter was down, I couldn’t just switch to a local model for a non-critical task. Every app had to hard-code its fallback chain. Most didn’t even try.
Enter the LiteLLM gateway
LiteLLM is an open-source proxy that wraps 100+ LLM providers behind a single OpenAI-compatible API. You point it at a config file with a model list, each model has a backend (Ollama, OpenRouter, Anthropic, Azure, whatever), and every app hits the same https://litellm.k3s.internal.zolty.systems/v1/chat/completions endpoint.
The gateway handles:
- Model routing — map a friendly name to a backend model + provider
- Fallback chains — if Model A fails, try Model B automatically
- Per-key budgets and spend caps (in dollars, tokens, or requests)
- Daily spend alerts when you hit 80% or 100% of a key’s budget
- Spend logging to PostgreSQL for later analysis
- Request/response logging and metrics (Prometheus-compatible)
- Context-window overflow handling — auto-truncate long inputs or reject
You define models once, version-control the config, and every agent/app/script in the cluster can call the same endpoint.
Setting up the gateway on k3s
I’m running this in the solar namespace — it’s an actual production app (the Solar Q&A chatbot), so the setup has to be repeatable and reliable.
The config file
Here’s the model list from my solar namespace. It’s the part that actually matters:
# kubernetes/apps/solar/solar.yaml — ConfigMap excerpt
model_list:
# Local Q&A bot: fast, free, domain-expert system prompt
- model_name: solar-expert
litellm_params:
model: ollama/gemma4:26b
api_base: http://192.168.1.216:11435
max_tokens: 2048
model_info:
rpm: 10 # rate limit: 10 requests/min
tpm: 30000 # token limit: 30k tokens/min
system_prompt: |
You are the Solar Development Expert...
[FULL SYSTEM PROMPT WITH SOLAR DOMAIN EXPERTISE]
# Deep reasoning on hard questions: slow, local, but free
- model_name: solar-deep
litellm_params:
model: ollama/qwen3:235b-a22b
api_base: http://192.168.1.216:11435
max_tokens: 4096
extra_body:
think: true
model_info:
rpm: 5
tpm: 40000
system_prompt: |
You are the Solar Deep Analysis model...
[EXTENDED REASONING SYSTEM PROMPT]
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
drop_params: true
litellm_settings:
num_retries: 2
request_timeout: 120
set_verbose: false
The key patterns:
- Model name is what the app requests (e.g.,
model="solar-expert"). litellm_params.modelis the backend —ollama/gemma4:26bmeans “call Ollama’s/api/generateendpoint with modelgemma4:26b”. For cloud vendors it’sopenrouter/openai/gpt-4, etc.api_baseis where Ollama is running. Mine is on the Mac Studio at192.168.1.216:11435(not reachable from the cluster, but explained below).model_info.rpm / tpmare hard rate limits per model. I set these to prevent runaway requests.system_promptis baked into the model definition, so every request using this model name gets the same expert context without passing it each time.
The Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: solar-litellm
namespace: solar
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: solar-litellm
template:
metadata:
labels:
app.kubernetes.io/name: solar-litellm
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "4000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: litellm
image: harbor.k3s.internal.zolty.systems/berriai/litellm:main-v1.83.3-stable
ports:
- containerPort: 4000
name: http
env:
- name: LITELLM_MASTER_KEY
valueFrom:
secretKeyRef:
name: solar-litellm-secrets
key: master-key
- name: DATABASE_URL
value: "postgresql://litellm:...@solar-litellm-db:5432/litellm"
volumeMounts:
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
readOnly: true
args: ["--config", "/app/config.yaml", "--port", "4000"]
resources:
requests:
cpu: 50m
memory: 512Mi
limits:
cpu: 1
memory: 2Gi
volumes:
- name: config
configMap:
name: solar-litellm-config
The deployment is basic: one LiteLLM container on port 4000, config injected from a ConfigMap, environment secrets for the master key and PostgreSQL credentials. Nothing exotic.
Database for spend tracking
LiteLLM can log every request to a PostgreSQL database. This is optional but critical if you actually want to understand where your money is going.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: solar-litellm-db
namespace: solar
spec:
serviceName: solar-litellm-db
replicas: 1
template:
metadata:
labels:
app.kubernetes.io/name: solar-litellm-db
spec:
containers:
- name: postgres
image: postgres:16.8-alpine
env:
- name: POSTGRES_USER
value: litellm
- name: POSTGRES_DB
value: litellm
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: solar-litellm-db-secret
key: postgres-password
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ReadWriteOnce]
storageClassName: longhorn
resources:
requests:
storage: 2Gi
A single-replica Postgres on Longhorn for spend logs. Query it with something like:
SELECT model, count(*) as requests, sum(input_tokens) as input_toks, sum(output_tokens) as output_toks
FROM litellm_logs
WHERE created_at > now() - interval '7 days'
GROUP BY model
ORDER BY output_toks DESC;
Routing in your app
From inside the cluster, apps connect to the Service:
from openai import OpenAI
client = OpenAI(
base_url="http://solar-litellm.solar.svc.cluster.local:4000/v1",
api_key="sk-solar-<master-key-fragment>"
)
# Request a cheap, local model
response = client.chat.completions.create(
model="solar-expert",
messages=[{"role": "user", "content": "Solar financing costs?"}]
)
# Or ask for a different backend (if you've defined it)
response = client.chat.completions.create(
model="solar-deep",
messages=[...]
)
That’s it. Every request goes through the gateway. LiteLLM logs it, checks the key’s spend cap, routes to the backend, and returns the response.
Configuring per-key budgets
LiteLLM’s API allows you to create keys with budgets:
curl -X POST \
http://litellm.k3s.internal.zolty.systems:4000/key/generate \
-H "Authorization: Bearer sk-solar-<master-key>" \
-H "Content-Type: application/json" \
-d '{
"max_budget": 50.00,
"budget_duration": "1d",
"team_id": "solar-qa-bot"
}'
This key can spend max $50/day. When it hits 80%, LiteLLM can fire a webhook (alert to Slack, PagerDuty, etc.). At 100%, the key is rejected.
You can also set budgets in config.yaml:
key_management:
keys:
- key_alias: "solar-bot-prod"
key_name: "sk-solar-..."
max_budget: 25.00
budget_duration: "1d"
budget_reset_at: "00:00"
Gotchas and honest limitations
Network egress requirement. Your LiteLLM pod needs to reach whatever backend you’re routing to. In my case, Ollama is on the Mac Studio at
192.168.1.216:11435, which is on the same home network as the cluster. That works. If Ollama were only exposed inside the cluster, I’d use a Service IP. If you’re using OpenRouter or Anthropic, the pod needs internet access (either direct egress or a proxy).Latency overhead. LiteLLM adds ~50-100ms per request just to log and route. For interactive UIs that’s fine. For sub-100ms required latency, it might matter. Profile it.
No vendor-specific params for every vendor. If a vendor adds a new feature (e.g., Anthropic’s thinking mode), it might not be in LiteLLM yet, or the config syntax might not expose it. I’ve found
extra_body: { think: true }works but sometimes you have to wait for upstream LiteLLM to expose new params. Not a blocker, just something to know.Fallback chains are simple. You can define fallback models (
fallback_models: [model-a, model-b]) but the logic is “try next on error” — no cost-aware fallback like “if input > 200k tokens, use a longer-context model.”Single point of failure on the hot path. If the LiteLLM pod crashes, every app that depends on it is blocked. You can run multiple replicas and a load balancer, but most homelabs (including mine) don’t need that complexity yet.
Why this beats the SDK-per-vendor approach
| Aspect | Old: Direct vendor SDKs | New: LiteLLM gateway |
|---|---|---|
| Auth management | 3+ keys in env vars / Vault | 1 master key + derived per-app keys |
| Model selection | Hard-coded per app | Declarative in YAML, zero-downtime reconfig |
| Cost visibility | 3 separate dashboards | Single Postgres DB, unified view |
| Fallback chains | Manual try/except per app | Built-in, defined once |
| Quota enforcement | Trust and hope | Hard spend caps + alerting |
| Latency | Direct call | +50-100ms proxy overhead |
| Operational overhead | ~medium | low, mostly yaml config |
The gateway pays for itself the first time you accidentally leave an expensive model wired into a non-critical job. Combined with Langfuse tracing and per-agent cost budgeting, you get visibility and safety that makes autonomous agents viable.
Lessons
- Route by intent, not by vendor. “Give me a solar expert for Q&A” beats “call ollama/gemma4”. If you ever want to swap backends, the intent-based naming survives the swap.
- Spend caps are non-negotiable. Not negotiable like “nice to have” — I mean architecturally required. A single runaway agent without caps can wipe out a month’s budget in minutes. Caps are the insurance.
- Log to a real database. Email alerts are fine. CSV files are not. You want to query “what did my content-generation pipeline spend last month?” weeks later. A table is the right tool.
- The ~50ms latency overhead is worth it. For most homelab stuff (agents, scheduled jobs, background work), 50ms is noise. You get unified routing and spend visibility for that cost.
- Version-control the config. LiteLLM config is YAML. It lives in git. When you rotate a key, change model backends, or adjust rate limits, it’s a PR + deploy, not a manual click-around.
Don’t have a homelab k3s cluster? The exact same pattern works on a standalone server or Docker Compose. LiteLLM is a single container; the Postgres is optional (logs go to disk if you want). DigitalOcean Kubernetes or a single Droplet can run this just as well. The principle is the same: one gateway, many backends, unified visibility.