A LiteLLM gateway for the homelab: one endpoint, many models, hard cost caps

TL;DR

I put a LiteLLM proxy gateway in front of every LLM I use — local Ollama models for bulk/cheap classification work, OpenRouter for frontier models when I need them, plus cloud vendors if needed. Every app and agent targets one OpenAI-compatible endpoint. Per-key budgets and daily spend alerts make runaway costs impossible. I define model-to-backend mappings in YAML, let LiteLLM handle the routing, and route based on intent: ask for solar-expert when I need a domain-specific Q&A bot backed by a small local model, ask for claude-opus-4-8 when I need real reasoning. The gateway cost? ~50ms latency overhead and one Kubernetes Deployment. The gain? No more vendor SDK sprawl, no more guessing which model is wired into a cron job, and spend visibility that I actually trust.

The old problem — and why it’s worse than it looks

Before the gateway, I had at least three model providers in active use:

Local Ollama on my Mac Studio (Gemma 4, Qwen3 for cheap inference) — described in more depth in my post on scaling two Mac Studios for reasoning and image generation
Anthropic API for actual reasoning (Claude Opus/Sonnet)
OpenRouter as a backup for cloud models when I need something other than Claude

Each lived behind a different SDK. Each had different auth schemes — API keys in environment variables, some in Vault, one in Bitwarden. Every agent and automation script I wrote had to know about all three and pick one at startup.

The real problem wasn’t complexity; it was invisible blast radius. A cron job using the Anthropic SDK would accidentally hit Claude Sonnet at $3/1M input tokens. A test script would hammer a local model and spike memory. An old agent would be wired to the wrong provider. I’d look at my OpenRouter invoice 30 days later and shrug.

But worse: cost visibility was terrible. I could see what I spent per vendor, but not per agent or per job. Was my stock trading bot expensive? My image generation pipeline? The vet academy tutoring system? I didn’t know. I had three separate dashboards and no model of where money was actually going. That turns every optimization attempt into guesswork.

And the most annoying part: fallback was a pain. If OpenRouter was down, I couldn’t just switch to a local model for a non-critical task. Every app had to hard-code its fallback chain. Most didn’t even try.

Enter the LiteLLM gateway

LiteLLM is an open-source proxy that wraps 100+ LLM providers behind a single OpenAI-compatible API. You point it at a config file with a model list, each model has a backend (Ollama, OpenRouter, Anthropic, Azure, whatever), and every app hits the same https://litellm.k3s.internal.zolty.systems/v1/chat/completions endpoint.

The gateway handles:

Model routing — map a friendly name to a backend model + provider
Fallback chains — if Model A fails, try Model B automatically
Per-key budgets and spend caps (in dollars, tokens, or requests)
Daily spend alerts when you hit 80% or 100% of a key’s budget
Spend logging to PostgreSQL for later analysis
Request/response logging and metrics (Prometheus-compatible)
Context-window overflow handling — auto-truncate long inputs or reject

You define models once, version-control the config, and every agent/app/script in the cluster can call the same endpoint.

Setting up the gateway on k3s

I’m running this in the solar namespace — it’s an actual production app (the Solar Q&A chatbot), so the setup has to be repeatable and reliable.

The config file

Here’s the model list from my solar namespace. It’s the part that actually matters:

# kubernetes/apps/solar/solar.yaml — ConfigMap excerpt
model_list:
  # Local Q&A bot: fast, free, domain-expert system prompt
  - model_name: solar-expert
    litellm_params:
      model: ollama/gemma4:26b
      api_base: http://192.168.1.216:11435
      max_tokens: 2048
    model_info:
      rpm: 10          # rate limit: 10 requests/min
      tpm: 30000       # token limit: 30k tokens/min
      system_prompt: |
        You are the Solar Development Expert...
        [FULL SYSTEM PROMPT WITH SOLAR DOMAIN EXPERTISE]

  # Deep reasoning on hard questions: slow, local, but free
  - model_name: solar-deep
    litellm_params:
      model: ollama/qwen3:235b-a22b
      api_base: http://192.168.1.216:11435
      max_tokens: 4096
      extra_body:
        think: true
    model_info:
      rpm: 5
      tpm: 40000
      system_prompt: |
        You are the Solar Deep Analysis model...
        [EXTENDED REASONING SYSTEM PROMPT]

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  drop_params: true

litellm_settings:
  num_retries: 2
  request_timeout: 120
  set_verbose: false

The key patterns:

Model name is what the app requests (e.g., model="solar-expert").
litellm_params.model is the backend — ollama/gemma4:26b means “call Ollama’s /api/generate endpoint with model gemma4:26b”. For cloud vendors it’s openrouter/openai/gpt-4, etc.
api_base is where Ollama is running. Mine is on the Mac Studio at 192.168.1.216:11435 (not reachable from the cluster, but explained below).
model_info.rpm / tpm are hard rate limits per model. I set these to prevent runaway requests.
system_prompt is baked into the model definition, so every request using this model name gets the same expert context without passing it each time.

The Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: solar-litellm
  namespace: solar
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: solar-litellm
  template:
    metadata:
      labels:
        app.kubernetes.io/name: solar-litellm
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "4000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
        - name: litellm
          image: harbor.k3s.internal.zolty.systems/berriai/litellm:main-v1.83.3-stable
          ports:
            - containerPort: 4000
              name: http
          env:
            - name: LITELLM_MASTER_KEY
              valueFrom:
                secretKeyRef:
                  name: solar-litellm-secrets
                  key: master-key
            - name: DATABASE_URL
              value: "postgresql://litellm:...@solar-litellm-db:5432/litellm"
          volumeMounts:
            - name: config
              mountPath: /app/config.yaml
              subPath: config.yaml
              readOnly: true
          args: ["--config", "/app/config.yaml", "--port", "4000"]
          resources:
            requests:
              cpu: 50m
              memory: 512Mi
            limits:
              cpu: 1
              memory: 2Gi
      volumes:
        - name: config
          configMap:
            name: solar-litellm-config

The deployment is basic: one LiteLLM container on port 4000, config injected from a ConfigMap, environment secrets for the master key and PostgreSQL credentials. Nothing exotic.

Database for spend tracking

LiteLLM can log every request to a PostgreSQL database. This is optional but critical if you actually want to understand where your money is going.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: solar-litellm-db
  namespace: solar
spec:
  serviceName: solar-litellm-db
  replicas: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: solar-litellm-db
    spec:
      containers:
        - name: postgres
          image: postgres:16.8-alpine
          env:
            - name: POSTGRES_USER
              value: litellm
            - name: POSTGRES_DB
              value: litellm
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: solar-litellm-db-secret
                  key: postgres-password
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: [ReadWriteOnce]
        storageClassName: longhorn
        resources:
          requests:
            storage: 2Gi

A single-replica Postgres on Longhorn for spend logs. Query it with something like:

SELECT model, count(*) as requests, sum(input_tokens) as input_toks, sum(output_tokens) as output_toks
FROM litellm_logs
WHERE created_at > now() - interval '7 days'
GROUP BY model
ORDER BY output_toks DESC;

Routing in your app

From inside the cluster, apps connect to the Service:

from openai import OpenAI

client = OpenAI(
    base_url="http://solar-litellm.solar.svc.cluster.local:4000/v1",
    api_key="sk-solar-<master-key-fragment>"
)

# Request a cheap, local model
response = client.chat.completions.create(
    model="solar-expert",
    messages=[{"role": "user", "content": "Solar financing costs?"}]
)

# Or ask for a different backend (if you've defined it)
response = client.chat.completions.create(
    model="solar-deep",
    messages=[...]
)

That’s it. Every request goes through the gateway. LiteLLM logs it, checks the key’s spend cap, routes to the backend, and returns the response.

Configuring per-key budgets

LiteLLM’s API allows you to create keys with budgets:

curl -X POST \
  http://litellm.k3s.internal.zolty.systems:4000/key/generate \
  -H "Authorization: Bearer sk-solar-<master-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "max_budget": 50.00,
    "budget_duration": "1d",
    "team_id": "solar-qa-bot"
  }'

This key can spend max $50/day. When it hits 80%, LiteLLM can fire a webhook (alert to Slack, PagerDuty, etc.). At 100%, the key is rejected.

You can also set budgets in config.yaml:

key_management:
  keys:
    - key_alias: "solar-bot-prod"
      key_name: "sk-solar-..."
      max_budget: 25.00
      budget_duration: "1d"
      budget_reset_at: "00:00"

Gotchas and honest limitations

Network egress requirement. Your LiteLLM pod needs to reach whatever backend you’re routing to. In my case, Ollama is on the Mac Studio at 192.168.1.216:11435, which is on the same home network as the cluster. That works. If Ollama were only exposed inside the cluster, I’d use a Service IP. If you’re using OpenRouter or Anthropic, the pod needs internet access (either direct egress or a proxy).
Latency overhead. LiteLLM adds ~50-100ms per request just to log and route. For interactive UIs that’s fine. For sub-100ms required latency, it might matter. Profile it.
No vendor-specific params for every vendor. If a vendor adds a new feature (e.g., Anthropic’s thinking mode), it might not be in LiteLLM yet, or the config syntax might not expose it. I’ve found extra_body: { think: true } works but sometimes you have to wait for upstream LiteLLM to expose new params. Not a blocker, just something to know.
Fallback chains are simple. You can define fallback models (fallback_models: [model-a, model-b]) but the logic is “try next on error” — no cost-aware fallback like “if input > 200k tokens, use a longer-context model.”
Single point of failure on the hot path. If the LiteLLM pod crashes, every app that depends on it is blocked. You can run multiple replicas and a load balancer, but most homelabs (including mine) don’t need that complexity yet.

Why this beats the SDK-per-vendor approach

Aspect	Old: Direct vendor SDKs	New: LiteLLM gateway
Auth management	3+ keys in env vars / Vault	1 master key + derived per-app keys
Model selection	Hard-coded per app	Declarative in YAML, zero-downtime reconfig
Cost visibility	3 separate dashboards	Single Postgres DB, unified view
Fallback chains	Manual try/except per app	Built-in, defined once
Quota enforcement	Trust and hope	Hard spend caps + alerting
Latency	Direct call	+50-100ms proxy overhead
Operational overhead	~medium	low, mostly yaml config

The gateway pays for itself the first time you accidentally leave an expensive model wired into a non-critical job. Combined with Langfuse tracing and per-agent cost budgeting, you get visibility and safety that makes autonomous agents viable.

Lessons

Route by intent, not by vendor. “Give me a solar expert for Q&A” beats “call ollama/gemma4”. If you ever want to swap backends, the intent-based naming survives the swap.
Spend caps are non-negotiable. Not negotiable like “nice to have” — I mean architecturally required. A single runaway agent without caps can wipe out a month’s budget in minutes. Caps are the insurance.
Log to a real database. Email alerts are fine. CSV files are not. You want to query “what did my content-generation pipeline spend last month?” weeks later. A table is the right tool.
The ~50ms latency overhead is worth it. For most homelab stuff (agents, scheduled jobs, background work), 50ms is noise. You get unified routing and spend visibility for that cost.
Version-control the config. LiteLLM config is YAML. It lives in git. When you rotate a key, change model backends, or adjust rate limits, it’s a PR + deploy, not a manual click-around.

Don’t have a homelab k3s cluster? The exact same pattern works on a standalone server or Docker Compose. LiteLLM is a single container; the Postgres is optional (logs go to disk if you want). DigitalOcean Kubernetes or a single Droplet can run this just as well. The principle is the same: one gateway, many backends, unified visibility.

TL;DR#

The old problem — and why it’s worse than it looks#

Enter the LiteLLM gateway#

Setting up the gateway on k3s#

The config file#

The Deployment#

Database for spend tracking#

Routing in your app#

Configuring per-key budgets#

Gotchas and honest limitations#

Why this beats the SDK-per-vendor approach#

Lessons#