TL;DR

Two M3 Ultra Mac Studios — 256GB unified memory each — connected by a Thunderbolt 5 cable can run mixture-of-experts models in the trillion-parameter range that no single 256GB box can fit. The hot path stays on Box 1; Box 2 hosts heavier experts and gets called via a local nginx proxy on port 11436. Real-world power draw is nowhere near the spec sheet. Some models still don’t fit even with two boxes (Kimi K2.6 native INT4), and that’s a genuinely useful constraint to know.

The setup

Two identical Mac Studios:

  • M3 Ultra, 28-core CPU, 256GB unified memory each
  • Thunderbolt 5 between them (one direct cable, no switch)
  • 10GbE from each box back to the homelab USW Aggregation switch (different cable, different purpose)
  • macOS with Ollama + MLX

Total combined memory: 512GB. Total combined GPU bandwidth: enough that I stopped caring about it.

Why two boxes

A single M3 Ultra at 256GB can handle a 70B dense model at decent quantization, or a 235B-class MoE if you’re willing to swap and accept first-token latency from disk pressure. The interesting frontier — Kimi K2 family, GLM-4.5/5, the larger DeepSeek configurations — needs more memory than one box has. Buying a third Mac Studio crossed a sanity threshold for me, but two boxes was tractable, and Thunderbolt 5 is fast enough that the second box can act as an expert host without the link becoming the bottleneck.

The physical math is also forgiving. Despite the 480W spec rating, an M3 Ultra under inference draws closer to 100-200W in practice. Two of them on a single 15A circuit is fine. No new electrical work, no thermal panic.

Topology

                  ┌─ 10GbE ─────────► USW Aggregation switch
                  │                        (homelab cluster)
   ┌──────────────┴─┐
   │  Box 1         │
   │  M3 Ultra 256G │  hot path: small/medium models,
   │  Ollama :11434 │  default routing target
   │                │
   └────────┬───────┘
            │ Thunderbolt 5
            │ (point-to-point)
   ┌────────┴───────┐
   │  Box 2         │  heavy experts: 70B+ dense,
   │  M3 Ultra 256G │  large MoE shards
   │  Ollama :11434 │
   │                │
   └────────────────┘
   nginx on Box 1 :11436 ──► proxies to Box 2:11434

LiteLLM in the cluster knows about three endpoints:

  • mac-studio-1:11434 — fast models, default
  • mac-studio-1:11436 — proxy to Box 2 for heavier models
  • specific aliases per model family

A request for gpt-oss:120b lands on Box 1. A request for kimi-k2:235b routes through 11436 to Box 2. Routing is just LiteLLM model_list config — no orchestration heroics.

What actually fits

After enough trial-and-error I built a mental model of what works on this hardware:

ModelSizeWhere it runsNotes
Gemma 4 (4B–27B)denseBox 1Default for fast tasks
Qwen3 / Qwen3.5 30B-classdense or MoEBox 1Solid generalist
Llama 3.x 70BdenseBox 2First-token slow, then fine
GLM-4.5 / GLM-5.1MoEBox 2Excellent for long-context
DeepSeek V3.2MoEBox 2Routing-friendly
Kimi K2 (235B-class)MoEBox 2Fits at q4
Kimi K2.6 (1T native INT4)MoEDoesn’t fitGenuinely too big

The Kimi K2.6 result was the data point I most wanted and the answer was “no”. One trillion parameters at native INT4 is roughly 500GB of weights before activations, KV cache, and overhead. Two M3 Ultras technically have 512GB but you can’t actually use all of it — macOS reserves a meaningful slice, and the GPU memory ceiling is below the unified total. Useful constraint to confirm.

Thunderbolt 5 negotiates 80 Gbps in each direction (120 in burst mode) and runs over a passive copper cable. It’s the only consumer-grade interconnect that’s competitive with datacenter NVLink for the things I care about — and it’s plug-and-play.

I set up a static IP on the Thunderbolt interface on each box (192.168.99.1 and 192.168.99.2) and the nginx proxy on Box 1 talks to Box 2 over that link, never over the 10GbE LAN. This keeps latency down and avoids contention with the rest of the cluster traffic.

# /usr/local/etc/nginx/servers/box2-proxy.conf
server {
    listen 11436;
    location / {
        proxy_pass http://192.168.99.2:11434;
        proxy_buffering off;
        proxy_read_timeout 600s;
    }
}

Streaming responses work fine; proxy_buffering off matters for token-by-token output.

Performance reality

For a 70B-class dense model, going through the TB5 proxy adds maybe 5-10ms per request setup vs. running it on Box 1 directly. Token throughput is identical because once the request is on Box 2, it’s just running locally there.

For a routing-heavy MoE like Kimi K2 235B, throughput on Box 2 is in the 25-40 tok/s range depending on sequence length and active-expert count. Better than I expected; I had been mentally pricing this against H100 numbers.

The thing I underestimated is first-token latency on cold load. Loading 100GB+ of weights into the GPU memory takes 30-60 seconds. I keep the heavy models pinned via Ollama’s keep_alive=24h so they only cold-load once a day.

Daemon environment

Ollama on macOS reads OLLAMA_HOST from the launchd plist:

launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_KEEP_ALIVE "24h"
launchctl setenv OLLAMA_MAX_LOADED_MODELS "2"

Two loaded models per box is the sweet spot — enough to swap between a router model and a worker model without thrashing, but not so many that they start evicting each other.

What I’d do differently

  • Get the Thunderbolt cable right the first time. I bought a TB4 cable initially because the box was on sale; it negotiated TB4 speeds (40 Gbps) and inter-box latency was visibly worse on chunky requests. The Apple TB5 cable is overpriced; third-party TB5 cables are not. Buy the third-party one.
  • Don’t treat the boxes as a cluster. They’re not. There’s no shared filesystem, no automatic failover, no load balancer. The proxy is dumb. Treating them as “one big GPU” is a fast track to debugging things that don’t exist.
  • Benchmark before routing. I had a stretch where I’d see a model trending on Hugging Face, drop it into LiteLLM, and add an alias before I’d actually measured throughput or output quality. Several of those were anti-additions — slower than what I had, no better at the task. The rule now: tok/s, JSON-fidelity, and a domain-specific eval, every time, before it gets a production alias.

What’s next

The hot path / heavy path split is settling into something stable. The next step is wiring the cluster’s request-routing layer to evict aliases that fall behind newer models on the eval, not just add new ones — a graveyard of retired model aliases is its own kind of tech debt.