TL;DR
A Mac Studio M3 Ultra costs the same as a single 4090 but comes with 256 GB of unified memory and 60-core GPU, all running at 100–200 W under inference. I stopped trying to pass MPS into containers and instead run Ollama and ComfyUI natively on macOS, then proxy them back into k3s as simple Kubernetes Services with manual Endpoints. Two Mac Studios connected via Thunderbolt 5 split the load: one handles hot-path LLM inference and embeddings, the other runs the heavy forge for diffusion and long-horizon reasoning. Both are cheaper to run than a single-socket A100 and require no special driver stacks.
Why a Mac Studio is a sensible GPU host
The conventional homelab GPU story is NVIDIA: you buy a workstation, bolt on a 4090 or two, and everything just works because there are a million tutorials. But the hardware tax is brutal. A 4090 costs $1,600–$2,000, draws 575 W on peak, and gives you 24 GB of discrete memory — so your LLM maxes out around 13B parameters before it has to shard across cards or go to quantization.
An M3 Ultra Mac Studio costs ~$4,000 for the maxed config, but you get:
- 256 GB unified memory across CPU + GPU, with no data motion cost. A 142B model and an active image generation queue coexist peacefully.
- 60-core GPU with MPS (Metal Performance Shaders) doing inference at near-CUDA parity — I see no meaningful difference in latency between Ollama on Mac and Ollama on a workstation GPU.
- Real idle power of 10–20 W, sustained load of 100–200 W. The 480 W spec is a peak that never appears in practice. Two Mac Studios running side-by-side draw less than a single A100.
- No special drivers or kernel modules. You install Ollama and ComfyUI, they see the GPU, they work. No
nvidia-container-runtimecomplexity.
The tradeoff is that you cannot shove MPS into a container—there are no Docker bindings for Apple Silicon GPU pass-through. But that’s not a blocker; it just changes the architecture pattern.
The pattern: native GPU + k3s proxy
Instead of trying to run GPU workloads inside Kubernetes, I run them natively on the Mac as launchd-managed services, then wire them into the cluster via Kubernetes Services + manual Endpoints. (This approach builds on what I covered in my post on running ComfyUI natively on the Mac Studio — here I extend that pattern to the cluster.)
┌─────────────────────────────────────────────────┐
│ Mac Studio M3 Ultra (macOS arm64) │
│ ┌─────────────────────────────────────────┐ │
│ │ Ollama :11435 (256GB split, ~150GB) │ │
│ │ ComfyUI :8188 (256GB split, ~100GB) │ │
│ └─────────────────────────────────────────┘ │
└─────────────────────────────┬───────────────────┘
│ 10GbE
▼
┌──────────────────────────┐
│ k3s cluster │
│ Traefik + Authentik │
│ Service + manual EP │
└──────────────────────────┘
│
▼
┌──────────────────┐
│ Internal apps │
│ LLM reqs/imgs │
└──────────────────┘
The Service in k3s has no selector—I manually create Endpoints pointing at 192.168.1.216:11435 (Ollama) and 192.168.1.216:8188 (ComfyUI). Traefik routes HTTPS traffic into these Services. Authentik can gate access. The cluster doesn’t know or care that the backend is a Mac three feet away; to the apps, it’s just a remote service like any other.
This sidesteps the GPU-in-Docker problem entirely. Native performance, simpler ops, and I get to keep the cluster purely amd64/Linux.
Splitting 256 GB between Ollama and ComfyUI
A Mac Studio M3 Ultra has 256 GB of unified memory, meaning the CPU and GPU share the same physical RAM pool. There’s no discrete VRAM fence; if Ollama loads a 142B model, that’s 142 GB of unified memory spoken for. ComfyUI can’t touch it until it’s unloaded.
The answer is to size both workloads and tune memory contention. Here’s what I settled on:
Ollama daemon config (com.ollama.serve.plist):
OLLAMA_MAX_VRAM=253403627520 # 236 GB (256 - 20 for OS overhead)
OLLAMA_KEEP_ALIVE=-1 # Keep loaded models warm indefinitely
OLLAMA_NUM_PARALLEL=10 # Queue up to 10 concurrent requests
A 142B model (qwen3:235b-a22b quantized to fit) takes ~142 GB; a pair of 70B models totals ~142 GB. One or the other (not both) can stay in memory while ComfyUI uses the rest.
ComfyUI on Mac (via Ansible):
--force-fp16 # Force 16-bit floats, halve memory footprint
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 # Don't hoard GPU memory; return it
PYTORCH_ENABLE_MPS_FALLBACK=1 # CPU fallback for unsupported ops
That watermark-ratio flag is load-bearing. By default, PyTorch allocates aggressively and holds onto memory. Setting it to 0 forces immediate release. Combined with --force-fp16, a 1024×1024 SDXL generation fits in ~16–20 GB, leaving Ollama room to run concurrent inference.
In practice: if Ollama isn’t under load, ComfyUI can run full-resolution generations freely. If an Ollama request comes in, ComfyUI pauses its current generation (unloads its working memory), Ollama handles the inference, then ComfyUI resumes. No OOM crashes because memory is always available when needed.
Two Mac Studios: hot path + heavy forge
A single M3 Ultra can run both workloads, but contention kills latency. I have two:
Box 1 (192.168.1.216, 10GbE):
- Role: hot-path inference (signal workers, news eval, embeddings, code completion)
- Ollama models: gemma4:26b (MoE), deepseek-r1:32b, qwen2.5-coder:32b, embeddings
- Public face of the inference cluster; all cluster LiteLLM proxies point here
- Powers the signal workers that run every 30 seconds; latency matters
Box 2 (TB5 direct link to Box 1, no public 10GbE):
- Role: heavy forge (long-horizon reasoning, large image batches, MoE sampling)
- Ollama models: qwen3:235b-a22b (142B), qwen3.5:122b-a10b (122B), palmyra-fin-70b
- Direct access via Box 1’s nginx proxy; never a direct public route
- Handles batch jobs and multi-hour reasoning tasks that don’t need sub-second response times
They’re connected via Thunderbolt 5 (10.10.0.1 ↔ 10.10.0.2), which measures ~47 Gbps bidirectional in practice. Box 2 has no separate LAN connection—all its egress and cluster callbacks NAT through Box 1 via pfctl. I’ve also explored this dual-Mac split in a deeper post on using two M3 Ultras for MoE and batching.
This split is cheaper than load-balancing a single M3. The hot-path box stays responsive, and the forge box can hold 200+ GB of models hot without competing for serving latency.
The Kubernetes wiring
The Service definition is trivial because I’m not running pods:
apiVersion: v1
kind: Service
metadata:
name: comfyui
namespace: comfyui
spec:
type: ClusterIP
sessionAffinity: ClientIP # Sticky sessions for job polling
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600
ports:
- name: http
port: 8188
targetPort: 8188
---
apiVersion: v1
kind: Endpoints
metadata:
name: comfyui
namespace: comfyui
subsets:
- addresses:
- ip: 192.168.1.216
nodeName: mac-studio
ports:
- name: http
port: 8188
That sessionAffinity: ClientIP is the only gotcha. ComfyUI returns job IDs from workflow submissions; if the next request round-robins to a different backend, the job won’t be found. Sticky sessions avoid that without needing a load balancer.
Traefik routes HTTPS to this Service, Authentik gates access, and internal apps see it as http://comfyui.comfyui.svc.cluster.local:8188. The Mac running ComfyUI natively is completely invisible from the cluster’s perspective.
For Ollama, I run a second Service at ollama.ollama.svc.cluster.local:11435 the same way. LiteLLM proxies inside the cluster alias large models to Box 2 via Box 1’s nginx proxy (http://192.168.1.216:11436), which forwards to Box 2 over TB5.
Why not container-based GPU compute
Every time I’ve tried to containerize GPU workloads on this cluster, I’ve hit one of two problems:
- MPS doesn’t play with container runtimes. Docker/containerd doesn’t have device bindings for Apple Silicon GPU. NVIDIA’s been building CUDA GPU support in containers for 10 years; Apple hasn’t.
- You end up proxying anyway. I tried running Ollama inside k3s with elevated privileges and GPU access. It never worked reliably; the inference seemed to work but I’d get mysterious pauses and OOM kills. The “native service + k3s proxy” pattern Just Works™.
The native approach also gives me freedom: I can use Homebrew, pip packages, and macOS-specific workflows. No need to build ARM64 Docker images or beg for someone’s pre-built multi-arch manifest. Ollama is a single Homebrew install; ComfyUI is a Git clone + Python venv.
The honest caveats
- No CUDA ecosystem. Some nodes and specialized libraries expect NVIDIA hardware. I can’t use certain video encoders or compute-heavy post-processors without CPU fallback.
- MPS has sharp edges. Sometimes PyTorch or other libraries hit an unsupported operation and fall back to CPU, which tanks throughput to seconds-per-operation. Setting
PYTORCH_ENABLE_MPS_FALLBACK=1keeps it from crashing, but you notice the slowdown. - Memory fragmentation. After hours of mixed inference + generation, unified memory can fragment badly and lead to spurious OOM even with 256 GB free. Restarting the services once weekly prevents this.
- Thunderbolt 5 is flaky. The Box 1 ↔ Box 2 TB5 link drops occasionally (weather, RF interference, I don’t know). I have pfctl NAT fallback routing to Box 2’s WiFi DHCP IP as a backup, but it’s slower. The link stays up for weeks at a time, then flakes twice in an hour.
Lessons
- Don’t force containers where the OS handles it better. Apple Silicon GPU is not container-friendly; native services + k3s proxy sidesteps the problem cleanly.
- Unified memory requires deliberate tuning. OLLAMA_MAX_VRAM + PYTORCH_MPS_HIGH_WATERMARK_RATIO + explicit model sizing prevents OOM thrashing.
- Sticky sessions matter for stateful backends. ComfyUI job polling without sessionAffinity is a footgun.
- Two specialized boxes beat one loaded box. Splitting hot-path and forge by machine keeps response latency predictable and model count high.
- macOS launchd is underrated.
launchctl load/unloadand plist-based config are simpler than systemd for single-machine workloads. No service file rewrites on every deployment.
Don’t have a homelab? The exact pattern works on cloud Macs or any GPU-equipped instance: run inference natively, proxy it into your infrastructure, and skip the container GPU bind complexity. DigitalOcean’s cloud compute supports dedicated instances with GPU options; infrastructure-as-code your proxy layer and let the GPU box be a simple native service.