TL;DR

I loaded Z.ai’s GLM-5.1 — a 744B parameter MoE model with 40B active parameters — onto a Mac Studio M3 Ultra with 256GB unified memory using a 2-bit quantized GGUF via llama.cpp. It runs at 5.8 tok/s with a 120-second time to first token. The financial analysis quality is genuinely impressive, but it eats 222GB of the 256GB available, leaving room for literally nothing else. It’s a “clear the schedule” model, not an always-on one.

Why run a 744B model locally

I run a local inference stack on my Mac Studio for an AI trading agent called OpenClaw. The stack normally serves 10+ models through Ollama — everything from fast 26B MoE models for triage to a 235B Qwen3 for deep reasoning. All local, all free, all running concurrently.

GLM-5.1 caught my eye because of its published benchmarks: 58.4% on SWE-Bench Pro, 68.7% on CyberGym, and strong marks on agentic task completion. Z.ai built it specifically for “agentic engineering and long-horizon software development tasks.” I wanted to know two things:

  1. Can it actually run on Apple Silicon?
  2. Is the quality good enough to justify monopolizing the entire machine?

The hardware

  • Mac Studio M3 Ultra: 256GB unified memory, 28-core CPU (20P+8E), 60-core GPU, 2TB internal SSD
  • Unified memory bandwidth: 800 GB/s — this is what makes MoE models viable on Apple Silicon
  • Available for inference: ~222GB after macOS overhead (per Ollama’s recommendedMaxWorkingSetSize)

The M3 Ultra is uniquely positioned for this kind of test. Most consumer GPUs top out at 24-48GB VRAM. Even a dual-GPU workstation with 2x 48GB cards can’t fit a 220GB model without heavy CPU offloading. Unified memory sidesteps this entirely — the GPU and CPU share the same 256GB pool.

Getting the model

GLM-5.1 isn’t available as a local Ollama model — the only Ollama tag is cloud (API-only). I had to go through llama.cpp directly.

Unsloth publishes quantized GGUFs at unsloth/GLM-5.1-GGUF. The options:

QuantizationDisk SizeQualityFits 256GB?
FP16 (full)1.65TBBestNo
Q8_0805GBGreatNo
UD-IQ2_M (2-bit dynamic)220GBGoodYes, barely
UD-IQ1_M (1-bit dynamic)200GBDegradedYes, with headroom

I went with UD-IQ2_M. Unsloth’s “Dynamic 2.0” quantization upcasts important layers to 8 or 16 bits while aggressively quantizing less critical ones. The theory is you get better quality than uniform 2-bit at a similar size.

Building llama.cpp

Apple Silicon support comes from llama.cpp’s Metal backend. Build with:

git clone https://github.com/ggerganov/llama.cpp.git
cmake -B build \
  -DGGML_METAL=ON \
  -DGGML_CUDA=OFF \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(sysctl -n hw.physicalcpu)

The DGGML_CUDA=OFF is explicit — Metal is the default on macOS, but I’ve seen builds get confused if CUDA detection picks up stale headers.

Downloading the model

The model is split across 6 GGUF shards totaling 220GB:

hf download unsloth/GLM-5.1-GGUF \
  --include "*UD-IQ2_M*" \
  --local-dir ~/llm-bench/models/glm-5.1

This took about 13 minutes on my connection. The HF CLI downloads shards in parallel, which helps.

Running the benchmark

The critical step is stopping Ollama first. GLM-5.1 needs 222GB — there’s no room for it to coexist with Ollama’s model fleet. I wrote a script that:

  1. Unloads all three Ollama launchd services (backend, metrics proxy, model exporter)
  2. Waits 30 seconds for macOS to reclaim unified memory pages
  3. Starts llama-server on a separate port (11436)
  4. Runs the benchmark
  5. Kills llama-server and restores Ollama
./llama-server \
  -m ~/llm-bench/models/glm-5.1/UD-IQ2_M/GLM-5.1-UD-IQ2_M-00001-of-00006.gguf \
  --host 0.0.0.0 --port 11436 \
  -c 8192 -ngl 99 -np 1 \
  --temp 0.7 --top-p 1.0 \
  --chat-template-kwargs '{"enable_thinking":false}'

Key flags:

  • -ngl 99: Offload all layers to Metal GPU
  • -c 8192: 8K context window (the model supports 200K, but I’m benchmarking quality, not context length)
  • --chat-template-kwargs '{"enable_thinking":false}': Disable GLM-5.1’s built-in chain-of-thought mode for cleaner benchmark timing

The model took about 4 minutes to load from SSD into unified memory. Once loaded, the server reported:

recommendedMaxWorkingSetSize = 239143.78 MB
MTL model buffer size = 221832 MiB

222GB consumed. 5GB left for the OS.

Benchmark results

I ran 7 prompts across three categories: VIX/market analysis (3), financial reasoning (2), and code generation (2). These are the same prompts I use to evaluate every model in my local fleet.

Prompttok/sTTFT (s)Total (s)Tokens
VIX Regime Analysis6.782149997
Yield Curve Inversion7.4921841,366
Options Roll Strategy6.11071811,100
Rate Hike FCF Impact6.21272201,358
Duration & Convexity5.01502291,144
Python Options Chain2.4196237570
K8s CronJob YAML6.6841521,009

Average: 5.8 tok/s, 120 seconds to first token.

Context: how this compares to my existing fleet

ModelArchitecturetok/sTTFTMemory
Gemma 4 26B (MoE)26B, Ollama76~1s17GB
Llama 4 Scout (MoE)109B/17B active, Ollama41~2s67GB
Qwen3 235B (MoE)235B/22B active, Ollama26~3s142GB
Palmyra-Fin-70B70B dense, Ollama10.6~5s42GB
GLM-5.1744B/40B active, llama.cpp5.8120s222GB

GLM-5.1 is 13x slower than Gemma 4 and nearly 5x slower than Qwen3 235B. The TTFT is the real killer — two full minutes before you see the first token.

Quality assessment

The numbers don’t tell the whole story. Let me share what GLM-5.1 actually produced.

The VIX regime analysis

I asked about hedging costs when VIX jumps from 18.2 to 28.3. GLM-5.1 didn’t just say “puts are more expensive.” It broke down four specific mechanics — absolute cost increase, vega risk from vol crush, accelerating theta at high IV, and steepening skew pushing break-evens further out. Then it recommended a zero-cost collar with a clear explanation of why it works specifically in a VIX 28 environment (calls are also expensive, so the financing works). It closed with a put debit spread alternative for traders who need uncapped upside.

That’s CFA-level structured thinking, not generic LLM fluff.

The FCF modeling

When asked to model a 75bps rate hike impact on a company with $800M floating-rate debt, GLM-5.1 built a complete financial model with a before/after table, correctly computed the tax shield effect ($6M pre-tax hit becomes $4.5M after-tax), and then discussed second-order risks — SOFR repricing lag, EBITDA margin compression, refinancing maturity risk, and covenant proximity.

The models I typically run (Gemma 4, Scout) give decent answers to these prompts, but they rarely build structured tables or catch the tax shield nuance unprompted.

The code generation

Weakest category. The Python options chain function was correct and well-structured, but at 2.4 tok/s it took nearly 4 minutes to generate 570 tokens. My Qwen2.5-Coder 32B produces equivalent code at 26 tok/s.

The verdict

GLM-5.1 at 2-bit quantization on a Mac Studio is technically impressive and practically impractical for automated workflows.

The good:

  • Quality is genuinely frontier-tier on financial reasoning and structured analysis
  • The model fits in 256GB unified memory — barely, but it works
  • Metal acceleration means no CPU fallback needed
  • MoE architecture (40B active of 744B total) keeps inference somewhat manageable

The bad:

  • 120-second TTFT makes it unusable for any pipeline or CronJob-based workflow
  • 5.8 tok/s means a detailed response takes 2-4 minutes
  • Monopolizes the entire machine — zero room for other models
  • No Ollama support — requires llama.cpp server management
  • 220GB download for a model you’ll use occasionally

Where it makes sense:

  • Deep research sessions where you want the best open-weight reasoning available
  • Complex financial modeling prompts where quality matters more than speed
  • Agentic coding sessions (its stated strength) where you’re willing to wait
  • Situations where you’d otherwise pay for a frontier API call

Where it doesn’t:

  • Automated trading signal pipelines (my VIX signal worker runs every 30 minutes — 120s TTFT is a non-starter)
  • Any multi-model workflow (it can’t coexist with Ollama)
  • Quick iterations during development

I’m keeping the model on disk for on-demand sessions but not wiring it into my always-on LiteLLM proxy. For production OpenClaw workloads, the MoE models running through Ollama (Gemma 4 at 76 tok/s, Scout at 41 tok/s, Qwen3 235B at 26 tok/s) deliver better value — and a separate A/B test I ran this week across 5,475 evaluations showed that general-purpose MoE models actually outperform domain-specialized models on my trading prompts.

The scripts

I built a reusable benchmark harness and a three-script workflow for running oversized models:

  • scripts/glm51-bench-setup.sh — builds llama.cpp and downloads the GGUF
  • scripts/glm51-bench-run.sh — stops Ollama, loads GLM-5.1, runs benchmark, stops server
  • scripts/glm51-bench-restore.sh — restores all Ollama services and verifies health
  • scripts/llm-benchmark.py — reusable harness that works with any OpenAI-compatible endpoint

The benchmark harness accepts --endpoint, --model, and --suite flags, so I can run the same prompts against any model for apples-to-apples comparison. All results go to JSON with a markdown summary table.

What’s next

I’m watching two models that might change the calculus:

  • DeepSeek R2 — if/when it lands on Ollama, its 32B size and reported 92.7% AIME score could replace my reasoner slot at a fraction of the memory cost
  • DragonLLM Qwen-Pro-Finance-R-32B — a finance-tuned 32B model that’s currently gated on HuggingFace, pending access approval

The broader lesson: unified memory on Apple Silicon is a legitimate platform for running frontier-class models locally. The M3 Ultra’s 256GB ceiling is the constraint — if Apple ships a 512GB variant, models like GLM-5.1 would run with comfortable headroom and could potentially serve as always-on endpoints. Until then, it’s a party trick with genuinely good output.