TL;DR
My Mac Studio M3 Ultra runs Ollama with 70B+ models but isn’t a k3s node. I needed it to show up in Grafana next to the cluster workloads. The solution: node_exporter for system metrics, a Go reverse proxy for per-model inference metrics, a custom Python exporter for model inventory and VRAM tracking, and Grafana Alloy for shipping logs to Loki. All four services managed by Ansible, all metrics scraped by the cluster’s Prometheus.
The problem
The Mac Studio (192.168.1.216) is the primary inference host — 256GB unified memory, 70B+ parameter models, handling all local LLM requests via Ollama. But it’s not a Kubernetes node. It runs macOS, not Linux. It doesn’t have a kubelet, a node_exporter DaemonSet, or any of the automatic observability that cluster nodes get.
Without monitoring, I was flying blind. Questions I couldn’t answer:
- How much VRAM is consumed right now?
- Which models are loaded and when do they get evicted?
- What’s the token throughput per model?
- Is the box running hot on CPU or memory?
- What’s in the Ollama logs when a request fails?
The architecture
Four services, each solving a different part of the problem:
Mac Studio (192.168.1.216)
├── node_exporter :9100 → CPU, memory, disk, network
├── ollama-metrics-proxy :9836 → per-model latency, throughput, token counts
├── ollama-model-exporter :9837 → loaded models, VRAM per model, disk usage
└── Grafana Alloy → logs → Loki (HTTP push)
k3s Cluster
├── Prometheus scrapes :9100, :9836, :9837
├── Loki receives log push from Alloy
└── Grafana dashboard visualizes everything
node_exporter (system metrics)
The easiest piece. Homebrew has a node_exporter package that works on macOS:
brew install node_exporter
brew services start node_exporter
This gives standard system metrics on :9100/metrics — CPU usage, memory pressure, disk I/O, network throughput. Same metrics you’d get on a Linux node, minus the kernel-specific ones.
Prometheus scrapes it as a static target:
- job_name: inference-hosts
static_configs:
- targets: ["192.168.1.216:9100"]
labels:
host: mac-studio
environment: homelab
Nothing interesting here, but it’s the foundation for the system resource panels in the dashboard.
ollama-metrics-proxy (inference metrics)
Ollama (as of 0.20.x) doesn’t expose a native Prometheus /metrics endpoint. It has an internal metrics flag (OLLAMA_METRICS=1) but it doesn’t produce Prometheus-format output yet.
The workaround is ollama-metrics-proxy — a Go reverse proxy that sits between clients and Ollama, observes every request/response, and exposes Prometheus metrics.
The port trick: Ollama normally listens on :11434. The proxy takes over :11434 and forwards to Ollama on :11435. Clients don’t know the proxy exists.
Client → :11434 (proxy) → :11435 (Ollama)
↓
:9836/metrics (Prometheus)
The proxy exposes per-model metrics:
- Token counts — prompt tokens ingested, completion tokens generated
- Request latency — p50, p95, p99 percentiles per model
- Throughput — tokens/second per model
- Request volume — requests/second per model
- Active requests — concurrent inference count
This is the data that answers “is the 70B model slower than expected” and “how many requests are queuing.”
Installing with Ansible
The proxy is built from source since there’s no Homebrew package:
- name: Clone ollama-metrics-proxy
git:
repo: https://github.com/elliotfehr/ollama-metrics-proxy.git
dest: /tmp/ollama-metrics-proxy
- name: Build proxy binary
command: go build -o /opt/homebrew/bin/ollama-metrics-proxy
args:
chdir: /tmp/ollama-metrics-proxy
It runs as a launchd service with KeepAlive: true — macOS will restart it automatically if it crashes.
ollama-model-exporter (model inventory)
The proxy handles request-level metrics, but I also wanted to know what’s loaded in VRAM right now — which models, how much memory each one uses, and when they’ll be evicted.
I wrote a 136-line Python exporter that polls Ollama’s API and exposes model inventory as Prometheus metrics:
# Polls /api/ps (running models) and /api/tags (available models)
# Exposes metrics on :9837/metrics
ollama_up # 1 if Ollama is reachable
ollama_loaded_models # count of models in VRAM
ollama_available_models # count of models on disk
ollama_total_vram_bytes # aggregate VRAM consumed
ollama_models_disk_bytes # total disk space used
# Per-model metrics with labels
ollama_model_vram_bytes{model, family, quantization}
ollama_model_size_bytes{model, family, quantization}
ollama_model_context_length{model, family, quantization}
ollama_model_parameters_billions{model, family, quantization}
ollama_model_expires_seconds{model, family, quantization} # TTL until eviction
The expires_seconds metric is the interesting one. Ollama unloads models from VRAM after a TTL expires (default 5 minutes of no requests). The exporter parses the ISO 8601 expires_at timestamp from /api/ps and calculates seconds remaining. This shows up in Grafana as a countdown — you can see when a model is about to be evicted.
The labels include quantization (Q4_K_M, Q5_K_M, etc.) and family (llama, qwen, gemma) so you can filter and group in dashboards.
Why a custom exporter instead of an existing one?
I looked. The existing Ollama exporters either don’t expose per-model VRAM, don’t parse the eviction TTL, or require Go compilation. This is 136 lines of Python with zero external dependencies — it uses only http.server, urllib, and json from the standard library. It runs anywhere Python 3 runs.
Prometheus scrapes it on a tight 15-second interval (vs the default 30s) because model load/unload events are fast and I want the dashboard to reflect current state:
- job_name: ollama-models
scrape_interval: 15s
static_configs:
- targets: ["192.168.1.216:9837"]
labels:
host: mac-studio
Grafana Alloy (log shipping)
Metrics tell you what happened. Logs tell you why. Grafana Alloy (the successor to Promtail) runs on the Mac Studio and pushes logs to Loki in the cluster.
The config is minimal:
loki.write "default" {
endpoint {
url = "http://loki.k3s.internal.zolty.systems/loki/api/v1/push"
}
}
local.file_match "system_log" {
path_targets = [{
__path__ = "/var/log/system.log",
job = "macos-system",
host = "mac-studio"
}]
}
loki.source.file "system_log" {
targets = local.file_match.system_log.targets
forward_to = [loki.write.default.receiver]
}
local.file_match "ollama_log" {
path_targets = [{
__path__ = "/Users/mat/Library/Logs/ollama-serve.log",
job = "ollama",
host = "mac-studio"
}]
}
loki.source.file "ollama_log" {
targets = local.file_match.ollama_log.targets
forward_to = [loki.write.default.receiver]
}
Two log sources: macOS system log and Ollama’s serve log. Both get host=mac-studio label so they’re filterable in Grafana.
The push endpoint is an HTTP Traefik IngressRoute that routes to the Loki service inside the cluster. No TLS for internal traffic — the Mac Studio is on the same physical network as the cluster nodes.
The Grafana dashboard
All of this feeds into a single dashboard with five sections:
Ollama Status — Is it up? How many models loaded? Total VRAM used? VRAM utilization gauge (256GB capacity, yellow at 70%, red at 90%).
Loaded Models — Table showing each loaded model with VRAM consumption, context window, parameter count, and eviction TTL. Sortable columns.
Inference Performance — Time series panels for tokens generated, tokens/sec, request latency p95, and request volume. All broken down by model.
System Resources — CPU, memory, network I/O, disk usage. Standard node_exporter panels but tuned for macOS specifics (256GB total memory, 28-core CPU).
Logs — Embedded Loki panel filtering for {host="mac-studio"} with a regex that surfaces interesting events: model loads, unloads, errors, and warnings.
The VRAM-over-time panel is particularly useful. It shows a stacked area chart with each loaded model’s VRAM consumption plus a 256GB capacity line. You can see exactly when models get loaded, how much memory they consume, and when they get evicted.
Ansible playbook structure
The entire stack deploys with a single playbook using granular tags:
# Deploy everything
ansible-playbook playbooks/mac-studio-monitoring.yml
# Just update the model exporter
ansible-playbook playbooks/mac-studio-monitoring.yml --tags exporter
# Just reconfigure Alloy
ansible-playbook playbooks/mac-studio-monitoring.yml --tags alloy
Tags: node_exporter, proxy, exporter, alloy, verify. The verify tag runs HTTP checks against all four endpoints to confirm everything is healthy.
All services use macOS launchd plists with RunAtLoad: true and KeepAlive: true. They survive reboots and auto-restart on crashes. Logs go to ~/Library/Logs/ — the standard macOS location.
The playbook is idempotent. Running it twice changes nothing. Running it after a macOS update restores any services that got disrupted.
What I learned
Ollama’s metrics story is immature. The OLLAMA_METRICS=1 flag exists but doesn’t produce Prometheus output yet. The community has filled the gap with proxies and exporters, but it’s fragmented. I expect this to get better in future Ollama releases — for now, the proxy + custom exporter combo works.
launchd is fine. I was worried about managing services on macOS after years of systemd. launchd plists are more verbose but functionally equivalent for this use case. Ansible’s community.general.launchd module handles the service lifecycle.
15-second scrape interval for model state. The default 30s misses model load/unload events. When someone asks “which model is loaded right now,” 30-second-old data isn’t good enough. 15s keeps the dashboard responsive without adding meaningful load to a machine with 28 CPU cores.
Push-based log shipping is simpler for external hosts. Promtail/Alloy pushing to Loki avoids opening inbound ports on the Mac Studio. The Mac pushes; the cluster receives. No bidirectional connectivity needed.
Lessons
- Non-k8s hosts need explicit observability. Cluster nodes get monitoring for free via DaemonSets. External hosts get nothing unless you build it.
- Custom exporters beat generic ones when you need specific metrics. 136 lines of Python gives me exactly the model inventory metrics I need, with the labels I want. No configuration files, no Go compilation.
- Ansible makes macOS service management reproducible. Without it, these four services are “things I set up once and hope survive the next macOS update.” With it, they’re code.
- VRAM eviction TTL is the metric I didn’t know I needed. Knowing when a model will be unloaded changes how I think about capacity planning.
Don’t have a homelab? If you’re running Ollama on any macOS or Linux box, the same exporter and proxy pattern works. The Grafana dashboard JSON is in a ConfigMap — import it into any Grafana instance. A DigitalOcean Droplet with Grafana Cloud’s free tier handles the Prometheus and Loki endpoints.