Self-Hosted AI on a 24GB GPU: OpenClaw + Ollama Setup Guide for Windows

TL;DR

You have a 24GB VRAM GPU. You want a private, self-hosted AI assistant that rivals ChatGPT – no subscriptions, no data leaving your machine. This guide walks you through setting up Ollama (local model runtime) and OpenClaw (AI gateway with a web UI) on Windows using Docker Desktop.

But the real value here is the model recommendations. I ran 5,475 evaluations across 21 prompt variants and 6 models on real trading data. The results contradicted almost everything the community recommends. Finance-tuned models performed worse than a coin flip. Chain-of-thought reasoning models were anti-patterns. The winners were general-purpose MoE (Mixture-of-Experts) models that nobody talks about for specialized tasks.

Why Self-Host

The pitch is simple:

Privacy. Your conversations never leave your machine. No training on your data. No terms of service changes to worry about.
Cost. After the hardware investment, inference is free. If you are spending $20-200/month on API credits or ChatGPT Plus, a 24GB GPU pays for itself in months.
No rate limits. Generate as many tokens as your GPU can push. No “you’ve hit your limit” messages at 2am when you are in the zone.
Offline capability. Works on a plane, in a cabin, during an ISP outage. The models live on your disk.
Customization. System prompts, custom model presets, RAG pipelines, tool integrations – you control the entire stack.

The tradeoff is that local models are not as capable as frontier models like Claude Opus or GPT-5. But the gap has narrowed dramatically. A well-chosen 26B MoE model running locally in 2026 matches or exceeds what GPT-4 could do in 2024. For most day-to-day tasks – coding assistance, writing drafts, summarization, brainstorming – local models are more than good enough.

What You Need

Hardware

GPU: Any NVIDIA GPU with 24GB VRAM. The RTX 3090, RTX 4090, RTX A5000, and RTX 5090 all qualify. AMD GPUs work with Ollama too, but NVIDIA has better driver support and faster inference via CUDA.
RAM: 32GB system RAM minimum. 64GB recommended. When a model doesn’t fully fit in VRAM, the overflow spills to system RAM – and you want headroom for that.
Storage: At least 100GB free. Models range from 2GB (small 3B models) to 22GB (large 32B models at Q4 quantization). You will want several models downloaded.
CPU: Any modern CPU works. The GPU does the heavy lifting.

Software

Windows 10/11 with WSL2 enabled
Docker Desktop with WSL2 backend
Ollama (native Windows install)
NVIDIA drivers (latest Game Ready or Studio driver)

Step 1: Install NVIDIA Drivers

If you are gaming on this machine, you probably already have recent drivers. Verify:

Open NVIDIA Control Panel (right-click desktop)
Click Help > System Information
Check the driver version – anything from 2025 or later is fine

If you need to update, grab the latest from nvidia.com/drivers. The Studio Driver is slightly more stable for compute workloads, but Game Ready works fine too.

Verify CUDA is working by opening PowerShell:

nvidia-smi

You should see your GPU listed with 24GB (or close to it) of memory. If nvidia-smi is not found, the driver install didn’t add it to PATH – restart your terminal or reboot.

Step 2: Enable WSL2

Docker Desktop requires WSL2 (Windows Subsystem for Linux 2) as its backend. If you have never used WSL, you need to set it up first. If you already have WSL2 running, skip to Step 3.

Check if WSL is Already Installed

Open PowerShell as Administrator and run:

wsl --status

If you see a default distribution and “WSL version: 2”, you are good – skip to Step 3. If you get an error or see “WSL version: 1”, keep reading.

Install WSL2 from Scratch

Open PowerShell as Administrator and run the install command:

wsl --install

This enables the WSL2 feature, installs the Linux kernel, and downloads Ubuntu as the default distribution. It requires a reboot.

After rebooting, Ubuntu will launch automatically and ask you to create a Linux username and password. These are for the Linux environment only – pick anything you will remember.

Once that is done, verify WSL2 is active:

wsl --list --verbose

You should see Ubuntu listed with VERSION 2. If it shows VERSION 1, upgrade it:

wsl --set-version Ubuntu 2

If WSL Install Fails

On older Windows 10 builds or machines with virtualization disabled, wsl --install may fail. The fix:

First, enable virtualization in BIOS. Reboot into BIOS/UEFI (usually Del or F2 during boot). Look for “Intel VT-x”, “AMD-V”, or “SVM Mode” and enable it.

Then, manually enable the required Windows features:

dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

Reboot, then set WSL2 as the default version and install a distro:

wsl --set-default-version 2
wsl --install -d Ubuntu

Update the WSL Kernel

Even if WSL2 is already installed, make sure the kernel is current. An outdated kernel can cause GPU passthrough issues:

wsl --update

Step 3: Install Docker Desktop

Download Docker Desktop from docker.com
Run the installer. When prompted, ensure WSL2 backend is selected (not Hyper-V)
After installation, open Docker Desktop and let it finish initializing
In Docker Desktop settings:
- General: Ensure “Use the WSL 2 based engine” is checked
- Resources > WSL Integration: Enable integration with your Ubuntu distro

Verify Docker is working:

docker run hello-world

If this fails with a permissions error, make sure Docker Desktop is running (check the system tray) and that your user is in the docker-users group (Docker Desktop usually handles this during install, but a sign-out/sign-in may be needed).

Enable GPU Access in Docker

Docker Desktop on Windows supports GPU passthrough via WSL2. In Docker Desktop settings:

Go to Resources > Advanced
Allocate at least 16GB of RAM to Docker (32GB if you have 64GB system RAM)
GPU passthrough should work automatically with recent Docker Desktop versions and an up-to-date WSL kernel

Test GPU access:

docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

If you see your GPU listed, you are good. If you get an error about --gpus, make sure Docker Desktop is updated to the latest version and WSL2 integration is enabled.

Step 4: Install Ollama

Ollama is the model runtime. It downloads, manages, and serves LLMs locally. Install it natively on Windows (not in Docker) for the best GPU performance:

Download the installer from ollama.com/download
Run the installer – it sets up Ollama as a background service
Ollama starts automatically and listens on http://localhost:11434

Verify:

ollama --version

Pull Your First Model

Let’s start with the model that actually won our benchmarks:

ollama pull gemma4:26b

This downloads Google’s Gemma 4 26B – a Mixture-of-Experts model with only a fraction of its parameters active per token, which means it runs fast while punching well above its weight class. The download is about 17GB. Once it is done:

ollama run gemma4:26b

You should get a chat prompt. Type something. If you see tokens streaming back at 40-70+ tokens per second, your GPU is doing its job. Press Ctrl+D to exit.

Configure Ollama for Docker Access

By default, Ollama only listens on localhost. Since OpenClaw will run in Docker, it needs to reach Ollama on the host. Set the environment variable so Ollama listens on all interfaces:

Open System Environment Variables (search “environment variables” in Start)
Under System variables, click New
Variable name: OLLAMA_HOST
Variable value: 0.0.0.0
Click OK, then restart Ollama (quit from system tray and reopen, or restart the service)

Verify it is listening:

curl http://localhost:11434/api/tags

You should see a JSON response listing your downloaded models.

Step 5: Run OpenClaw in Docker

OpenClaw is the AI gateway – it provides a web chat interface, multi-model support, conversation history, and tool integrations. It connects to Ollama as a model backend.

OpenClaw does not publish an official Docker image, so we build one. Create a file called Dockerfile anywhere on your machine:

FROM node:22-bookworm-slim
RUN apt-get update && apt-get install -y python3 make g++ libopus-dev && rm -rf /var/lib/apt/lists/*
RUN npm install -g openclaw@latest
USER 1000
EXPOSE 18789
CMD ["openclaw", "gateway", "--port", "18789", "--bind", "lan", "--allow-unconfigured"]

Build and run it:

docker build -t openclaw .

docker run -d `
  --name openclaw `
  -p 18789:18789 `
  --add-host=host.docker.internal:host-gateway `
  --restart unless-stopped `
  openclaw

The --add-host=host.docker.internal:host-gateway flag is critical. It lets the container reach services running on your Windows host (like Ollama) via host.docker.internal.

Open your browser to http://localhost:18789. You should see the OpenClaw Control UI.

Step 6: Configure OpenClaw

On first launch, OpenClaw’s Control UI lets you configure model providers. Here is the setup for Ollama:

Navigate to the Models section
Add a new provider:
- Type: OpenAI-compatible
- Base URL: http://host.docker.internal:11434/v1
- API Key: ollama (Ollama doesn’t require a real key, but the field cannot be empty)
OpenClaw will auto-discover your Ollama models
Set your preferred default model (I recommend starting with gemma4:26b)

Optional: Add Cloud Providers as Fallback

OpenClaw can route to multiple providers. For tasks that exceed local model capabilities, you can add API keys for cloud providers:

Anthropic (Claude) – best for complex reasoning and code
OpenRouter – aggregates dozens of providers behind one API key
OpenAI – GPT models

This is optional. The point of this guide is running everything locally. But having a cloud fallback for the occasional hard problem is pragmatic.

Step 7: Pull the Right Models

This is where most guides get it wrong. They recommend models based on benchmark leaderboards and marketing copy. I’m going to recommend models based on what actually works, tested across 5,475 evaluations on real-world tasks.

What I Tested

I run an AI trading agent called OpenClaw that analyzes VIX/volatility data and generates trading signals. Over two weeks, I ran every promising model through the same battery: 21 prompt variants, 61 days of historical trading data, scoring each model on win rate, average profit per trade, and consistency.

The results were humbling. Everything the community consensus told me to do was wrong.

The Counter-Intuitive Results

Finance-tuned models are worse than a coin flip. Palmyra-Fin-70B – a model specifically trained on financial data – scored a 35.6% win rate. Below 50%. You’d be better off flipping a coin. The fine-tuning made it overfit to patterns in its training data that don’t generalize.

Chain-of-thought reasoning models are anti-patterns. DeepSeek R1 32B, the darling of the reasoning community, scored 35.5% on the same task. The chain-of-thought “thinking” process actually hurt performance by overcomplicating signals that should be straightforward.

General-purpose MoE models dominate. Gemma 4 26B – a model nobody recommends for financial analysis – scored 66% win rate with +$61 average profit per trade at 76 tokens/second. Qwen 3.5 122B hit 67% win rate at 29 tok/s for when you want higher accuracy and can wait a beat.

Data formatting beats model selection. The single biggest improvement came from labeling news headlines with VIX impact scores before feeding them to the model. This one formatting change outperformed switching models, tweaking temperatures, or adding elaborate system prompts. Temperature was completely irrelevant – 0.0, 0.3, and 0.5 all performed within noise.

Complexity has negative returns. Kitchen-sink prompts (throw every indicator at the model): 52% win rate. Regime-rule prompts (elaborate if/then logic): 45%. The best prompt was the simplest one that formatted the data well.

The Must-Have Starter Pack

Based on actual testing, these are the models I recommend:

# Production winner -- 66% win rate, 76 tok/s, fast and accurate
ollama pull gemma4:26b

# High-accuracy alternative -- 67% win rate, 29 tok/s
ollama pull qwen3.5:122b-a10b

# Long context champion -- handles 48K+ tokens without slowdown
ollama pull qwen3:30b-a3b

# Efficiency pick -- fast responses when you need speed over depth
ollama pull phi4

Why MoE Models Win

The key insight is Mixture-of-Experts architecture. MoE models have many total parameters but only activate a fraction per token. Gemma 4 26B is a 26 billion parameter model, but only a subset of those “experts” fire for each token. This means:

Speed. Fewer active parameters = fewer calculations per token = faster inference.
Diversity. The model has more total knowledge (spread across experts) than a dense model of the same speed class.
VRAM efficiency. The active parameter count determines speed, but the total parameter count determines knowledge. You get a better knowledge-to-speed ratio than dense models.

On a 24GB GPU, MoE models like Gemma 4 26B and Qwen 3 30B-A3B run 2-4x faster than dense models of comparable quality. This isn’t theoretical – it is measured.

Best Models by Category

General Intelligence / Default: Gemma 4 26B (MoE)

ollama pull gemma4:26b

VRAM: ~17GB at Q4
Speed: ~50-76 tokens/sec (hardware dependent)
Why: Won our A/B tests across multiple prompt variants. Fast, accurate, and leaves VRAM headroom. This should be your default model for everything unless you have a specific reason to use something else.

High Accuracy: Qwen 3.5 122B A10B (MoE)

ollama pull qwen3.5:122b-a10b

VRAM: Needs 48GB+ (won’t fit 24GB alone – use with CPU offload or on larger hardware)
Speed: ~25-29 tokens/sec on 256GB unified memory
Why: Highest accuracy in our tests (67% win rate). If you have a Mac with 64GB+ unified memory or are willing to accept CPU offloading on Windows, this is the quality pick.

Note: This model won’t fully fit in 24GB VRAM. On Windows, Ollama will automatically spill overflow to system RAM, which works but reduces speed significantly. If you have 64GB system RAM, it is usable. If you only have 32GB, skip this and stick with Gemma 4.

Long Context: Qwen 3 30B A3B (MoE)

ollama pull qwen3:30b-a3b

VRAM: ~20GB at 16K context, ~24GB at 48K context
Speed: ~54 tokens/sec at 16K, ~33 tokens/sec at 48K
Why: The only model in this class that maintains full GPU utilization at 48K tokens. Every other 30B+ model spills to CPU RAM at long contexts, dropping speed below 10 tokens/sec. If you are feeding in long documents, codebases, or conversation histories, this is the pick.

Coding: Qwen 2.5 Coder 32B

ollama pull qwen2.5-coder:32b

VRAM: ~22GB
Speed: ~25-35 tokens/sec
Why: Still the best dedicated coding model at this size. GPT-4o-level code generation, debugging, and refactoring. Note: our A/B tests focused on analysis tasks, not pure code generation – for coding specifically, a specialized model still makes sense.

Efficiency: Phi-4 14B

ollama pull phi4

VRAM: ~11GB
Speed: ~80+ tokens/sec
Why: Microsoft’s efficiency king. Uses half your VRAM, leaving room for long context windows. Despite its size, reasoning variants punch well above their weight. The model you run when you want instant responses or when you are multitasking.

Vision (Multimodal): Llama 3.2 Vision 11B

ollama pull llama3.2-vision:11b

VRAM: ~7.3GB
Speed: ~84 tokens/sec
Why: Lightweight, fast, and handles image analysis well. Good for screenshots, diagrams, charts. Leaves plenty of VRAM for concurrent use with other models.

Models to Avoid (And Why)

This section will save you hours of downloading models that underperform. These aren’t bad models – they are bad choices for a local AI assistant setup.

Model	Why It Underperforms	Test Result
Palmyra-Fin-70B	Finance fine-tuning overfits to training patterns	35.6% win rate (below coin flip)
DeepSeek R1 32B	Chain-of-thought overcomplicates simple decisions	35.5% win rate
Any “finance-tuned” model	Domain fine-tuning hurts generalization	Consistently below general models
Llama 4 Scout	109B total params, doesn’t fit 24GB despite “17B active” marketing	N/A – won’t load

The lesson: don’t chase specialization. A fast general-purpose MoE model with well-formatted input data beats a domain-specific model every time. The model is the least important variable. Your data formatting is the most important.

The Gotcha: Models That Don’t Fit

This is worth calling out because the marketing is misleading.

Llama 4 Scout has “17B active parameters” per token – sounds like it should fit in 24GB, right? Wrong. Scout uses a Mixture-of-Experts architecture with 109B total parameters across 16 experts. All 109B must be loaded into memory even though only 17B are active per token. At Q4 quantization, that is 55-65GB. It does not fit.

Model	Total Params	Q4 Size	Fits 24GB?
Llama 4 Scout	109B (17B active)	~60 GB	No
Llama 4 Maverick	400B (17B active)	~200 GB	No
Llama 3.3 70B	70B	~40 GB	No
Qwen 2.5 72B	72B	~42 GB	No
Qwen 3.5 122B A10B	122B (10B active)	~76 GB	No (CPU offload possible)
DeepSeek R1 (full)	671B	~350 GB	No

The 24GB sweet spot is 24-32B parameter dense models and small MoE models (like Gemma 4 26B and Qwen 3 30B A3B, which keep active parameter counts low).

VRAM Quick Reference

For planning which models to keep downloaded:

Model	Type	Params	VRAM (8K ctx)	Best For
Llama 3.2 3B	Dense	3B	3.6 GB	Quick tasks, testing
Llama 3.2 Vision 11B	Vision	11B	7.3 GB	Image analysis, fast
Phi-4 14B	Dense	14B	11.0 GB	Efficiency, leaves VRAM headroom
Qwen 3 14B	Dense	14B	10.7 GB	Mid-range all-rounder
Gemma 4 26B	MoE	26B	~17 GB	Production winner, default pick
Mistral Small 3.2	Dense	24B	~19 GB	Multilingual, instruction following
Qwen 3 30B A3B	MoE	30B	20.3 GB	Long context champion
Qwen 2.5 Coder 32B	Dense	32B	~22 GB	Code generation
Gemma 3 27B	Dense	27B	22.5 GB	Google QAT, 128K context
Gemma 4 31B	Dense	31B	18.5 GB	Heavier general-purpose
DeepSeek R1 32B	Dense	32B	~22 GB	Reasoning (but see caveats above)

Context length matters. These VRAM numbers are at 8K context. At 16K, add 1-3GB. At 48K, add 5-10GB. Only Qwen 3 30B A3B handles long contexts gracefully – everything else spills to system RAM and slows down dramatically.

Understanding Quantization

When Ollama downloads a model, it uses Q4_K_M quantization by default. This means each parameter is stored in ~4 bits instead of the original 16 or 32 bits. The practical effect:

Quantization	Quality Retention	VRAM Savings
Q8_0 (8-bit)	~99%	~50%
Q6_K (6-bit)	~98%	~62%
Q5_K_M (5-bit)	~97%	~69%
Q4_K_M (4-bit)	~95-99%	~75%
Q3_K_M (3-bit)	~90-93%	~81%
Q2_K (2-bit)	~85%	~88%

Q4_K_M is the consensus recommendation. The quality loss is imperceptible for most tasks. You can request specific quantizations from Ollama if needed:

ollama pull gemma3:27b-q3_K_M   # Tighter fit, slightly lower quality
ollama pull phi4:q8_0            # Higher quality, uses more VRAM

The Model Landscape in 2026

A quick overview of who is making what, since the ecosystem moves fast:

Google Gemma 4 is the sleeper hit of 2026. While everyone was watching Llama 4 and DeepSeek, Google quietly shipped MoE models that dominate on consumer hardware. Gemma 4 26B runs at 76 tok/s on Apple Silicon and won our production A/B tests. The QAT (Quantization-Aware Trained) variants of Gemma 3 are specifically optimized for consumer GPUs. Google is playing the efficiency game better than anyone right now.

Qwen (Alibaba) overtook Llama as the most downloaded open-source model family in late 2025. Qwen 3 and 3.5 offer models from 0.6B to 397B. Apache 2.0 license. The 30B-A3B MoE variant is the best long-context model at this size. Qwen 3.5 122B is the quality ceiling for local inference. Supports 119 languages.

Meta Llama 4 launched Scout and Maverick with MoE architectures and native multimodality. Impressive tech, but neither fits consumer GPUs. The older Llama 3.2 Vision 11B remains useful for multimodal tasks. The marketing around “17B active parameters” has confused a lot of people into thinking Scout fits their GPU – it doesn’t.

DeepSeek popularized chain-of-thought reasoning in open-source models. The full R1 at 671B is genuinely impressive. But the distilled 32B variant lost our A/B tests badly. The chain-of-thought reasoning that works well for math and logic puzzles actively hurts performance on real-world decision-making tasks where speed and directness matter more. Use it for math homework, not for production.

Microsoft Phi-4 is the efficiency champion. 14B parameters that punch above their weight. The best choice when you want to leave VRAM headroom or need sub-second responses.

Mistral continues shipping competitive models from Europe. Mistral Small 3.2 at 24B matches much larger models while keeping the VRAM footprint manageable. Vision-capable in newer versions.

The Hierarchy That Actually Matters

After running thousands of evaluations, here is what moves the needle, ranked by impact:

Data formatting – how you structure and label the input data before it reaches the model. This is 60% of the outcome.
Prompt design – what you ask the model to do and how. Keep it simple. Complexity has negative returns.
Model architecture – MoE vs dense, size class. Pick MoE for speed, dense for VRAM-constrained depth.
Specific model choice – within the same architecture class, differences are noise.
Temperature – completely irrelevant. 0.0, 0.3, 0.5 all within margin of error.
Domain fine-tuning – actively harmful. Don’t seek it out.

Most people obsess over #4 and #5 while ignoring #1 and #2. If your results aren’t good, the model isn’t the problem. Your data pipeline is.

Tips for Daily Use

Switch Models for Different Tasks

Don’t pick one model and use it for everything. OpenClaw makes it easy to switch:

General use and analysis: Gemma 4 26B (fast, accurate, production-tested)
Long document processing: Qwen 3 30B A3B (handles large context without slowdown)
Coding sessions: Qwen 2.5 Coder 32B (specialized, still the coding king)
Quick questions and brainstorming: Phi-4 14B (fast, leaves VRAM headroom)
Image analysis: Llama 3.2 Vision 11B (lightweight multimodal)

Manage Your Model Library

Models take disk space. Keep your active set small:

# List downloaded models and their sizes
ollama list

# Remove a model you are not using
ollama rm mistral-nemo

# Update a model to the latest version
ollama pull gemma4:26b

Monitor GPU Usage

Keep nvidia-smi handy to watch VRAM usage:

# One-shot check
nvidia-smi

# Continuous monitoring (updates every 2 seconds)
nvidia-smi -l 2

If VRAM usage is at 100% and inference is slow, the model is spilling to system RAM. Either use a smaller model or reduce the context length.

Create Custom Model Presets

Ollama lets you create custom model configurations with system prompts baked in:

# Create a Modelfile
@"
FROM gemma4:26b
SYSTEM You are a helpful assistant. Be concise and direct. Format data clearly.
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
"@ | Set-Content Modelfile

ollama create my-assistant -f Modelfile
ollama run my-assistant

Note: temperature doesn’t meaningfully affect output quality in our testing. Set it to whatever you prefer aesthetically – lower values give slightly more consistent formatting.

The Lazy Way: Let Claude Code Do It

If you have Claude Code installed, you can skip most of the manual steps above. Open a terminal, launch claude, and paste one of these prompts. Claude Code will run the commands, troubleshoot errors, and verify each step.

Full Setup (Everything At Once)

I have a Windows machine with an NVIDIA GPU (24GB VRAM) and Docker Desktop
installed with the WSL2 backend. Set up a local AI stack for me:

1. Verify my GPU is accessible (run nvidia-smi)
2. Install Ollama if not already installed (winget or download)
3. Set the OLLAMA_HOST environment variable to 0.0.0.0 so Docker
   containers can reach it, then restart the Ollama service
4. Pull these models: gemma4:26b, qwen3:30b-a3b,
   qwen2.5-coder:32b, phi4
5. Create a Dockerfile for OpenClaw (FROM node:22-bookworm-slim,
   npm install -g openclaw@latest, expose 18789, CMD with --bind lan
   and --allow-unconfigured)
6. Build the OpenClaw image and run it with --add-host for
   host.docker.internal and port 18789 mapped
7. Verify OpenClaw is reachable at http://localhost:18789
8. Verify Ollama models are visible from inside the container by
   curling http://host.docker.internal:11434/api/tags

After each step, verify it worked before moving on. If something fails,
diagnose and fix it.

Just Ollama + Models

Install Ollama on this Windows machine and pull a curated set of models
for my 24GB VRAM GPU. I want:

- gemma4:26b (general purpose, production-tested winner)
- qwen3:30b-a3b (long context, MoE speed)
- qwen2.5-coder:32b (coding)
- phi4 (fast/efficient)

After pulling, set OLLAMA_HOST=0.0.0.0 as a system environment variable
so other services (like Docker containers) can connect. Verify each model
loads by running a quick test prompt through it.

Just OpenClaw (Ollama Already Running)

Ollama is already running on this machine at localhost:11434. Set up
OpenClaw in Docker to connect to it:

1. Create a Dockerfile for OpenClaw (node:22-bookworm-slim base,
   install openclaw@latest globally via npm, expose port 18789,
   CMD: openclaw gateway --port 18789 --bind lan --allow-unconfigured)
2. Build the image tagged as "openclaw"
3. Run it with port 18789 mapped and --add-host=host.docker.internal:host-gateway
4. Verify the web UI is reachable at http://localhost:18789
5. Verify it can reach Ollama by curling the Ollama API from inside
   the container

Troubleshooting Prompt

My self-hosted AI setup is not working. Here is what I have:
- Windows with Docker Desktop (WSL2)
- Ollama installed natively
- OpenClaw running in a Docker container

Diagnose the issue:
1. Check if nvidia-smi shows the GPU
2. Check if Ollama is running and listening (curl localhost:11434)
3. Check if OLLAMA_HOST is set to 0.0.0.0
4. Check if the OpenClaw container is running (docker ps)
5. Check OpenClaw logs (docker logs openclaw)
6. From inside the container, check if it can reach Ollama
   (docker exec openclaw curl http://host.docker.internal:11434/api/tags)
7. Report what is broken and fix it

These prompts are self-contained – Claude Code has enough context in each one to execute the full workflow without follow-up questions. If something goes sideways, it will diagnose the error and try to fix it before asking you for help.

Updating OpenClaw

When a new version of OpenClaw is released, rebuild the Docker image:

docker stop openclaw
docker rm openclaw

# Rebuild with latest
docker build --no-cache -t openclaw .

# Run again
docker run -d `
  --name openclaw `
  -p 18789:18789 `
  --add-host=host.docker.internal:host-gateway `
  --restart unless-stopped `
  openclaw

What’s Next

Once you have this running, there are a few upgrades worth exploring:

RAG (Retrieval Augmented Generation): Feed your own documents, code repos, or notes into the model’s context. OpenClaw supports this through its memory and RAG features.
MCP Tools: Model Context Protocol lets your AI assistant call external tools – web search, file access, API calls. OpenClaw has built-in support.
Multiple channels: OpenClaw supports Telegram, Discord, and Slack in addition to the web UI. Set up a Telegram bot and chat with your AI from your phone.
Scheduled tasks: Use OpenClaw’s agent features to run automated workflows – code review, log analysis, daily summaries.
Data pipeline optimization: If you are using your local AI for any kind of analysis, invest time in formatting your input data, not in switching models. Label your inputs, structure them consistently, and keep your prompts simple. This is where the real gains are.

The self-hosted AI space is moving fast. Models that required a cluster a year ago now run on a single consumer GPU. With 24GB of VRAM, you are sitting at the sweet spot – capable enough for serious work, affordable enough for a personal setup. Just don’t fall for the hype. Pick MoE models, format your data well, and skip the domain-specific models. The generalists win.

TL;DR#

Why Self-Host#

What You Need#

Hardware#

Software#

Step 1: Install NVIDIA Drivers#

Step 2: Enable WSL2#

Check if WSL is Already Installed#

Install WSL2 from Scratch#

If WSL Install Fails#

Update the WSL Kernel#

Step 3: Install Docker Desktop#

Enable GPU Access in Docker#

Step 4: Install Ollama#

Pull Your First Model#

Configure Ollama for Docker Access#

Step 5: Run OpenClaw in Docker#

Step 6: Configure OpenClaw#

Optional: Add Cloud Providers as Fallback#

Step 7: Pull the Right Models#

What I Tested#

The Counter-Intuitive Results#

The Must-Have Starter Pack#

Why MoE Models Win#

Best Models by Category#

General Intelligence / Default: Gemma 4 26B (MoE)#

High Accuracy: Qwen 3.5 122B A10B (MoE)#

Long Context: Qwen 3 30B A3B (MoE)#

Coding: Qwen 2.5 Coder 32B#

Efficiency: Phi-4 14B#

Vision (Multimodal): Llama 3.2 Vision 11B#

Models to Avoid (And Why)#

The Gotcha: Models That Don’t Fit#

VRAM Quick Reference#

Understanding Quantization#

The Model Landscape in 2026#

The Hierarchy That Actually Matters#

Tips for Daily Use#

Switch Models for Different Tasks#

Manage Your Model Library#

Monitor GPU Usage#

Create Custom Model Presets#

The Lazy Way: Let Claude Code Do It#

Full Setup (Everything At Once)#

Just Ollama + Models#

Just OpenClaw (Ollama Already Running)#

Troubleshooting Prompt#

Updating OpenClaw#

What’s Next#