TL;DR
You have a 24GB VRAM GPU. You want a private, self-hosted AI assistant that rivals ChatGPT – no subscriptions, no data leaving your machine. This guide walks you through setting up Ollama (local model runtime) and OpenClaw (AI gateway with a web UI) on Windows using Docker Desktop.
But the real value here is the model recommendations. I ran 5,475 evaluations across 21 prompt variants and 6 models on real trading data. The results contradicted almost everything the community recommends. Finance-tuned models performed worse than a coin flip. Chain-of-thought reasoning models were anti-patterns. The winners were general-purpose MoE (Mixture-of-Experts) models that nobody talks about for specialized tasks.
Why Self-Host
The pitch is simple:
- Privacy. Your conversations never leave your machine. No training on your data. No terms of service changes to worry about.
- Cost. After the hardware investment, inference is free. If you are spending $20-200/month on API credits or ChatGPT Plus, a 24GB GPU pays for itself in months.
- No rate limits. Generate as many tokens as your GPU can push. No “you’ve hit your limit” messages at 2am when you are in the zone.
- Offline capability. Works on a plane, in a cabin, during an ISP outage. The models live on your disk.
- Customization. System prompts, custom model presets, RAG pipelines, tool integrations – you control the entire stack.
The tradeoff is that local models are not as capable as frontier models like Claude Opus or GPT-5. But the gap has narrowed dramatically. A well-chosen 26B MoE model running locally in 2026 matches or exceeds what GPT-4 could do in 2024. For most day-to-day tasks – coding assistance, writing drafts, summarization, brainstorming – local models are more than good enough.
What You Need
Hardware
- GPU: Any NVIDIA GPU with 24GB VRAM. The RTX 3090, RTX 4090, RTX A5000, and RTX 5090 all qualify. AMD GPUs work with Ollama too, but NVIDIA has better driver support and faster inference via CUDA.
- RAM: 32GB system RAM minimum. 64GB recommended. When a model doesn’t fully fit in VRAM, the overflow spills to system RAM – and you want headroom for that.
- Storage: At least 100GB free. Models range from 2GB (small 3B models) to 22GB (large 32B models at Q4 quantization). You will want several models downloaded.
- CPU: Any modern CPU works. The GPU does the heavy lifting.
Software
- Windows 10/11 with WSL2 enabled
- Docker Desktop with WSL2 backend
- Ollama (native Windows install)
- NVIDIA drivers (latest Game Ready or Studio driver)
Step 1: Install NVIDIA Drivers
If you are gaming on this machine, you probably already have recent drivers. Verify:
- Open NVIDIA Control Panel (right-click desktop)
- Click Help > System Information
- Check the driver version – anything from 2025 or later is fine
If you need to update, grab the latest from nvidia.com/drivers. The Studio Driver is slightly more stable for compute workloads, but Game Ready works fine too.
Verify CUDA is working by opening PowerShell:
nvidia-smi
You should see your GPU listed with 24GB (or close to it) of memory. If nvidia-smi is not found, the driver install didn’t add it to PATH – restart your terminal or reboot.
Step 2: Enable WSL2
Docker Desktop requires WSL2 (Windows Subsystem for Linux 2) as its backend. If you have never used WSL, you need to set it up first. If you already have WSL2 running, skip to Step 3.
Check if WSL is Already Installed
Open PowerShell as Administrator and run:
wsl --status
If you see a default distribution and “WSL version: 2”, you are good – skip to Step 3. If you get an error or see “WSL version: 1”, keep reading.
Install WSL2 from Scratch
Open PowerShell as Administrator and run the install command:
wsl --install
This enables the WSL2 feature, installs the Linux kernel, and downloads Ubuntu as the default distribution. It requires a reboot.
After rebooting, Ubuntu will launch automatically and ask you to create a Linux username and password. These are for the Linux environment only – pick anything you will remember.
Once that is done, verify WSL2 is active:
wsl --list --verbose
You should see Ubuntu listed with VERSION 2. If it shows VERSION 1, upgrade it:
wsl --set-version Ubuntu 2
If WSL Install Fails
On older Windows 10 builds or machines with virtualization disabled, wsl --install may fail. The fix:
First, enable virtualization in BIOS. Reboot into BIOS/UEFI (usually Del or F2 during boot). Look for “Intel VT-x”, “AMD-V”, or “SVM Mode” and enable it.
Then, manually enable the required Windows features:
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
Reboot, then set WSL2 as the default version and install a distro:
wsl --set-default-version 2
wsl --install -d Ubuntu
Update the WSL Kernel
Even if WSL2 is already installed, make sure the kernel is current. An outdated kernel can cause GPU passthrough issues:
wsl --update
Step 3: Install Docker Desktop
- Download Docker Desktop from docker.com
- Run the installer. When prompted, ensure WSL2 backend is selected (not Hyper-V)
- After installation, open Docker Desktop and let it finish initializing
- In Docker Desktop settings:
- General: Ensure “Use the WSL 2 based engine” is checked
- Resources > WSL Integration: Enable integration with your Ubuntu distro
Verify Docker is working:
docker run hello-world
If this fails with a permissions error, make sure Docker Desktop is running (check the system tray) and that your user is in the docker-users group (Docker Desktop usually handles this during install, but a sign-out/sign-in may be needed).
Enable GPU Access in Docker
Docker Desktop on Windows supports GPU passthrough via WSL2. In Docker Desktop settings:
- Go to Resources > Advanced
- Allocate at least 16GB of RAM to Docker (32GB if you have 64GB system RAM)
- GPU passthrough should work automatically with recent Docker Desktop versions and an up-to-date WSL kernel
Test GPU access:
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
If you see your GPU listed, you are good. If you get an error about --gpus, make sure Docker Desktop is updated to the latest version and WSL2 integration is enabled.
Step 4: Install Ollama
Ollama is the model runtime. It downloads, manages, and serves LLMs locally. Install it natively on Windows (not in Docker) for the best GPU performance:
- Download the installer from ollama.com/download
- Run the installer – it sets up Ollama as a background service
- Ollama starts automatically and listens on
http://localhost:11434
Verify:
ollama --version
Pull Your First Model
Let’s start with the model that actually won our benchmarks:
ollama pull gemma4:26b
This downloads Google’s Gemma 4 26B – a Mixture-of-Experts model with only a fraction of its parameters active per token, which means it runs fast while punching well above its weight class. The download is about 17GB. Once it is done:
ollama run gemma4:26b
You should get a chat prompt. Type something. If you see tokens streaming back at 40-70+ tokens per second, your GPU is doing its job. Press Ctrl+D to exit.
Configure Ollama for Docker Access
By default, Ollama only listens on localhost. Since OpenClaw will run in Docker, it needs to reach Ollama on the host. Set the environment variable so Ollama listens on all interfaces:
- Open System Environment Variables (search “environment variables” in Start)
- Under System variables, click New
- Variable name:
OLLAMA_HOST - Variable value:
0.0.0.0 - Click OK, then restart Ollama (quit from system tray and reopen, or restart the service)
Verify it is listening:
curl http://localhost:11434/api/tags
You should see a JSON response listing your downloaded models.
Step 5: Run OpenClaw in Docker
OpenClaw is the AI gateway – it provides a web chat interface, multi-model support, conversation history, and tool integrations. It connects to Ollama as a model backend.
OpenClaw does not publish an official Docker image, so we build one. Create a file called Dockerfile anywhere on your machine:
FROM node:22-bookworm-slim
RUN apt-get update && apt-get install -y python3 make g++ libopus-dev && rm -rf /var/lib/apt/lists/*
RUN npm install -g openclaw@latest
USER 1000
EXPOSE 18789
CMD ["openclaw", "gateway", "--port", "18789", "--bind", "lan", "--allow-unconfigured"]
Build and run it:
docker build -t openclaw .
docker run -d `
--name openclaw `
-p 18789:18789 `
--add-host=host.docker.internal:host-gateway `
--restart unless-stopped `
openclaw
The --add-host=host.docker.internal:host-gateway flag is critical. It lets the container reach services running on your Windows host (like Ollama) via host.docker.internal.
Open your browser to http://localhost:18789. You should see the OpenClaw Control UI.
Step 6: Configure OpenClaw
On first launch, OpenClaw’s Control UI lets you configure model providers. Here is the setup for Ollama:
- Navigate to the Models section
- Add a new provider:
- Type: OpenAI-compatible
- Base URL:
http://host.docker.internal:11434/v1 - API Key:
ollama(Ollama doesn’t require a real key, but the field cannot be empty)
- OpenClaw will auto-discover your Ollama models
- Set your preferred default model (I recommend starting with
gemma4:26b)
Optional: Add Cloud Providers as Fallback
OpenClaw can route to multiple providers. For tasks that exceed local model capabilities, you can add API keys for cloud providers:
- Anthropic (Claude) – best for complex reasoning and code
- OpenRouter – aggregates dozens of providers behind one API key
- OpenAI – GPT models
This is optional. The point of this guide is running everything locally. But having a cloud fallback for the occasional hard problem is pragmatic.
Step 7: Pull the Right Models
This is where most guides get it wrong. They recommend models based on benchmark leaderboards and marketing copy. I’m going to recommend models based on what actually works, tested across 5,475 evaluations on real-world tasks.
What I Tested
I run an AI trading agent called OpenClaw that analyzes VIX/volatility data and generates trading signals. Over two weeks, I ran every promising model through the same battery: 21 prompt variants, 61 days of historical trading data, scoring each model on win rate, average profit per trade, and consistency.
The results were humbling. Everything the community consensus told me to do was wrong.
The Counter-Intuitive Results
Finance-tuned models are worse than a coin flip. Palmyra-Fin-70B – a model specifically trained on financial data – scored a 35.6% win rate. Below 50%. You’d be better off flipping a coin. The fine-tuning made it overfit to patterns in its training data that don’t generalize.
Chain-of-thought reasoning models are anti-patterns. DeepSeek R1 32B, the darling of the reasoning community, scored 35.5% on the same task. The chain-of-thought “thinking” process actually hurt performance by overcomplicating signals that should be straightforward.
General-purpose MoE models dominate. Gemma 4 26B – a model nobody recommends for financial analysis – scored 66% win rate with +$61 average profit per trade at 76 tokens/second. Qwen 3.5 122B hit 67% win rate at 29 tok/s for when you want higher accuracy and can wait a beat.
Data formatting beats model selection. The single biggest improvement came from labeling news headlines with VIX impact scores before feeding them to the model. This one formatting change outperformed switching models, tweaking temperatures, or adding elaborate system prompts. Temperature was completely irrelevant – 0.0, 0.3, and 0.5 all performed within noise.
Complexity has negative returns. Kitchen-sink prompts (throw every indicator at the model): 52% win rate. Regime-rule prompts (elaborate if/then logic): 45%. The best prompt was the simplest one that formatted the data well.
The Must-Have Starter Pack
Based on actual testing, these are the models I recommend:
# Production winner -- 66% win rate, 76 tok/s, fast and accurate
ollama pull gemma4:26b
# High-accuracy alternative -- 67% win rate, 29 tok/s
ollama pull qwen3.5:122b-a10b
# Long context champion -- handles 48K+ tokens without slowdown
ollama pull qwen3:30b-a3b
# Efficiency pick -- fast responses when you need speed over depth
ollama pull phi4
Why MoE Models Win
The key insight is Mixture-of-Experts architecture. MoE models have many total parameters but only activate a fraction per token. Gemma 4 26B is a 26 billion parameter model, but only a subset of those “experts” fire for each token. This means:
- Speed. Fewer active parameters = fewer calculations per token = faster inference.
- Diversity. The model has more total knowledge (spread across experts) than a dense model of the same speed class.
- VRAM efficiency. The active parameter count determines speed, but the total parameter count determines knowledge. You get a better knowledge-to-speed ratio than dense models.
On a 24GB GPU, MoE models like Gemma 4 26B and Qwen 3 30B-A3B run 2-4x faster than dense models of comparable quality. This isn’t theoretical – it is measured.
Best Models by Category
General Intelligence / Default: Gemma 4 26B (MoE)
ollama pull gemma4:26b
- VRAM: ~17GB at Q4
- Speed: ~50-76 tokens/sec (hardware dependent)
- Why: Won our A/B tests across multiple prompt variants. Fast, accurate, and leaves VRAM headroom. This should be your default model for everything unless you have a specific reason to use something else.
High Accuracy: Qwen 3.5 122B A10B (MoE)
ollama pull qwen3.5:122b-a10b
- VRAM: Needs 48GB+ (won’t fit 24GB alone – use with CPU offload or on larger hardware)
- Speed: ~25-29 tokens/sec on 256GB unified memory
- Why: Highest accuracy in our tests (67% win rate). If you have a Mac with 64GB+ unified memory or are willing to accept CPU offloading on Windows, this is the quality pick.
Note: This model won’t fully fit in 24GB VRAM. On Windows, Ollama will automatically spill overflow to system RAM, which works but reduces speed significantly. If you have 64GB system RAM, it is usable. If you only have 32GB, skip this and stick with Gemma 4.
Long Context: Qwen 3 30B A3B (MoE)
ollama pull qwen3:30b-a3b
- VRAM: ~20GB at 16K context, ~24GB at 48K context
- Speed: ~54 tokens/sec at 16K, ~33 tokens/sec at 48K
- Why: The only model in this class that maintains full GPU utilization at 48K tokens. Every other 30B+ model spills to CPU RAM at long contexts, dropping speed below 10 tokens/sec. If you are feeding in long documents, codebases, or conversation histories, this is the pick.
Coding: Qwen 2.5 Coder 32B
ollama pull qwen2.5-coder:32b
- VRAM: ~22GB
- Speed: ~25-35 tokens/sec
- Why: Still the best dedicated coding model at this size. GPT-4o-level code generation, debugging, and refactoring. Note: our A/B tests focused on analysis tasks, not pure code generation – for coding specifically, a specialized model still makes sense.
Efficiency: Phi-4 14B
ollama pull phi4
- VRAM: ~11GB
- Speed: ~80+ tokens/sec
- Why: Microsoft’s efficiency king. Uses half your VRAM, leaving room for long context windows. Despite its size, reasoning variants punch well above their weight. The model you run when you want instant responses or when you are multitasking.
Vision (Multimodal): Llama 3.2 Vision 11B
ollama pull llama3.2-vision:11b
- VRAM: ~7.3GB
- Speed: ~84 tokens/sec
- Why: Lightweight, fast, and handles image analysis well. Good for screenshots, diagrams, charts. Leaves plenty of VRAM for concurrent use with other models.
Models to Avoid (And Why)
This section will save you hours of downloading models that underperform. These aren’t bad models – they are bad choices for a local AI assistant setup.
| Model | Why It Underperforms | Test Result |
|---|---|---|
| Palmyra-Fin-70B | Finance fine-tuning overfits to training patterns | 35.6% win rate (below coin flip) |
| DeepSeek R1 32B | Chain-of-thought overcomplicates simple decisions | 35.5% win rate |
| Any “finance-tuned” model | Domain fine-tuning hurts generalization | Consistently below general models |
| Llama 4 Scout | 109B total params, doesn’t fit 24GB despite “17B active” marketing | N/A – won’t load |
The lesson: don’t chase specialization. A fast general-purpose MoE model with well-formatted input data beats a domain-specific model every time. The model is the least important variable. Your data formatting is the most important.
The Gotcha: Models That Don’t Fit
This is worth calling out because the marketing is misleading.
Llama 4 Scout has “17B active parameters” per token – sounds like it should fit in 24GB, right? Wrong. Scout uses a Mixture-of-Experts architecture with 109B total parameters across 16 experts. All 109B must be loaded into memory even though only 17B are active per token. At Q4 quantization, that is 55-65GB. It does not fit.
| Model | Total Params | Q4 Size | Fits 24GB? |
|---|---|---|---|
| Llama 4 Scout | 109B (17B active) | ~60 GB | No |
| Llama 4 Maverick | 400B (17B active) | ~200 GB | No |
| Llama 3.3 70B | 70B | ~40 GB | No |
| Qwen 2.5 72B | 72B | ~42 GB | No |
| Qwen 3.5 122B A10B | 122B (10B active) | ~76 GB | No (CPU offload possible) |
| DeepSeek R1 (full) | 671B | ~350 GB | No |
The 24GB sweet spot is 24-32B parameter dense models and small MoE models (like Gemma 4 26B and Qwen 3 30B A3B, which keep active parameter counts low).
VRAM Quick Reference
For planning which models to keep downloaded:
| Model | Type | Params | VRAM (8K ctx) | Best For |
|---|---|---|---|---|
| Llama 3.2 3B | Dense | 3B | 3.6 GB | Quick tasks, testing |
| Llama 3.2 Vision 11B | Vision | 11B | 7.3 GB | Image analysis, fast |
| Phi-4 14B | Dense | 14B | 11.0 GB | Efficiency, leaves VRAM headroom |
| Qwen 3 14B | Dense | 14B | 10.7 GB | Mid-range all-rounder |
| Gemma 4 26B | MoE | 26B | ~17 GB | Production winner, default pick |
| Mistral Small 3.2 | Dense | 24B | ~19 GB | Multilingual, instruction following |
| Qwen 3 30B A3B | MoE | 30B | 20.3 GB | Long context champion |
| Qwen 2.5 Coder 32B | Dense | 32B | ~22 GB | Code generation |
| Gemma 3 27B | Dense | 27B | 22.5 GB | Google QAT, 128K context |
| Gemma 4 31B | Dense | 31B | 18.5 GB | Heavier general-purpose |
| DeepSeek R1 32B | Dense | 32B | ~22 GB | Reasoning (but see caveats above) |
Context length matters. These VRAM numbers are at 8K context. At 16K, add 1-3GB. At 48K, add 5-10GB. Only Qwen 3 30B A3B handles long contexts gracefully – everything else spills to system RAM and slows down dramatically.
Understanding Quantization
When Ollama downloads a model, it uses Q4_K_M quantization by default. This means each parameter is stored in ~4 bits instead of the original 16 or 32 bits. The practical effect:
| Quantization | Quality Retention | VRAM Savings |
|---|---|---|
| Q8_0 (8-bit) | ~99% | ~50% |
| Q6_K (6-bit) | ~98% | ~62% |
| Q5_K_M (5-bit) | ~97% | ~69% |
| Q4_K_M (4-bit) | ~95-99% | ~75% |
| Q3_K_M (3-bit) | ~90-93% | ~81% |
| Q2_K (2-bit) | ~85% | ~88% |
Q4_K_M is the consensus recommendation. The quality loss is imperceptible for most tasks. You can request specific quantizations from Ollama if needed:
ollama pull gemma3:27b-q3_K_M # Tighter fit, slightly lower quality
ollama pull phi4:q8_0 # Higher quality, uses more VRAM
The Model Landscape in 2026
A quick overview of who is making what, since the ecosystem moves fast:
Google Gemma 4 is the sleeper hit of 2026. While everyone was watching Llama 4 and DeepSeek, Google quietly shipped MoE models that dominate on consumer hardware. Gemma 4 26B runs at 76 tok/s on Apple Silicon and won our production A/B tests. The QAT (Quantization-Aware Trained) variants of Gemma 3 are specifically optimized for consumer GPUs. Google is playing the efficiency game better than anyone right now.
Qwen (Alibaba) overtook Llama as the most downloaded open-source model family in late 2025. Qwen 3 and 3.5 offer models from 0.6B to 397B. Apache 2.0 license. The 30B-A3B MoE variant is the best long-context model at this size. Qwen 3.5 122B is the quality ceiling for local inference. Supports 119 languages.
Meta Llama 4 launched Scout and Maverick with MoE architectures and native multimodality. Impressive tech, but neither fits consumer GPUs. The older Llama 3.2 Vision 11B remains useful for multimodal tasks. The marketing around “17B active parameters” has confused a lot of people into thinking Scout fits their GPU – it doesn’t.
DeepSeek popularized chain-of-thought reasoning in open-source models. The full R1 at 671B is genuinely impressive. But the distilled 32B variant lost our A/B tests badly. The chain-of-thought reasoning that works well for math and logic puzzles actively hurts performance on real-world decision-making tasks where speed and directness matter more. Use it for math homework, not for production.
Microsoft Phi-4 is the efficiency champion. 14B parameters that punch above their weight. The best choice when you want to leave VRAM headroom or need sub-second responses.
Mistral continues shipping competitive models from Europe. Mistral Small 3.2 at 24B matches much larger models while keeping the VRAM footprint manageable. Vision-capable in newer versions.
The Hierarchy That Actually Matters
After running thousands of evaluations, here is what moves the needle, ranked by impact:
- Data formatting – how you structure and label the input data before it reaches the model. This is 60% of the outcome.
- Prompt design – what you ask the model to do and how. Keep it simple. Complexity has negative returns.
- Model architecture – MoE vs dense, size class. Pick MoE for speed, dense for VRAM-constrained depth.
- Specific model choice – within the same architecture class, differences are noise.
- Temperature – completely irrelevant. 0.0, 0.3, 0.5 all within margin of error.
- Domain fine-tuning – actively harmful. Don’t seek it out.
Most people obsess over #4 and #5 while ignoring #1 and #2. If your results aren’t good, the model isn’t the problem. Your data pipeline is.
Tips for Daily Use
Switch Models for Different Tasks
Don’t pick one model and use it for everything. OpenClaw makes it easy to switch:
- General use and analysis: Gemma 4 26B (fast, accurate, production-tested)
- Long document processing: Qwen 3 30B A3B (handles large context without slowdown)
- Coding sessions: Qwen 2.5 Coder 32B (specialized, still the coding king)
- Quick questions and brainstorming: Phi-4 14B (fast, leaves VRAM headroom)
- Image analysis: Llama 3.2 Vision 11B (lightweight multimodal)
Manage Your Model Library
Models take disk space. Keep your active set small:
# List downloaded models and their sizes
ollama list
# Remove a model you are not using
ollama rm mistral-nemo
# Update a model to the latest version
ollama pull gemma4:26b
Monitor GPU Usage
Keep nvidia-smi handy to watch VRAM usage:
# One-shot check
nvidia-smi
# Continuous monitoring (updates every 2 seconds)
nvidia-smi -l 2
If VRAM usage is at 100% and inference is slow, the model is spilling to system RAM. Either use a smaller model or reduce the context length.
Create Custom Model Presets
Ollama lets you create custom model configurations with system prompts baked in:
# Create a Modelfile
@"
FROM gemma4:26b
SYSTEM You are a helpful assistant. Be concise and direct. Format data clearly.
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
"@ | Set-Content Modelfile
ollama create my-assistant -f Modelfile
ollama run my-assistant
Note: temperature doesn’t meaningfully affect output quality in our testing. Set it to whatever you prefer aesthetically – lower values give slightly more consistent formatting.
The Lazy Way: Let Claude Code Do It
If you have Claude Code installed, you can skip most of the manual steps above. Open a terminal, launch claude, and paste one of these prompts. Claude Code will run the commands, troubleshoot errors, and verify each step.
Full Setup (Everything At Once)
I have a Windows machine with an NVIDIA GPU (24GB VRAM) and Docker Desktop
installed with the WSL2 backend. Set up a local AI stack for me:
1. Verify my GPU is accessible (run nvidia-smi)
2. Install Ollama if not already installed (winget or download)
3. Set the OLLAMA_HOST environment variable to 0.0.0.0 so Docker
containers can reach it, then restart the Ollama service
4. Pull these models: gemma4:26b, qwen3:30b-a3b,
qwen2.5-coder:32b, phi4
5. Create a Dockerfile for OpenClaw (FROM node:22-bookworm-slim,
npm install -g openclaw@latest, expose 18789, CMD with --bind lan
and --allow-unconfigured)
6. Build the OpenClaw image and run it with --add-host for
host.docker.internal and port 18789 mapped
7. Verify OpenClaw is reachable at http://localhost:18789
8. Verify Ollama models are visible from inside the container by
curling http://host.docker.internal:11434/api/tags
After each step, verify it worked before moving on. If something fails,
diagnose and fix it.
Just Ollama + Models
Install Ollama on this Windows machine and pull a curated set of models
for my 24GB VRAM GPU. I want:
- gemma4:26b (general purpose, production-tested winner)
- qwen3:30b-a3b (long context, MoE speed)
- qwen2.5-coder:32b (coding)
- phi4 (fast/efficient)
After pulling, set OLLAMA_HOST=0.0.0.0 as a system environment variable
so other services (like Docker containers) can connect. Verify each model
loads by running a quick test prompt through it.
Just OpenClaw (Ollama Already Running)
Ollama is already running on this machine at localhost:11434. Set up
OpenClaw in Docker to connect to it:
1. Create a Dockerfile for OpenClaw (node:22-bookworm-slim base,
install openclaw@latest globally via npm, expose port 18789,
CMD: openclaw gateway --port 18789 --bind lan --allow-unconfigured)
2. Build the image tagged as "openclaw"
3. Run it with port 18789 mapped and --add-host=host.docker.internal:host-gateway
4. Verify the web UI is reachable at http://localhost:18789
5. Verify it can reach Ollama by curling the Ollama API from inside
the container
Troubleshooting Prompt
My self-hosted AI setup is not working. Here is what I have:
- Windows with Docker Desktop (WSL2)
- Ollama installed natively
- OpenClaw running in a Docker container
Diagnose the issue:
1. Check if nvidia-smi shows the GPU
2. Check if Ollama is running and listening (curl localhost:11434)
3. Check if OLLAMA_HOST is set to 0.0.0.0
4. Check if the OpenClaw container is running (docker ps)
5. Check OpenClaw logs (docker logs openclaw)
6. From inside the container, check if it can reach Ollama
(docker exec openclaw curl http://host.docker.internal:11434/api/tags)
7. Report what is broken and fix it
These prompts are self-contained – Claude Code has enough context in each one to execute the full workflow without follow-up questions. If something goes sideways, it will diagnose the error and try to fix it before asking you for help.
Updating OpenClaw
When a new version of OpenClaw is released, rebuild the Docker image:
docker stop openclaw
docker rm openclaw
# Rebuild with latest
docker build --no-cache -t openclaw .
# Run again
docker run -d `
--name openclaw `
-p 18789:18789 `
--add-host=host.docker.internal:host-gateway `
--restart unless-stopped `
openclaw
What’s Next
Once you have this running, there are a few upgrades worth exploring:
- RAG (Retrieval Augmented Generation): Feed your own documents, code repos, or notes into the model’s context. OpenClaw supports this through its memory and RAG features.
- MCP Tools: Model Context Protocol lets your AI assistant call external tools – web search, file access, API calls. OpenClaw has built-in support.
- Multiple channels: OpenClaw supports Telegram, Discord, and Slack in addition to the web UI. Set up a Telegram bot and chat with your AI from your phone.
- Scheduled tasks: Use OpenClaw’s agent features to run automated workflows – code review, log analysis, daily summaries.
- Data pipeline optimization: If you are using your local AI for any kind of analysis, invest time in formatting your input data, not in switching models. Label your inputs, structure them consistently, and keep your prompts simple. This is where the real gains are.
The self-hosted AI space is moving fast. Models that required a cluster a year ago now run on a single consumer GPU. With 24GB of VRAM, you are sitting at the sweet spot – capable enough for serious work, affordable enough for a personal setup. Just don’t fall for the hype. Pick MoE models, format your data well, and skip the domain-specific models. The generalists win.