TL;DR
You have a 24GB VRAM GPU. You want a private, self-hosted AI assistant that rivals ChatGPT – no subscriptions, no data leaving your machine. This guide walks you through setting up Ollama (local model runtime) and OpenClaw (AI gateway with a web UI) on Windows using Docker Desktop. I also cover which models actually fit in 24GB, which ones don’t despite the marketing, and how to pick models for coding, reasoning, creative writing, and general use.
Why Self-Host
The pitch is simple:
- Privacy. Your conversations never leave your machine. No training on your data. No terms of service changes to worry about.
- Cost. After the hardware investment, inference is free. If you are spending $20-200/month on API credits or ChatGPT Plus, a 24GB GPU pays for itself in months.
- No rate limits. Generate as many tokens as your GPU can push. No “you’ve hit your limit” messages at 2am when you are in the zone.
- Offline capability. Works on a plane, in a cabin, during an ISP outage. The models live on your disk.
- Customization. System prompts, custom model presets, RAG pipelines, tool integrations – you control the entire stack.
The tradeoff is that local models are not as capable as frontier models like Claude Opus or GPT-5. But the gap has narrowed dramatically. A 32B parameter model running locally in 2026 matches or exceeds what GPT-4 could do in 2024. For most day-to-day tasks – coding assistance, writing drafts, summarization, brainstorming – local models are more than good enough.
What You Need
Hardware
- GPU: Any NVIDIA GPU with 24GB VRAM. The RTX 3090, RTX 4090, RTX A5000, and RTX 5090 all qualify. AMD GPUs work with Ollama too, but NVIDIA has better driver support and faster inference via CUDA.
- RAM: 32GB system RAM minimum. 64GB recommended. When a model doesn’t fully fit in VRAM, the overflow spills to system RAM – and you want headroom for that.
- Storage: At least 100GB free. Models range from 2GB (small 3B models) to 22GB (large 32B models at Q4 quantization). You will want several models downloaded.
- CPU: Any modern CPU works. The GPU does the heavy lifting.
Software
- Windows 10/11 with WSL2 enabled
- Docker Desktop with WSL2 backend
- Ollama (native Windows install)
- NVIDIA drivers (latest Game Ready or Studio driver)
Step 1: Install NVIDIA Drivers
If you are gaming on this machine, you probably already have recent drivers. Verify:
- Open NVIDIA Control Panel (right-click desktop)
- Click Help > System Information
- Check the driver version – anything from 2025 or later is fine
If you need to update, grab the latest from nvidia.com/drivers. The Studio Driver is slightly more stable for compute workloads, but Game Ready works fine too.
Verify CUDA is working by opening PowerShell:
nvidia-smi
You should see your GPU listed with 24GB (or close to it) of memory. If nvidia-smi is not found, the driver install didn’t add it to PATH – restart your terminal or reboot.
Step 2: Enable WSL2
Docker Desktop requires WSL2 (Windows Subsystem for Linux 2) as its backend. If you have never used WSL, you need to set it up first. If you already have WSL2 running, skip to Step 3.
Check if WSL is Already Installed
Open PowerShell as Administrator and run:
wsl --status
If you see a default distribution and “WSL version: 2”, you are good – skip to Step 3. If you get an error or see “WSL version: 1”, keep reading.
Install WSL2 from Scratch
Open PowerShell as Administrator and run the install command:
wsl --install
This enables the WSL2 feature, installs the Linux kernel, and downloads Ubuntu as the default distribution. It requires a reboot.
After rebooting, Ubuntu will launch automatically and ask you to create a Linux username and password. These are for the Linux environment only – pick anything you will remember.
Once that is done, verify WSL2 is active:
wsl --list --verbose
You should see Ubuntu listed with VERSION 2. If it shows VERSION 1, upgrade it:
wsl --set-version Ubuntu 2
If WSL Install Fails
On older Windows 10 builds or machines with virtualization disabled, wsl --install may fail. The fix:
First, enable virtualization in BIOS. Reboot into BIOS/UEFI (usually Del or F2 during boot). Look for “Intel VT-x”, “AMD-V”, or “SVM Mode” and enable it.
Then, manually enable the required Windows features:
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
Reboot, then set WSL2 as the default version and install a distro:
wsl --set-default-version 2
wsl --install -d Ubuntu
Update the WSL Kernel
Even if WSL2 is already installed, make sure the kernel is current. An outdated kernel can cause GPU passthrough issues:
wsl --update
Step 3: Install Docker Desktop
- Download Docker Desktop from docker.com
- Run the installer. When prompted, ensure WSL2 backend is selected (not Hyper-V)
- After installation, open Docker Desktop and let it finish initializing
- In Docker Desktop settings:
- General: Ensure “Use the WSL 2 based engine” is checked
- Resources > WSL Integration: Enable integration with your Ubuntu distro
Verify Docker is working:
docker run hello-world
If this fails with a permissions error, make sure Docker Desktop is running (check the system tray) and that your user is in the docker-users group (Docker Desktop usually handles this during install, but a sign-out/sign-in may be needed).
Enable GPU Access in Docker
Docker Desktop on Windows supports GPU passthrough via WSL2. In Docker Desktop settings:
- Go to Resources > Advanced
- Allocate at least 16GB of RAM to Docker (32GB if you have 64GB system RAM)
- GPU passthrough should work automatically with recent Docker Desktop versions and an up-to-date WSL kernel
Test GPU access:
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
If you see your GPU listed, you are good. If you get an error about --gpus, make sure Docker Desktop is updated to the latest version and WSL2 integration is enabled.
Step 4: Install Ollama
Ollama is the model runtime. It downloads, manages, and serves LLMs locally. Install it natively on Windows (not in Docker) for the best GPU performance:
- Download the installer from ollama.com/download
- Run the installer – it sets up Ollama as a background service
- Ollama starts automatically and listens on
http://localhost:11434
Verify:
ollama --version
Pull Your First Model
Let’s start with something that demonstrates the power of 24GB VRAM:
ollama pull qwen3:30b-a3b
This downloads Qwen3 30B – a 30 billion parameter model that fits comfortably in 24GB at Q4 quantization. The download is about 18GB. Once it is done:
ollama run qwen3:30b-a3b
You should get a chat prompt. Type something. If you see tokens streaming back at 30-50 tokens per second, your GPU is doing its job. Press Ctrl+D to exit.
Configure Ollama for Docker Access
By default, Ollama only listens on localhost. Since OpenClaw will run in Docker, it needs to reach Ollama on the host. Set the environment variable so Ollama listens on all interfaces:
- Open System Environment Variables (search “environment variables” in Start)
- Under System variables, click New
- Variable name:
OLLAMA_HOST - Variable value:
0.0.0.0 - Click OK, then restart Ollama (quit from system tray and reopen, or restart the service)
Verify it is listening:
curl http://localhost:11434/api/tags
You should see a JSON response listing your downloaded models.
Step 5: Run OpenClaw in Docker
OpenClaw is the AI gateway – it provides a web chat interface, multi-model support, conversation history, and tool integrations. It connects to Ollama as a model backend.
OpenClaw does not publish an official Docker image, so we build one. Create a file called Dockerfile anywhere on your machine:
FROM node:22-bookworm-slim
RUN apt-get update && apt-get install -y python3 make g++ libopus-dev && rm -rf /var/lib/apt/lists/*
RUN npm install -g openclaw@latest
USER 1000
EXPOSE 18789
CMD ["openclaw", "gateway", "--port", "18789", "--bind", "lan", "--allow-unconfigured"]
Build and run it:
docker build -t openclaw .
docker run -d `
--name openclaw `
-p 18789:18789 `
--add-host=host.docker.internal:host-gateway `
--restart unless-stopped `
openclaw
The --add-host=host.docker.internal:host-gateway flag is critical. It lets the container reach services running on your Windows host (like Ollama) via host.docker.internal.
Open your browser to http://localhost:18789. You should see the OpenClaw Control UI.
Step 6: Configure OpenClaw
On first launch, OpenClaw’s Control UI lets you configure model providers. Here is the setup for Ollama:
- Navigate to the Models section
- Add a new provider:
- Type: OpenAI-compatible
- Base URL:
http://host.docker.internal:11434/v1 - API Key:
ollama(Ollama doesn’t require a real key, but the field cannot be empty)
- OpenClaw will auto-discover your Ollama models
- Set your preferred default model (I recommend starting with
qwen3:30b-a3b)
Optional: Add Cloud Providers as Fallback
OpenClaw can route to multiple providers. For tasks that exceed local model capabilities, you can add API keys for cloud providers:
- Anthropic (Claude) – best for complex reasoning and code
- OpenRouter – aggregates dozens of providers behind one API key
- OpenAI – GPT models
This is optional. The point of this guide is running everything locally. But having a cloud fallback for the occasional hard problem is pragmatic.
Step 7: Pull the Right Models
Here is where the 24GB VRAM really shines. You can run models that would have required a data center two years ago. Below are my recommendations organized by use case.
The Must-Have Starter Pack
These three models cover most use cases and all fit in 24GB:
# Best all-rounder -- strong across all tasks
ollama pull qwen3:30b-a3b
# Best for coding -- GPT-4o level code generation
ollama pull qwen2.5-coder:32b
# Best for reasoning -- chain-of-thought problem solving
ollama pull deepseek-r1:32b
Best Models by Category
General Intelligence: GLM-4.7-Flash (30B)
ollama pull glm-4.7-flash
- VRAM: ~21GB at 16K context
- Speed: ~42 tokens/sec
The top scorer on recent intelligence benchmarks for models that fit in 24GB. It won an agentic coding challenge by autonomously building a procedural rendering engine – without human intervention. If you want the single smartest model that fits your GPU, this is it.
Long Context: Qwen3 30B A3B
ollama pull qwen3:30b-a3b
- VRAM: ~20GB at 16K context, ~24GB at 48K context
- Speed: ~54 tokens/sec at 16K, ~33 tokens/sec at 48K
The only model in this class that maintains 100% GPU utilization at 48K tokens. Every other 30B+ model spills to CPU RAM at long contexts, dropping speed to under 10 tokens/sec. If you are feeding in long documents, codebases, or conversation histories, Qwen3 30B is the pick.
Coding: Qwen 2.5 Coder 32B
ollama pull qwen2.5-coder:32b
- VRAM: ~22GB
- Speed: ~35 tokens/sec
GPT-4o-level coding performance in a model that runs on your desk. Specialized for code generation, debugging, refactoring, and completion across dozens of languages. If you primarily want a coding copilot, this is your main model.
Reasoning: DeepSeek R1 32B
ollama pull deepseek-r1:32b
- VRAM: ~22GB
- Speed: ~40 tokens/sec
Distilled from the massive 671B DeepSeek R1. This model shows its thinking process – you can watch it reason through multi-step problems. Strong on math, logic puzzles, and complex analysis. The chain-of-thought output is longer (more tokens per response), but the final answers are noticeably better on hard problems.
Efficiency: Phi-4 14B
ollama pull phi4
- VRAM: ~11GB
- Speed: ~80+ tokens/sec
Microsoft’s efficiency king. At 14B parameters, it only uses half your VRAM, leaving room for long context windows or running other services. Despite its size, Phi-4-reasoning variants outperform models 5x larger on reasoning benchmarks. This is the model you run when you want fast responses or when you are multitasking and cannot dedicate the full GPU.
Vision (Multimodal): Qwen3 VL 32B
ollama pull qwen3-vl:32b
- VRAM: ~22GB
Can analyze images – screenshots, diagrams, documents, photos. Useful for debugging UI issues, understanding charts, or extracting text from images. The only multimodal model in this size class that fits 24GB.
Google’s Entry: Gemma 3 27B
ollama pull gemma3:27b
- VRAM: ~22GB (or ~17GB with QAT variant)
Google’s quantization-aware trained model. The QAT version is specifically optimized for consumer GPUs, fitting more comfortably in 24GB than standard quantization. Good all-rounder with a theoretical 128K context window.
Mistral’s Entry: Mistral Small 3.2 (24B)
ollama pull mistral-small3.2:24b
- VRAM: ~19GB
European-built, strong at instruction following and multilingual tasks. Matches Llama 3.3 70B quality at 3x the speed. Vision-capable in newer versions.
The Gotcha: Models That Don’t Fit
This is worth calling out because the marketing is misleading.
Llama 4 Scout has “17B active parameters” per token – sounds like it should fit in 24GB, right? Wrong. Scout uses a Mixture-of-Experts (MoE) architecture with 109B total parameters across 16 experts. All 109B parameters must be loaded into memory even though only 17B are active per token. At Q4 quantization, that is 55-65GB. It does not fit.
| Model | Total Params | Q4 Size | Fits 24GB? |
|---|---|---|---|
| Llama 4 Scout | 109B (17B active) | ~60 GB | No |
| Llama 4 Maverick | 400B (17B active) | ~200 GB | No |
| Llama 3.3 70B | 70B | ~40 GB | No |
| Qwen 2.5 72B | 72B | ~42 GB | No |
| DeepSeek R1 (full) | 671B | ~350 GB | No |
The 24GB sweet spot is 24-32B parameter dense models and 30B MoE models with small active parameter counts (like Qwen3 30B A3B, which uses 3B active parameters).
VRAM Quick Reference
For planning which models to keep downloaded:
| Model | Params | VRAM (8K ctx) | Best For |
|---|---|---|---|
| Llama 3.2 3B | 3B | 3.6 GB | Quick tasks, testing |
| Qwen 3 8B | 8B | ~6 GB | Light duty, fast responses |
| DeepSeek R1 7B | 7B | 3.3 GB | Light reasoning |
| Gemma 3 12B | 12B | 12.4 GB | Mid-range all-rounder |
| Phi-4 14B | 14B | 11.0 GB | Efficiency, leaves VRAM headroom |
| Qwen3 14B | 14B | 10.7 GB | Mid-range all-rounder |
| Mistral Small 3.2 | 24B | ~19 GB | Multilingual, instruction following |
| Gemma 3 27B | 27B | 22.5 GB | Google ecosystem, 128K context |
| Qwen3 30B A3B | 30B | 20.3 GB | Long context, speed |
| GLM-4.7-Flash | 30B | 20.9 GB | Raw intelligence |
| DeepSeek R1 32B | 32B | ~22 GB | Reasoning, math |
| Qwen 2.5 Coder 32B | 32B | ~22 GB | Code generation |
Context length matters. These VRAM numbers are at 8K context. At 16K, add 1-3GB. At 48K, add 5-10GB. Only Qwen3 30B A3B handles long contexts gracefully – everything else spills to system RAM and slows down dramatically.
Understanding Quantization
When Ollama downloads a model, it uses Q4_K_M quantization by default. This means each parameter is stored in ~4 bits instead of the original 16 or 32 bits. The practical effect:
| Quantization | Quality Retention | VRAM Savings |
|---|---|---|
| Q8_0 (8-bit) | ~99% | ~50% |
| Q6_K (6-bit) | ~98% | ~62% |
| Q5_K_M (5-bit) | ~97% | ~69% |
| Q4_K_M (4-bit) | ~95-99% | ~75% |
| Q3_K_M (3-bit) | ~90-93% | ~81% |
| Q2_K (2-bit) | ~85% | ~88% |
Q4_K_M is the consensus recommendation. The quality loss is imperceptible for most tasks. You can request specific quantizations from Ollama if needed:
ollama pull gemma3:27b-q3_K_M # Tighter fit, slightly lower quality
ollama pull phi4:q8_0 # Higher quality, uses more VRAM
The Model Landscape in 2026
A quick overview of who is making what, since the ecosystem moves fast:
Qwen (Alibaba) overtook Llama as the most downloaded open-source model family in late 2025. Qwen 3 and 3.5 offer models from 0.6B to 397B. Apache 2.0 license. The 30B-A3B MoE variants are the sweet spot for 24GB cards. Supports 119 languages.
DeepSeek popularized chain-of-thought reasoning in open-source models. The distilled variants (7B, 14B, 32B) bring R1-level reasoning to consumer GPUs. The full R1 at 671B is cloud-only.
Meta Llama 4 launched Scout and Maverick with MoE architectures and native multimodality. Impressive tech, but neither fits consumer GPUs. Llama 3.1 8B and 3.2 3B remain useful smaller options.
Google Gemma 3 released QAT (Quantization-Aware Trained) models specifically optimized for consumer GPUs. Smart move by Google.
Microsoft Phi-4 is the efficiency champion. 14B parameters that punch above 70B on reasoning tasks. The best choice when you want to leave VRAM headroom.
Mistral continues shipping competitive models from Europe. Mistral Small 3.2 at 24B matches much larger models while keeping the VRAM footprint manageable.
Tips for Daily Use
Switch Models for Different Tasks
Don’t pick one model and use it for everything. OpenClaw makes it easy to switch:
- Quick questions and brainstorming: Phi-4 14B (fast, leaves VRAM headroom)
- Coding sessions: Qwen 2.5 Coder 32B (specialized, accurate)
- Deep analysis or math: DeepSeek R1 32B (chain-of-thought reasoning)
- Long document processing: Qwen3 30B A3B (handles large context without slowdown)
- Image analysis: Qwen3 VL 32B (multimodal)
Manage Your Model Library
Models take disk space. Keep your active set small:
# List downloaded models and their sizes
ollama list
# Remove a model you are not using
ollama rm mistral-nemo
# Update a model to the latest version
ollama pull qwen3:30b-a3b
Monitor GPU Usage
Keep nvidia-smi handy to watch VRAM usage:
# One-shot check
nvidia-smi
# Continuous monitoring (updates every 2 seconds)
nvidia-smi -l 2
If VRAM usage is at 100% and inference is slow, the model is spilling to system RAM. Either use a smaller model or reduce the context length.
Create Custom Model Presets
Ollama lets you create custom model configurations with system prompts baked in:
# Create a Modelfile
@"
FROM qwen2.5-coder:32b
SYSTEM You are a senior software engineer. Write clean, well-tested code. Prefer simple solutions. When you see a bug, explain the root cause before the fix.
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
"@ | Set-Content Modelfile
ollama create coding-assistant -f Modelfile
ollama run coding-assistant
Updating OpenClaw
When a new version of OpenClaw is released, rebuild the Docker image:
docker stop openclaw
docker rm openclaw
# Rebuild with latest
docker build --no-cache -t openclaw .
# Run again
docker run -d `
--name openclaw `
-p 18789:18789 `
--add-host=host.docker.internal:host-gateway `
--restart unless-stopped `
openclaw
The Lazy Way: Let Claude Code Do It
If you have Claude Code installed, you can skip most of the manual steps above. Open a terminal, launch claude, and paste one of these prompts. Claude Code will run the commands, troubleshoot errors, and verify each step.
Full Setup (Everything At Once)
I have a Windows machine with an NVIDIA GPU (24GB VRAM) and Docker Desktop
installed with the WSL2 backend. Set up a local AI stack for me:
1. Verify my GPU is accessible (run nvidia-smi)
2. Install Ollama if not already installed (winget or download)
3. Set the OLLAMA_HOST environment variable to 0.0.0.0 so Docker
containers can reach it, then restart the Ollama service
4. Pull these starter models: qwen3:30b-a3b, qwen2.5-coder:32b,
deepseek-r1:32b, phi4
5. Create a Dockerfile for OpenClaw (FROM node:22-bookworm-slim,
npm install -g openclaw@latest, expose 18789, CMD with --bind lan
and --allow-unconfigured)
6. Build the OpenClaw image and run it with --add-host for
host.docker.internal and port 18789 mapped
7. Verify OpenClaw is reachable at http://localhost:18789
8. Verify Ollama models are visible from inside the container by
curling http://host.docker.internal:11434/api/tags
After each step, verify it worked before moving on. If something fails,
diagnose and fix it.
Just Ollama + Models
Install Ollama on this Windows machine and pull a curated set of models
for my 24GB VRAM GPU. I want:
- qwen3:30b-a3b (general purpose, long context)
- qwen2.5-coder:32b (coding)
- deepseek-r1:32b (reasoning)
- phi4 (fast/efficient)
- glm-4.7-flash (intelligence)
After pulling, set OLLAMA_HOST=0.0.0.0 as a system environment variable
so other services (like Docker containers) can connect. Verify each model
loads by running a quick test prompt through it.
Just OpenClaw (Ollama Already Running)
Ollama is already running on this machine at localhost:11434. Set up
OpenClaw in Docker to connect to it:
1. Create a Dockerfile for OpenClaw (node:22-bookworm-slim base,
install openclaw@latest globally via npm, expose port 18789,
CMD: openclaw gateway --port 18789 --bind lan --allow-unconfigured)
2. Build the image tagged as "openclaw"
3. Run it with port 18789 mapped and --add-host=host.docker.internal:host-gateway
4. Verify the web UI is reachable at http://localhost:18789
5. Verify it can reach Ollama by curling the Ollama API from inside
the container
Add More Models Later
I'm running Ollama with 24GB VRAM. Pull these additional models and
verify each one loads without running out of VRAM:
- gemma3:27b (Google, good all-rounder)
- mistral-small3.2:24b (Mistral, multilingual)
- qwen3-vl:32b (vision/multimodal)
After pulling, list all models with their sizes so I can see my
total disk usage.
Troubleshooting Prompt
My self-hosted AI setup is not working. Here's what I have:
- Windows with Docker Desktop (WSL2)
- Ollama installed natively
- OpenClaw running in a Docker container
Diagnose the issue:
1. Check if nvidia-smi shows the GPU
2. Check if Ollama is running and listening (curl localhost:11434)
3. Check if OLLAMA_HOST is set to 0.0.0.0
4. Check if the OpenClaw container is running (docker ps)
5. Check OpenClaw logs (docker logs openclaw)
6. From inside the container, check if it can reach Ollama
(docker exec openclaw curl http://host.docker.internal:11434/api/tags)
7. Report what's broken and fix it
These prompts are designed to be self-contained – Claude Code has enough context in each one to execute the full workflow without follow-up questions. If something goes sideways, it will diagnose the error and try to fix it before asking you for help.
What’s Next
Once you have this running, there are a few upgrades worth exploring:
- RAG (Retrieval Augmented Generation): Feed your own documents, code repos, or notes into the model’s context. OpenClaw supports this through its memory and RAG features.
- MCP Tools: Model Context Protocol lets your AI assistant call external tools – web search, file access, API calls. OpenClaw has built-in support.
- Multiple channels: OpenClaw supports Telegram, Discord, and Slack in addition to the web UI. Set up a Telegram bot and chat with your AI from your phone.
- Scheduled tasks: Use OpenClaw’s agent features to run automated workflows – code review, log analysis, daily summaries.
The self-hosted AI space is moving fast. Models that required a cluster a year ago now run on a single consumer GPU. With 24GB of VRAM, you are sitting at the sweet spot – capable enough for serious work, affordable enough for a personal setup. Welcome to the future of private AI.