Self-Hosted AI on a 24GB GPU: OpenClaw + Ollama Setup Guide for Windows

TL;DR

You have a 24GB VRAM GPU. You want a private, self-hosted AI assistant that rivals ChatGPT – no subscriptions, no data leaving your machine. This guide walks you through setting up Ollama (local model runtime) and OpenClaw (AI gateway with a web UI) on Windows using Docker Desktop. I also cover which models actually fit in 24GB, which ones don’t despite the marketing, and how to pick models for coding, reasoning, creative writing, and general use.

Why Self-Host

The pitch is simple:

Privacy. Your conversations never leave your machine. No training on your data. No terms of service changes to worry about.
Cost. After the hardware investment, inference is free. If you are spending $20-200/month on API credits or ChatGPT Plus, a 24GB GPU pays for itself in months.
No rate limits. Generate as many tokens as your GPU can push. No “you’ve hit your limit” messages at 2am when you are in the zone.
Offline capability. Works on a plane, in a cabin, during an ISP outage. The models live on your disk.
Customization. System prompts, custom model presets, RAG pipelines, tool integrations – you control the entire stack.

The tradeoff is that local models are not as capable as frontier models like Claude Opus or GPT-5. But the gap has narrowed dramatically. A 32B parameter model running locally in 2026 matches or exceeds what GPT-4 could do in 2024. For most day-to-day tasks – coding assistance, writing drafts, summarization, brainstorming – local models are more than good enough.

What You Need

Hardware

GPU: Any NVIDIA GPU with 24GB VRAM. The RTX 3090, RTX 4090, RTX A5000, and RTX 5090 all qualify. AMD GPUs work with Ollama too, but NVIDIA has better driver support and faster inference via CUDA.
RAM: 32GB system RAM minimum. 64GB recommended. When a model doesn’t fully fit in VRAM, the overflow spills to system RAM – and you want headroom for that.
Storage: At least 100GB free. Models range from 2GB (small 3B models) to 22GB (large 32B models at Q4 quantization). You will want several models downloaded.
CPU: Any modern CPU works. The GPU does the heavy lifting.

Software

Windows 10/11 with WSL2 enabled
Docker Desktop with WSL2 backend
Ollama (native Windows install)
NVIDIA drivers (latest Game Ready or Studio driver)

Step 1: Install NVIDIA Drivers

If you are gaming on this machine, you probably already have recent drivers. Verify:

Open NVIDIA Control Panel (right-click desktop)
Click Help > System Information
Check the driver version – anything from 2025 or later is fine

If you need to update, grab the latest from nvidia.com/drivers. The Studio Driver is slightly more stable for compute workloads, but Game Ready works fine too.

Verify CUDA is working by opening PowerShell:

nvidia-smi

You should see your GPU listed with 24GB (or close to it) of memory. If nvidia-smi is not found, the driver install didn’t add it to PATH – restart your terminal or reboot.

Step 2: Enable WSL2

Docker Desktop requires WSL2 (Windows Subsystem for Linux 2) as its backend. If you have never used WSL, you need to set it up first. If you already have WSL2 running, skip to Step 3.

Check if WSL is Already Installed

Open PowerShell as Administrator and run:

wsl --status

If you see a default distribution and “WSL version: 2”, you are good – skip to Step 3. If you get an error or see “WSL version: 1”, keep reading.

Install WSL2 from Scratch

Open PowerShell as Administrator and run the install command:

wsl --install

This enables the WSL2 feature, installs the Linux kernel, and downloads Ubuntu as the default distribution. It requires a reboot.

After rebooting, Ubuntu will launch automatically and ask you to create a Linux username and password. These are for the Linux environment only – pick anything you will remember.

Once that is done, verify WSL2 is active:

wsl --list --verbose

You should see Ubuntu listed with VERSION 2. If it shows VERSION 1, upgrade it:

wsl --set-version Ubuntu 2

If WSL Install Fails

On older Windows 10 builds or machines with virtualization disabled, wsl --install may fail. The fix:

First, enable virtualization in BIOS. Reboot into BIOS/UEFI (usually Del or F2 during boot). Look for “Intel VT-x”, “AMD-V”, or “SVM Mode” and enable it.

Then, manually enable the required Windows features:

dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

Reboot, then set WSL2 as the default version and install a distro:

wsl --set-default-version 2
wsl --install -d Ubuntu

Update the WSL Kernel

Even if WSL2 is already installed, make sure the kernel is current. An outdated kernel can cause GPU passthrough issues:

wsl --update

Step 3: Install Docker Desktop

Download Docker Desktop from docker.com
Run the installer. When prompted, ensure WSL2 backend is selected (not Hyper-V)
After installation, open Docker Desktop and let it finish initializing
In Docker Desktop settings:
- General: Ensure “Use the WSL 2 based engine” is checked
- Resources > WSL Integration: Enable integration with your Ubuntu distro

Verify Docker is working:

docker run hello-world

If this fails with a permissions error, make sure Docker Desktop is running (check the system tray) and that your user is in the docker-users group (Docker Desktop usually handles this during install, but a sign-out/sign-in may be needed).

Enable GPU Access in Docker

Docker Desktop on Windows supports GPU passthrough via WSL2. In Docker Desktop settings:

Go to Resources > Advanced
Allocate at least 16GB of RAM to Docker (32GB if you have 64GB system RAM)
GPU passthrough should work automatically with recent Docker Desktop versions and an up-to-date WSL kernel

Test GPU access:

docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi

If you see your GPU listed, you are good. If you get an error about --gpus, make sure Docker Desktop is updated to the latest version and WSL2 integration is enabled.

Step 4: Install Ollama

Ollama is the model runtime. It downloads, manages, and serves LLMs locally. Install it natively on Windows (not in Docker) for the best GPU performance:

Download the installer from ollama.com/download
Run the installer – it sets up Ollama as a background service
Ollama starts automatically and listens on http://localhost:11434

Verify:

ollama --version

Pull Your First Model

Let’s start with something that demonstrates the power of 24GB VRAM:

ollama pull qwen3:30b-a3b

This downloads Qwen3 30B – a 30 billion parameter model that fits comfortably in 24GB at Q4 quantization. The download is about 18GB. Once it is done:

ollama run qwen3:30b-a3b

You should get a chat prompt. Type something. If you see tokens streaming back at 30-50 tokens per second, your GPU is doing its job. Press Ctrl+D to exit.

Configure Ollama for Docker Access

By default, Ollama only listens on localhost. Since OpenClaw will run in Docker, it needs to reach Ollama on the host. Set the environment variable so Ollama listens on all interfaces:

Open System Environment Variables (search “environment variables” in Start)
Under System variables, click New
Variable name: OLLAMA_HOST
Variable value: 0.0.0.0
Click OK, then restart Ollama (quit from system tray and reopen, or restart the service)

Verify it is listening:

curl http://localhost:11434/api/tags

You should see a JSON response listing your downloaded models.

Step 5: Run OpenClaw in Docker

OpenClaw is the AI gateway – it provides a web chat interface, multi-model support, conversation history, and tool integrations. It connects to Ollama as a model backend.

OpenClaw does not publish an official Docker image, so we build one. Create a file called Dockerfile anywhere on your machine:

FROM node:22-bookworm-slim
RUN apt-get update && apt-get install -y python3 make g++ libopus-dev && rm -rf /var/lib/apt/lists/*
RUN npm install -g openclaw@latest
USER 1000
EXPOSE 18789
CMD ["openclaw", "gateway", "--port", "18789", "--bind", "lan", "--allow-unconfigured"]

Build and run it:

docker build -t openclaw .

docker run -d `
  --name openclaw `
  -p 18789:18789 `
  --add-host=host.docker.internal:host-gateway `
  --restart unless-stopped `
  openclaw

The --add-host=host.docker.internal:host-gateway flag is critical. It lets the container reach services running on your Windows host (like Ollama) via host.docker.internal.

Open your browser to http://localhost:18789. You should see the OpenClaw Control UI.

Step 6: Configure OpenClaw

On first launch, OpenClaw’s Control UI lets you configure model providers. Here is the setup for Ollama:

Navigate to the Models section
Add a new provider:
- Type: OpenAI-compatible
- Base URL: http://host.docker.internal:11434/v1
- API Key: ollama (Ollama doesn’t require a real key, but the field cannot be empty)
OpenClaw will auto-discover your Ollama models
Set your preferred default model (I recommend starting with qwen3:30b-a3b)

Optional: Add Cloud Providers as Fallback

OpenClaw can route to multiple providers. For tasks that exceed local model capabilities, you can add API keys for cloud providers:

Anthropic (Claude) – best for complex reasoning and code
OpenRouter – aggregates dozens of providers behind one API key
OpenAI – GPT models

This is optional. The point of this guide is running everything locally. But having a cloud fallback for the occasional hard problem is pragmatic.

Step 7: Pull the Right Models

Here is where the 24GB VRAM really shines. You can run models that would have required a data center two years ago. Below are my recommendations organized by use case.

The Must-Have Starter Pack

These three models cover most use cases and all fit in 24GB:

# Best all-rounder -- strong across all tasks
ollama pull qwen3:30b-a3b

# Best for coding -- GPT-4o level code generation
ollama pull qwen2.5-coder:32b

# Best for reasoning -- chain-of-thought problem solving
ollama pull deepseek-r1:32b

Best Models by Category

General Intelligence: GLM-4.7-Flash (30B)

ollama pull glm-4.7-flash

VRAM: ~21GB at 16K context
Speed: ~42 tokens/sec

The top scorer on recent intelligence benchmarks for models that fit in 24GB. It won an agentic coding challenge by autonomously building a procedural rendering engine – without human intervention. If you want the single smartest model that fits your GPU, this is it.

Long Context: Qwen3 30B A3B

ollama pull qwen3:30b-a3b

VRAM: ~20GB at 16K context, ~24GB at 48K context
Speed: ~54 tokens/sec at 16K, ~33 tokens/sec at 48K

The only model in this class that maintains 100% GPU utilization at 48K tokens. Every other 30B+ model spills to CPU RAM at long contexts, dropping speed to under 10 tokens/sec. If you are feeding in long documents, codebases, or conversation histories, Qwen3 30B is the pick.

Coding: Qwen 2.5 Coder 32B

ollama pull qwen2.5-coder:32b

VRAM: ~22GB
Speed: ~35 tokens/sec

GPT-4o-level coding performance in a model that runs on your desk. Specialized for code generation, debugging, refactoring, and completion across dozens of languages. If you primarily want a coding copilot, this is your main model.

Reasoning: DeepSeek R1 32B

ollama pull deepseek-r1:32b

VRAM: ~22GB
Speed: ~40 tokens/sec

Distilled from the massive 671B DeepSeek R1. This model shows its thinking process – you can watch it reason through multi-step problems. Strong on math, logic puzzles, and complex analysis. The chain-of-thought output is longer (more tokens per response), but the final answers are noticeably better on hard problems.

Efficiency: Phi-4 14B

ollama pull phi4

VRAM: ~11GB
Speed: ~80+ tokens/sec

Microsoft’s efficiency king. At 14B parameters, it only uses half your VRAM, leaving room for long context windows or running other services. Despite its size, Phi-4-reasoning variants outperform models 5x larger on reasoning benchmarks. This is the model you run when you want fast responses or when you are multitasking and cannot dedicate the full GPU.

Vision (Multimodal): Qwen3 VL 32B

ollama pull qwen3-vl:32b

VRAM: ~22GB

Can analyze images – screenshots, diagrams, documents, photos. Useful for debugging UI issues, understanding charts, or extracting text from images. The only multimodal model in this size class that fits 24GB.

Google’s Entry: Gemma 3 27B

ollama pull gemma3:27b

VRAM: ~22GB (or ~17GB with QAT variant)

Google’s quantization-aware trained model. The QAT version is specifically optimized for consumer GPUs, fitting more comfortably in 24GB than standard quantization. Good all-rounder with a theoretical 128K context window.

Mistral’s Entry: Mistral Small 3.2 (24B)

ollama pull mistral-small3.2:24b

VRAM: ~19GB

European-built, strong at instruction following and multilingual tasks. Matches Llama 3.3 70B quality at 3x the speed. Vision-capable in newer versions.

The Gotcha: Models That Don’t Fit

This is worth calling out because the marketing is misleading.

Llama 4 Scout has “17B active parameters” per token – sounds like it should fit in 24GB, right? Wrong. Scout uses a Mixture-of-Experts (MoE) architecture with 109B total parameters across 16 experts. All 109B parameters must be loaded into memory even though only 17B are active per token. At Q4 quantization, that is 55-65GB. It does not fit.

Model	Total Params	Q4 Size	Fits 24GB?
Llama 4 Scout	109B (17B active)	~60 GB	No
Llama 4 Maverick	400B (17B active)	~200 GB	No
Llama 3.3 70B	70B	~40 GB	No
Qwen 2.5 72B	72B	~42 GB	No
DeepSeek R1 (full)	671B	~350 GB	No

The 24GB sweet spot is 24-32B parameter dense models and 30B MoE models with small active parameter counts (like Qwen3 30B A3B, which uses 3B active parameters).

VRAM Quick Reference

For planning which models to keep downloaded:

Model	Params	VRAM (8K ctx)	Best For
Llama 3.2 3B	3B	3.6 GB	Quick tasks, testing
Qwen 3 8B	8B	~6 GB	Light duty, fast responses
DeepSeek R1 7B	7B	3.3 GB	Light reasoning
Gemma 3 12B	12B	12.4 GB	Mid-range all-rounder
Phi-4 14B	14B	11.0 GB	Efficiency, leaves VRAM headroom
Qwen3 14B	14B	10.7 GB	Mid-range all-rounder
Mistral Small 3.2	24B	~19 GB	Multilingual, instruction following
Gemma 3 27B	27B	22.5 GB	Google ecosystem, 128K context
Qwen3 30B A3B	30B	20.3 GB	Long context, speed
GLM-4.7-Flash	30B	20.9 GB	Raw intelligence
DeepSeek R1 32B	32B	~22 GB	Reasoning, math
Qwen 2.5 Coder 32B	32B	~22 GB	Code generation

Context length matters. These VRAM numbers are at 8K context. At 16K, add 1-3GB. At 48K, add 5-10GB. Only Qwen3 30B A3B handles long contexts gracefully – everything else spills to system RAM and slows down dramatically.

Understanding Quantization

When Ollama downloads a model, it uses Q4_K_M quantization by default. This means each parameter is stored in ~4 bits instead of the original 16 or 32 bits. The practical effect:

Quantization	Quality Retention	VRAM Savings
Q8_0 (8-bit)	~99%	~50%
Q6_K (6-bit)	~98%	~62%
Q5_K_M (5-bit)	~97%	~69%
Q4_K_M (4-bit)	~95-99%	~75%
Q3_K_M (3-bit)	~90-93%	~81%
Q2_K (2-bit)	~85%	~88%

Q4_K_M is the consensus recommendation. The quality loss is imperceptible for most tasks. You can request specific quantizations from Ollama if needed:

ollama pull gemma3:27b-q3_K_M   # Tighter fit, slightly lower quality
ollama pull phi4:q8_0            # Higher quality, uses more VRAM

The Model Landscape in 2026

A quick overview of who is making what, since the ecosystem moves fast:

Qwen (Alibaba) overtook Llama as the most downloaded open-source model family in late 2025. Qwen 3 and 3.5 offer models from 0.6B to 397B. Apache 2.0 license. The 30B-A3B MoE variants are the sweet spot for 24GB cards. Supports 119 languages.

DeepSeek popularized chain-of-thought reasoning in open-source models. The distilled variants (7B, 14B, 32B) bring R1-level reasoning to consumer GPUs. The full R1 at 671B is cloud-only.

Meta Llama 4 launched Scout and Maverick with MoE architectures and native multimodality. Impressive tech, but neither fits consumer GPUs. Llama 3.1 8B and 3.2 3B remain useful smaller options.

Google Gemma 3 released QAT (Quantization-Aware Trained) models specifically optimized for consumer GPUs. Smart move by Google.

Microsoft Phi-4 is the efficiency champion. 14B parameters that punch above 70B on reasoning tasks. The best choice when you want to leave VRAM headroom.

Mistral continues shipping competitive models from Europe. Mistral Small 3.2 at 24B matches much larger models while keeping the VRAM footprint manageable.

Tips for Daily Use

Switch Models for Different Tasks

Don’t pick one model and use it for everything. OpenClaw makes it easy to switch:

Quick questions and brainstorming: Phi-4 14B (fast, leaves VRAM headroom)
Coding sessions: Qwen 2.5 Coder 32B (specialized, accurate)
Deep analysis or math: DeepSeek R1 32B (chain-of-thought reasoning)
Long document processing: Qwen3 30B A3B (handles large context without slowdown)
Image analysis: Qwen3 VL 32B (multimodal)

Manage Your Model Library

Models take disk space. Keep your active set small:

# List downloaded models and their sizes
ollama list

# Remove a model you are not using
ollama rm mistral-nemo

# Update a model to the latest version
ollama pull qwen3:30b-a3b

Monitor GPU Usage

Keep nvidia-smi handy to watch VRAM usage:

# One-shot check
nvidia-smi

# Continuous monitoring (updates every 2 seconds)
nvidia-smi -l 2

If VRAM usage is at 100% and inference is slow, the model is spilling to system RAM. Either use a smaller model or reduce the context length.

Create Custom Model Presets

Ollama lets you create custom model configurations with system prompts baked in:

# Create a Modelfile
@"
FROM qwen2.5-coder:32b
SYSTEM You are a senior software engineer. Write clean, well-tested code. Prefer simple solutions. When you see a bug, explain the root cause before the fix.
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
"@ | Set-Content Modelfile

ollama create coding-assistant -f Modelfile
ollama run coding-assistant

Updating OpenClaw

When a new version of OpenClaw is released, rebuild the Docker image:

docker stop openclaw
docker rm openclaw

# Rebuild with latest
docker build --no-cache -t openclaw .

# Run again
docker run -d `
  --name openclaw `
  -p 18789:18789 `
  --add-host=host.docker.internal:host-gateway `
  --restart unless-stopped `
  openclaw

The Lazy Way: Let Claude Code Do It

If you have Claude Code installed, you can skip most of the manual steps above. Open a terminal, launch claude, and paste one of these prompts. Claude Code will run the commands, troubleshoot errors, and verify each step.

Full Setup (Everything At Once)

I have a Windows machine with an NVIDIA GPU (24GB VRAM) and Docker Desktop
installed with the WSL2 backend. Set up a local AI stack for me:

1. Verify my GPU is accessible (run nvidia-smi)
2. Install Ollama if not already installed (winget or download)
3. Set the OLLAMA_HOST environment variable to 0.0.0.0 so Docker
   containers can reach it, then restart the Ollama service
4. Pull these starter models: qwen3:30b-a3b, qwen2.5-coder:32b,
   deepseek-r1:32b, phi4
5. Create a Dockerfile for OpenClaw (FROM node:22-bookworm-slim,
   npm install -g openclaw@latest, expose 18789, CMD with --bind lan
   and --allow-unconfigured)
6. Build the OpenClaw image and run it with --add-host for
   host.docker.internal and port 18789 mapped
7. Verify OpenClaw is reachable at http://localhost:18789
8. Verify Ollama models are visible from inside the container by
   curling http://host.docker.internal:11434/api/tags

After each step, verify it worked before moving on. If something fails,
diagnose and fix it.

Just Ollama + Models

Install Ollama on this Windows machine and pull a curated set of models
for my 24GB VRAM GPU. I want:

- qwen3:30b-a3b (general purpose, long context)
- qwen2.5-coder:32b (coding)
- deepseek-r1:32b (reasoning)
- phi4 (fast/efficient)
- glm-4.7-flash (intelligence)

After pulling, set OLLAMA_HOST=0.0.0.0 as a system environment variable
so other services (like Docker containers) can connect. Verify each model
loads by running a quick test prompt through it.

Just OpenClaw (Ollama Already Running)

Ollama is already running on this machine at localhost:11434. Set up
OpenClaw in Docker to connect to it:

1. Create a Dockerfile for OpenClaw (node:22-bookworm-slim base,
   install openclaw@latest globally via npm, expose port 18789,
   CMD: openclaw gateway --port 18789 --bind lan --allow-unconfigured)
2. Build the image tagged as "openclaw"
3. Run it with port 18789 mapped and --add-host=host.docker.internal:host-gateway
4. Verify the web UI is reachable at http://localhost:18789
5. Verify it can reach Ollama by curling the Ollama API from inside
   the container

Add More Models Later

I'm running Ollama with 24GB VRAM. Pull these additional models and
verify each one loads without running out of VRAM:

- gemma3:27b (Google, good all-rounder)
- mistral-small3.2:24b (Mistral, multilingual)
- qwen3-vl:32b (vision/multimodal)

After pulling, list all models with their sizes so I can see my
total disk usage.

Troubleshooting Prompt

My self-hosted AI setup is not working. Here's what I have:
- Windows with Docker Desktop (WSL2)
- Ollama installed natively
- OpenClaw running in a Docker container

Diagnose the issue:
1. Check if nvidia-smi shows the GPU
2. Check if Ollama is running and listening (curl localhost:11434)
3. Check if OLLAMA_HOST is set to 0.0.0.0
4. Check if the OpenClaw container is running (docker ps)
5. Check OpenClaw logs (docker logs openclaw)
6. From inside the container, check if it can reach Ollama
   (docker exec openclaw curl http://host.docker.internal:11434/api/tags)
7. Report what's broken and fix it

These prompts are designed to be self-contained – Claude Code has enough context in each one to execute the full workflow without follow-up questions. If something goes sideways, it will diagnose the error and try to fix it before asking you for help.

What’s Next

Once you have this running, there are a few upgrades worth exploring:

RAG (Retrieval Augmented Generation): Feed your own documents, code repos, or notes into the model’s context. OpenClaw supports this through its memory and RAG features.
MCP Tools: Model Context Protocol lets your AI assistant call external tools – web search, file access, API calls. OpenClaw has built-in support.
Multiple channels: OpenClaw supports Telegram, Discord, and Slack in addition to the web UI. Set up a Telegram bot and chat with your AI from your phone.
Scheduled tasks: Use OpenClaw’s agent features to run automated workflows – code review, log analysis, daily summaries.

The self-hosted AI space is moving fast. Models that required a cluster a year ago now run on a single consumer GPU. With 24GB of VRAM, you are sitting at the sweet spot – capable enough for serious work, affordable enough for a personal setup. Welcome to the future of private AI.

TL;DR#

Why Self-Host#

What You Need#

Hardware#

Software#

Step 1: Install NVIDIA Drivers#

Step 2: Enable WSL2#

Check if WSL is Already Installed#

Install WSL2 from Scratch#

If WSL Install Fails#

Update the WSL Kernel#

Step 3: Install Docker Desktop#

Enable GPU Access in Docker#

Step 4: Install Ollama#

Pull Your First Model#

Configure Ollama for Docker Access#

Step 5: Run OpenClaw in Docker#

Step 6: Configure OpenClaw#

Optional: Add Cloud Providers as Fallback#

Step 7: Pull the Right Models#

The Must-Have Starter Pack#

Best Models by Category#

General Intelligence: GLM-4.7-Flash (30B)#

Long Context: Qwen3 30B A3B#

Coding: Qwen 2.5 Coder 32B#

Reasoning: DeepSeek R1 32B#

Efficiency: Phi-4 14B#

Vision (Multimodal): Qwen3 VL 32B#

Google’s Entry: Gemma 3 27B#

Mistral’s Entry: Mistral Small 3.2 (24B)#

The Gotcha: Models That Don’t Fit#

VRAM Quick Reference#

Understanding Quantization#

The Model Landscape in 2026#

Tips for Daily Use#

Switch Models for Different Tasks#

Manage Your Model Library#

Monitor GPU Usage#

Create Custom Model Presets#

Updating OpenClaw#

The Lazy Way: Let Claude Code Do It#

Full Setup (Everything At Once)#

Just Ollama + Models#

Just OpenClaw (Ollama Already Running)#

Add More Models Later#

Troubleshooting Prompt#

What’s Next#