The Cluster That Documents Itself: Self-Hosted Wiki.js as Living Infrastructure Knowledge

TL;DR

I run Wiki.js on k3s as the cluster’s internal knowledge base. It is not a place I write documentation — it is a place the AI writes documentation after completing work. When Claude finishes deploying a service, debugging an incident, or refactoring infrastructure, it commits the results to the wiki with architecture diagrams, decision rationale, and operational notes. I am the primary reader. When I want to understand how something works, or why a specific decision was made three weeks ago, I go to the wiki instead of digging through git history or re-reading code.

Why Not Confluence, Notion, or a Git Repo

I evaluated the obvious alternatives before deploying Wiki.js.

Confluence / Notion: Cloud-hosted, context-switches away from the terminal, requires an account, not scriptable in any meaningful way. The AI can write to a REST API, but the authentication flows for both are annoying and the free tiers have storage limits. More importantly, I don’t want my infrastructure knowledge in someone else’s cloud.

Git repo (Markdown files): This is what most people would do, and I did consider it. The problem is that Markdown files in a repo are not searchable in a useful way. grep finds strings but not concepts. A wiki with full-text search, tags, and cross-page links is qualitatively different from a flat directory of files.

Wiki.js: GraphQL API, full-text search, Mermaid diagram support (more on this later), self-hosted, runs cleanly on k3s. The API is the key requirement — the AI needs to write pages programmatically.

Deployment

Wiki.js runs as a Deployment in the wiki namespace with a PostgreSQL StatefulSet backend on a Longhorn PVC.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: wiki
  namespace: wiki
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: wiki
      app.kubernetes.io/component: web
  template:
    spec:
      containers:
      - name: wiki
        image: harbor.k3s.internal.zolty.systems/production/wiki:latest
        env:
        - name: DB_TYPE
          value: postgres
        - name: DB_HOST
          value: wiki-postgres
        - name: DB_PORT
          value: "5432"
        - name: DB_NAME
          value: wiki
        ports:
        - containerPort: 3000

Internal URL: https://wiki.k3s.internal.zolty.systems. No external exposure — this is private infrastructure documentation, not a public wiki.

The AI Writing Loop

The key behavior is that Claude writes to the wiki as part of completing work. This is built into the system prompts and CLAUDE.md files in the repos.

When Claude finishes deploying a new service, the default behavior is:

Verify the deployment is healthy
Write (or update) a wiki page for the service at /cluster/services/<name>
Include: architecture diagram (Mermaid), environment variables, secret names, ingress URL, monitoring endpoints, known issues, and deployment notes
Update the service index page with a link

For debugging sessions:

Resolve the issue
Write an incident page at /cluster/incidents/<date>-<service>
Include: what failed, how it was diagnosed, what fixed it, and what to check next time
Update the relevant service page with a “known issues” entry

This is not something I have to request. It is the default completion behavior. The result is that every service in the cluster has a page, and every significant incident has a post-mortem, without me writing any of it.

What the Wiki Actually Contains

After two weeks of operation, the wiki has:

Category	Pages
Service documentation	18
Incident post-mortems	7
Architecture decisions	12
Operational runbooks	9
Network diagrams	4

The service pages are the most useful. A typical service page looks like:

# Cardboard — TCG Price Tracker

## Architecture
[Mermaid diagram: Ingress → webapp → scraper CronJob → Postgres → Chart.js]

## Configuration
| Variable | Value | Source |
|----------|-------|--------|
| DATABASE_URL | postgresql://... | cardboard-secrets |
| SCRAPE_SCHEDULE | 0 23 * * * | hardcoded |

## Endpoints
- Web: https://cardboard.k3s.internal.zolty.systems
- Metrics: :8080/metrics (scraped by ServiceMonitor)

## Known Issues
- 2026-02-18: Selenium ChromeDriver version mismatch after base image update. Fix: pin chromedriver version in Dockerfile.

This is more useful than reading the manifest, because it includes the operational context — what broke before, why the configuration is the way it is, what to check first when something goes wrong.

The MCP Server

To make the wiki writable from Claude Code sessions, I built a small MCP (Model Context Protocol) server that exposes the Wiki.js GraphQL API as tools:

wiki_search — full-text search across all pages
wiki_read_page — read a page by path
wiki_create_page — create a new page with content
wiki_update_page — update an existing page
wiki_list_pages — list pages by path prefix
wiki_list_tags — get all tags

The MCP server runs locally and connects to the wiki via the internal cluster URL. Claude Code loads it automatically from ~/.claude/mcp_servers.json. When Claude finishes a task that warrants documentation, it calls these tools directly.

Reading the Wiki

The primary use case for me reading the wiki is: I come back to something after a few days and need context fast. Instead of reading the manifest, the git log, and the deployment script, I open the wiki page and get the operational picture in 30 seconds.

The Mermaid diagram support is not optional for this — architecture without diagrams is much harder to parse quickly. Which leads to the next post in this series.

What Doesn’t Work Well

Stale pages: If I update a service and Claude doesn’t run a wiki update (because it was a small change via a direct kubectl command), the page drifts. There is no automated freshness check yet. I rely on Claude running an update as part of any deployment workflow.

Search quality: Wiki.js full-text search is decent but not great. It is better than grep on flat files, but it misses conceptual queries. “How do I debug image pull errors?” returns nothing useful even though the answer is spread across three incident pages.

Cross-page linking: Claude creates links when it knows about them, but the link graph is sparse. An automated pass that finds mentions of service names and converts them to links would help.

These are quality-of-life problems. The core behavior — AI writes, human reads — works well enough that I would not go back to manual documentation or scattered Markdown files.

TL;DR#

Why Not Confluence, Notion, or a Git Repo#

Deployment#

The AI Writing Loop#

What the Wiki Actually Contains#

The MCP Server#

Reading the Wiki#

What Doesn’t Work Well#