LLM-powered GitLab CI: auto-reviewing and auto-fixing merge requests

TL;DR

I’ve wired LLMs into my GitLab CI pipeline to auto-review merge requests, post findings as comments, and (on command) generate patches and commit fixes. The key insight: deterministic gates run first. Before the LLM ever sees a diff, regex-enforced checks block deleted tests, committed secrets, and destructive commands. Regex is certain; LLM judgment is probabilistic. Gate first, judge second. The bot reviews silently unless it finds something, posts to the MR with confidence levels, and can be leveled up from read-only observer to trusted committer as it proves itself — hence the “autonomy ladder” (Rungs 0–4) that gates who decides what. Infrastructure repos cap at Rung 2 (never auto-merge).

Why build a code reviewer that commits

Every pull request workflow faces the same bottleneck: humans review boilerplate. Format fixes, missing labels, typos in comments. The LLM is free (I run my own local inference stack), parallel (doesn’t block the human), and honest — it confidently suggests wrong things, but those get caught during human review.

The pipeline isn’t meant to cut humans out. It’s meant to cut the time humans spend on obvious stuff — so they review the real decisions instead.

Three jobs live in the CI suite:

MR review — runs automatically on every MR. Fetches the diff, asks the LLM to look for 8 blocking rules + suggestions, posts a comment if anything lands. Quiet by default (no findings = no comment).
Pipeline medic — runs on job failure. Fetches logs, asks the LLM why it failed, posts diagnosis as a comment. Helps humans understand permission denied vs OOM vs credentials wrong without scrolling.
/llm fix ChatOps — humans comment /llm fix [instruction] on an MR. A scheduled job polls for these every 5 minutes, generates a patch, applies it locally, and pushes as a new branch with a follow-up MR. The human reviews the bot’s work; if it’s wrong, they close the MR and clear the reaction to retry.

The blocking rules that go before LLM judgment

This is non-negotiable. Every runner that can execute CI jobs has hard-coded regex checks for:

Deleted tests — grep -r 'test_' >> /dev/null; if not found, fail
Committed secrets — git diff for private keys, tokens, AWS access key patterns; fail immediately
Destructive commands — grep -E '(DROP|DELETE FROM|rm -rf|kubectl delete|terraform destroy)' in scripts; fail if any match, unless allowlisted per-repo
Missing manifest validity — kubeval + kustomize build on any kubernetes/** change; non-negotiable

These are never allow_failure: true. A blocking rule fails the job, blocks the merge, and requires a re-push. No exceptions. The LLM never gets to see the diff.

Why? Because the LLM can be confidently wrong. It might say “this rm -rf looks fine to me,” misread the context, and suggest merging. The regex, by contrast, never has an opinion—it just matches or doesn’t.

# In the CI runner setup, before any job runs:
.safety_checks:
  script:
    # Hard blocks — no allow_failure
    - python3 /opt/ci/safety.py --block deleted-tests
    - python3 /opt/ci/safety.py --block committed-secrets
    - python3 /opt/ci/safety.py --block destructive-commands
    - kubeval kubernetes/**/*.yaml || true  # fails job if invalid

If all checks pass, then the LLM is allowed to read the code.

The MR review job

When a merge request opens, the .review_mr job runs in the .pre stage with needs: [] — it doesn’t wait for anything. It fetches the MR metadata via the GitLab API, pulls the diff, and sends this to the LLM:

## Blocking Review Rules
1. registry-direct-pull — Images must go through harbor.k3s.internal.zolty.systems (not docker.io)
2. harbor-staging-push — New CI builds extend .build_registry, not .build_harbor
3. missing-prometheus-annotation — New Deployments have prometheus.io/scrape: "true"
4. imagepullpolicy-always-no-sha — No imagePullPolicy: Always without a pinned sha-... tag
5. longhorn-rolling-update — Deployments with Longhorn PVCs use strategy: Recreate
6. service-missing-component-label — Service selectors include app.kubernetes.io/component
7. plaintext-secret — No plaintext secrets in code
8. terraform-destructive-plan — terraform plan shows no destroy on stateful resources

## Files changed
- kubernetes/apps/blog/deployment.yaml
- terraform/rds.tf

## Diff
[actual diff here]

The LLM returns JSON:

{
  "blocking": [
    {
      "rule": "plaintext-secret",
      "file": "terraform/rds.tf",
      "line": "42",
      "message": "RDS password is hardcoded. Move to Bitwarden or a k8s Secret."
    }
  ],
  "suggestions": [
    {
      "rule": "harbor-staging-push",
      "file": "kubernetes/apps/blog/deployment.yaml",
      "line": "15",
      "message": "image: registry.gitlab.com/... should be built with .build_registry to land in Harbor staging first."
    }
  ]
}

The CI script formats these as a markdown comment and posts it to the MR. Each finding links to the exact line and explains the rule.

Idempotency via SHA marker: The comment includes . If that marker for the current commit SHA already exists on the MR, the script exits without calling the LLM again. Push a new commit, the review runs on the new SHA.

Model selection is context-aware. If the diff touches infrastructure files (.tf, kubernetes/**), the job uses Claude’s vision model for a harder look at the architecture. Otherwise, it routes to a fast coder model to save tokens.

The `/llm fix` ChatOps job

The workflow:

You comment /llm fix remove the hardcoded password on an MR.
A scheduled job (every 5 min) polls all open MRs for /llm fix comments.
If found, it adds a 👀 reaction (claims the work), fetches the MR diff + latest pipeline-medic diagnosis, and asks the LLM to generate a unified-diff patch.
The patch is extracted, applied locally in a temp repo, validated with git apply --check --3way, committed, and pushed as a new branch (claude/llm-fix-mr-{iid}-...).
A follow-up MR is opened from the new branch back to the source branch, with the patch in the body.
A 🤖 reaction marks it done. If anything fails, a ❌ reaction marks it failed, and a reply explains why.

Why a new branch instead of pushing to the source branch? Because the user might not have push permission to their own branch, and it sidesteps permission friction. The bot’s work is transparent — it’s a separate MR the human can review, close, or merge.

Safety gates: The patch generator runs against the repo context (CLAUDE.md, architecture notes, recent commits) so it understands the codebase’s rules before it writes code. But the patch is always extracted and validated with git apply --check --3way before it’s pushed. If the check fails, the job replies with the error and marks it failed.

Idempotency via reactions: 👀 = in-flight, 🤖 = done, ❌ = failed. Clear the reaction to retry.

Autonomy ladder — Rungs 0–4

Not all repos trust the bot equally. Different rungs:

Rung 0: Read-only. The bot reviews code, posts findings, never pushes anything. Good for onboarding.
Rung 1: Draft MRs. The bot can push fixes to new branches and open follow-up MRs, but humans always decide whether to merge. (Current default.)
Rung 2: Commit on command. /llm fix commits patches. Humans still review the follow-up MR, but the flow is tighter.
Rung 3: Auto-merge patches. The bot commits fixes, runs the test suite, and auto-merges if tests pass. For boilerplate-heavy repos (formatting, dependency bumps).
Rung 4: Exceptions-only. The bot handles routine tasks autonomously; humans only step in if something strange happens.

Infrastructure repos cap at Rung 2. Terraform, Kubernetes manifests, Vault configs — humans review before any merge, always. The risk surface is too wide.

Each repo tracks its rung in Langfuse traces (the LLM observability backend). When the bot fails the criteria for its rung (deletes a test, leaks a secret, changes scope without asking), it demotes immediately to Rung 0.

Honest caveats

The LLM will suggest wrong fixes. It might read a conditional backwards, misunderstand a context switch, or hallucinate a missing import. Always review. The follow-up MR workflow exists because of this.
Cost scales with diff size. I cap user messages at 60k characters; larger diffs get truncated with a marker. Tokens are budgeted via Langfuse (I use a self-hosted LiteLLM proxy with a $10/day cap).
Hallucinated files. The LLM sometimes references files that don’t exist in the patch or misnames things. The git apply --check step catches some of this, but not all. A failed patch deploy still requires human diagnosis.
No sidecar observability yet. I log LLM calls to Langfuse (tokens, latency, model, feature tag) but don’t yet correlate failed patches to “which prompt triggered this” without manual lookup. That’s on the roadmap.

Lessons

Deterministic gates save the entire system. A single regex check that blocks destructive commands scales to any number of repos and LLM calls. Once it’s in place, you stop worrying about “what if the LLM approved something dangerous?”
The autonomy ladder is not trust — it’s gradualism. Rung 0 (read-only) teaches the bot the rules on your diffs. Rung 1 (draft MRs) proves it can generate valid patches. Rung 2 (commit on command) is only granted after weeks of clean runs. This beats “LLM goes full auto on day one.”
Idempotency markers are your friend. SHA-based markers for reviews, emoji reactions for ChatOps — these make re-runs safe. You can re-trigger a job without worrying about double-posts or double-commits.
LLMs excel at reading context, terrible at following precise rules. Ask it to summarize a 10-file architecture migration — it nails it. Ask it to never output a DROP TABLE statement — it will, confidently, eventually. Gate the rules, use the LLM for judgment.

Running solo without a cluster? You can replicate this entire setup on a managed Kubernetes service like DigitalOcean, swapping the self-hosted inference stack for a cloud LLM API and using a cheaper runner. The blocking-rules-first approach works the same everywhere — the gates are stateless and fast, so they scale down happily to smaller clusters.

TL;DR#

Why build a code reviewer that commits#

The blocking rules that go before LLM judgment#

The MR review job#

The /llm fix ChatOps job#

Autonomy ladder — Rungs 0–4#

Honest caveats#

Lessons#

Further reading#