The Saturday DR drill — burning the cluster down on purpose

TL;DR

Three weeks after accidentally wiping GitLab with a misdirected blkdiscard and rebuilding from S3, I scheduled a deliberate drill: wipe GitLab, Vault, Harbor’s proxy cache, Authentik’s database, and one Longhorn volume on a Saturday morning, then rebuild everything from Terraform + S3 with a stopwatch running. Total drill time: 4 hours 22 minutes, end to end. About 90 minutes of that was actual rebuild work; the rest was discovering pieces of state I’d accidentally left out of the IaC.

Why a deliberate drill

The accidental rebuild worked. I had backups, the GitLab Helm chart’s restore path is well-documented, and the cluster came back. But “it worked the one time I had to” is not the same as “it works”. Two specific worries:

The accidental rebuild touched one service. The blast radius of a real disaster — failed Longhorn node, ransomware, hardware fire — could touch every stateful service at once. Rebuilding one thing while the rest hold steady is different from rebuilding from a cold cluster.
The rebuild leaned on muscle memory that’s already fading. I want a documented runbook, validated, with a clock against it.

A scheduled drill fixes both. It also makes me confront the things I’ve quietly been hoping I won’t have to rebuild — the Vault recovery keys in Bitwarden, the Authentik admin password, the Harbor robot-account list. Knowing where they are isn’t the same as rebuilding from them.

The plan

Scope

Five things would be deliberately destroyed:

GitLab — Helm release plus all Longhorn PVCs.
Vault HA — Helm release plus all three Raft PVCs. Auto-unseal via KMS would prove itself on the way back up.
Harbor’s proxy-cache project layers — not the whole Harbor, just the cached blobs. Tests whether re-pulling from upstream actually works.
Authentik’s Postgres volume — the IdP that everything else SSOs against. The recovery order matters.
One application Longhorn volume — monitoring-prometheus-tsdb-0. Tests restore from snapshot, not just from backup.

Five things were explicitly out of scope:

The S3 backup bucket itself. The drill assumes the off-site backup is intact. A drill that tests “what if the backups are also gone” is a different drill.
The Route53 zone. DNS is rented, see the seam.
The Mac Studio inference rig. It’s not in scope for the cluster’s DR.
The k3s_bootstrap repo on GitHub. The escape hatch.
Anything network-layer: switches, router, VLANs. Network DR is its own drill, not this one.

Pre-drill checklist

The day before:

Verify last successful S3 sync for each backed-up service. aws s3 ls s3://k3s-backups/<service>/ --recursive | tail -5 should show a tarball from <24h ago.
Verify the recovery keys for Vault are in Bitwarden and readable.
Verify the Authentik admin recovery password is in Bitwarden.
Snapshot every Longhorn PVC in scope, just in case the drill itself goes wrong. (I’ll delete the snapshots at the end whether the drill succeeded or not — a snapshot kept “just in case” is a snapshot you’ll never delete.)
Tell the household nothing important is going to work for ~4 hours. Jellyfin runs out of a different namespace and is technically unaffected, but always announce the drill anyway.

Order of rebuild

Documented in the runbook before the drill starts:

1. Authentik   ← every other service authenticates against it
2. Vault       ← Authentik's OIDC client_secret lives here
3. Harbor      ← every pod needs it back for image pulls
4. GitLab      ← runners can't register without it
5. Workloads   ← apps last, including the wiped Prometheus volume

Each step has a “verified up” criterion before the next step starts: an HTTP probe, a vault status, a successful test pull.

The drill itself

Saturday 09:00. Coffee. Stopwatch.

T+00:00 — wipe

# Authentik
helm uninstall authentik -n authentik
kubectl delete pvc -n authentik --all

# Vault
helm uninstall vault -n vault
kubectl delete pvc -n vault --all

# Harbor proxy-cache layers (just blobs, not the project config)
kubectl exec -n harbor harbor-trivy-0 -- \
  /bin/sh -c "rm -rf /storage/cache/docker/registry/v2/blobs/*"

# GitLab
helm uninstall gitlab -n gitlab
kubectl delete pvc -n gitlab --all

# One Prometheus volume
kubectl delete pvc -n monitoring monitoring-prometheus-tsdb-0

Every command above could be terraform destroy -target=... if I’d built the Helm releases through the Terraform Kubernetes provider. I haven’t — they’re applied directly via helm and tracked in Git. This is a thing the drill exposed: my IaC story is incomplete for in-cluster state.

T+00:14 — Authentik back up

Reinstall:

helm install authentik authentik/authentik -n authentik --create-namespace \
  --values clusters/prod/authentik/values.yaml

Wait for the pods. Then restore the Postgres backup:

kubectl exec -n authentik authentik-postgres-0 -- \
  pg_restore -U authentik -d authentik /tmp/authentik-backup-latest.dump

The dump file was pulled from S3 via an init container baked into the chart values. That’s a pattern I’m now copying to every service.

Verified up by logging in via the recovery admin password (Bitwarden), confirming the OIDC providers list is intact (yes — they’re rows in the Authentik DB, not separate state).

T+00:42 — Vault back up

Reinstall the Helm chart with the same values file used in the Vault behind Authentik post. Auto-unseal via AWS KMS does its job — the pods come up Sealed: false without any human intervention, which is the most satisfying 30 seconds of any DR drill.

Initialize raft membership for the two new replicas, restore from the most recent S3 snapshot:

kubectl exec -n vault vault-0 -- \
  vault operator raft snapshot restore /tmp/vault-snapshot-latest.snap

Verified up by reading a known secret. The OIDC config came back with the snapshot. Authentik trust loop closes itself.

T+01:10 — Harbor blobs back up by pulling

Harbor’s proxy-cache is self-healing. When a pod tries to pull library/postgres:16-alpine and the blob isn’t cached, Harbor fetches from upstream and re-caches. So Harbor’s “restore” is really “wait and watch the cache repopulate as cluster workloads come back up.”

I noted Harbor’s cache size before the drill (~180GB) and at the end (~22GB). It’ll grow back to ~180GB over the next few days as workloads cycle.

T+01:23 — GitLab back up

The procedure from the migration post, step by step. The painful part this time wasn’t the restore itself — it was re-registering every GitLab Runner. The runner registration tokens are server-side state and the new server doesn’t know the old ones.

# For each runner namespace:
helm uninstall gitlab-runner -n <ns>
helm install gitlab-runner gitlab/gitlab-runner -n <ns> \
  --values clusters/prod/gitlab-runner/<ns>-values.yaml

Verified up by manually triggering a pipeline on home_k3s_cluster and watching it succeed.

T+02:55 — Workloads + Prometheus volume

The wiped Prometheus volume restored from a Longhorn snapshot in ~3 minutes. Prometheus came up with about 4 hours of missing TSDB data — the gap between the most recent backup and the wipe — and gracefully filled the gap forward from the moment the new volume started recording.

This is the only thing that didn’t restore to a fully clean state. Acceptable for monitoring telemetry; would not be acceptable for billing or audit data.

T+04:22 — done

The full timeline:

Time	Event
T+00:00	Wipe starts
T+00:08	Wipe complete
T+00:14	Authentik restored
T+00:42	Vault restored, auto-unsealed
T+01:10	Harbor noted; cache will rebuild lazily
T+01:23	GitLab restored
T+02:35	All GitLab runners re-registered
T+02:55	Prometheus volume restored
T+03:40	All workloads back to healthy
T+04:22	Runbook updated, snapshots deleted, drill closed

About 90 minutes of pure rebuild work. The rest was discovery — pieces of state that weren’t in IaC, registration steps the runbook didn’t capture, one outright surprise.

What surprised me

Authentik’s brand-customization assets are stored as files, not DB rows. The branded login page wallpaper, the SVG logo — all live on a separate media PVC that wasn’t part of the Postgres dump. The brand came back stock until I restored the media volume separately. Now it’s in the backup script.

The Harbor proxy-cache “restore by pulling” path is slow on the first few minutes. Every pod restarting at once means every pod pulling at once. Harbor’s CPU spiked, upstream pulls queued. Not broken, but it was the loudest part of the rebuild.

Vault auto-unseal via KMS just worked. The most-feared step took zero human input. AWS KMS being available is one of those dependencies I have to take on faith; the drill confirmed the faith is well-placed.

Re-registering GitLab runners was the longest single step. ~70 minutes for the full set. This is a chunk of the runbook that needs automation — a script that reads runner namespaces from a Helm-values manifest and re-installs each one.

What broke / what worked

Worked first try:

Authentik Postgres restore
Vault auto-unseal
Vault Raft snapshot restore
GitLab restore (second time I’ve done this now)
Longhorn snapshot restore for Prometheus

Broke or needed intervention:

Authentik brand assets (missing PVC in backups)
One GitLab runner whose values file referenced an obsolete tag — fixed inline
Two ServiceMonitor CRDs that came back before Prometheus was scraping, generated stale-target alerts for ~10 minutes

Cleanup

After the drill closed:

Deleted the pre-drill snapshots (no excuses kept).
Updated the runbook with the three things the drill exposed.
Added the Authentik media PVC to the backup CronJob.
Filed an MR to draft a gitlab-runner re-registration script.
Slack post to myself documenting drill date + duration.

Lessons

A drill is a test. Tests find bugs. Every drill I’ve run has surfaced at least one piece of state not in the backup. Schedule them often enough that the next drill finds the next bug.
The order of rebuild matters and should be written down. Authentik → Vault → Harbor → GitLab → workloads. Restoring out of order means SSO-protected services come up before SSO, secret-needing services come up before secrets, and you spend the next hour fixing it.
Auto-unseal is the line between “Vault as a tool” and “Vault as a thing you avoid”. A 4-hour rebuild that includes 30 seconds of “wait for the seal to lift” is fine. A 4-hour rebuild that includes 20 minutes of typing key shares is not.
In-cluster state is a gap in IaC. Helm releases I install by hand are not in Terraform. They should be — that’s the next refactor.
Schedule the next drill before you forget how this one felt. I have one on the calendar for late August. The next one wipes the network layer too.

What’s next

The next drill (Q3) adds the network layer to the blast radius: I’ll force-fail the primary k3s control plane, simulate a switch failure, and time how long it takes to recover with the runbook. The current drill tests data recovery; the next one tests availability recovery. Different muscle.

Longer-term, the goal is making each drill smaller than the last — not because the cluster grows, but because more state moves into IaC and more steps move into scripts. A drill that takes 4 hours today should take 90 minutes in six months.

TL;DR#

Why a deliberate drill#

The plan#

Scope#

Pre-drill checklist#

Order of rebuild#

The drill itself#

T+00:00 — wipe#

T+00:14 — Authentik back up#

T+00:42 — Vault back up#

T+01:10 — Harbor blobs back up by pulling#

T+01:23 — GitLab back up#

T+02:55 — Workloads + Prometheus volume#

T+04:22 — done#

What surprised me#

What broke / what worked#

Cleanup#

Lessons#

What’s next#