TL;DR

A routine cluster health check surfaced seven simultaneous issues. Most were transient — Longhorn self-healed its replica fault, Prometheus recovered behind it, a stale manually-created Job was deleted in one command, and a liveness probe blip fixed itself. The real work was dnd-backend, which had been in CrashLoopBackOff and turned out to contain three separate bugs layered on top of each other. The AI identified all three during a single debugging session, authored the fixes across three PRs, and the service came up 1/1 Running with all 18 database tables created on the first boot after the final merge.


The Initial Alert Triage

The session started with a kubectl get pods --all-namespaces sweep that returned seven issues worth investigating:

NamespaceResourceStatusInitial Assessment
longhorn-systeminstance-manager-*TimeoutTransient — retried successfully
monitoringprometheus-*1/2 RunningLonghorn-dependent
cardboardpostgres-backup (Job)Completed/stuckStale manual Job
livekitlivekit-*Liveness probe failTransient
longhorn-systemReplica faultDegradedStorage event
mediaplex-*Readiness probeTransient
dnddnd-backend-*CrashLoopBackOffUnknown root cause

The three transient issues (livekit liveness, Longhorn instance-manager timeout, Plex readiness probe) all self-resolved during the investigation. The cardboard stale Job was a one-liner delete. The Longhorn replica fault and Prometheus recovery followed each other automatically — Longhorn self-healed to three replicas, and Prometheus came back 2/2 Running behind it.

That left dnd-backend.


The ECR Secret Red Herring

The first hypothesis for the dnd-backend crash was the ECR pull secret. Our ECR secrets expire every 12 hours, and the one in the dnd namespace was 44 hours old. The in-cluster ECR refresh CronJob in kube-system handles most namespaces but hadn’t included dnd.

After refreshing the secret, the pod transitioned from ImagePullBackOff to CrashLoopBackOff almost immediately. Progress — image is pulling — but the crash moved inside the container.


Bug #1: asyncio.run() Inside a Running Event Loop

The first crash log was:

RuntimeError: This event loop is already running.

The FastAPI lifespan handler calls init_db(), which calls Alembic’s command.upgrade(). What the original code didn’t account for: command.upgrade() is a synchronous function that internally calls asyncio.run(run_async_migrations()). Calling asyncio.run() from inside an already-running async event loop raises RuntimeError.

The exception was caught somewhere up the call stack, the coroutine was garbage-collected unawaited, and Uvicorn exited with code 3.

Fix (PR #42): Wrap the blocking command.upgrade() call in asyncio.to_thread(), which runs it in a thread pool executor outside the event loop:

# Before
def init_db():
    command.upgrade(alembic_cfg, "head")

# After
async def init_db():
    await asyncio.to_thread(command.upgrade, alembic_cfg, "head")

PR merged, new image built, pod restarted. Crash continued. Bug #1 was the outermost layer — fixing it exposed bug #2.


Bug #2: Vector(1536) vs. ARRAY(Float) Type Mismatch

With the event loop fix in place, Alembic now actually ran migrations. The logs showed:

Running upgrade  -> 000_initial_schema
Running upgrade 000_initial_schema -> 001_world_invite

Then the pod crashed again, exit code 3.

The migration chain uses PostgreSQL’s transactional DDL — all migrations run inside a single transaction. To see where it was failing, a debug pod was spawned with a sleep command overriding the container entrypoint, then alembic upgrade head was run manually inside it to capture the full error.

The error was:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedObject)
operator class "vector_cosine_ops" does not exist for access method "hnsw"

Migration 002_hnsw_lore_index tried to create an HNSW index on lore_entries.embedding:

op.create_index(
    "ix_lore_entries_embedding_hnsw",
    "lore_entries",
    ["embedding"],
    postgresql_using="hnsw",
    postgresql_with={"m": 16, "ef_construction": 64},
    postgresql_ops={"embedding": "vector_cosine_ops"},
)

The vector_cosine_ops operator class is only valid when the column is type vector. But migration 000_initial_schema had defined the column as:

sa.Column("embedding", postgresql.ARRAY(sa.Float), nullable=True),

That creates a PostgreSQL real[] array — not a pgvector vector(1536). The ORM model in app/models.py correctly used Vector(1536) from pgvector, but the migration that created the table was using the wrong type.

Since the HNSW index creation fails, PostgreSQL rolls back the entire migration transaction. The database stays empty. On every subsequent start, Alembic sees no migrations applied and tries again. Same failure, every time.

Fix (PR #43): Change the embedding column definitions in 000_initial_schema to match the ORM:

from pgvector.sqlalchemy import Vector

# Before
sa.Column("embedding", postgresql.ARRAY(sa.Float), nullable=True),

# After
sa.Column("embedding", Vector(1536), nullable=True),

PR merged, new image deployed, pod restarted. Still crashing. The migration chain was progressing further now — 000 and 001 completed — but something new was breaking.


Bug #3: server_default Double-Quoting and PostgreSQL Enum Failure

The third iteration of manual Alembic execution in the debug pod revealed the actual crash:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.InvalidTextRepresentation)
invalid input value for enum sessionstatus: "'idle'"
LINE 1: ...DEFAULT '''idle''', created_at timestamptz DEFAULT now(),...

The migration 001_world_invite defined a column with:

sa.Column(
    "status",
    sa.Enum("idle", "active", "paused", "ended", name="sessionstatus"),
    server_default="'idle'",
    nullable=False,
),

SQLAlchemy’s server_default renders the value literally in the DEFAULT clause of the CREATE TABLE SQL. When you pass "'idle'" (a string containing single quotes), SQLAlchemy emits:

DEFAULT '''idle'''

Three single quotes. The outer pair is SQLAlchemy’s SQL-safe quoting; the inner pair is the literal quotes from the Python string. PostgreSQL receives the string value 'idle' — with actual quote characters — which doesn’t match the enum value idle.

This broke the CREATE TABLE and caused PostgreSQL to roll back the whole migration transaction. Bug #3 was the one keeping the database completely empty on every boot.

The affected migrations were 000_initial_schema, 001_world_invite, and 008_session_planning. All had pre-quoted server_default values like "'idle'", "'lore'", "'pending'".

Fix (PR #44): Remove the pre-quoting — SQLAlchemy handles it:

# Before (broken)
server_default="'idle'",

# After (correct)
server_default="idle",

Applied across all three affected migrations. Committed, PR opened, CI green, merged.


The Recovery

After PR #44 merged and the build+deploy pipeline completed, the pod came up:

NAME                           READY   STATUS    RESTARTS   AGE
dnd-backend-7f9d4b8c6-xk2p9   1/1     Running   0          2m

The startup logs showed all nine migrations running to completion in order:

Running upgrade  -> 000_initial_schema
Running upgrade 000_initial_schema -> 001_world_invite
Running upgrade 001_world_invite -> 002_hnsw_lore_index
Running upgrade 002_hnsw_lore_index -> 003_session_tracking
Running upgrade 003_session_tracking -> 004_world_enhancements
Running upgrade 004_world_enhancements -> 005_lore_system
Running upgrade 005_lore_system -> 006_npcs
Running upgrade 006_npcs -> 007_quests
Running upgrade 007_quests -> 008_session_planning
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

A direct \dt query against the database confirmed all 18 tables present.


Why All Three Bugs Coexisted

Each bug masked the next one:

  1. Bug #1 (asyncio.run() inside running loop) — The exception was thrown before Alembic ran a single migration. From the outside it looked like a generic startup failure.

  2. Bug #2 (wrong column type in migration 000) — Only visible once bug #1 was fixed and Alembic actually started. The migration transaction rolled back at migration 002 — but the error message referenced an HNSW index, not the initial schema, making the root cause non-obvious.

  3. Bug #3 (double-quoted server_default) — Only visible after fixing bug #2 pushed the failure point into migration 001. The invalid input value for enum error from PostgreSQL is unusually clear, but it only appears in the Alembic stdout that isn’t surfaced by kubectl logs in a CrashLoopBackOff pod with a 3-second exit time.

The 3-second exit window made each iteration painful. Getting the error message required either running a debug pod with a sleep override, or catching the logs at exactly the right moment. A startup probe with a longer initial delay would have helped with visibility during the investigation.


Rollback Attempt and Why It Failed

Before finding bug #2, a rollback to the previous image (sha-6c9c12b, pushed 7 minutes before the current image) was attempted. It crashed with the same error. Rolling back further to the last pre-March-2 image (sha-74dfd42, from February 28) failed differently — an ImportError from a restructured User model that the older code didn’t know about.

This is the category of migration bug where rollbacks don’t save you: the bug is in the migration files themselves, not in the application code, and the migration files only go back to the beginning. Fixing the path forward was the only option.


Kubectl Commands That Surfaced the Most Information

A few commands that cut through the noise during this session:

# Get the full Alembic error without the 3-second race
kubectl patch deployment dnd-backend -n dnd \
  --patch '{"spec":{"template":{"spec":{"containers":[{"name":"dnd-backend","command":["sleep","infinity"]}]}}}}'
kubectl exec -n dnd <pod> -- bash -c "cd /app && alembic upgrade head 2>&1"

# Check migration state in the DB directly
kubectl exec -n dnd postgres-0 -- psql -U dndapp -d dndmulti \
  -c "SELECT version_num FROM alembic_version;"

# Confirm all tables after recovery
kubectl exec -n dnd postgres-0 -- psql -U dndapp -d dndmulti \
  -c "SELECT tablename FROM pg_tables WHERE schemaname='public' ORDER BY tablename;"

Patching the deployment to run sleep infinity instead of the actual entrypoint was the most valuable technique during this investigation. It lets you run commands inside a pod that has all the correct environment variables, secrets, and network access — without the race against the container exit timer.


Final Outcome

All seven issues resolved:

IssueResolution
Longhorn replica faultSelf-healed to 3 replicas
Prometheus 1/2 RunningRecovered behind Longhorn
Cardboard stale Jobkubectl delete job postgres-backup -n cardboard
livekit liveness probeTransient — self-resolved
Longhorn instance-manager timeoutTransient — self-resolved
Plex readiness probeTransient — self-resolved
dnd-backend CrashLoopBackOff3 PRs: #42, #43, #44 — 9 migrations, 18 tables ✓

Three PRs, four hours, zero manual code written by the human. The AI identified the bugs, authored the fixes, monitored CI, and knew when a rollback wasn’t going to work. The human’s job was to approve the PRs and confirm the pod came up healthy.


Lessons for the Docs

A few things worth adding to the cluster runbook:

On server_default in Alembic migrations: Never pre-quote string values. server_default="idle" not server_default="'idle'". SQLAlchemy adds the SQL quoting. This applies to any string: enum values, literal strings, anything that isn’t a SQL function call.

On overlapping bugs causing CrashLoopBackOff: When kubectl logs shows a crash at a different point on each deploy, suspect the fix exposed a new bug rather than being incomplete. Layer the fixes, don’t iterate on the same hypothesis.

On debug pods vs. log tailing: For any container that exits in under 5 seconds, patch the deployment to sleep infinity and exec in. Log tailing at CrashLoopBackOff timing windows is unreliable.

On ECR namespace coverage: The ECR refresh CronJob in kube-system must be updated whenever a new application namespace is created. Add the namespace check to the new-service-checklist.