TL;DR

A 109-day project plan. One day of actual work. Eight hours of active pipeline time. The key was treating planning and implementation as two separate AI-driven phases: spend an evening getting the plan right by routing it through multiple models, then let Claude Sonnet 4.6 implement it autonomously overnight via GitHub Copilot’s cloud agent while you sleep.

This is the full playbook — planning phase included.

The Project

This came out of building dnd-multi, a full-stack AI Dungeon Master platform: FastAPI backend, Next.js 15 frontend, a Discord bot, LiveKit voice, and AWS Bedrock integration. Seven feature phases, a plan projected to take until June 19.

It shipped March 2. All seven phases. In a single calendar day.

Phase 0: Multi-Model Planning

Before any code is written, the plan has to be right. A bad plan handed to an autonomous agent produces a lot of very confident wrong code. The planning phase is where you spend the most time — and where multiple AI models are more valuable than one.

The process I used:

Step 1 — Human outline. I wrote the initial project outline: vision, feature list, subsystems, rough phase structure. Nothing a model generated. This is the artifact that everything else bounces off of.

Step 2 — ChatGPT pass. Uploaded the outline and the existing codebase structure to ChatGPT. Asked it to evaluate the plan against the repo and identify gaps: missing acceptance criteria, underspecified features, dependencies I hadn’t called out. ChatGPT is good at this kind of structured gap analysis — it thinks in checklists and surfaces missing preconditions.

Step 3 — Multi-model review panel. Routed the ChatGPT-refined plan through three more models, each with the same prompt: “Evaluate this project plan and the existing repo. What’s missing, what’s wrong, what’s underspecified?”

ModelContribution
GitHub Copilot (Claude Sonnet 4.6)Caught import path and Pydantic v2 migration gaps; identified the asyncio.get_event_loop() deprecation as a blocking issue
Gemini Pro 2.5Identified missing CI path filters — the on.push.paths issue that would cause CI to not gate some PRs
GPT o3 (Codex)Flagged the --workers 2 Uvicorn config as incompatible with in-process WebSocket state; caught the missing WorldState seed on campaign create

Each model saw the plan and the repo. Each found different things. The overlap was small — which means running all three was worth it.

Step 4 — Claude Opus 4.6 plan finalization. With all the gap analysis in hand, I handed the full accumulated context to Claude Opus 4.6 and asked it to produce the final execution plan: phase-by-phase feature breakdown, acceptance criteria per issue, dependency chain, migration strategy, and phase exit gates. Opus is the right model for this — it’s slower and more expensive, but it synthesizes across a large context better than any of the others and produces more structured output.

The output was docs/project-plan.md — a 506-line document specifying every phase, every gate check, every file to touch, and every dependency. This is what the implementation agent would execute against.

Step 5 — Sleep. Claude Sonnet 4.6 executed the plan overnight.


The Implementation Workflow

With the plan finalized, the implementation phase is a mechanical execution loop. The actors:

RoleToolWhat it does
ImplementerGitHub Copilot cloud agent (@copilot)Reads issues, writes code, pushes copilot/<branch> PRs
Director + ReviewerClaude Sonnet 4.6 in VS CodeWrites issues, reviews diffs, corrects mistakes, commits, merges
HumanMeReviewed the plan the night before; woke up to 24 merged PRs

The key insight: Copilot’s cloud agent is good at bounded, well-specified tasks. It has no memory of your project’s conventions, API shapes, or past decisions. Claude in VS Code knows all of that from copilot-instructions.md and the full codebase context. The workflow transfers that context to the cloud agent through surgical GitHub issues.

Step 1: Write a Surgical Issue

The GitHub Copilot agent picks up issues assigned to @copilot. What it produces is directly proportional to how specific the issue is.

Issue that produces bad code:

Add turn tracking to the game session

Issue that produces usable code:

## Goal
Parse `[TURN: <user_id>]` directives from Claude's DM response and set
`GameSession.active_player_id` accordingly, enabling the WebSocket hub to
enforce whose turn it is to act.

## Acceptance Criteria
- [ ] `DMResponse` Pydantic model gains `next_player_id: Optional[str]`
- [ ] `dm_engine.py` regex parser extracts `[TURN: <id>]` and strips it from narration text
- [ ] After a DM turn, backend updates `game_session.active_player_id` via
      `await db.execute(update(GameSession)...)`
- [ ] Unit test added asserting directive is parsed and stripped correctly

## Files to Touch
- `backend/app/ai/dm_engine.py` — directive parsing + session update
- `backend/tests/test_dm_engine.py` — new unit test

## Depends On
#6 (merged — `active_player_id` column added to `game_sessions`)

The Depends On line is critical. It tells Copilot what API shapes already exist. Without it, Copilot may re-implement a model that already exists, or write code against a DB column it doesn’t know was just added.

Creating the issue from Claude in VS Code:

gh issue create \
  --repo owner/repo \
  --title "Phase 1.1b: Add [TURN: player_id] directive parser to DM engine" \
  --body "$(cat /tmp/issue_body.md)" \
  --assignee "@copilot" \
  --label enhancement

After creation, GitHub Copilot typically spins up a branch named copilot/<kebab-title> within a few minutes and auto-opens a draft PR.

Step 2: Inspect the Diff

Don’t merge the Copilot branch directly. Review it:

git fetch origin copilot/<branch-name>
git diff main..origin/copilot/<branch-name> --stat
git diff main..origin/copilot/<branch-name> -- backend/app/ai/dm_engine.py

Copilot will be ~80% correct on well-specified issues. Common problems I caught:

  • Pydantic v1 .dict() calls — project uses Pydantic v2, needs .model_dump()
  • Missing await — forgot to make a DB call async
  • Import paths wrong — used absolute import where the project uses relative
  • Directive not stripped from narration — regex extracted but didn’t remove the tag, so players see raw [TURN: abc123] in chat
  • Tests not importing the right modules — test file used a different mock pattern than the rest of the test suite

The review step is where the local LLM earns its keep. Claude sees the full codebase context, knows exactly what Pydantic version is in use, knows the existing test patterns, and can evaluate the diff against all of that.

Step 3: Implement Correctly on a Feature Branch

Whether you take Copilot’s version verbatim or rewrite it, the output lives on your own branch — never on the copilot/ branch directly:

git checkout main && git pull
git checkout -b feat/phase1-turn-directive

For the cases where Copilot’s implementation is clean, it’s just a file copy from the fetched remote branch. For the cases where it needed correction, Claude makes the fixes directly. Either way, the commit references the issue:

git add backend/app/ai/dm_engine.py backend/tests/test_dm_engine.py
git commit -m "feat: implement [TURN:] directive parser in DM engine

- DMResponse gains next_player_id: Optional[str]
- _parse_response handles [TURN: <user_id>], strips from narration
- process_action updates GameSession.active_player_id via async UPDATE
- System prompt updated with TURN MANAGEMENT instructions
- 6 unit tests: parse, strip, absent, UUID, mixed directives

Closes #8"
git push -u origin feat/phase1-turn-directive

Step 4: Open a PR and Gate on CI

gh pr create \
  --repo owner/repo \
  --title "feat: Phase 1.1b — [TURN:] directive parser in DM engine" \
  --body "$(cat /tmp/pr_body.md)" \
  --base main \
  --head feat/phase1-turn-directive

Then wait for CI. The key rule: not all PR types need a CI gate.

Change typeWait for CI?Why
Backend PythonYes — alwayspytest runs on every backend PR
Frontend TypeScript/TSXNo — merge immediatelyFrontend CI is build-only, no test gate
Discord bot PythonNo — merge immediatelyBot CI is build-only
Docs-onlyNoNo workflow

Poll from Claude:

sleep 45 && gh run list --repo owner/repo \
  --branch feat/phase1-turn-directive \
  --limit 1 --json databaseId,status,conclusion,name 2>&1 | cat

# If still running:
gh run view <run_id> --repo owner/repo --json status,conclusion 2>&1 | cat

Step 5: Squash Merge and Move Immediately

Once CI is green:

gh pr merge <pr_number> --repo owner/repo --squash --delete-branch 2>&1 | cat
git checkout main && git pull
git checkout -b feat/phase1-next-feature

Don’t pause between PRs. The moment main is updated, start the next branch. Claude tracks what’s been merged and what Copilot issues are open — the next issue body can reference the just-merged PR number immediately.

The gh CLI Non-Negotiables

These two habits prevent hours of lost time:

1. Always append 2>&1 | cat

# This hangs in VS Code's terminal — opens a pager that never clears
gh pr list --repo owner/repo

# This works every time
gh pr list --repo owner/repo --json number,title,state 2>&1 | cat

VS Code’s integrated terminal is not a full TTY. gh detects that and opens a pager. Without | cat, every gh command that produces output blocks indefinitely.

2. export GH_PAGER=cat at session start

Set this at the top of a long session and stop thinking about it:

export GH_PAGER=cat

What Copilot Is Good At vs. Not

After 41 PRs with this workflow, the pattern is clear:

Use the Copilot agent for:

  • Adding a field to a Pydantic model + wiring it through an API endpoint
  • Implementing a new parser or regex extraction pattern
  • Writing unit tests for a well-defined function
  • Adding a new async route handler following existing patterns
  • CRUD operations on a new DB model

Implement directly (skip the agent) for:

  • Cross-cutting refactors touching more than 5 files
  • Infrastructure manifests (K8s YAML, Terraform, Ansible)
  • Anything that requires understanding the full system state
  • Frontend components with complex state — Copilot’s Next.js 15 App Router instincts are patchy
  • Fixes that need to go out fast — time spent reviewing the agent’s work can exceed direct implementation

In practice, Copilot handled ~60% of backend PRs and 0% of frontend, Discord bot, and infrastructure PRs.

Dependency Chaining

This is where the workflow is fragile if you’re not careful. Features build on each other. If two issues both touch dm_engine.py, and the second one is created before the first is merged, Copilot implements against the old API shape.

The rule: one open Copilot issue per logical feature area at a time. Never create issue N+1 in a file until issue N is merged.

The Depends On: #<N> (merged) line in every issue body signals to Claude (when writing the next issue) what API shapes have definitively landed. Claude uses this to write accurate file references and exact field names in the next issue — which directly improves Copilot’s output quality.

Velocity

From dnd-multi v1.0 (7 phases, all shipped 2026-03-02):

MetricProjectedActual
Calendar days109 days1 day
Completion date2026-06-192026-03-02
PRs merged24
Lines of code added4,654
Files changed104
Estimated human engineer cost$175,490
Actual cost (Copilot sub + AI usage)$1,439
Cost per PR$7,312$60

Active pipeline window (first PR merged → last): 8 hours, 1 minute. The largest bottleneck was CI/CD wait time — ~5.5 of those 8 hours was GitHub Actions running backend builds.

The bottleneck was never implementation speed. It was plan quality and CI wait times. Every minute spent on the planning phase the night before translated to significantly less rework during implementation.

Common Pitfalls

Copilot branch appears, no PR auto-opens. Sometimes the agent pushes the branch but doesn’t open a PR. Check with gh pr list --state open — open the PR manually against the copilot/<branch> head if it’s missing.

Copilot uses a stale API shape. If you merged a schema change 10 minutes ago, Copilot may not have indexed it. Always specify new field names explicitly in the issue body — don’t say “use the new field”, say “use GameSession.active_player_id: Optional[str]”.

Two issues race on the same file. If both get picked up simultaneously (rare but happens), the second commit will conflict with or silently overwrite the first. Catch this in the diff review step and resolve before opening your feature branch PR.

CI run never appears. GitHub Actions only triggers against path filters in on.push.paths. If a PR branch has no matching paths, no run fires at all — that’s actually fine, it means no backend code changed and you can merge immediately.

gh pr merge returns “already merged” on a valid PR. Usually means your local git cache is stale. git fetch --prune and recheck.

The Bigger Pattern

There are actually two handoff protocols here, not one.

The first is between models during planning: human outline → gap analysis across multiple models → a single model synthesizing into an execution plan. Each model contributes something different. No single model catches everything. The output isn’t a conversation — it’s a document, a structured artifact that the next step in the chain consumes.

The second is between the local LLM and the cloud agent during implementation: Claude in VS Code writes issues that encode enough project context for the cloud agent to produce correct code. The PR diff is the verification layer. The CI gate is the safety net.

Neither handoff requires special tooling. The planning phase uses any chat interface that accepts file uploads. The implementation phase uses gh CLI and standard GitHub features. The “workflow” is issue structure and commit convention.

The thing that makes it work is the plan. A vague plan handed to an autonomous implementation agent produces autonomous chaos. A plan with exact file paths, exact function signatures, exact acceptance criteria, and explicit dependency ordering produces 24 merged PRs in 8 hours.

Spend the evening on the plan. Sleep through the implementation.