Drive watch — end-to-end test loop runbook

Why this runbook exists. Every Phase 1–3.1 PR shipped with unit tests against a mocked GoogleDriveClient. Each one was "green" in CI. Two of them shipped Drive-API-assumption bugs that only surfaced in production:

PR before V015: assumed resource_id == file_id. Bug surfaced on every push.

PR #125 fixed an instance: per-file workflow dispatch missing task_runs insert. Bug surfaced on first reconciliation dispatch.

The bug drive-watch-test-loop was created to catch: comments.list returns top-level comments only, not thread replies. Bug surfaced on the second user comment ever.

Mocked Drive clients cannot catch "I made a wrong assumption about Drive's API." The only test that catches it is one that hits real Drive. This runbook is mandatory for any PR that touches the files listed in Hard Rule 14 (.context/AI_RULES.md).

When to run this loop

Run before opening a PR (or before declaring a fix verified) for any change that touches:

internal/backend/.../shared/clients/GoogleDriveClient.kt — listFileComments, listChanges, getStartPageToken, createChangesWatchChannel, createWatchChannel
internal/backend/.../agents/spec/temporal/DriveChangesProcessActivityImpl.kt
internal/backend/.../agents/spec/temporal/DriveReconciliationActivityImpl.kt
internal/backend/.../agents/spec/temporal/DriveWatchRenewalActivityImpl.kt
internal/backend/.../agents/spec/DriveWatchBootstrap.kt
internal/backend/.../shared/clients/DriveWatchRepository.kt — query methods
internal/backend/.../shared/clients/ProcessedCommentsRepository.kt
internal/dispatcher/.../webhooks/DriveWebhookHandler.kt
internal/dispatcher/.../db/HikariDatabaseAccess.kt — findFileIdByChannelId, findDriveWatchByChannelId
internal/backend/.../agents/common-tools/temporal/RequestModelResponseTool.kt — secondary-workflow spawn (Variant C)
internal/backend/.../agents/common-tools/docs/workflow/ExecuteInlineEditsWorkflowTool.kt — /inline edit-scope behaviour (Variant C)
internal/backend/.../agents/shared/executor/SystemPromptAssembler.kt — commentRoutingSection, COMMENT_THREAD_ETIQUETTE (Variant C)
internal/backend/.../agents/spec/temporal/SpecAgentWorkflow.kt — SpecAgentRequest, DriveContext (Variant C)
internal/backend/.../agents/spec/temporal/SpecAgentActivityImpl.kt — runAgent modelOverride / driveContext threading (Variant C)

Pre-conditions

Requirement	How to verify
Chrome with Claude in Chrome extension	`chrome://extensions` → "Claude in Chrome" present and enabled
Extension paired with this Claude session	Run `mcp__Claude_in_Chrome__list_connected_browsers` — must return ≥1 browser
Site access for `docs.google.com`	Extension → Details → Site access → "On all sites" OR "On specific sites" with `https://docs.google.com/` listed. "On click" is NOT enough — each automated nav re-prompts.*
`vivek.soneja@aucert.ai` (or another `aucert.ai` member) signed into Chrome	Page title bar shows the right account
`atlas-agent@aucert.dev` has Edit access to the test doc	Open Share dialog on the doc; confirm `atlas-agent@aucert.dev` listed as Editor
Only ONE tab has the test doc open (the MCP-driven tab)	Close any other tab with the doc URL — Google enforces single-session-per-doc and the LAST opener wins; if your primary Chrome window has the doc tab, the MCP tab will be signed out mid-session with the "You have been signed out. Sign into your account from another tab" overlay. Symptoms: doc title bar shows "Trying to connect…", then a sign-out modal blocks interaction. Recovery: close every other tab with this doc, then have the MCP nav-reload.
`kubectl` configured for `aucert-aks`	`kubectl config current-context` returns `aucert-aks`
Worker is on the code under test	`kubectl get pod -n internal-platform -l app=spec-agent-worker -o jsonpath='{.items[0].status.containerStatuses[0].imageID}'` matches the image you just deployed

The test doc

Use exactly this doc; do not pick a different one without a strong reason.

URL: https://docs.google.com/document/d/1jxnoGsGFJJKhiBq-WCaLC1q1YLfvkVGJpezQkfvnq58/edit
Drive file ID: 1jxnoGsGFJJKhiBq-WCaLC1q1YLfvkVGJpezQkfvnq58
Title: "Test Agent Document"
Maintenance: this doc is the canonical test fixture. Don't repurpose it for unrelated work; don't rename it. If it gets cluttered with old test comments, resolve them in bulk rather than deleting the doc.

The loop

Variant A — top-level comment (the headline path)

This is the path that worked even before the listFileComments fix. It catches every other Drive-watch bug class (resource_id translation, page-token advance, task_run insertion, workflow dispatch).

Open the doc (Chrome MCP):

tabs_context_mcp { createIfEmpty: true }
navigate { url: "https://docs.google.com/document/d/1jxnoGsGFJJKhiBq-WCaLC1q1YLfvkVGJpezQkfvnq58/edit", tabId: <fresh tab> }

Map the UI (read_page with filter: "interactive"). Capture:
- The toolbar "Add comment (⌘+Option+M)" button — typically labelled exactly that.
- The top-right "Show all comments N new comments" button.
Post a top-level comment with a unique marker:
```
[trigger-test] ts=<unix epoch seconds at the moment of posting>
```
The timestamp lets you tell "atlas's reply to THIS test" apart from "atlas's reply to a previous test." If you want a specific model label, append [gpt] or similar — but the test should NOT depend on operator-tag routing working (that's a separate code path; default-model [S46] reply is the success criterion).

To post: click in the doc body → click the toolbar "Add comment" button → type the marker → click "Comment" submit button. Or use ⌘+Option+M as a fallback.

Force-trigger reconciliation (skips the up-to-15-min wait for the next cron tick):

kubectl exec -n temporal $(kubectl get pod -n temporal -l app.kubernetes.io/component=admintools -o jsonpath='{.items[0].metadata.name}') -- \
  temporal schedule trigger \
    --address temporal-frontend.temporal.svc.cluster.local:7233 \
    --namespace aucert-default \
    --schedule-id drive-reconciliation-cron

Wait ~60 seconds for the activity to walk changes.list → list comments → markProcessed → dispatch the per-file workflow → atlas read the doc → atlas post a reply. The 60s budget is for the activity (~1-3s) + the SpecAgentWorkflow's full run (LLM call + Drive write).
Verify: re-read_page the test doc, find the comments-sidebar count badge. Healthy outcome:
- Badge label changed from "Show all comments 0 new comments" to "Show all comments 1 new comments" (or +N).
- Click the sidebar; read its body. Expect at least one comment whose body contains both the timestamp marker AND a model-label prefix like [S46], [O47], [K26], or [G54]. That's atlas's reply.

Failure recovery — if no atlas reply within 60s, check:

kubectl logs -n internal-platform -l app=spec-agent-worker --since=5m | grep "drive-changes-process\|drive-reconciliation"

Expect lines like:

drive-reconciliation: found 1 active changes-mode watch(es); starting per-watch process workflows
drive-changes-process: channel=<uuid> changes_seen=N candidate_files=M comments_seen=K comments_deduped=L dispatched=P ...

Diagnostic decision tree:

What the log shows	What's broken
No `drive-reconciliation` log line at all	Schedule isn't ticking; check `temporal schedule describe drive-reconciliation-cron`
`drive-reconciliation: found 0 active changes-mode watch(es)`	The bootstrap watch is missing; check `drive_watches` table for a `scope='changes'` row
`drive-changes-process: ... changes_seen=0`	Drive's changes feed has nothing new — page token may have already advanced past your test comment. Roll back `start_page_token` in `drive_watches` and re-trigger.
`drive-changes-process: ... comments_seen>0 dispatched=0 skipped_no_new_comments>0`	Either every comment was already dedup'd (delete `processed_comments` rows for the test file_id and re-trigger), OR the comment-walking logic is wrong.
`drive-changes-process: ... dispatched>0` but no atlas reply	The per-file `SpecAgentWorkflow` failed. Check Temporal UI for `spec-agent-drive-1jxnoGsGFJJKhiBq-WCaLC1q1YLfvkVGJpezQkfvnq58`.

Variant B — thread reply (the bug `drive-watch-test-loop` was created for)

This is the path that was broken until the listFileComments fix in PR #136. Drive distinguishes top-level Comments from Replies-within-threads; comments.list only returns top-level. The fix walks each comment's replies[] array and emits one synthetic dedup ID per reply.

Pre-condition: Variant A has run successfully and atlas posted a reply to your top-level comment. That created a thread.
Open the comments sidebar — click the top-right "Show all comments N new comments" button (refs vary per session; use find { query: "Show all comments" }).
Reply within the thread — click on the comment thread atlas just replied to, click the "Reply" input, type a NEW unique marker:
```
[trigger-test-reply] ts=<unix epoch seconds at the moment of posting>
```
Submit.
Force-trigger reconciliation (same command as Variant A step 4).
Wait ~60 seconds.
Verify:
- The thread now contains 3 entries: your original top-level comment, atlas's first reply, your reply, AND atlas's reply to YOUR reply (4 total). The fourth entry — atlas replying to a reply — is the success signal.
- The fourth entry must contain the new timestamp marker AND a model-label prefix.
Failure recovery for Variant B specifically:
- If drive-changes-process log shows comments_seen=N but it's the same N as before posting the reply → listFileComments is not seeing the reply → the replies field in the API request is missing. Check the setFields(...) mask in GoogleDriveClient.listFileComments.
- If comments_seen increased but comments_deduped ate everything → the synthetic dedup ID format may have changed and old processed_comments rows are colliding. Inspect rows with WHERE file_id = '1jxnoGsGFJJKhiBq-WCaLC1q1YLfvkVGJpezQkfvnq58'.

Variant C — multi-model routing (covers PR #141 / #142 / #144)

This variant validates the secondary-workflow spawn path that was broken before PR #141: /models <other-model> and /models <self>,<other> should produce real Drive replies from the secondary, not silent completions in Temporal. It also exercises the /inline scope-match fix (PR #142) and the tertiary-spawn guard (PR #144).

Pre-conditions on top of Variants A/B: agent token vault must have anthropic-direct-api-key populated if [opus-direct]/[sonnet-direct] are tested, AND Bedrock opus access must be working on the AWS account (separate infra fix — skip /models opus if still failing on Bedrock).

C-1 — Case C (operator picks one non-self model)

Post a top-level comment via Chrome MCP: @atlas-agent /models kimi /readonly_qa what is this doc about?
Wait ≤ 90 s. Expect TWO replies in the thread:
- First reply (primary sonnet, label [S46]): a brief ack like Spawned kimi; the [K26] response will follow shortly. — and nothing substantive.
- Second reply (secondary kimi, label [K26]): a substantive answer to the operator's question.
Pass criteria (each one independent — if any fails, the secondary path is regressing):
- The [K26] reply must arrive within 90 s of the [S46] ack — silent completion of the secondary is the symptom of the bug PR #141 fixed.
- The [K26] reply must be substantive (not another ack) — operator's question was answered.
- The [S46] ack must NOT include any analysis or workflow tool output — Case C primary should only route.
- The secondary's worker log line Spawned secondary workflow must appear exactly once for kimi; no tertiary spawn attempt.

SQL spot-check:

SELECT id, run_type, input_payload->>'source' AS source, outcome
FROM task_runs
WHERE input_payload->>'comment_id' = '<comment_id>'
ORDER BY started_at DESC;

Expect exactly one secondary row with source='secondary_model_response' and outcome='success'. Pre-fix this row was missing the comment_id field entirely.

C-2 — Case B (self + one other model)

Post: @atlas-agent /models sonnet,opus /readonly_qa <question>
Expect TWO substantive replies — [S46] and [O47] — both answering the question (Case B, both YOUR MODELs are in the tagged set).
Pass: both labels post real answers; neither is silent.
Fail mode to look for: only one of the two replies arrives → either the primary mishandled Case B, or the secondary regressed back to silent completion.

C-3 — `/inline` scope-match (covers PR #142)

Identify a short doc passage with a known typo (one or two characters).
Post: @atlas-agent /inline fix the typo "<word with typo>" → "<corrected word>"
After atlas applies the edit, read the tracking comment. It MUST contain:
- approx N chars changed (body now X chars, was Y) — N should be close to the actual typo length (a few chars), NOT the body length.
- (atlas-inline-session: atlas-inline-<uuid>; pre_edit_revision: <revision>) — parseable for /revert.
Open Drive Version History on the doc. Diff the live revision against the pre-edit revision. Pass: only the typo region differs; surrounding prose is byte-for-byte identical. Fail: surrounding sentences were re-flowed or punctuation was normalised — system prompt rule is not biting.
Revert via the tracking comment: post /revert as a reply to the [ATLAS] tracking comment. Doc body must restore exactly to the pre-edit revision.

C-4 — Tertiary spawn guard (covers PR #144)

This one can't be exercised by a real operator comment because the secondary never tries to spawn another secondary in practice. Instead, the guard is checked at deploy time via log inspection:

After Variant C-1 runs, search the worker log for any request_model_response rejected: parent task_run is itself a secondary warning. It should not appear in the happy path.
If you want to actively probe: temporarily craft a comment that the secondary's LLM is likely to interpret as needing a second opinion (e.g. @atlas-agent /models kimi /review compare your answer with sonnet). If kimi hallucinates a request_model_response call, the guard rejection MUST log and no tertiary workflow must appear in Temporal.

Cleanup after a successful test

The test doc accumulates comments over time. After verifying both variants pass:

Don't delete test comments — they're audit trail. Resolve them via the comment menu (⋮ → Resolve) so they collapse out of the active sidebar but remain queryable in the doc's revision history.
Don't delete processed_comments rows for the test file_id — they're audit trail too. They will age out via Phase 3.2's deleteProcessedOlderThan cleanup once that's wired (currently manual).

Fallback — webhook mock test

Use this if Chrome MCP is broken or you can't get site access for docs.google.com. Less faithful but bulletproof.

Compose a synthetic Drive notification:

CHANNEL_ID=$(kubectl exec -n temporal admintools -- psql ... -c "SELECT channel_id FROM drive_watches WHERE scope='changes' LIMIT 1" -t)
RESOURCE_ID=$(kubectl exec ... -c "SELECT resource_id FROM drive_watches WHERE channel_id='$CHANNEL_ID'" -t)

POST to the dispatcher's webhook endpoint (note: dispatcher is internal; use a port-forward):

kubectl port-forward -n internal-platform svc/dispatcher 8080:8080 &
curl -X POST http://localhost:8080/webhooks/drive \
  -H "X-Goog-Channel-ID: $CHANNEL_ID" \
  -H "X-Goog-Resource-ID: $RESOURCE_ID" \
  -H "X-Goog-Resource-State: update" \
  -H "X-Goog-Changed: comments"

Verify a DriveChangesProcessWorkflow appears in Temporal and reaches Completed.

This validates the dispatcher → workflow → activity → token-advance chain. It does NOT validate the comments.list walk against real Drive; if your bug is in that layer (like the thread-reply bug), the webhook mock will pass while real-world is broken.

When you make a change to the loop

If the test doc URL changes, the chrome MCP behaviour changes, or you discover a new failure mode worth recording: edit this doc in the same PR so the next agent inherits the improvement. The runbook is the source of truth; out-of-band tribal knowledge defeats the point.

When to run this loop​

Pre-conditions​

The test doc​

The loop​

Variant A — top-level comment (the headline path)​

Variant B — thread reply (the bug drive-watch-test-loop was created for)​

Variant C — multi-model routing (covers PR #141 / #142 / #144)​

C-1 — Case C (operator picks one non-self model)​

C-2 — Case B (self + one other model)​

C-3 — /inline scope-match (covers PR #142)​

C-4 — Tertiary spawn guard (covers PR #144)​

Cleanup after a successful test​

Fallback — webhook mock test​

When you make a change to the loop​