Drive watch — end-to-end test loop runbook
Why this runbook exists. Every Phase 1–3.1 PR shipped with unit tests against a mocked
GoogleDriveClient. Each one was "green" in CI. Two of them shipped Drive-API-assumption bugs that only surfaced in production:
- PR before V015: assumed
resource_id == file_id. Bug surfaced on every push.- PR #125 fixed an instance: per-file workflow dispatch missing
task_runsinsert. Bug surfaced on first reconciliation dispatch.- The bug
drive-watch-test-loopwas created to catch:comments.listreturns top-level comments only, not thread replies. Bug surfaced on the second user comment ever.Mocked Drive clients cannot catch "I made a wrong assumption about Drive's API." The only test that catches it is one that hits real Drive. This runbook is mandatory for any PR that touches the files listed in Hard Rule 14 (
.context/AI_RULES.md).
When to run this loop
Run before opening a PR (or before declaring a fix verified) for any change that touches:
internal/backend/.../shared/clients/GoogleDriveClient.kt—listFileComments,listChanges,getStartPageToken,createChangesWatchChannel,createWatchChannelinternal/backend/.../agents/spec/temporal/DriveChangesProcessActivityImpl.ktinternal/backend/.../agents/spec/temporal/DriveReconciliationActivityImpl.ktinternal/backend/.../agents/spec/temporal/DriveWatchRenewalActivityImpl.ktinternal/backend/.../agents/spec/DriveWatchBootstrap.ktinternal/backend/.../shared/clients/DriveWatchRepository.kt— query methodsinternal/backend/.../shared/clients/ProcessedCommentsRepository.ktinternal/dispatcher/.../webhooks/DriveWebhookHandler.ktinternal/dispatcher/.../db/HikariDatabaseAccess.kt—findFileIdByChannelId,findDriveWatchByChannelIdinternal/backend/.../agents/common-tools/temporal/RequestModelResponseTool.kt— secondary-workflow spawn (Variant C)internal/backend/.../agents/common-tools/docs/workflow/ExecuteInlineEditsWorkflowTool.kt—/inlineedit-scope behaviour (Variant C)internal/backend/.../agents/shared/executor/SystemPromptAssembler.kt—commentRoutingSection,COMMENT_THREAD_ETIQUETTE(Variant C)internal/backend/.../agents/spec/temporal/SpecAgentWorkflow.kt—SpecAgentRequest,DriveContext(Variant C)internal/backend/.../agents/spec/temporal/SpecAgentActivityImpl.kt—runAgentmodelOverride / driveContext threading (Variant C)
Pre-conditions
| Requirement | How to verify |
|---|---|
| Chrome with Claude in Chrome extension | chrome://extensions → "Claude in Chrome" present and enabled |
| Extension paired with this Claude session | Run mcp__Claude_in_Chrome__list_connected_browsers — must return ≥1 browser |
Site access for docs.google.com | Extension → Details → Site access → "On all sites" OR "On specific sites" with https://docs.google.com/* listed. "On click" is NOT enough — each automated nav re-prompts. |
vivek.soneja@aucert.ai (or another aucert.ai member) signed into Chrome | Page title bar shows the right account |
atlas-agent@aucert.dev has Edit access to the test doc | Open Share dialog on the doc; confirm atlas-agent@aucert.dev listed as Editor |
| Only ONE tab has the test doc open (the MCP-driven tab) | Close any other tab with the doc URL — Google enforces single-session-per-doc and the LAST opener wins; if your primary Chrome window has the doc tab, the MCP tab will be signed out mid-session with the "You have been signed out. Sign into your account from another tab" overlay. Symptoms: doc title bar shows "Trying to connect…", then a sign-out modal blocks interaction. Recovery: close every other tab with this doc, then have the MCP nav-reload. |
kubectl configured for aucert-aks | kubectl config current-context returns aucert-aks |
| Worker is on the code under test | kubectl get pod -n internal-platform -l app=spec-agent-worker -o jsonpath='{.items[0].status.containerStatuses[0].imageID}' matches the image you just deployed |
The test doc
Use exactly this doc; do not pick a different one without a strong reason.
- URL:
https://docs.google.com/document/d/1jxnoGsGFJJKhiBq-WCaLC1q1YLfvkVGJpezQkfvnq58/edit - Drive file ID:
1jxnoGsGFJJKhiBq-WCaLC1q1YLfvkVGJpezQkfvnq58 - Title: "Test Agent Document"
- Maintenance: this doc is the canonical test fixture. Don't repurpose it for unrelated work; don't rename it. If it gets cluttered with old test comments, resolve them in bulk rather than deleting the doc.
The loop
Variant A — top-level comment (the headline path)
This is the path that worked even before the listFileComments fix. It catches every other Drive-watch bug class (resource_id translation, page-token advance, task_run insertion, workflow dispatch).
-
Open the doc (Chrome MCP):
tabs_context_mcp { createIfEmpty: true }
navigate { url: "https://docs.google.com/document/d/1jxnoGsGFJJKhiBq-WCaLC1q1YLfvkVGJpezQkfvnq58/edit", tabId: <fresh tab> } -
Map the UI (
read_pagewithfilter: "interactive"). Capture:- The toolbar "Add comment (⌘+Option+M)" button — typically labelled exactly that.
- The top-right "Show all comments N new comments" button.
-
Post a top-level comment with a unique marker:
[trigger-test] ts=<unix epoch seconds at the moment of posting>The timestamp lets you tell "atlas's reply to THIS test" apart from "atlas's reply to a previous test." If you want a specific model label, append
[gpt]or similar — but the test should NOT depend on operator-tag routing working (that's a separate code path; default-model[S46]reply is the success criterion).To post: click in the doc body → click the toolbar "Add comment" button → type the marker → click "Comment" submit button. Or use ⌘+Option+M as a fallback.
-
Force-trigger reconciliation (skips the up-to-15-min wait for the next cron tick):
kubectl exec -n temporal $(kubectl get pod -n temporal -l app.kubernetes.io/component=admintools -o jsonpath='{.items[0].metadata.name}') -- \
temporal schedule trigger \
--address temporal-frontend.temporal.svc.cluster.local:7233 \
--namespace aucert-default \
--schedule-id drive-reconciliation-cron -
Wait ~60 seconds for the activity to walk changes.list → list comments → markProcessed → dispatch the per-file workflow → atlas read the doc → atlas post a reply. The 60s budget is for the activity (~1-3s) + the SpecAgentWorkflow's full run (LLM call + Drive write).
-
Verify: re-
read_pagethe test doc, find the comments-sidebar count badge. Healthy outcome:- Badge label changed from
"Show all comments 0 new comments"to"Show all comments 1 new comments"(or +N). - Click the sidebar; read its body. Expect at least one comment whose body contains both the timestamp marker AND a model-label prefix like
[S46],[O47],[K26], or[G54]. That's atlas's reply.
- Badge label changed from
-
Failure recovery — if no atlas reply within 60s, check:
kubectl logs -n internal-platform -l app=spec-agent-worker --since=5m | grep "drive-changes-process\|drive-reconciliation"Expect lines like:
drive-reconciliation: found 1 active changes-mode watch(es); starting per-watch process workflows
drive-changes-process: channel=<uuid> changes_seen=N candidate_files=M comments_seen=K comments_deduped=L dispatched=P ...Diagnostic decision tree:
What the log shows What's broken No drive-reconciliationlog line at allSchedule isn't ticking; check temporal schedule describe drive-reconciliation-crondrive-reconciliation: found 0 active changes-mode watch(es)The bootstrap watch is missing; check drive_watchestable for ascope='changes'rowdrive-changes-process: ... changes_seen=0Drive's changes feed has nothing new — page token may have already advanced past your test comment. Roll back start_page_tokenindrive_watchesand re-trigger.drive-changes-process: ... comments_seen>0 dispatched=0 skipped_no_new_comments>0Either every comment was already dedup'd (delete processed_commentsrows for the test file_id and re-trigger), OR the comment-walking logic is wrong.drive-changes-process: ... dispatched>0but no atlas replyThe per-file SpecAgentWorkflowfailed. Check Temporal UI forspec-agent-drive-1jxnoGsGFJJKhiBq-WCaLC1q1YLfvkVGJpezQkfvnq58.
Variant B — thread reply (the bug drive-watch-test-loop was created for)
This is the path that was broken until the listFileComments fix in PR #136. Drive distinguishes top-level Comments from Replies-within-threads; comments.list only returns top-level. The fix walks each comment's replies[] array and emits one synthetic dedup ID per reply.
-
Pre-condition: Variant A has run successfully and atlas posted a reply to your top-level comment. That created a thread.
-
Open the comments sidebar — click the top-right "Show all comments N new comments" button (refs vary per session; use
find { query: "Show all comments" }). -
Reply within the thread — click on the comment thread atlas just replied to, click the "Reply" input, type a NEW unique marker:
[trigger-test-reply] ts=<unix epoch seconds at the moment of posting>Submit.
-
Force-trigger reconciliation (same command as Variant A step 4).
-
Wait ~60 seconds.
-
Verify:
- The thread now contains 3 entries: your original top-level comment, atlas's first reply, your reply, AND atlas's reply to YOUR reply (4 total). The fourth entry — atlas replying to a reply — is the success signal.
- The fourth entry must contain the new timestamp marker AND a model-label prefix.
-
Failure recovery for Variant B specifically:
- If
drive-changes-processlog showscomments_seen=Nbut it's the same N as before posting the reply →listFileCommentsis not seeing the reply → therepliesfield in the API request is missing. Check thesetFields(...)mask inGoogleDriveClient.listFileComments. - If
comments_seenincreased butcomments_dedupedate everything → the synthetic dedup ID format may have changed and oldprocessed_commentsrows are colliding. Inspect rows withWHERE file_id = '1jxnoGsGFJJKhiBq-WCaLC1q1YLfvkVGJpezQkfvnq58'.
- If
Variant C — multi-model routing (covers PR #141 / #142 / #144)
This variant validates the secondary-workflow spawn path that was broken before PR #141: /models <other-model> and /models <self>,<other> should produce real Drive replies from the secondary, not silent completions in Temporal. It also exercises the /inline scope-match fix (PR #142) and the tertiary-spawn guard (PR #144).
Pre-conditions on top of Variants A/B: agent token vault must have anthropic-direct-api-key populated if [opus-direct]/[sonnet-direct] are tested, AND Bedrock opus access must be working on the AWS account (separate infra fix — skip /models opus if still failing on Bedrock).
C-1 — Case C (operator picks one non-self model)
-
Post a top-level comment via Chrome MCP:
@atlas-agent /models kimi /readonly_qa what is this doc about? -
Wait ≤ 90 s. Expect TWO replies in the thread:
- First reply (primary sonnet, label
[S46]): a brief ack likeSpawned kimi; the [K26] response will follow shortly.— and nothing substantive. - Second reply (secondary kimi, label
[K26]): a substantive answer to the operator's question.
- First reply (primary sonnet, label
-
Pass criteria (each one independent — if any fails, the secondary path is regressing):
- The
[K26]reply must arrive within 90 s of the[S46]ack — silent completion of the secondary is the symptom of the bug PR #141 fixed. - The
[K26]reply must be substantive (not another ack) — operator's question was answered. - The
[S46]ack must NOT include any analysis or workflow tool output — Case C primary should only route. - The secondary's worker log line
Spawned secondary workflowmust appear exactly once for kimi; no tertiary spawn attempt.
- The
-
SQL spot-check:
SELECT id, run_type, input_payload->>'source' AS source, outcome
FROM task_runs
WHERE input_payload->>'comment_id' = '<comment_id>'
ORDER BY started_at DESC;Expect exactly one
secondaryrow withsource='secondary_model_response'andoutcome='success'. Pre-fix this row was missing thecomment_idfield entirely.
C-2 — Case B (self + one other model)
- Post:
@atlas-agent /models sonnet,opus /readonly_qa <question> - Expect TWO substantive replies —
[S46]and[O47]— both answering the question (Case B, both YOUR MODELs are in the tagged set). - Pass: both labels post real answers; neither is silent.
- Fail mode to look for: only one of the two replies arrives → either the primary mishandled Case B, or the secondary regressed back to silent completion.
C-3 — /inline scope-match (covers PR #142)
- Identify a short doc passage with a known typo (one or two characters).
- Post:
@atlas-agent /inline fix the typo "<word with typo>" → "<corrected word>" - After atlas applies the edit, read the tracking comment. It MUST contain:
approx N chars changed (body now X chars, was Y)— N should be close to the actual typo length (a few chars), NOT the body length.(atlas-inline-session: atlas-inline-<uuid>; pre_edit_revision: <revision>)— parseable for/revert.
- Open Drive Version History on the doc. Diff the live revision against the pre-edit revision. Pass: only the typo region differs; surrounding prose is byte-for-byte identical. Fail: surrounding sentences were re-flowed or punctuation was normalised — system prompt rule is not biting.
- Revert via the tracking comment: post
/revertas a reply to the[ATLAS]tracking comment. Doc body must restore exactly to the pre-edit revision.
C-4 — Tertiary spawn guard (covers PR #144)
This one can't be exercised by a real operator comment because the secondary never tries to spawn another secondary in practice. Instead, the guard is checked at deploy time via log inspection:
- After Variant C-1 runs, search the worker log for any
request_model_response rejected: parent task_run is itself a secondarywarning. It should not appear in the happy path. - If you want to actively probe: temporarily craft a comment that the secondary's LLM is likely to interpret as needing a second opinion (e.g.
@atlas-agent /models kimi /review compare your answer with sonnet). If kimi hallucinates arequest_model_responsecall, the guard rejection MUST log and no tertiary workflow must appear in Temporal.
Cleanup after a successful test
The test doc accumulates comments over time. After verifying both variants pass:
- Don't delete test comments — they're audit trail. Resolve them via the comment menu (⋮ → Resolve) so they collapse out of the active sidebar but remain queryable in the doc's revision history.
- Don't delete
processed_commentsrows for the test file_id — they're audit trail too. They will age out via Phase 3.2'sdeleteProcessedOlderThancleanup once that's wired (currently manual).
Fallback — webhook mock test
Use this if Chrome MCP is broken or you can't get site access for docs.google.com. Less faithful but bulletproof.
-
Compose a synthetic Drive notification:
CHANNEL_ID=$(kubectl exec -n temporal admintools -- psql ... -c "SELECT channel_id FROM drive_watches WHERE scope='changes' LIMIT 1" -t)
RESOURCE_ID=$(kubectl exec ... -c "SELECT resource_id FROM drive_watches WHERE channel_id='$CHANNEL_ID'" -t) -
POST to the dispatcher's webhook endpoint (note: dispatcher is internal; use a port-forward):
kubectl port-forward -n internal-platform svc/dispatcher 8080:8080 &
curl -X POST http://localhost:8080/webhooks/drive \
-H "X-Goog-Channel-ID: $CHANNEL_ID" \
-H "X-Goog-Resource-ID: $RESOURCE_ID" \
-H "X-Goog-Resource-State: update" \
-H "X-Goog-Changed: comments" -
Verify a
DriveChangesProcessWorkflowappears in Temporal and reachesCompleted.
This validates the dispatcher → workflow → activity → token-advance chain. It does NOT validate the comments.list walk against real Drive; if your bug is in that layer (like the thread-reply bug), the webhook mock will pass while real-world is broken.
When you make a change to the loop
If the test doc URL changes, the chrome MCP behaviour changes, or you discover a new failure mode worth recording: edit this doc in the same PR so the next agent inherits the improvement. The runbook is the source of truth; out-of-band tribal knowledge defeats the point.