Skip to main content

Drive watch migration — historical record (2026-05-06, complete)

Status: COMPLETE (2026-05-08). This document is preserved as the historical record of the Drive watch migration — incident context, durability invariants, and the architecture decisions that informed each phase's implementation. Active work tracking has moved to drive-watch-followups-handover-2026-05-08.md.

Code in the codebase still references this doc by phase number ("Phase 2.1", "durability principle #3") — those references intentionally stay valid here. Don't delete; do consult.

This handover documented the work to migrate Atlas's Drive comment integration from per-file files.watch to a single drive.changes.watch per agent identity, plus the durability hardening that followed. All in-scope phases shipped between 2026-05-06 and 2026-05-08; the system in production today is the one this doc describes.

What got built

Atlas's Drive comment integration was technically working but operationally fragile. The 2026-05-06 incident demonstrated that the agent could hallucinate the dispatcher webhook URL when registering a watch, and the per-file files.watch model required explicit registration for every doc — neither failure mode produced a signal. The migration replaced per-file watches with a single drive.changes.watch per agent identity, hardened the tool surface so URLs can't be hallucinated, and added the durability guarantees needed to make the system self-healing.

PhaseWhat it deliveredPR
Phase 1URL hardcoded from DRIVE_WEBHOOK_URL env; agent can no longer supply infrastructure facts#105
Phase 2.1V016 schema (scope, start_page_token, agent_drive_email); registration tool#108
Phase 2.2Scope-aware dispatcher; per-channel DriveChangesProcessWorkflow#110
Phase 2.3Bootstrap-by-invariant + scope-aware renewal cron#113
Phase 3.1Reconciliation cron + processed_comments dedup table#116
DocsOperator runbook corrected (Schedule API, not legacy --cron-schedule)#120

The remaining followups (Phase 3.2a/3.2b alerting and Phase 2.4 legacy cleanup) are tracked in the followups handover.

Durability principles (still load-bearing)

Five invariants this system must maintain forever, listed in priority order. Every PR in the migration sequence below should preserve these; any change that violates one is a regression. Phase 1 establishes invariant #2; Phase 2 sets up the substrate for #1, #3, and #4; Phase 3 closes #3, #4, and #5. Each is meaningless without the others — they are a coupled set.

  1. Watches are self-bootstrapped, not human-bootstrapped. The system creates and maintains a changes watch for each active agent identity automatically — on worker startup and on every renewal-cron tick. There is no manual "first registration" step. A fresh agent activation provisions its own watch invariant within minutes.

  2. No agent tool exposes infrastructure facts as input. URLs, hostnames, channel IDs, table names, secret IDs come from worker-side env/config, never from model-supplied tool arguments. Phase 1 demonstrates this on webhook_url; the principle applies to every tool we add. Code-review checklist item: when adding a tool, audit inputSchema for any field the agent shouldn't be inventing.

  3. The page token advances atomically with successful dispatch — never on failure. start_page_token is the system's watermark. Updating it without dispatching the underlying changes loses events; dispatching without updating it duplicates them. Updates must happen in the same transaction as workflow start (or with a write-after-read pattern that is recoverable on crash).

  4. Dispatch is idempotent on (file_id, comment_id). Push delivery and poll reconciliation will overlap. Without dedup, every reconciliation tick risks double-replies. A unique constraint on task_runs(file_id, comment_id) (or a separate processed_comments table) is the single mechanism that makes push and poll safe to interleave.

  5. Silent failure is the worst failure mode. Every "system should be doing X but isn't" condition has at least one alert. The 2026-05-06 incident's deeper bug wasn't URL synthesis — it was that the broken state produced no signal: no error, no log, no alarm. The system looked healthy. Phase 3.2 wires the minimum set of detectors so this category of failure can never recur silently.

These principles are the lens through which Phase 2 and Phase 3 should be reviewed. Any PR is "done" only when its invariants hold by construction, not by accident.

Context (read these before starting)

In this order:

FileWhy
docs/internal/docs/agents/model-routing-and-labels.mdxMulti-provider abstraction context (you'll be touching tools that interact with model routing)
docs/internal/docs/agents/spec-agent-v01-handover-2026-05-01.mdPrior spec-agent handover; explains overall agent architecture
internal/backend/src/main/kotlin/dev/aucert/internal/agents/common-tools/drive/DriveWatchTool.ktThe tool that has the hallucination-prone webhook_url parameter
internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/clients/DriveWatchRepository.ktHow drive_watches rows are read and written
internal/dispatcher/src/main/kotlin/dev/aucert/internal/dispatcher/routing/WebhookRoutes.ktDispatcher's /webhooks/drive endpoint (current per-file webhook landing)
internal/dispatcher/src/main/kotlin/dev/aucert/internal/dispatcher/webhooks/DriveWebhookHandler.ktThe handler that resolves channel_id → file_id from drive_watches
internal/backend/src/main/kotlin/dev/aucert/internal/agents/spec/temporal/DriveWatchRenewalActivityImpl.ktThe 12h cron that renews watches before TTL expiry
infra/migrations/astra/V015__create_drive_watches.sqldrive_watches schema
.context/decisions/ADR-011-agent-harness-strategy.mdWhy we have an agent-tool abstraction and how to extend it

Background — what's broken and why

Diagnosis from the 2026-05-06 incident

Vivek shared a new Google Doc (1HYbKksSlMM1DphACAfLBcinq0CNHf5fN) with atlas-agent@aucert.dev and tried to comment to trigger atlas. Nothing happened. The investigation found two distinct bugs:

Bug 1 (high-severity, blocking): atlas hallucinated the webhook URL.

Atlas was asked via comment to register a watch on the new file. It called the drive_watch tool with:

{"folder_id":"1HYbKksSlMM1DphACAfLBcinq0CNHf5fN","webhook_url":"https://dispatcher.aucert.dev/webhooks/google-drive","ttl_hours":168}

But the dispatcher's actual route is /webhooks/drive (singular, no google- prefix), defined in WebhookRoutes.kt:63. Drive happily accepted the registration and started sending notifications to the wrong URL, getting 404 Not Found from the dispatcher. The drive_watches row exists with the wrong URL persisted; Vivek's comments arrive at Drive, Drive forwards to a non-existent path, dispatcher 404s, no workflow runs.

The four pre-existing watches (all on 1BSLKqiL42H_…) were registered with the correct URL (/webhooks/drive) — likely by an earlier hardcoded path or a different agent reasoning step. There is no SHARED canonical source of the URL; atlas chose /webhooks/google-drive because most Google APIs use that pattern, and the tool gave it no signal that it was wrong.

Root cause: the drive_watch tool exposes webhook_url as an agent-supplied input. This is an anti-pattern. URLs and credentials are infrastructure facts the agent has no way to know correctly; they should be hardcoded from worker-side env vars, not synthesised by the model.

Bug 2 (medium-severity, design-level): the per-file watch model is wrong for our use case.

Each file requires an explicit files.watch registration. There is no auto-discovery — when Vivek shares a new doc with atlas-agent@aucert.dev, atlas does not know the file exists until someone explicitly tells it to register a watch. This means the operator-facing UX for "I want atlas to respond on this new doc" is "remember to comment on the OLD doc asking atlas to register a watch on the NEW doc." Brittle; doesn't scale.

The right primitive is drive.changes.watch — a single push channel per Drive scope that fires on any change in any file the agent has access to. Slack's official Google Drive integration uses essentially this approach (combined with Gmail watch for some signals). We should too.

What is already correct (don't disturb)

  • Webhook receipt path through Cloudflare Tunnel → dispatcher pod → Ktor route → handler → Temporal workflow start. This pipeline is solid and validated.
  • The drive_watches table schema (channel_id, resource_id, file_id, webhook_url, expiration_time). It needs a small addition for changes-watch mode (see below) but the existing columns stay.
  • The 12h Temporal cron that renews watches before TTL. It does have a known "renewal window too narrow" bug (separate follow-up; flagged in our backlog). Don't fix as part of this work unless it materially blocks.
  • Atlas's tool registry, multi-provider routing, comment-tag override semantics. All separate from this work.

Phase 1 — Immediate URL fix (deploy today)

Goal: prevent agents from being able to hallucinate the webhook URL ever again.

Changes required

  1. DriveWatchTool.kt: remove webhook_url from the input schema entirely. The tool now resolves the URL from a worker-side env var, equivalent to private val webhookUrl: String = System.getenv("DRIVE_WEBHOOK_URL") ?: error("DRIVE_WEBHOOK_URL env var is not set"). Hardcode reference: the value is https://dispatcher.aucert.dev/webhooks/drive in the aucert-dev environment.

  2. Helm chart values: add DRIVE_WEBHOOK_URL env var to the spec-agent worker deployment. File: infra/k8s/internal-platform/spec-agent-worker/configmap.yaml (the spec-agent-worker-config ConfigMap; backend pods consume via envFrom).

  3. Tool description: update so atlas knows it doesn't supply a URL anymore — the description should say "Webhook URL is configured at deploy time; the agent has no control over it."

  4. Test: DriveWatchToolTest should assert the tool's input schema does not include webhook_url.

  5. Re-register fix for existing broken watch: after deploy, the broken watch row for 1HYbKksSlMM1… (channel_id 021a490d-360d-4dd6-96c7-fed9917e9891, registered 2026-05-06 20:34:53 UTC) should be replaced. A second corrected watch was registered manually by Vivek + Claude shortly after (channel_id d5fd409c-f772-4c59-85cb-24463e1ac547, webhook_url /webhooks/drive). The broken row can be deleted in a follow-up cleanup PR; both are harmless to leave in place since the broken one will simply expire in 24h.

Verification

  • New drive_watches rows, regardless of which agent or which model registered them, MUST all have webhook_url = https://dispatcher.aucert.dev/webhooks/drive (or whatever env var dictates).
  • DriveWatchTool's JSON schema must not contain webhook_url.
  • Re-running the registration flow (post a comment on the existing watched doc asking atlas to watch a new file) lands a row with the correct URL — without any agent reasoning being involved in URL selection.

Phase 2 — Migrate to drive.changes.watch

Goal: one watch per agent identity, covering every file the agent has access to. New files require zero registration; the watch automatically reports any comment on any accessible file.

Architectural sketch

                  [Drive — atlas-agent's Drive scope]
|
| (single watch, push)
v
POST /webhooks/drive (existing path)
|
v
DriveWebhookHandler
(NEW: detects scope=changes vs scope=file)
|
┌─────────────┴─────────────┐
v v
(legacy file mode) (NEW: changes mode)
file_id channels.list?pageToken=X
direct dispatch walk diff, find new comments
one workflow per comment

Changes required

  1. Schema (drive_watches table):

    • Add scope VARCHAR(16) NOT NULL DEFAULT 'file' — values 'file' (legacy) | 'changes' (new).
    • Add start_page_token VARCHAR(64) NULL — required for changes.watch; null for file scope.
    • Add agent_drive_email VARCHAR(255) NULL — the Drive identity this watch covers (e.g. atlas-agent@aucert.dev); useful for multi-agent scenarios.
    • File: new migration infra/migrations/astra/V016__add_changes_watch_columns.sql.
  2. New tool: drive_changes_watch_register (or extend drive_watch with a mode parameter). Inputs:

    • agent_email (the Drive identity to watch on behalf of). The webhook URL is hardcoded as in Phase 1.
    • Internally: call drive.changes.getStartPageToken() first; then drive.changes.watch(pageToken, channel) with the start token; persist the row with scope='changes' and the start token.
  3. Update DriveWebhookHandler:

    • On incoming notification, look up the channel_id row.
    • If scope='file': behave as today.
    • If scope='changes': call drive.changes.list?pageToken=X, get the diff, walk changed files, for each file query drive.comments.list and find new comments since last poll, dispatch one workflow per new comment.
    • Update start_page_token to the nextPageToken returned by changes.list so the next notification picks up where this one left off. This is the key correctness invariant — failing to update the page token means duplicate or missed events.
  4. Renewal cron — extended to enforce invariant #1: update DriveWatchRenewalActivityImpl to handle both scopes. For changes watches, renewal is identical to file watches (re-call changes.watch with the same page token before TTL). The 12h cron cadence is fine for both.

    Refinement (2026-05-07, durability principle #1): on every tick, before processing renewals, ensure each active agent identity has at least one scope='changes' row with expiration > now + 1h; if missing, register a fresh watch. This collapses "bootstrap a watch" and "renew a watch" into a single self-healing loop — there is no separate manual bootstrap step.

  5. DriveWatchBootstrap activity (worker-side backstop): a one-shot activity that runs from SpecAgentWorker startup. Idempotent against concurrent invocations (INSERT … ON CONFLICT DO NOTHING on the bootstrap row). Exists so a freshly deployed agent has a watch within minutes of pod startup, rather than waiting up to 12h for the next renewal-cron tick. The renewal cron (item 4) is the primary mechanism; this activity is a latency optimization for cold starts.

  6. Deprecate files.watch for new files: keep the existing drive_watch tool functional (some files might still want per-file watches for reliability tests), but make drive_changes_watch_register the default for new agent setup. Add a runbook step in docs/internal/docs/agents/adding-new-agent.mdx for registering the changes watch on agent activation.

  7. Backfill: with invariants #1 above, no explicit backfill job is needed — the renewal cron's invariant enforcement creates the initial changes watch on its first tick after Phase 2.3 lands. The existing 1BSLKqiL42H_… file watch can be deprecated (deleted after a soak period).

Decisions made during implementation

  • Comment-only filtering: implemented as Phase 3.1's per-file comments.list walk gated by the processed_comments dedup table (approach (a) from the original options). drive.activity.v2 was rejected to avoid adding a new OAuth scope. Filtering happens in the worker, not the dispatcher; the dispatcher just routes by scope.
  • Scope of "Drive access": confirmed the existing https://www.googleapis.com/auth/drive scope covers files the agent identity has access to, including files in other users' Drives shared with atlas-agent@aucert.dev.
  • Notification volume: stayed in single-digit accessible files per agent. Reconciliation costs ~1 changes.list + N comments.list calls per tick; well within Drive's quota at this scale. Revisit at >5 agents and >hundreds of accessible files.
  • Multi-agent isolation: agent_drive_email is the partition key. Each future agent gets its own changes watch, its own dedup namespace, its own webhook channel — Phase 2.3's invariant-enforcement loop will handle multi-agent enumeration once it switches from the env-driven single identity to a SELECT … FROM agents enumeration.

Verification (post-deploy signals — useful for regression testing)

  • A drive_watches row exists with scope='changes' and agent_drive_email='atlas-agent@aucert.dev'.
  • Sharing a NEW file with atlas-agent (no per-file registration) → commenting on it → atlas responds. This is the headline win.
  • start_page_token advances on every push or reconciliation tick that finds non-zero changes; stays put when there are zero changes (the healthy steady state).
  • Force-expiring expiration_time to NOW() triggers re-registration on the next renewal-cron tick.
  • Killing the dispatcher pod for 15 min then bringing it back: any comment posted during the outage is picked up within 15 min by the next reconciliation tick.

Out of scope for the migration (deferred / rejected)

  • Slack-style Gmail Pub/Sub integration. Discussed and rejected as unnecessary at current scale; note the option in a follow-up.
  • Auto-revoking watches when access is removed. Drive surfaces this via changes.list with removed=true; handle it as a next iteration.
  • Cross-organisation file watches. Drive does support these but adds OAuth complexity; defer.
  • Building an Astra UI page for "files atlas is currently watching." Useful, but separate work.

Phase 3 — Durability hardening

Phase 2 makes drive.changes.watch work; Phase 3 makes it stay working. Two PRs: a reconciliation tick that catches missed events, and a minimal alerting layer that surfaces silent failures. Without Phase 3, the system is functional but fragile — push delivery is best-effort and any single dropped notification is invisible.

Phase 3.1 — Reconciliation poll loop

Goal: catch any comment that push delivery missed, without doubling up on comments push already delivered.

Drive's push notifications are best-effort: a TTL expiry, a dispatcher outage, a Cloudflare blip, or a transient 5xx will silently drop a notification. The system needs a poll-based correctness layer running in parallel with push. Slack, Stripe, and GitHub all recommend the same pattern in their respective integration docs — push for latency, poll for correctness.

Mechanism

A 15-min Temporal cron workflow (DriveReconciliationWorkflow) that, for each active changes watch:

  1. Calls drive.changes.list(pageToken=row.start_page_token).
  2. For each change in the response, dispatches the comment-handler workflow — idempotent on (file_id, comment_id) per durability principle #4.
  3. Updates row.start_page_token = response.nextPageToken (or newStartPageToken when the page chain terminates) — atomically with successful dispatch (durability principle #3).

The token is the same one the push handler updates. Push and poll are simply two consumers of the same watermark; either or both can advance it. When push is healthy, reconciliation finds nothing and is essentially free.

New schema additions

  • processed_comments(file_id, comment_id, processed_at) — primary key (file_id, comment_id). Inserted at workflow-start time; INSERT … ON CONFLICT DO NOTHING lets push and poll race safely without producing duplicate workflows. This is the mechanism that satisfies durability principle #4.

    Alternative considered: unique constraint on task_runs(file_id, comment_id). Rejected because dedup needs to be independent of the task_runs row's lifecycle — task_runs rows can be archived or deleted as a maintenance operation, but dedup must persist for the lifetime of the file.

Cadence rationale

15 min is the smallest interval that gives meaningful safety while staying cheap. Drive's quota is generous; changes.list for our scope returns small responses; the cron runs ~96 times per day. If push is healthy, reconciliation finds nothing on every tick. If push has been broken for 30 min, reconciliation catches up within two ticks. Tunable; document the rationale wherever the cadence is configured.

Operator runbook (post-merge)

The reconciliation cron uses Temporal's Schedule API (temporal schedule create), not the legacy workflow-level --cron-schedule flag. The legacy flag was removed from temporal workflow start in CLI 1.6+; tctl and old workflow-level cron still work server-side, but new crons go through the Schedule API.

There is no local temporal binary required — the temporal-admintools pod in ns:temporal ships with the CLI baked in. Run via kubectl exec:

kubectl exec -n temporal $(kubectl get pod -n temporal -l app.kubernetes.io/name=admintools -o jsonpath='{.items[0].metadata.name}') -- \
temporal schedule create \
--address temporal-frontend.temporal.svc.cluster.local:7233 \
--namespace aucert-default \
--schedule-id drive-reconciliation-cron \
--workflow-id drive-reconciliation-cron \
--task-queue spec-agent-queue \
--type DriveReconciliationWorkflow \
--cron '*/15 * * * *' \
--execution-timeout 14m

The --schedule-id and --workflow-id are intentionally the same string for the legacy "one cron, one stable workflow ID" semantics — the Schedule API otherwise treats them as separate concepts (schedule-id identifies the cron rule; workflow-id is the base name Temporal uses for run IDs).

To force a tick on demand (debugging or first-deploy verification): temporal schedule trigger --schedule-id drive-reconciliation-cron. To pause/resume: temporal schedule toggle. To remove: temporal schedule delete.

Verification

  • Force a reconciliation tick (temporal schedule trigger --schedule-id drive-reconciliation-cron); confirm zero duplicate workflows when push has been healthy.
  • Disable the dispatcher pod for 15 min; comment on a watched file; bring the dispatcher back; confirm the next reconciliation tick (within 15-30 min) picks up the missed comment and atlas responds.
  • Manually wind back start_page_token by one event; confirm reconciliation re-queries that change but processed_comments blocks duplicate dispatch (the ON CONFLICT log line should fire exactly once).

Phase 3.2 — Health invariants & alerting

Goal: every "should be doing X but isn't" condition emits a signal. This is the answer to durability principle #5 — silent failure is the worst failure mode, and the 2026-05-06 incident was exactly such a failure.

Status (2026-05-08): blocked on observability infrastructure. This phase splits cleanly into two halves:

  • Phase 3.2a — Code-side log-shape audit is doable today: standardize a stable event=drive_watch.<name> discriminator and a fixed set of fields (agent=, file_id=, channel_id=, latency_ms=) across the existing log lines. The DriveChangesProcessActivityImpl and DriveReconciliationActivityImpl counters from Phase 3.1 are already alert-shaped; this phase formalises the field names so future Grafana queries are stable.
  • Phase 3.2b — Alert wiring is blocked on Month-2 observability work: the original Phase 3.2 design assumed Grafana + Prometheus + Loki were already deployed in ns:internal-platform, but per infra/.context/INTERNAL_PLATFORM.md they are still planned (Month 2). When that infra lands, the alert rules below become a one-PR reading job against the structured-log fields Phase 3.2a established.

Minimum set of alerts

Once Loki + Prometheus + Grafana are deployed, these are the alert rules to wire (each is a single Loki/Prom query against the structured-log fields):

AlertConditionSeverity
No-event-while-activeNo webhook hit in 6h AND changes.list shows ≥1 change in the same windowpage
Watch-near-expiry-no-renewalexpiration < now + 6h AND no successful renewal logged in last 12hpage
Reconciliation-backlogA single reconciliation tick processed > 50 changes (suggests push has been silently broken)warn
Bootstrap-failedDriveWatchBootstrap activity returned error on worker startwarn
Dispatch-dedup-rateprocessed_comments ON CONFLICT rate > 10% over 1h (push/poll race tuning signal, not a bug per se)warn

Implementation

  • Backend emits structured log lines with stable fields (event=drive_watch.{registered,renewed,push_received,reconciled,deduped}, agent=, file_id=, latency_ms=). Audit current log lines and add missing fields.
  • New Grafana dashboard Drive watch health with one panel per alert + alert rules wired to PagerDuty (page severity) and #alerts-internal Slack (warn severity).

Verification

  • Synthetic test: pause renewal cron in a staging environment, fast-forward expiration to now+5h, confirm Watch-near-expiry-no-renewal fires within one minute.
  • Synthetic test: post 100 comments simultaneously on a watched file, confirm Reconciliation-backlog warn fires.
  • Synthetic test: temporarily set DRIVE_WEBHOOK_URL to an invalid host in staging, confirm No-event-while-active fires within 6h.

Out of scope for Phase 3

  • Distributed tracing through the push → handler → workflow path. Useful for debugging but not required for durability.
  • Auto-recovery (e.g. automatically re-registering a broken watch). Add only after we've watched alerts fire for a while and understand which classes of failure auto-recovery is safe for.
  • Full chaos-engineering harness for the Drive integration. Defer until the platform has a chaos framework in general.

References

  • 2026-05-06 incident investigation (this conversation; transcript in claude.ai)
  • ADR-011 — Agent harness strategy (amendment 2026-05-06)
  • SPEC-005 — Spec agent v0.1 (§ tools, drive_watch is one of them)
  • Google Drive API: changes.watch
  • Google Drive API: changes.list