Skip to main content

Drive watch — pending followups (2026-05-08)

The Drive watch migration (2026-05-06 handover) is complete. Phases 1–3.1 shipped between 2026-05-06 and 2026-05-08; the system maintains a drive.changes.watch per agent identity automatically, the dispatcher routes both push and reconciliation through the same activity, and a processed_comments dedup table keeps push and poll race-safe.

This document captures the three remaining followups. Each is independently shippable; none blocks the others.

TL;DR

#ItemStatusTrigger
1Phase 3.2a — log-shape auditDoable nowWhenever stable event= field names become useful (e.g. before Month-2 Grafana).
2Phase 3.2b — Grafana alert rulesBlockedLands when Month-2 observability infra (Grafana + Prometheus + Loki) is rolled out.
3Phase 2.4 — legacy scope='file' cleanupSoak-period gatedAfter 1-2 weeks of stable changes-mode operation; recommend mid-May at earliest.

1. Phase 3.2a — Structured-log audit

Goal: standardise the log-line shape across all Drive-watch code paths so future Grafana queries (Phase 3.2b) read against stable, machine-extractable fields.

What exists today

The Phase 3.1 activity log line already carries the right counters but as positional values, not key=value pairs:

drive-changes-process: channel=52df69d5-… changes_seen=0 candidate_files=0
comments_seen=0 comments_deduped=0 dispatched=0 skipped_removed=0
skipped_no_file_id=0 skipped_no_new_comments=0 advanced_to=338 chain_terminated=true

Same for the renewal cron, bootstrap, reconciliation, and dispatcher webhook handler. Each has its own ad-hoc shape.

What's missing

Three things, in priority order:

  1. A stable event=drive_watch.<name> discriminator on every log line. Names from the durability invariants:
    • event=drive_watch.bootstrap_skipped / bootstrap_created / bootstrap_failed
    • event=drive_watch.renewed (with mode=file or mode=changes field)
    • event=drive_watch.push_received
    • event=drive_watch.reconciliation_tick
    • event=drive_watch.changes_processed
    • event=drive_watch.comment_deduped
    • event=drive_watch.workflow_dispatched
  2. Standard fields across all events: agent=, agent_drive_email=, channel_id=, file_id= (where applicable), comment_id= (where applicable), latency_ms= (where applicable). Not all events need all fields, but when an event has the data, it goes in the same field name.
  3. JSON encoding wrapper (optional but nice) — wraps the field bag in a single JSON object so Loki's json parser extracts everything in one pass without per-line regex.

Files to touch

  • internal/backend/.../shared/clients/GoogleDriveClient.kt (watch-create / watch-stop log lines)
  • internal/backend/.../agents/spec/DriveWatchBootstrap.kt
  • internal/backend/.../agents/spec/temporal/DriveWatchRenewalActivityImpl.kt
  • internal/backend/.../agents/spec/temporal/DriveChangesProcessActivityImpl.kt
  • internal/backend/.../agents/spec/temporal/DriveReconciliationActivityImpl.kt
  • internal/dispatcher/.../webhooks/DriveWebhookHandler.kt

Out of scope

  • Adding new metrics (Prometheus counters/gauges) — that's Phase 3.2b territory.
  • Changing the content of any log line — only its shape. Counters stay.
  • Changing log levels.

Why doable now (no infra dependency)

The log-shape change writes to whatever stdout collector is in place — today's kubectl logs, tomorrow's Loki ingester. Standardising the shape ahead of Loki's arrival means alert rules become a one-PR job once Loki is live, instead of a "audit + standardise + wire" job.

2. Phase 3.2b — Grafana alert rules

Goal: every "should be doing X but isn't" condition emits a signal to PagerDuty (page) or Slack (warn).

Status: blocked on Month-2 observability infra rollout. Per infra/.context/INTERNAL_PLATFORM.md — Grafana, Prometheus, Loki are listed as planned (Month 2) in ns:internal-platform. Until they're deployed, alert rules have nothing to read against.

Alert set (unchanged from the original handover)

AlertConditionSeverity
No-event-while-activeNo event=drive_watch.push_received in 6h AND event=drive_watch.changes_processed shows changes_seen>0 in same windowpage
Watch-near-expiry-no-renewalMost-recent event=drive_watch.renewed for an agent identity older than 12h AND expiration_time < now+6h (DB query)page
Reconciliation-backlogevent=drive_watch.changes_processed with changes_seen>50 in a single tickwarn
Bootstrap-failedevent=drive_watch.bootstrap_failed on worker startwarn
Dispatch-dedup-rateRatio of event=drive_watch.comment_deduped to total comments processed > 10% over 1hwarn (push/poll race tuning, not a bug)

Pre-conditions before this PR can land

  • Loki + Prometheus deployed in ns:internal-platform.
  • Grafana deployed with a datasource for each.
  • Phase 3.2a merged and deployed (so the events exist in extractable form).

What this PR will contain

A docs/internal/docs/observability/drive-watch-alerts.md doc + Helm values (or Grafana provisioning YAML) checked into infra/k8s/internal-platform/grafana/ (path TBD when Grafana lands). PagerDuty routing wires through Grafana's contact-point system; Slack uses a webhook.

Why deferred, not "do something simple now"

Two reasons:

  1. Half-measures invite false alerts. Without Loki, the only "alerting" we could do is shell scripts cron'd against kubectl logs. That's brittle, has no debounce, no de-dup, and produces weekend pages for transient blips.
  2. Phase 3.2a does the load-bearing work anyway. Once Loki lands, the alert rules are 5 LogQL queries against the structured fields — 1 day of work. Doing structured logging now buys 90% of the value.

3. Phase 2.4 — Legacy scope='file' row cleanup

Goal: deprecate per-file watches now that changes-mode covers everything.

Status: gated on a soak period for the new mode. Recommend 1-2 weeks of stable changes-mode operation before deletion. Earliest sensible date: 2026-05-22.

What exists today

A handful of legacy scope='file' rows for atlas — registered before Phase 2.1 landed. They continue to renew via the file-mode path of DriveWatchRenewalActivityImpl (the changes-mode rewrite preserved file-mode behaviour byte-for-byte). They cost one Drive API call per renewal cycle and one drive_watches row each — small but unnecessary.

What this PR will do

  1. Stop the underlying Drive channels for each legacy scope='file' row via drive.channels.stop() so Drive stops delivering notifications for those file-scoped channels. Without this step, Drive will keep firing webhooks at the dispatcher until the channel TTL expires (~24h after the last renewal).
  2. Delete the rows from drive_watches so the renewal cron stops touching them.
  3. Confirm via reconciliation that atlas still responds to comments on the previously-file-scoped files — they should be covered by the changes-mode watch.

Order matters

Delete-then-stop produces orphaned channels (Drive keeps notifying; dispatcher 404-equivalent). Stop-then-delete is the correct order. The cleanup activity must call stopWatchChannel first and only delete the row on success.

Why gated, not done now

The risk is that we discover during the soak that changes-mode misses some change type that file-mode caught. That has not happened in the limited testing so far — but 2 weeks of production operation is cheap insurance for a one-way migration. If something does surface, the legacy rows are still doing their job.

Trigger checklist

Before scheduling this PR:

  • At least 14 days since 2026-05-08 (the Phase 3.1 schedule-create date).
  • Zero Failed runs on drive-reconciliation-cron in the schedule's ActionCounts history.
  • At least one comment on a previously-file-scoped file received an atlas reply via the changes-mode path (manual check).
  • No new event=drive_watch.changes_processed log lines with non-zero skipped_no_file_id in the past 7 days.

If any check fails, defer another week and re-run.

Cross-references

  • Original migration handover (incident context, durability invariants, architecture decisions): drive-watch-migration-handover-2026-05-06.md
  • End-to-end test loop runbook (mandatory before declaring any drive-watch change done; AI_RULES.md hard rule 14): drive-watch-test-loop.md
  • Schema: infra/migrations/astra/V015..V017
  • Code: internal/backend/.../agents/spec/temporal/Drive* and internal/dispatcher/.../webhooks/DriveWebhookHandler.kt
  • Operator runbook (post-merge schedule create): see Phase 3.1 section of the original handover (corrected by PR #120)