Drive watch — pending followups (2026-05-08)

The Drive watch migration (2026-05-06 handover) is complete. Phases 1–3.1 shipped between 2026-05-06 and 2026-05-08; the system maintains a drive.changes.watch per agent identity automatically, the dispatcher routes both push and reconciliation through the same activity, and a processed_comments dedup table keeps push and poll race-safe.

This document captures the three remaining followups. Each is independently shippable; none blocks the others.

TL;DR

#	Item	Status	Trigger
1	Phase 3.2a — log-shape audit	Doable now	Whenever stable `event=` field names become useful (e.g. before Month-2 Grafana).
2	Phase 3.2b — Grafana alert rules	Blocked	Lands when Month-2 observability infra (Grafana + Prometheus + Loki) is rolled out.
3	Phase 2.4 — legacy `scope='file'` cleanup	Soak-period gated	After 1-2 weeks of stable changes-mode operation; recommend mid-May at earliest.

1. Phase 3.2a — Structured-log audit

Goal: standardise the log-line shape across all Drive-watch code paths so future Grafana queries (Phase 3.2b) read against stable, machine-extractable fields.

What exists today

The Phase 3.1 activity log line already carries the right counters but as positional values, not key=value pairs:

drive-changes-process: channel=52df69d5-… changes_seen=0 candidate_files=0
  comments_seen=0 comments_deduped=0 dispatched=0 skipped_removed=0
  skipped_no_file_id=0 skipped_no_new_comments=0 advanced_to=338 chain_terminated=true

Same for the renewal cron, bootstrap, reconciliation, and dispatcher webhook handler. Each has its own ad-hoc shape.

What's missing

Three things, in priority order:

A stable event=drive_watch.<name> discriminator on every log line. Names from the durability invariants:
- event=drive_watch.bootstrap_skipped / bootstrap_created / bootstrap_failed
- event=drive_watch.renewed (with mode=file or mode=changes field)
- event=drive_watch.push_received
- event=drive_watch.reconciliation_tick
- event=drive_watch.changes_processed
- event=drive_watch.comment_deduped
- event=drive_watch.workflow_dispatched
Standard fields across all events: agent=, agent_drive_email=, channel_id=, file_id= (where applicable), comment_id= (where applicable), latency_ms= (where applicable). Not all events need all fields, but when an event has the data, it goes in the same field name.
JSON encoding wrapper (optional but nice) — wraps the field bag in a single JSON object so Loki's json parser extracts everything in one pass without per-line regex.

Files to touch

internal/backend/.../shared/clients/GoogleDriveClient.kt (watch-create / watch-stop log lines)
internal/backend/.../agents/spec/DriveWatchBootstrap.kt
internal/backend/.../agents/spec/temporal/DriveWatchRenewalActivityImpl.kt
internal/backend/.../agents/spec/temporal/DriveChangesProcessActivityImpl.kt
internal/backend/.../agents/spec/temporal/DriveReconciliationActivityImpl.kt
internal/dispatcher/.../webhooks/DriveWebhookHandler.kt

Out of scope

Adding new metrics (Prometheus counters/gauges) — that's Phase 3.2b territory.
Changing the content of any log line — only its shape. Counters stay.
Changing log levels.

Why doable now (no infra dependency)

The log-shape change writes to whatever stdout collector is in place — today's kubectl logs, tomorrow's Loki ingester. Standardising the shape ahead of Loki's arrival means alert rules become a one-PR job once Loki is live, instead of a "audit + standardise + wire" job.

2. Phase 3.2b — Grafana alert rules

Goal: every "should be doing X but isn't" condition emits a signal to PagerDuty (page) or Slack (warn).

Status: blocked on Month-2 observability infra rollout. Per infra/.context/INTERNAL_PLATFORM.md — Grafana, Prometheus, Loki are listed as planned (Month 2) in ns:internal-platform. Until they're deployed, alert rules have nothing to read against.

Alert set (unchanged from the original handover)

Alert	Condition	Severity
No-event-while-active	No `event=drive_watch.push_received` in 6h AND `event=drive_watch.changes_processed` shows `changes_seen>0` in same window	page
Watch-near-expiry-no-renewal	Most-recent `event=drive_watch.renewed` for an agent identity older than 12h AND `expiration_time < now+6h` (DB query)	page
Reconciliation-backlog	`event=drive_watch.changes_processed` with `changes_seen>50` in a single tick	warn
Bootstrap-failed	`event=drive_watch.bootstrap_failed` on worker start	warn
Dispatch-dedup-rate	Ratio of `event=drive_watch.comment_deduped` to total comments processed > 10% over 1h	warn (push/poll race tuning, not a bug)

Pre-conditions before this PR can land

Loki + Prometheus deployed in ns:internal-platform.
Grafana deployed with a datasource for each.
Phase 3.2a merged and deployed (so the events exist in extractable form).

What this PR will contain

A docs/internal/docs/observability/drive-watch-alerts.md doc + Helm values (or Grafana provisioning YAML) checked into infra/k8s/internal-platform/grafana/ (path TBD when Grafana lands). PagerDuty routing wires through Grafana's contact-point system; Slack uses a webhook.

Why deferred, not "do something simple now"

Two reasons:

Half-measures invite false alerts. Without Loki, the only "alerting" we could do is shell scripts cron'd against kubectl logs. That's brittle, has no debounce, no de-dup, and produces weekend pages for transient blips.
Phase 3.2a does the load-bearing work anyway. Once Loki lands, the alert rules are 5 LogQL queries against the structured fields — 1 day of work. Doing structured logging now buys 90% of the value.

3. Phase 2.4 — Legacy `scope='file'` row cleanup

Goal: deprecate per-file watches now that changes-mode covers everything.

Status: gated on a soak period for the new mode. Recommend 1-2 weeks of stable changes-mode operation before deletion. Earliest sensible date: 2026-05-22.

What exists today

A handful of legacy scope='file' rows for atlas — registered before Phase 2.1 landed. They continue to renew via the file-mode path of DriveWatchRenewalActivityImpl (the changes-mode rewrite preserved file-mode behaviour byte-for-byte). They cost one Drive API call per renewal cycle and one drive_watches row each — small but unnecessary.

What this PR will do

Stop the underlying Drive channels for each legacy scope='file' row via drive.channels.stop() so Drive stops delivering notifications for those file-scoped channels. Without this step, Drive will keep firing webhooks at the dispatcher until the channel TTL expires (~24h after the last renewal).
Delete the rows from drive_watches so the renewal cron stops touching them.
Confirm via reconciliation that atlas still responds to comments on the previously-file-scoped files — they should be covered by the changes-mode watch.

Order matters

Delete-then-stop produces orphaned channels (Drive keeps notifying; dispatcher 404-equivalent). Stop-then-delete is the correct order. The cleanup activity must call stopWatchChannel first and only delete the row on success.

Why gated, not done now

The risk is that we discover during the soak that changes-mode misses some change type that file-mode caught. That has not happened in the limited testing so far — but 2 weeks of production operation is cheap insurance for a one-way migration. If something does surface, the legacy rows are still doing their job.

Trigger checklist

Before scheduling this PR:

At least 14 days since 2026-05-08 (the Phase 3.1 schedule-create date).
Zero Failed runs on drive-reconciliation-cron in the schedule's ActionCounts history.
At least one comment on a previously-file-scoped file received an atlas reply via the changes-mode path (manual check).
No new event=drive_watch.changes_processed log lines with non-zero skipped_no_file_id in the past 7 days.

If any check fails, defer another week and re-run.

Cross-references

Original migration handover (incident context, durability invariants, architecture decisions): drive-watch-migration-handover-2026-05-06.md
End-to-end test loop runbook (mandatory before declaring any drive-watch change done; AI_RULES.md hard rule 14): drive-watch-test-loop.md
Schema: infra/migrations/astra/V015..V017
Code: internal/backend/.../agents/spec/temporal/Drive* and internal/dispatcher/.../webhooks/DriveWebhookHandler.kt
Operator runbook (post-merge schedule create): see Phase 3.1 section of the original handover (corrected by PR #120)

TL;DR​

1. Phase 3.2a — Structured-log audit​

What exists today​

What's missing​

Files to touch​

Out of scope​

Why doable now (no infra dependency)​

2. Phase 3.2b — Grafana alert rules​

Alert set (unchanged from the original handover)​

Pre-conditions before this PR can land​

What this PR will contain​

Why deferred, not "do something simple now"​

3. Phase 2.4 — Legacy scope='file' row cleanup​

What exists today​

What this PR will do​

Order matters​

Why gated, not done now​

Trigger checklist​

Cross-references​

TL;DR

1. Phase 3.2a — Structured-log audit

What exists today

What's missing

Files to touch

Out of scope

Why doable now (no infra dependency)

2. Phase 3.2b — Grafana alert rules

Alert set (unchanged from the original handover)

Pre-conditions before this PR can land

What this PR will contain

Why deferred, not "do something simple now"

3. Phase 2.4 — Legacy `scope='file'` row cleanup

What exists today

What this PR will do

Order matters

Why gated, not done now

Trigger checklist

Cross-references