Spec agent v0.1 — handover (2026-05-01)

TL;DR

The spec agent (atlas) runs end-to-end on production infrastructure. A smoke run completes in ~60 seconds, processes ~150K input tokens, calls real tools against the cloned codebase, and produces a publication-quality summary persisted to task_events.metadata.output_payload. The full plumbing chain (Bedrock LLM → 51 tools → Astra HTTP API → Postgres → Temporal workflow) is operational.

The next milestone is replacing the manual smoke trigger with a real event source — webhooks from Google Docs (and Slack / GitHub / Plane) flowing through the dispatcher service, into Temporal, into the spec agent, ending with the agent posting resolutions back to the originating doc.

Status snapshot

Component	Status	Notes
`astra-backend` (Ktor, port 8081)	✅ Live	Image: `aucertacr41e0x5.azurecr.io/astra-backend:latest`
`astra-frontend` (Next.js console)	✅ Live	Image: `aucertacr41e0x5.azurecr.io/astra-frontend:latest`
`astra-proxy` (nginx)	✅ Live	Routes `/api/` → backend, `/` → frontend
`dispatcher` (Ktor, webhook → Temporal starter)	✅ Live	43h+ uptime; manual + webhook paths exercised end-to-end through public Cloudflare tunnel (verified 2026-05-01); `/health` + `/webhooks/*` reachable, `/api/tasks` gated by CF Access
`spec-agent-worker` (Temporal worker)	✅ Live	Polls `spec-agent-queue`; ran iter 20 to completion
Temporal cluster (`temporal.aucert.dev`)	✅ Live	UI at `https://temporal.aucert.dev`
Plane (project tracker)	✅ Live	Various `plane-*` deployments
Cloudflare tunnel (`cloudflared`)	✅ Live	Public ingress for Astra, Temporal, Plane, dispatcher
`astra_db` schema (Postgres)	✅ Migrated	13 Flyway migrations applied (V001–V013)
Bedrock model access	✅ Granted	Sonnet 4.6, Opus 4.7, Kimi K2.5, GLM 4.7
AWS IAM admin user	✅ Created	`vivek.soneja` with `AdministratorAccess`, MFA, CLI keys

What's working today

Spec agent end-to-end

Atlas can be invoked via the smoke script and will:

Bootstrap — clone the repo into /workspace/aucert using a freshly-minted GitHub App installation token, register all 51 tools, fetch agent metadata + composable personalities from astra_db, build the system prompt.
Pick the right model per operation — Sonnet 4.6 (us.anthropic.claude-sonnet-4-6) for conversational tasks, Opus 4.7 (us.anthropic.claude-opus-4-7) for spec_finalize / spec_generate / spec_synthesize. Per-operation routing is in SpecAgentConfig.resolveModel(operation).
Tag every response with a model label — [S46], [O47], [K26] (and any future label registered in ModelLabels.kt) at the start of every assistant turn, enforced via system-prompt instruction with code-level fallback in doneOutputPayload. Operators can request a specific model via tags in Drive comments — [kimi], [opus], [opus-direct], multi-tag [kimi][opus] for parallel runs. Full reference: Model routing and operator labels.
Dispatch tools through the registry — local repo (read-only filesystem ops), GitHub, Slack, Plane, Google Docs/Drive, Brave web search, knowledge base vector search, embeddings.
Persist lifecycle to Astra — fetchTaskContext → updateTaskStatus(IN_PROGRESS) → logTaskEvent("agent_started") → run loop → completeTaskRun(outcome="success") → logTaskEvent("agent_completed", { output_payload, model_used, ... }).
Be Temporal-retry-safe — failures during the activity body do not push the row to a terminal state; Temporal can retry the activity any number of times without state-machine 422s.

Iter 20 verification

Smoke iter 20 produced this row in astra_db.task_runs:

id          = 028a4f80-648a-4f16-bd6b-2fa171d97bce
outcome     = success
duration_ms = 59,587
tokens_input  = 147,398
tokens_output = 2,590
output_type = inline

And atlas's actual output (in task_events.metadata.output_payload for the agent_completed event) was a publication-quality 12-spec summary including a real bug it caught — LocalRepoClient.listAdrs() scans docs/adrs/ instead of the actual .context/decisions/. Spawn-task chip queued for the fix.

The 8-PR journey

The smoke loop took 20 iterations and 8 PRs to close. Each PR fixed one structural bug surfaced by the previous iteration:

PR	Layer	Triggered by
#52	Server-side `GET /api/internal/astra/v1/task-runs/{id}/context` route was missing	iter 13
#53	`updateTaskStatus` URL/body/enum mismatched server contract	iter 14
#54	`AstraClient` was hitting `/api/...` not `/api/internal/astra/v1/...`	iter 15
#55	`task_events` table was never created (V012 migration)	iter 16
#56	Activity body marked terminal state, breaking Temporal retries	iter 17
#57	Bedrock model IDs wrong format + per-operation routing + `[S46]`/`[O47]` labels	iter 17
#58	`completeTaskRun` POST→PATCH + body shape	iter 18
#59	Bedrock SDK socket/attempt timeouts too tight for 88K-token calls	iter 19

Next milestone — end-to-end webhook flow

The smoke script is a manual trigger that bypasses the dispatcher. The real flow is:

Google Docs comment
   ↓ (webhook)
dispatcher.aucert.dev
   ↓ (Temporal workflow start)
SpecAgentWorkflow on spec-agent-queue
   ↓ (activity dispatch)
spec-agent-worker → SpecAgentExecutor → AgentLoop → Bedrock
   ↓ (tool calls)
GoogleDocsClient.postComment / GoogleDocsClient.replyToThread
   ↓
Comment resolution visible in the originating Google Doc

All four webhook handlers (Slack/Drive/GitHub/Plane) are shipped in internal/dispatcher/src/main/kotlin/dev/aucert/internal/dispatcher/routing/WebhookRoutes.kt. SignatureValidator does HMAC-SHA256 for Slack and GitHub. The public webhook → DB → Temporal → agent → outcome chain was verified end-to-end on 2026-05-01: a probe POST to https://dispatcher.aucert.dev/webhooks/drive (empty body, no headers) produced a real workflow that ran to outcome=success in 19s (30K input tokens, 647 output, ~$0.10 Bedrock spend). Manual trigger endpoint is POST /api/tasks (auth: X-API-Key header); exercised 6+ times with 201 responses in the same window.

SPEC-021 documents the design (Wave 5: W5-A standalone Ktor service, W5-B webhook handlers for Slack/Google Drive/GitHub/Plane, W5-C Dockerfile + manifests + tunnel config).

Three known dispatcher issues discovered during the probe run (hardening pass pending — #62):

WebhookRoutes drops the classifier's initialMessage from trigger_event; the agent currently sees only raw headers/body rather than a synthesised description of the event.
DriveWebhookHandler dispatches on empty/missing resource headers — the empty-body probe accidentally created a real workflow.
DispatcherConfig treats empty-string K8s secrets as "validation enabled" — SLACK_SIGNING_SECRET and GITHUB_WEBHOOK_SECRET are currently 0-byte placeholders, so HMAC validation fails for those sources.

What needs to happen, in order

1. Webhook source configuration (per source, ~30 min each)

Each source needs to be told to POST to https://dispatcher.aucert.dev/webhooks/<source> with appropriate authentication. Handler code is already deployed; this is external configuration only.

Google Docs / Drive:

Create a Drive changes.watch subscription targeting the docs you want monitored (or set up at the folder level for a topic-folder).
Subscription POSTs to https://dispatcher.aucert.dev/webhooks/google-drive with X-Goog-Resource-State, X-Goog-Channel-Token headers.
Drive subscriptions expire (max 7 days for change watches). Need a renewal job — either a Temporal cron or a K8s CronJob — that re-creates the subscription before expiry. Out of scope for v0.1; renew manually until v0.2.
Reference: SPEC-021 §W5-B for handler details.
Prerequisite: apply the dispatcher hardening PR first so empty-header probe traffic no longer triggers real workflows.

Slack:

Slack app → Event Subscriptions → Request URL: https://dispatcher.aucert.dev/webhooks/slack. Slack POSTs a URL verification challenge first; the handler echoes challenge back.
Subscribe to app_mention, message.channels, message.im, reaction_added (for :atlas: reactions as a trigger).
Slack signs every webhook with HMAC-SHA256 using your signing secret. Handler validates X-Slack-Signature against v0:<timestamp>:<body>.
Prerequisite: populate SLACK_SIGNING_SECRET in dispatcher-secrets (currently a 0-byte placeholder).

GitHub:

For each watched repo: Settings → Webhooks → Add webhook → Payload URL https://dispatcher.aucert.dev/webhooks/github, secret = HMAC key, events = pull_request, pull_request_review_comment, issue_comment, issues.
The GitHub App you already provisioned can also push events org-wide via App-level webhooks if you want one-time setup vs per-repo.
Validate X-Hub-Signature-256 against the secret.
Prerequisite: populate GITHUB_WEBHOOK_SECRET in dispatcher-secrets (currently a 0-byte placeholder).

Plane:

Plane workspace settings → webhooks → URL https://dispatcher.aucert.dev/webhooks/plane, events = issue created/updated/commented.

2. Verify dispatcher → Temporal hop

Verified as of 2026-05-01: manual trigger exercised 6+ times with 201 responses, each kicking a real SpecAgentWorkflow on spec-agent-queue. A Drive probe confirmed the full DB-persistence and workflow-start path.

To re-run manually from inside the cluster:

# From inside the cluster (Cloudflare Access bypass):
kubectl run curl-test --rm -i --restart=Never -n internal-platform --image=curlimages/curl:8.11.1 -- \
  curl -sS -X POST http://dispatcher.internal-platform.svc.cluster.local/api/tasks \
    -H "X-API-Key: $(kubectl get secret dispatcher-secrets -n internal-platform -o jsonpath='{.data.API_KEY}' | base64 -d)" \
    -H 'Content-Type: application/json' \
    -d '{"message":"Manual dispatcher → workflow test"}'

Note: hitting https://dispatcher.aucert.dev/api/tasks from outside the cluster requires a Cloudflare Access service token — not yet configured. The /webhooks/* paths are CF Access bypassed for HMAC-authenticated callers.

3. End-to-end Google Docs round trip

Once the dispatcher is verified and Google Drive webhooks are configured:

Setup (one-time):

Pick a test doc — e.g. a draft spec in your Drive.
Create a Drive changes.watch subscription targeting that doc (or its parent folder). Save the subscription ID for cleanup.
Confirm the subscription is active: aws sts get-caller-identity equivalent for Drive API is https://www.googleapis.com/drive/v3/changes/watch returning a 200.

Test:

Open the test doc in Google Docs as a real user.
Add a comment that mentions atlas — e.g. "@atlas can you summarise the open questions in this doc?"
Within seconds, the dispatcher should receive the webhook. Verify in dispatcher logs.
Within ~30 seconds, the workflow should be visible in Temporal UI under aucert-default namespace, ID matching the doc + comment ID.
Within ~60–120 seconds, atlas should:
- Read the doc via GoogleDocsClient.
- Generate a response keyed to the comment.
- Post the response as a reply on the comment via GoogleDocsClient.replyToComment (or whatever the tool is named in agents/spec/tools/).
- Mark task_runs.outcome = success.

What to look for in the doc:

A new comment reply from the atlas service account (verify the agent-email token in vault matches the service account writing comments).
The reply should start with [S46] (model label).
The reply text should be relevant to the comment, not a generic "I read the doc" response.

4. Resolve flow

The harder problem: atlas should be able to resolve a comment thread when the user signals they're satisfied (e.g. "thanks atlas, resolved" or a 👍 reaction). This requires:

A separate webhook event when a comment is resolved (Drive emits comment.resolved).
Dispatcher routes to a resolve_comment operation.
Atlas reads the thread, optionally posts a final summary, and calls GoogleDocsClient.resolveComment(commentId).

This is genuinely new agent behavior — not just "respond to comments" but "decide when a comment is fully addressed." Worth scoping carefully:

v0.1: explicit resolve trigger only (user reacts with ✅ or types /atlas resolve).
v0.2: atlas decides autonomously based on conversation state.

Recommend starting with v0.1 (explicit trigger).

Outstanding cleanup tasks

In rough priority order:

High value, small scope

Task	Why	Where
Dispatcher webhook hardening (`initialMessage` in `trigger_event`, Drive null-resourceId guard, empty-string secret foot-gun)	Probe traffic exposed three small issues blocking clean Drive E2E. #62	`internal/dispatcher/src/main/kotlin/dev/aucert/internal/dispatcher/{routing,webhooks,config}/`
Fix `LocalRepoClient.listAdrs()` ADR path	Atlas caught a real bug — points at `docs/adrs/` but actual is `.context/decisions/`. Already spawned as a separate task.	`internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/clients/LocalRepoClient.kt`
Promote SPEC-017 to `approved/`	Atlas observed it's fully implemented but still in `drafts/`. Mechanical move.	`docs/specs/drafts/SPEC-017-...md` → `docs/specs/approved/`
Update SPEC-010 / SPEC-020 stale Google notes	Both have amendment notes about service-account auth superseded by ADR-014 (OAuth refresh tokens). Should be reconciled.	`docs/specs/drafts/SPEC-010...md`, `SPEC-020-...md`
Workflow-level final-failure marking	Today, when Temporal exhausts retries, the row stays in `running` forever. The workflow should mark `outcome = failure` after final retry.	`internal/backend/src/main/kotlin/dev/aucert/internal/agents/spec/temporal/SpecAgentWorkflowImpl.kt`

Quota / capacity

Task	Why
Request Bedrock Opus 4.7 quota increase	Iter 19 + 20 confirmed Sonnet 4.6 works fine on the default quota. Opus 4.7 throttled in earlier tests. Console → Service Quotas → AWS Bedrock → "Cross-region model invocation tokens per minute for Claude Opus 4.7" → request increase. Free, usually approved within hours.
~~Renewal job for Drive `changes.watch`~~	Shipped in PR #XX — see "Drive watch renewal" section below.

Deferred from PRs (not blocking)

Task	Deferred from
Per-call timeout override via `RuntimeConfig.timeoutMs`	#59 — today every Bedrock call gets the generous 5-min budget, regardless of expected duration
Streaming responses (`InvokeModelWithResponseStream` / `Converse Stream`)	#59 — would let us start processing tokens as they arrive instead of waiting for the full response
Widen `task_runs.output_ref` from `varchar(500)` to TEXT or jsonb	#58 — today the actual output lives in `task_events.metadata.output_payload` to avoid truncation
Per-token cost computation (`estimated_cost` column on `task_runs`)	#58 — reporting layer can backfill from `tokens_input * input_rate + tokens_output * output_rate` using `model_used` from `agent_completed` event

Future agents (Kimi / GLM)

Update 2026-05-06. Kimi shipped, but via a different path than the one originally sketched here. We did not add Kimi via Bedrock Converse — instead, we deployed Kimi K2.6 on Azure AI Foundry and wrote a new FoundryOpenAIAdapter (OpenAI-compatible chat completions) alongside the existing BedrockAdapter. Atlas now invokes Kimi via the [kimi] operator tag in Google Docs comments, and the route is validated end-to-end on a real Drive comment thread. See Model routing and operator labels for the full operator + developer reference, and ADR-011's 2026-05-06 amendment for the architectural framing.

GLM is still future. When it lands, the same one-line ModelRegistry + ModelLabels pattern applies — adapter choice depends on whether GLM is hosted on Bedrock (extend BedrockAdapter with a Converse-API variant), Foundry (reuse FoundryOpenAIAdapter), or somewhere else (new adapter).

The original analysis below is preserved for context but the conclusion has shifted: the simpler refactor turned out to be a third adapter (FoundryOpenAIAdapter), not extending BedrockAdapter. The provider-agnostic interface paid for itself.

Both models were provisioned and verified callable on Bedrock (moonshotai.kimi-k2.5, zai.glm-4.7). When a future non-spec agent needs them:

Add a BedrockMessageTranslator variant for non-Anthropic body shapes (Anthropic Messages API doesn't apply to Kimi/GLM).

Or use the Bedrock Converse API instead of InvokeModel — Converse is provider-agnostic and would let the existing BedrockAdapter work for any model.

The simpler refactor is the second one. SPEC for whichever future agent comes first should pick the path.

Architecture decisions reaffirmed

The smoke loop took 8 PRs to close, which prompted a re-examination of the harness strategy decided on 2026-04-15. The conclusion is to keep the existing layered architecture unchanged, with two operational addenda below.

Why the existing decision still holds

Categorising the 8 PRs (#52–#59) honestly against the four-layer model:

PR	Layer the bug lived in	Would a framework have prevented it?
#52	L2 — server-side route handler	No (API design, framework-agnostic)
#53	L2 — client/server contract drift	No (HTTP shape mismatch)
#54	L2 — mount-path prefix mistake	No (path constant typo)
#55	L2 — missing migration	No (schema bug)
#56	L3/4 — activity body marked terminal state	No (Temporal idiom we had to learn)
#57	Provider integration — Bedrock model IDs	No (provider-specific config)
#58	L2 — same class as #52/#53	No
#59	Provider integration — Bedrock SDK timeouts	No (provider-specific tuning)

Zero of the 8 PRs were in Layer 1 (the loop). The loop is ~50 lines and worked correctly throughout. Pain was concentrated in API contracts, schema, and provider-specific configuration — none of which a different harness choice would have addressed. Adopting an off-the-shelf framework (Anthropic Agent SDK, LangChain4j, etc.) would have introduced a polyglot service boundary or framework coupling without removing any of the bugs we actually hit.

The original decision document (2026-04-15) should be promoted to a canonical ADR, with the two addenda below incorporated.

Addendum 1 — provider adapters need real-credential integration tests

Three Bedrock-specific quirks only surfaced in production, despite passing unit tests:

Bare model IDs (anthropic.claude-sonnet-4-6) reject InvokeModel — must use cross-region inference profile (us.anthropic.claude-sonnet-4-6).
Default AWS SDK socket timeout (~30s) is tight for large-context LLM calls; needed bumping to 5 min.
agent_tokens vault path versus Astra Token Vault path needed reconciliation for the AWS credentials.

Rule: every new provider adapter (Foundry, Moonshot, Zhipu, etc.) ships with a real-credential integration test that exercises a multi-tool, multi-step interaction end-to-end — not just translation correctness. The BedrockAdapterTest today only covers BedrockMessageTranslator (pure JSON in / JSON out); future adapters need an integration counterpart that runs against the actual provider with @Tag("integration").

Addendum 2 — dispatcher line-count is a real trigger

The harness doc characterises the dispatcher as "thin, ~500 lines." The dispatcher pod has 43h+ uptime; the full public webhook path was exercised end-to-end on 2026-05-01 and the claim held — the handler code is thin and the volume of source-specific logic is still well below the trigger threshold.

Trigger to revisit the build-vs-buy calculation for the dispatcher: if per-source handler complexity (Drive subscription renewal + Slack rate limits + GitHub event filtering + Plane payload variants) pushes the dispatcher past 1500 lines of source-specific code, the cost calculus shifts toward a managed alternative (Trigger.dev, n8n, or similar webhook-to-workflow platform). Until then, keep custom.

This trigger should be added to the "When to reconsider" section of the canonical ADR.

Multi-tenancy considerations

Today everything is single-tenant Aucert-internal. The system was deliberately built that way for v0.1 — getting one tenant working end-to-end is hard enough. But several design choices will reverberate when we onboard external customers, and starting the conversation now (before v0.5+ build) is cheaper than retrofitting.

What's tenant-coupled today

Surface	Coupling	Hard to change later?
`astra_db` schemas (agents, task_runs, task_events, agent_tokens, personalities)	Single shared database; no `tenant_id` column anywhere	Medium — add column + backfill + scope every query
`specs_db`, `shared_kb_db`, `internal_shared_db`	Single instance per database	Same as above
Workspace clones (`/workspace/aucert`)	Hardcoded repo, single GitHub App installation	High — workspace bootstrap assumes one repo
`SystemPromptAssembler`	Reads "Aucert core context" — proprietary IP in the prompt	High — context leakage risk if reused unchanged for tenants
Bedrock IAM principal	One `bedrock-access-key` shared across all agent runs	Low — easy to scope per-tenant once IAM is parameterised
Astra service token	Single `DISPATCHER_SERVICE_TOKEN` from `astra-secrets`	Low — token broker can mint per-tenant tokens
Cloudflare tunnel routes	`astra.aucert.dev`, `dispatcher.aucert.dev`, `temporal.aucert.dev` (single tenant)	Medium — needs subdomain or path scheme per tenant
K8s namespace	Single `internal-platform` namespace for all workloads	Medium — per-tenant namespace gives isolation but multiplies operational cost
Worker pool concurrency	`spec-agent-worker` polls one `spec-agent-queue`	Medium — Temporal task queues per tenant are cheap; worker pool sizing per tenant is the real question

Open design questions

These should be decided (not necessarily built) before v0.5+ work begins:

Database isolation model. Three options, in increasing strength:
- Shared schema, tenant_id column on every row + row-level security policies. Cheapest, weakest. Risk: missed WHERE tenant_id = ? clause leaks data across tenants.
- Per-tenant Postgres schema (astra_db.tenant_<id>.task_runs). Stronger isolation, requires application to set search_path. Migrations applied per schema.
- Per-tenant Postgres database (astra_db_tenant_<id>). Strongest isolation, highest operational cost. Backup/restore scoped per tenant.
- Recommendation to pre-decide: start with the second option (per-tenant schema). Strong-enough isolation; reasonable migration story; doesn't multiply infra cost. Only escalate to per-database if a customer's compliance requires it.
Workspace clone strategy. The spec agent clones aucert/aucert on activity start. For a customer running atlas on their own private repo:
- Per-tenant GitHub App installation (each customer installs the App on their own org).
- Per-tenant workspace volume (different /workspace/<tenant>/ paths).
- Per-tenant clone-on-start IAM (no cross-tenant credential leakage).
- Open question: is the workspace shared across activities for the same tenant (cache-friendly, locking risk) or fresh per activity (slower, isolated)?
Prompt isolation. SystemPromptAssembler today reads the "Aucert core context" — a section that bleeds Aucert-specific framing ("the Aucert spec workflow", "atlas the spec agent", etc.) into the system prompt. For multi-tenant:
- Move the per-Aucert framing into a personality fragment or a tenant config that's swappable.
- The fixed-section "core directive" / "tool discipline" can stay genuinely tenant-agnostic.
- Personalities themselves are already keyed per-agent in astra_db.agent_personalities — extending to per-tenant is a small change.
Audit and compliance. task_runs is the canonical audit log. Per-tenant access controls + retention policies need to exist before any compliance-sensitive customer onboards. Today everything is in one table queryable by any IAM principal with astra:task_runs.view.
Cost attribution. tokens_input / tokens_output per task_run gives us the raw data, but allocating to a tenant (for invoicing or capacity planning) needs tenant_id on the row. The agent_completed task_events row already carries model_used so per-model + per-tenant rate cards become tractable once tenant_id lands.
Worker pool sizing. A noisy-neighbour tenant could starve atlas of capacity. Options:
- Per-tenant Temporal task queues with per-queue worker pools (operational cost: more pods).
- Single shared queue with rate limiting at the activity level (cheaper, weaker isolation).
- Recommendation to pre-decide: start with shared queue + activity-level rate limiting; escalate to per-tenant queues if a customer's load justifies dedicated capacity.

What to do now

Don't build any of this yet. But:

Add tenant_id UUID to the next schema migration that touches a tenant-coupled table (with default value 'aucert' for backfill).
Wrap the "Aucert core context" section of SystemPromptAssembler in a config-driven fragment the next time that file is touched.
When the next agent role's worker pod is being designed, make the IAM principal scoped per-role (not shared with atlas), so per-tenant scoping is a small extension later rather than a refactor.
Capture this section's open design questions in a new SPEC (e.g. "SPEC-NNN — Multi-tenancy strategy for the agent platform") before the first paying customer is committed to a launch date. The SPEC's job is to pick option (1) and (2) above with concrete tradeoffs.

The handover author's view: the existing single-tenant build is correct for where Aucert is today. None of the choices above need to be made tonight. But picking them in the right order — schema strategy first, workspace second, prompt isolation third — saves weeks compared to retrofitting under deadline pressure.

Operational runbook

Deploy the spec agent worker

./tools/scripts/deploy-spec-agent-worker.sh

Builds internal/backend/Dockerfile.spec-agent, pushes to ACR, restarts the deployment, waits for rollout. Tag defaults to git rev-parse --short HEAD.

Deploy astra-backend

CI handles this automatically on push to main when internal/backend/**, internal/frontend/**, infra/migrations/**, or infra/k8s/internal-platform/astra/** change. Workflow: .github/workflows/deploy-astra.yml.

Manual deploy:

./tools/scripts/astra-deploy.sh                    # full deploy
./tools/scripts/astra-deploy.sh --backend-only     # backend only
./tools/scripts/astra-deploy.sh --skip-migrations  # skip Flyway

Run a smoke test

./tools/scripts/smoke-test-spec-agent.sh
# Or with a custom prompt:
./tools/scripts/smoke-test-spec-agent.sh "Read SPEC-005 and SPEC-013, identify any contradictions in the workspace prep section."

The script:

Resolves atlas's agent UUID via psql against astra_db.
Inserts a fresh task_runs row with the prompt as task_title.
Starts a SpecAgentWorkflow on spec-agent-queue via temporalio/admin-tools pod.
Tails worker logs (kubectl logs -f).

Query a task run's outcome

TASK_RUN_ID=028a4f80-648a-4f16-bd6b-2fa171d97bce  # from smoke output

kubectl run pg-probe-$(date +%s) --rm -i --restart=Never -n internal-platform \
  --image=postgres:18.3-alpine \
  --env="PGPASSWORD=$(kubectl get secret astra-db-credentials -n internal-platform -o jsonpath='{.data.ASTRA_DB_PASSWORD}' | base64 -d)" \
  --env="PGSSLMODE=require" \
  --command -- psql -h aucert-internal-pg.postgres.database.azure.com \
    -U internaladmin -d astra_db -X -A \
    -c "SELECT outcome, duration_ms, tokens_input, tokens_output FROM task_runs WHERE id = '$TASK_RUN_ID'::uuid;"

Read the actual agent output:

... -c "SELECT metadata->'output_payload'->>'text' FROM task_events WHERE task_run_id = '$TASK_RUN_ID'::uuid AND event_type = 'agent_completed';"

Check Bedrock model availability

aws bedrock list-foundation-models --region us-west-2 \
  --by-provider Anthropic \
  --query 'modelSummaries[].[modelId,modelName,modelLifecycle.status]' \
  --output table

aws bedrock list-inference-profiles --region us-west-2 \
  --query 'inferenceProfileSummaries[].[inferenceProfileId,status]' \
  --output table

Verify a model is callable:

aws bedrock-runtime converse \
  --region us-west-2 \
  --model-id us.anthropic.claude-sonnet-4-6 \
  --messages '[{"role":"user","content":[{"text":"hi"}]}]' \
  --inference-config '{"maxTokens":20}'

Inspect a workflow in Temporal

UI: https://temporal.aucert.dev → namespace aucert-default → workflow ID matches spec-agent-smoke-<short-uuid> for smoke runs.

CLI from inside the cluster:

kubectl run -it --rm --restart=Never \
  --image=temporalio/admin-tools:1.31.0 \
  -n temporal \
  -- temporal workflow describe \
    --address temporal-frontend.temporal.svc.cluster.local:7233 \
    --namespace aucert-default \
    --workflow-id spec-agent-smoke-<short-uuid>

Drive watch renewal (post-merge operator setup)

Drive watch channels have a 24-hour TTL in Workspace tenants. The renewal job is a Temporal cron workflow (DriveWatchRenewalWorkflow) registered on the spec-agent-worker. It runs every 12 h, looks ahead 6 h, and re-registers any channel that would otherwise expire before the next tick.

Incident that motivated this: a watch registered at 06:55 on 2026-04-30 expired at 06:55 on 2026-05-01. A comment posted 14 minutes after expiry was never seen by atlas. The watch had to be re-registered manually.

Start the cron schedule (one-time, after deploying the updated worker)

kubectl run -it --rm --restart=Never \
  --image=temporalio/admin-tools:1.31.0 \
  -n temporal \
  -- temporal workflow start \
    --address temporal-frontend.temporal.svc.cluster.local:7233 \
    --namespace aucert-default \
    --type DriveWatchRenewalWorkflow \
    --task-queue spec-agent-queue \
    --cron "0 */12 * * *" \
    --workflow-id drive-watch-renewal-cron

The --workflow-id is stable. Re-running this command after a future worker deployment is safe — Temporal replaces the existing cron handle without creating duplicates.

Verify the cron is running

In the Temporal UI: https://temporal.aucert.dev → namespace aucert-default → search workflow ID drive-watch-renewal-cron. You should see a Running cron workflow with completed runs on a 12-hour cadence.

CLI:

kubectl run -it --rm --restart=Never \
  --image=temporalio/admin-tools:1.31.0 \
  -n temporal \
  -- temporal workflow describe \
    --address temporal-frontend.temporal.svc.cluster.local:7233 \
    --namespace aucert-default \
    --workflow-id drive-watch-renewal-cron

Check current watch expiry times

kubectl run pg-probe-$(date +%s) --rm -i --restart=Never -n internal-platform \
  --image=postgres:18.3-alpine \
  --env="PGPASSWORD=$(kubectl get secret astra-db-credentials -n internal-platform -o jsonpath='{.data.ASTRA_DB_PASSWORD}' | base64 -d)" \
  --env="PGSSLMODE=require" \
  --command -- psql -h aucert-internal-pg.postgres.database.azure.com \
    -U internaladmin -d astra_db -X -A \
    -c "SELECT channel_id, file_id, agent_id, expiration_time,
               expiration_time - now() AS time_left
        FROM drive_watches
        ORDER BY expiration_time ASC;"

How old-channel cleanup works

The renewal job creates a new channel for each near-expiring watch (Drive returns a new channel_id). The old channel row is not deleted immediately; it is left to expire naturally and is purged at the end of each renewal pass once it is more than 7 days past its expiration_time. This keeps the table tidy without a separate cleanup job.

References

Specs

SPEC-005 — Spec agent v0.1 definition (docs/specs/drafts/SPEC-005-spec-agent-v0.1.md)
SPEC-010 — Parallel execution plan (docs/specs/drafts/SPEC-010-spec-agent-v0.1-execution-plan.md)
SPEC-012, SPEC-013 — Codebase access amendments
SPEC-020 — Unified platform and token management (GitHub App auth, integration registry)
SPEC-021 — Wave 5 dispatcher design (the next milestone)

ADRs

ADR-002 — Bazel staged adoption (relevant to build-system context)
ADR-011 — Agent harness layering (Layer 1 = AgentExecutor, Layer 2 = AgentLoop)
ADR-014 — Google integration via OAuth 2.0 refresh tokens (supersedes earlier service-account approach)

All ADRs live in .context/decisions/ (canonical) and sync to docs/internal/docs/decisions/ via .github/workflows/docs-adr-sync.yml.

Source code entry points

File	Role
`internal/backend/src/main/kotlin/dev/aucert/internal/agents/spec/temporal/SpecAgentActivityImpl.kt`	Temporal activity that spins up the executor
`internal/backend/src/main/kotlin/dev/aucert/internal/agents/spec/SpecAgentExecutor.kt`	Atlas-specific executor — model routing, label injection, output wrapping
`internal/backend/src/main/kotlin/dev/aucert/internal/agents/spec/SpecAgentConfig.kt`	Atlas config — default runtime config, model constants, per-operation routing
`internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/executor/AgentExecutor.kt`	Base executor (Layer 1) — lifecycle, retry semantics, event logging
`internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/executor/AgentLoop.kt`	Loop (Layer 2) — LLM ↔ tool dispatch
`internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/clients/AstraClient.kt`	HTTP client for astra-backend (task_runs lifecycle) + direct-JDBC reads (agent metadata, tokens)
`internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/model/adapters/BedrockAdapter.kt`	Bedrock LLM provider (with the new 5-min timeouts)
`internal/backend/src/main/kotlin/dev/aucert/internal/astra/api/TaskRunApi.kt`	Server-side task_runs HTTP routes
`internal/backend/src/main/kotlin/dev/aucert/internal/astra/AstraModule.kt`	Mounts all Astra routes under `/api/internal/astra/v1`

Infrastructure

Path	Role
`infra/migrations/astra/`	Flyway migrations (V001–V013) for `astra_db`
`infra/k8s/internal-platform/spec-agent-worker/`	K8s deployment for the worker
`infra/k8s/internal-platform/astra/`	K8s manifests for backend / frontend / proxy
`infra/k8s/internal-platform/cloudflared/tunnel.yaml`	Cloudflare tunnel routing
`infra/terraform/foundation/`	UAMI, Federated Identity Credentials, Key Vault role assignments
`tools/scripts/smoke-test-spec-agent.sh`	The smoke loop trigger
`tools/scripts/deploy-spec-agent-worker.sh`	Worker build + deploy
`tools/scripts/astra-deploy.sh`	Astra-backend / frontend / proxy deploy

TL;DR​

Status snapshot​

What's working today​

Spec agent end-to-end​

Iter 20 verification​

The 8-PR journey​

Next milestone — end-to-end webhook flow​

What needs to happen, in order​

1. Webhook source configuration (per source, ~30 min each)​

2. Verify dispatcher → Temporal hop​

3. End-to-end Google Docs round trip​

4. Resolve flow​

Outstanding cleanup tasks​

High value, small scope​

Quota / capacity​

Deferred from PRs (not blocking)​

Future agents (Kimi / GLM)​

Architecture decisions reaffirmed​

Why the existing decision still holds​

Addendum 1 — provider adapters need real-credential integration tests​

Addendum 2 — dispatcher line-count is a real trigger​

Multi-tenancy considerations​

What's tenant-coupled today​

Open design questions​

What to do now​

Operational runbook​

Deploy the spec agent worker​

Deploy astra-backend​

Run a smoke test​

Query a task run's outcome​

Check Bedrock model availability​

Inspect a workflow in Temporal​

Drive watch renewal (post-merge operator setup)​

Start the cron schedule (one-time, after deploying the updated worker)​

Verify the cron is running​

Check current watch expiry times​

How old-channel cleanup works​

References​

Specs​

ADRs​

Source code entry points​

Infrastructure​