Skip to main content

Spec agent v0.1 — handover (2026-05-01)

TL;DR

The spec agent (atlas) runs end-to-end on production infrastructure. A smoke run completes in ~60 seconds, processes ~150K input tokens, calls real tools against the cloned codebase, and produces a publication-quality summary persisted to task_events.metadata.output_payload. The full plumbing chain (Bedrock LLM → 51 tools → Astra HTTP API → Postgres → Temporal workflow) is operational.

The next milestone is replacing the manual smoke trigger with a real event source — webhooks from Google Docs (and Slack / GitHub / Plane) flowing through the dispatcher service, into Temporal, into the spec agent, ending with the agent posting resolutions back to the originating doc.

Status snapshot

ComponentStatusNotes
astra-backend (Ktor, port 8081)✅ LiveImage: aucertacr41e0x5.azurecr.io/astra-backend:latest
astra-frontend (Next.js console)✅ LiveImage: aucertacr41e0x5.azurecr.io/astra-frontend:latest
astra-proxy (nginx)✅ LiveRoutes /api/ → backend, / → frontend
dispatcher (Ktor, webhook → Temporal starter)✅ Live43h+ uptime; manual + webhook paths exercised end-to-end through public Cloudflare tunnel (verified 2026-05-01); /health + /webhooks/* reachable, /api/tasks gated by CF Access
spec-agent-worker (Temporal worker)✅ LivePolls spec-agent-queue; ran iter 20 to completion
Temporal cluster (temporal.aucert.dev)✅ LiveUI at https://temporal.aucert.dev
Plane (project tracker)✅ LiveVarious plane-* deployments
Cloudflare tunnel (cloudflared)✅ LivePublic ingress for Astra, Temporal, Plane, dispatcher
astra_db schema (Postgres)✅ Migrated13 Flyway migrations applied (V001–V013)
Bedrock model access✅ GrantedSonnet 4.6, Opus 4.7, Kimi K2.5, GLM 4.7
AWS IAM admin user✅ Createdvivek.soneja with AdministratorAccess, MFA, CLI keys

What's working today

Spec agent end-to-end

Atlas can be invoked via the smoke script and will:

  1. Bootstrap — clone the repo into /workspace/aucert using a freshly-minted GitHub App installation token, register all 51 tools, fetch agent metadata + composable personalities from astra_db, build the system prompt.
  2. Pick the right model per operation — Sonnet 4.6 (us.anthropic.claude-sonnet-4-6) for conversational tasks, Opus 4.7 (us.anthropic.claude-opus-4-7) for spec_finalize / spec_generate / spec_synthesize. Per-operation routing is in SpecAgentConfig.resolveModel(operation).
  3. Tag every response with a model label[S46], [O47], [K26] (and any future label registered in ModelLabels.kt) at the start of every assistant turn, enforced via system-prompt instruction with code-level fallback in doneOutputPayload. Operators can request a specific model via tags in Drive comments — [kimi], [opus], [opus-direct], multi-tag [kimi][opus] for parallel runs. Full reference: Model routing and operator labels.
  4. Dispatch tools through the registry — local repo (read-only filesystem ops), GitHub, Slack, Plane, Google Docs/Drive, Brave web search, knowledge base vector search, embeddings.
  5. Persist lifecycle to AstrafetchTaskContextupdateTaskStatus(IN_PROGRESS)logTaskEvent("agent_started") → run loop → completeTaskRun(outcome="success")logTaskEvent("agent_completed", { output_payload, model_used, ... }).
  6. Be Temporal-retry-safe — failures during the activity body do not push the row to a terminal state; Temporal can retry the activity any number of times without state-machine 422s.

Iter 20 verification

Smoke iter 20 produced this row in astra_db.task_runs:

id          = 028a4f80-648a-4f16-bd6b-2fa171d97bce
outcome = success
duration_ms = 59,587
tokens_input = 147,398
tokens_output = 2,590
output_type = inline

And atlas's actual output (in task_events.metadata.output_payload for the agent_completed event) was a publication-quality 12-spec summary including a real bug it caught — LocalRepoClient.listAdrs() scans docs/adrs/ instead of the actual .context/decisions/. Spawn-task chip queued for the fix.

The 8-PR journey

The smoke loop took 20 iterations and 8 PRs to close. Each PR fixed one structural bug surfaced by the previous iteration:

PRLayerTriggered by
#52Server-side GET /api/internal/astra/v1/task-runs/{id}/context route was missingiter 13
#53updateTaskStatus URL/body/enum mismatched server contractiter 14
#54AstraClient was hitting /api/... not /api/internal/astra/v1/...iter 15
#55task_events table was never created (V012 migration)iter 16
#56Activity body marked terminal state, breaking Temporal retriesiter 17
#57Bedrock model IDs wrong format + per-operation routing + [S46]/[O47] labelsiter 17
#58completeTaskRun POST→PATCH + body shapeiter 18
#59Bedrock SDK socket/attempt timeouts too tight for 88K-token callsiter 19

Next milestone — end-to-end webhook flow

The smoke script is a manual trigger that bypasses the dispatcher. The real flow is:

Google Docs comment
↓ (webhook)
dispatcher.aucert.dev
↓ (Temporal workflow start)
SpecAgentWorkflow on spec-agent-queue
↓ (activity dispatch)
spec-agent-worker → SpecAgentExecutor → AgentLoop → Bedrock
↓ (tool calls)
GoogleDocsClient.postComment / GoogleDocsClient.replyToThread

Comment resolution visible in the originating Google Doc

All four webhook handlers (Slack/Drive/GitHub/Plane) are shipped in internal/dispatcher/src/main/kotlin/dev/aucert/internal/dispatcher/routing/WebhookRoutes.kt. SignatureValidator does HMAC-SHA256 for Slack and GitHub. The public webhook → DB → Temporal → agent → outcome chain was verified end-to-end on 2026-05-01: a probe POST to https://dispatcher.aucert.dev/webhooks/drive (empty body, no headers) produced a real workflow that ran to outcome=success in 19s (30K input tokens, 647 output, ~$0.10 Bedrock spend). Manual trigger endpoint is POST /api/tasks (auth: X-API-Key header); exercised 6+ times with 201 responses in the same window.

SPEC-021 documents the design (Wave 5: W5-A standalone Ktor service, W5-B webhook handlers for Slack/Google Drive/GitHub/Plane, W5-C Dockerfile + manifests + tunnel config).

Three known dispatcher issues discovered during the probe run (hardening pass pending — #62):

  1. WebhookRoutes drops the classifier's initialMessage from trigger_event; the agent currently sees only raw headers/body rather than a synthesised description of the event.
  2. DriveWebhookHandler dispatches on empty/missing resource headers — the empty-body probe accidentally created a real workflow.
  3. DispatcherConfig treats empty-string K8s secrets as "validation enabled" — SLACK_SIGNING_SECRET and GITHUB_WEBHOOK_SECRET are currently 0-byte placeholders, so HMAC validation fails for those sources.

What needs to happen, in order

1. Webhook source configuration (per source, ~30 min each)

Each source needs to be told to POST to https://dispatcher.aucert.dev/webhooks/<source> with appropriate authentication. Handler code is already deployed; this is external configuration only.

Google Docs / Drive:

  • Create a Drive changes.watch subscription targeting the docs you want monitored (or set up at the folder level for a topic-folder).
  • Subscription POSTs to https://dispatcher.aucert.dev/webhooks/google-drive with X-Goog-Resource-State, X-Goog-Channel-Token headers.
  • Drive subscriptions expire (max 7 days for change watches). Need a renewal job — either a Temporal cron or a K8s CronJob — that re-creates the subscription before expiry. Out of scope for v0.1; renew manually until v0.2.
  • Reference: SPEC-021 §W5-B for handler details.
  • Prerequisite: apply the dispatcher hardening PR first so empty-header probe traffic no longer triggers real workflows.

Slack:

  • Slack app → Event Subscriptions → Request URL: https://dispatcher.aucert.dev/webhooks/slack. Slack POSTs a URL verification challenge first; the handler echoes challenge back.
  • Subscribe to app_mention, message.channels, message.im, reaction_added (for :atlas: reactions as a trigger).
  • Slack signs every webhook with HMAC-SHA256 using your signing secret. Handler validates X-Slack-Signature against v0:<timestamp>:<body>.
  • Prerequisite: populate SLACK_SIGNING_SECRET in dispatcher-secrets (currently a 0-byte placeholder).

GitHub:

  • For each watched repo: Settings → Webhooks → Add webhook → Payload URL https://dispatcher.aucert.dev/webhooks/github, secret = HMAC key, events = pull_request, pull_request_review_comment, issue_comment, issues.
  • The GitHub App you already provisioned can also push events org-wide via App-level webhooks if you want one-time setup vs per-repo.
  • Validate X-Hub-Signature-256 against the secret.
  • Prerequisite: populate GITHUB_WEBHOOK_SECRET in dispatcher-secrets (currently a 0-byte placeholder).

Plane:

  • Plane workspace settings → webhooks → URL https://dispatcher.aucert.dev/webhooks/plane, events = issue created/updated/commented.

2. Verify dispatcher → Temporal hop

Verified as of 2026-05-01: manual trigger exercised 6+ times with 201 responses, each kicking a real SpecAgentWorkflow on spec-agent-queue. A Drive probe confirmed the full DB-persistence and workflow-start path.

To re-run manually from inside the cluster:

# From inside the cluster (Cloudflare Access bypass):
kubectl run curl-test --rm -i --restart=Never -n internal-platform --image=curlimages/curl:8.11.1 -- \
curl -sS -X POST http://dispatcher.internal-platform.svc.cluster.local/api/tasks \
-H "X-API-Key: $(kubectl get secret dispatcher-secrets -n internal-platform -o jsonpath='{.data.API_KEY}' | base64 -d)" \
-H 'Content-Type: application/json' \
-d '{"message":"Manual dispatcher → workflow test"}'

Note: hitting https://dispatcher.aucert.dev/api/tasks from outside the cluster requires a Cloudflare Access service token — not yet configured. The /webhooks/* paths are CF Access bypassed for HMAC-authenticated callers.

3. End-to-end Google Docs round trip

Once the dispatcher is verified and Google Drive webhooks are configured:

Setup (one-time):

  • Pick a test doc — e.g. a draft spec in your Drive.
  • Create a Drive changes.watch subscription targeting that doc (or its parent folder). Save the subscription ID for cleanup.
  • Confirm the subscription is active: aws sts get-caller-identity equivalent for Drive API is https://www.googleapis.com/drive/v3/changes/watch returning a 200.

Test:

  1. Open the test doc in Google Docs as a real user.
  2. Add a comment that mentions atlas — e.g. "@atlas can you summarise the open questions in this doc?"
  3. Within seconds, the dispatcher should receive the webhook. Verify in dispatcher logs.
  4. Within ~30 seconds, the workflow should be visible in Temporal UI under aucert-default namespace, ID matching the doc + comment ID.
  5. Within ~60–120 seconds, atlas should:
    • Read the doc via GoogleDocsClient.
    • Generate a response keyed to the comment.
    • Post the response as a reply on the comment via GoogleDocsClient.replyToComment (or whatever the tool is named in agents/spec/tools/).
    • Mark task_runs.outcome = success.

What to look for in the doc:

  • A new comment reply from the atlas service account (verify the agent-email token in vault matches the service account writing comments).
  • The reply should start with [S46] (model label).
  • The reply text should be relevant to the comment, not a generic "I read the doc" response.

4. Resolve flow

The harder problem: atlas should be able to resolve a comment thread when the user signals they're satisfied (e.g. "thanks atlas, resolved" or a 👍 reaction). This requires:

  • A separate webhook event when a comment is resolved (Drive emits comment.resolved).
  • Dispatcher routes to a resolve_comment operation.
  • Atlas reads the thread, optionally posts a final summary, and calls GoogleDocsClient.resolveComment(commentId).

This is genuinely new agent behavior — not just "respond to comments" but "decide when a comment is fully addressed." Worth scoping carefully:

  • v0.1: explicit resolve trigger only (user reacts with ✅ or types /atlas resolve).
  • v0.2: atlas decides autonomously based on conversation state.

Recommend starting with v0.1 (explicit trigger).

Outstanding cleanup tasks

In rough priority order:

High value, small scope

TaskWhyWhere
Dispatcher webhook hardening (initialMessage in trigger_event, Drive null-resourceId guard, empty-string secret foot-gun)Probe traffic exposed three small issues blocking clean Drive E2E. #62internal/dispatcher/src/main/kotlin/dev/aucert/internal/dispatcher/{routing,webhooks,config}/
Fix LocalRepoClient.listAdrs() ADR pathAtlas caught a real bug — points at docs/adrs/ but actual is .context/decisions/. Already spawned as a separate task.internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/clients/LocalRepoClient.kt
Promote SPEC-017 to approved/Atlas observed it's fully implemented but still in drafts/. Mechanical move.docs/specs/drafts/SPEC-017-...mddocs/specs/approved/
Update SPEC-010 / SPEC-020 stale Google notesBoth have amendment notes about service-account auth superseded by ADR-014 (OAuth refresh tokens). Should be reconciled.docs/specs/drafts/SPEC-010...md, SPEC-020-...md
Workflow-level final-failure markingToday, when Temporal exhausts retries, the row stays in running forever. The workflow should mark outcome = failure after final retry.internal/backend/src/main/kotlin/dev/aucert/internal/agents/spec/temporal/SpecAgentWorkflowImpl.kt

Quota / capacity

TaskWhy
Request Bedrock Opus 4.7 quota increaseIter 19 + 20 confirmed Sonnet 4.6 works fine on the default quota. Opus 4.7 throttled in earlier tests. Console → Service Quotas → AWS Bedrock → "Cross-region model invocation tokens per minute for Claude Opus 4.7" → request increase. Free, usually approved within hours.
Renewal job for Drive changes.watchShipped in PR #XX — see "Drive watch renewal" section below.

Deferred from PRs (not blocking)

TaskDeferred from
Per-call timeout override via RuntimeConfig.timeoutMs#59 — today every Bedrock call gets the generous 5-min budget, regardless of expected duration
Streaming responses (InvokeModelWithResponseStream / Converse Stream)#59 — would let us start processing tokens as they arrive instead of waiting for the full response
Widen task_runs.output_ref from varchar(500) to TEXT or jsonb#58 — today the actual output lives in task_events.metadata.output_payload to avoid truncation
Per-token cost computation (estimated_cost column on task_runs)#58 — reporting layer can backfill from tokens_input * input_rate + tokens_output * output_rate using model_used from agent_completed event

Future agents (Kimi / GLM)

Update 2026-05-06. Kimi shipped, but via a different path than the one originally sketched here. We did not add Kimi via Bedrock Converse — instead, we deployed Kimi K2.6 on Azure AI Foundry and wrote a new FoundryOpenAIAdapter (OpenAI-compatible chat completions) alongside the existing BedrockAdapter. Atlas now invokes Kimi via the [kimi] operator tag in Google Docs comments, and the route is validated end-to-end on a real Drive comment thread. See Model routing and operator labels for the full operator + developer reference, and ADR-011's 2026-05-06 amendment for the architectural framing.

GLM is still future. When it lands, the same one-line ModelRegistry + ModelLabels pattern applies — adapter choice depends on whether GLM is hosted on Bedrock (extend BedrockAdapter with a Converse-API variant), Foundry (reuse FoundryOpenAIAdapter), or somewhere else (new adapter).

The original analysis below is preserved for context but the conclusion has shifted: the simpler refactor turned out to be a third adapter (FoundryOpenAIAdapter), not extending BedrockAdapter. The provider-agnostic interface paid for itself.

Both models were provisioned and verified callable on Bedrock (moonshotai.kimi-k2.5, zai.glm-4.7). When a future non-spec agent needs them:

  • Add a BedrockMessageTranslator variant for non-Anthropic body shapes (Anthropic Messages API doesn't apply to Kimi/GLM).
  • Or use the Bedrock Converse API instead of InvokeModel — Converse is provider-agnostic and would let the existing BedrockAdapter work for any model.

The simpler refactor is the second one. SPEC for whichever future agent comes first should pick the path.

Architecture decisions reaffirmed

The smoke loop took 8 PRs to close, which prompted a re-examination of the harness strategy decided on 2026-04-15. The conclusion is to keep the existing layered architecture unchanged, with two operational addenda below.

Why the existing decision still holds

Categorising the 8 PRs (#52–#59) honestly against the four-layer model:

PRLayer the bug lived inWould a framework have prevented it?
#52L2 — server-side route handlerNo (API design, framework-agnostic)
#53L2 — client/server contract driftNo (HTTP shape mismatch)
#54L2 — mount-path prefix mistakeNo (path constant typo)
#55L2 — missing migrationNo (schema bug)
#56L3/4 — activity body marked terminal stateNo (Temporal idiom we had to learn)
#57Provider integration — Bedrock model IDsNo (provider-specific config)
#58L2 — same class as #52/#53No
#59Provider integration — Bedrock SDK timeoutsNo (provider-specific tuning)

Zero of the 8 PRs were in Layer 1 (the loop). The loop is ~50 lines and worked correctly throughout. Pain was concentrated in API contracts, schema, and provider-specific configuration — none of which a different harness choice would have addressed. Adopting an off-the-shelf framework (Anthropic Agent SDK, LangChain4j, etc.) would have introduced a polyglot service boundary or framework coupling without removing any of the bugs we actually hit.

The original decision document (2026-04-15) should be promoted to a canonical ADR, with the two addenda below incorporated.

Addendum 1 — provider adapters need real-credential integration tests

Three Bedrock-specific quirks only surfaced in production, despite passing unit tests:

  • Bare model IDs (anthropic.claude-sonnet-4-6) reject InvokeModel — must use cross-region inference profile (us.anthropic.claude-sonnet-4-6).
  • Default AWS SDK socket timeout (~30s) is tight for large-context LLM calls; needed bumping to 5 min.
  • agent_tokens vault path versus Astra Token Vault path needed reconciliation for the AWS credentials.

Rule: every new provider adapter (Foundry, Moonshot, Zhipu, etc.) ships with a real-credential integration test that exercises a multi-tool, multi-step interaction end-to-end — not just translation correctness. The BedrockAdapterTest today only covers BedrockMessageTranslator (pure JSON in / JSON out); future adapters need an integration counterpart that runs against the actual provider with @Tag("integration").

Addendum 2 — dispatcher line-count is a real trigger

The harness doc characterises the dispatcher as "thin, ~500 lines." The dispatcher pod has 43h+ uptime; the full public webhook path was exercised end-to-end on 2026-05-01 and the claim held — the handler code is thin and the volume of source-specific logic is still well below the trigger threshold.

Trigger to revisit the build-vs-buy calculation for the dispatcher: if per-source handler complexity (Drive subscription renewal + Slack rate limits + GitHub event filtering + Plane payload variants) pushes the dispatcher past 1500 lines of source-specific code, the cost calculus shifts toward a managed alternative (Trigger.dev, n8n, or similar webhook-to-workflow platform). Until then, keep custom.

This trigger should be added to the "When to reconsider" section of the canonical ADR.

Multi-tenancy considerations

Today everything is single-tenant Aucert-internal. The system was deliberately built that way for v0.1 — getting one tenant working end-to-end is hard enough. But several design choices will reverberate when we onboard external customers, and starting the conversation now (before v0.5+ build) is cheaper than retrofitting.

What's tenant-coupled today

SurfaceCouplingHard to change later?
astra_db schemas (agents, task_runs, task_events, agent_tokens, personalities)Single shared database; no tenant_id column anywhereMedium — add column + backfill + scope every query
specs_db, shared_kb_db, internal_shared_dbSingle instance per databaseSame as above
Workspace clones (/workspace/aucert)Hardcoded repo, single GitHub App installationHigh — workspace bootstrap assumes one repo
SystemPromptAssemblerReads "Aucert core context" — proprietary IP in the promptHigh — context leakage risk if reused unchanged for tenants
Bedrock IAM principalOne bedrock-access-key shared across all agent runsLow — easy to scope per-tenant once IAM is parameterised
Astra service tokenSingle DISPATCHER_SERVICE_TOKEN from astra-secretsLow — token broker can mint per-tenant tokens
Cloudflare tunnel routesastra.aucert.dev, dispatcher.aucert.dev, temporal.aucert.dev (single tenant)Medium — needs subdomain or path scheme per tenant
K8s namespaceSingle internal-platform namespace for all workloadsMedium — per-tenant namespace gives isolation but multiplies operational cost
Worker pool concurrencyspec-agent-worker polls one spec-agent-queueMedium — Temporal task queues per tenant are cheap; worker pool sizing per tenant is the real question

Open design questions

These should be decided (not necessarily built) before v0.5+ work begins:

  1. Database isolation model. Three options, in increasing strength:

    • Shared schema, tenant_id column on every row + row-level security policies. Cheapest, weakest. Risk: missed WHERE tenant_id = ? clause leaks data across tenants.
    • Per-tenant Postgres schema (astra_db.tenant_<id>.task_runs). Stronger isolation, requires application to set search_path. Migrations applied per schema.
    • Per-tenant Postgres database (astra_db_tenant_<id>). Strongest isolation, highest operational cost. Backup/restore scoped per tenant.
    • Recommendation to pre-decide: start with the second option (per-tenant schema). Strong-enough isolation; reasonable migration story; doesn't multiply infra cost. Only escalate to per-database if a customer's compliance requires it.
  2. Workspace clone strategy. The spec agent clones aucert/aucert on activity start. For a customer running atlas on their own private repo:

    • Per-tenant GitHub App installation (each customer installs the App on their own org).
    • Per-tenant workspace volume (different /workspace/<tenant>/ paths).
    • Per-tenant clone-on-start IAM (no cross-tenant credential leakage).
    • Open question: is the workspace shared across activities for the same tenant (cache-friendly, locking risk) or fresh per activity (slower, isolated)?
  3. Prompt isolation. SystemPromptAssembler today reads the "Aucert core context" — a section that bleeds Aucert-specific framing ("the Aucert spec workflow", "atlas the spec agent", etc.) into the system prompt. For multi-tenant:

    • Move the per-Aucert framing into a personality fragment or a tenant config that's swappable.
    • The fixed-section "core directive" / "tool discipline" can stay genuinely tenant-agnostic.
    • Personalities themselves are already keyed per-agent in astra_db.agent_personalities — extending to per-tenant is a small change.
  4. Audit and compliance. task_runs is the canonical audit log. Per-tenant access controls + retention policies need to exist before any compliance-sensitive customer onboards. Today everything is in one table queryable by any IAM principal with astra:task_runs.view.

  5. Cost attribution. tokens_input / tokens_output per task_run gives us the raw data, but allocating to a tenant (for invoicing or capacity planning) needs tenant_id on the row. The agent_completed task_events row already carries model_used so per-model + per-tenant rate cards become tractable once tenant_id lands.

  6. Worker pool sizing. A noisy-neighbour tenant could starve atlas of capacity. Options:

    • Per-tenant Temporal task queues with per-queue worker pools (operational cost: more pods).
    • Single shared queue with rate limiting at the activity level (cheaper, weaker isolation).
    • Recommendation to pre-decide: start with shared queue + activity-level rate limiting; escalate to per-tenant queues if a customer's load justifies dedicated capacity.

What to do now

Don't build any of this yet. But:

  • Add tenant_id UUID to the next schema migration that touches a tenant-coupled table (with default value 'aucert' for backfill).
  • Wrap the "Aucert core context" section of SystemPromptAssembler in a config-driven fragment the next time that file is touched.
  • When the next agent role's worker pod is being designed, make the IAM principal scoped per-role (not shared with atlas), so per-tenant scoping is a small extension later rather than a refactor.
  • Capture this section's open design questions in a new SPEC (e.g. "SPEC-NNN — Multi-tenancy strategy for the agent platform") before the first paying customer is committed to a launch date. The SPEC's job is to pick option (1) and (2) above with concrete tradeoffs.

The handover author's view: the existing single-tenant build is correct for where Aucert is today. None of the choices above need to be made tonight. But picking them in the right order — schema strategy first, workspace second, prompt isolation third — saves weeks compared to retrofitting under deadline pressure.

Operational runbook

Deploy the spec agent worker

./tools/scripts/deploy-spec-agent-worker.sh

Builds internal/backend/Dockerfile.spec-agent, pushes to ACR, restarts the deployment, waits for rollout. Tag defaults to git rev-parse --short HEAD.

Deploy astra-backend

CI handles this automatically on push to main when internal/backend/**, internal/frontend/**, infra/migrations/**, or infra/k8s/internal-platform/astra/** change. Workflow: .github/workflows/deploy-astra.yml.

Manual deploy:

./tools/scripts/astra-deploy.sh                    # full deploy
./tools/scripts/astra-deploy.sh --backend-only # backend only
./tools/scripts/astra-deploy.sh --skip-migrations # skip Flyway

Run a smoke test

./tools/scripts/smoke-test-spec-agent.sh
# Or with a custom prompt:
./tools/scripts/smoke-test-spec-agent.sh "Read SPEC-005 and SPEC-013, identify any contradictions in the workspace prep section."

The script:

  1. Resolves atlas's agent UUID via psql against astra_db.
  2. Inserts a fresh task_runs row with the prompt as task_title.
  3. Starts a SpecAgentWorkflow on spec-agent-queue via temporalio/admin-tools pod.
  4. Tails worker logs (kubectl logs -f).

Query a task run's outcome

TASK_RUN_ID=028a4f80-648a-4f16-bd6b-2fa171d97bce  # from smoke output

kubectl run pg-probe-$(date +%s) --rm -i --restart=Never -n internal-platform \
--image=postgres:18.3-alpine \
--env="PGPASSWORD=$(kubectl get secret astra-db-credentials -n internal-platform -o jsonpath='{.data.ASTRA_DB_PASSWORD}' | base64 -d)" \
--env="PGSSLMODE=require" \
--command -- psql -h aucert-internal-pg.postgres.database.azure.com \
-U internaladmin -d astra_db -X -A \
-c "SELECT outcome, duration_ms, tokens_input, tokens_output FROM task_runs WHERE id = '$TASK_RUN_ID'::uuid;"

Read the actual agent output:

... -c "SELECT metadata->'output_payload'->>'text' FROM task_events WHERE task_run_id = '$TASK_RUN_ID'::uuid AND event_type = 'agent_completed';"

Check Bedrock model availability

aws bedrock list-foundation-models --region us-west-2 \
--by-provider Anthropic \
--query 'modelSummaries[].[modelId,modelName,modelLifecycle.status]' \
--output table

aws bedrock list-inference-profiles --region us-west-2 \
--query 'inferenceProfileSummaries[].[inferenceProfileId,status]' \
--output table

Verify a model is callable:

aws bedrock-runtime converse \
--region us-west-2 \
--model-id us.anthropic.claude-sonnet-4-6 \
--messages '[{"role":"user","content":[{"text":"hi"}]}]' \
--inference-config '{"maxTokens":20}'

Inspect a workflow in Temporal

UI: https://temporal.aucert.dev → namespace aucert-default → workflow ID matches spec-agent-smoke-<short-uuid> for smoke runs.

CLI from inside the cluster:

kubectl run -it --rm --restart=Never \
--image=temporalio/admin-tools:1.31.0 \
-n temporal \
-- temporal workflow describe \
--address temporal-frontend.temporal.svc.cluster.local:7233 \
--namespace aucert-default \
--workflow-id spec-agent-smoke-<short-uuid>

Drive watch renewal (post-merge operator setup)

Drive watch channels have a 24-hour TTL in Workspace tenants. The renewal job is a Temporal cron workflow (DriveWatchRenewalWorkflow) registered on the spec-agent-worker. It runs every 12 h, looks ahead 6 h, and re-registers any channel that would otherwise expire before the next tick.

Incident that motivated this: a watch registered at 06:55 on 2026-04-30 expired at 06:55 on 2026-05-01. A comment posted 14 minutes after expiry was never seen by atlas. The watch had to be re-registered manually.

Start the cron schedule (one-time, after deploying the updated worker)

kubectl run -it --rm --restart=Never \
--image=temporalio/admin-tools:1.31.0 \
-n temporal \
-- temporal workflow start \
--address temporal-frontend.temporal.svc.cluster.local:7233 \
--namespace aucert-default \
--type DriveWatchRenewalWorkflow \
--task-queue spec-agent-queue \
--cron "0 */12 * * *" \
--workflow-id drive-watch-renewal-cron

The --workflow-id is stable. Re-running this command after a future worker deployment is safe — Temporal replaces the existing cron handle without creating duplicates.

Verify the cron is running

In the Temporal UI: https://temporal.aucert.dev → namespace aucert-default → search workflow ID drive-watch-renewal-cron. You should see a Running cron workflow with completed runs on a 12-hour cadence.

CLI:

kubectl run -it --rm --restart=Never \
--image=temporalio/admin-tools:1.31.0 \
-n temporal \
-- temporal workflow describe \
--address temporal-frontend.temporal.svc.cluster.local:7233 \
--namespace aucert-default \
--workflow-id drive-watch-renewal-cron

Check current watch expiry times

kubectl run pg-probe-$(date +%s) --rm -i --restart=Never -n internal-platform \
--image=postgres:18.3-alpine \
--env="PGPASSWORD=$(kubectl get secret astra-db-credentials -n internal-platform -o jsonpath='{.data.ASTRA_DB_PASSWORD}' | base64 -d)" \
--env="PGSSLMODE=require" \
--command -- psql -h aucert-internal-pg.postgres.database.azure.com \
-U internaladmin -d astra_db -X -A \
-c "SELECT channel_id, file_id, agent_id, expiration_time,
expiration_time - now() AS time_left
FROM drive_watches
ORDER BY expiration_time ASC;"

How old-channel cleanup works

The renewal job creates a new channel for each near-expiring watch (Drive returns a new channel_id). The old channel row is not deleted immediately; it is left to expire naturally and is purged at the end of each renewal pass once it is more than 7 days past its expiration_time. This keeps the table tidy without a separate cleanup job.

References

Specs

  • SPEC-005 — Spec agent v0.1 definition (docs/specs/drafts/SPEC-005-spec-agent-v0.1.md)
  • SPEC-010 — Parallel execution plan (docs/specs/drafts/SPEC-010-spec-agent-v0.1-execution-plan.md)
  • SPEC-012, SPEC-013 — Codebase access amendments
  • SPEC-020 — Unified platform and token management (GitHub App auth, integration registry)
  • SPEC-021 — Wave 5 dispatcher design (the next milestone)

ADRs

  • ADR-002 — Bazel staged adoption (relevant to build-system context)
  • ADR-011 — Agent harness layering (Layer 1 = AgentExecutor, Layer 2 = AgentLoop)
  • ADR-014 — Google integration via OAuth 2.0 refresh tokens (supersedes earlier service-account approach)

All ADRs live in .context/decisions/ (canonical) and sync to docs/internal/docs/decisions/ via .github/workflows/docs-adr-sync.yml.

Source code entry points

FileRole
internal/backend/src/main/kotlin/dev/aucert/internal/agents/spec/temporal/SpecAgentActivityImpl.ktTemporal activity that spins up the executor
internal/backend/src/main/kotlin/dev/aucert/internal/agents/spec/SpecAgentExecutor.ktAtlas-specific executor — model routing, label injection, output wrapping
internal/backend/src/main/kotlin/dev/aucert/internal/agents/spec/SpecAgentConfig.ktAtlas config — default runtime config, model constants, per-operation routing
internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/executor/AgentExecutor.ktBase executor (Layer 1) — lifecycle, retry semantics, event logging
internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/executor/AgentLoop.ktLoop (Layer 2) — LLM ↔ tool dispatch
internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/clients/AstraClient.ktHTTP client for astra-backend (task_runs lifecycle) + direct-JDBC reads (agent metadata, tokens)
internal/backend/src/main/kotlin/dev/aucert/internal/agents/shared/model/adapters/BedrockAdapter.ktBedrock LLM provider (with the new 5-min timeouts)
internal/backend/src/main/kotlin/dev/aucert/internal/astra/api/TaskRunApi.ktServer-side task_runs HTTP routes
internal/backend/src/main/kotlin/dev/aucert/internal/astra/AstraModule.ktMounts all Astra routes under /api/internal/astra/v1

Infrastructure

PathRole
infra/migrations/astra/Flyway migrations (V001–V013) for astra_db
infra/k8s/internal-platform/spec-agent-worker/K8s deployment for the worker
infra/k8s/internal-platform/astra/K8s manifests for backend / frontend / proxy
infra/k8s/internal-platform/cloudflared/tunnel.yamlCloudflare tunnel routing
infra/terraform/foundation/UAMI, Federated Identity Credentials, Key Vault role assignments
tools/scripts/smoke-test-spec-agent.shThe smoke loop trigger
tools/scripts/deploy-spec-agent-worker.shWorker build + deploy
tools/scripts/astra-deploy.shAstra-backend / frontend / proxy deploy