Decision: Agent Harness Strategy for Aucert Agent Platform
Status: Decided Date: 2026-04-15 Deciders: Vivek (CEO), Claude (consultation) Supersedes: None Related: ADR-002 (Bazel staged adoption), SPEC-010 (spec agent v0.1 design), Temporal deployment decision (2026-04-14)
TL;DR
The Aucert agent platform uses a layered architecture with different tools at each layer:
| Layer | What it handles | Choice |
|---|---|---|
| Layer 1 — Single-agent loop | LLM call → tool execution → repeat | Custom Kotlin |
| Layer 2 — Per-agent infrastructure | Memory, prompts, tools, telemetry | Custom Kotlin |
| Layer 3 — Cross-agent orchestration | Agent A delegates to Agent B, Pattern C | Temporal |
| Layer 4 — Durable workflow execution | Long-running workflows, retries, compensation | Temporal |
We are NOT using: LangChain, LangChain4j, LangGraph, AutoGen/AG2, CrewAI, Pydantic AI, Mastra, NeMo Agent Toolkit, Google ADK, Claude Code (as harness), OpenCode, Open SWE.
This decision is considered closed. Do not revisit without a concrete trigger (see "When to reconsider" at the end).
Context
During Wave 1 execution of the spec agent v0.1 build, the question arose: "Are we building too much ourselves? Should we use an agent framework?" This question resurfaced multiple times with different framings (LangChain for v0.1, then multi-agent frameworks for orchestration, then coding-agent products as harness). Each time, the evaluation arrived at the same answer. This document captures the final reasoning so it does not need to be re-derived.
The Aucert agent platform envisions 10-15 agents (spec, coder, reviewer, program manager, ops, research, QA, etc.) with cross-agent orchestration (Pattern C delegation), task-linked conversations, event telemetry, and end-to-end automation from spec to PR. This is a real agent platform, not a single script.
The question is: what tools do we take on vs. build ourselves at each layer?
The four layers
Agent platforms decompose into four layers. Conflating them leads to bad framework decisions.
Layer 1 — Single-agent loop
What it does: Take a task, call the LLM with system prompt + message history + tools, execute any tool calls, feed results back, repeat until the model produces a final answer.
Code size: ~50 lines of core logic.
Example (pseudocode):
loop:
response = model.call(system_prompt, messages, tools)
if response.is_text:
return response.text
if response.is_tool_use:
for each tool_call in response.tool_calls:
result = tools.execute(tool_call.name, tool_call.input)
messages.append(tool_result(tool_call.id, result))
if loop_count > max_steps:
return max_steps_exceeded
This is universal — every tool-using agent framework implements this loop.
Layer 2 — Per-agent infrastructure
What it does: Everything a single agent needs beyond the loop — memory, prompt assembly, tool implementations, client integrations (Slack, Postgres, GitHub), observability, secrets.
Code size: Depends on domain complexity. For Aucert's spec agent: ~5,000 lines across clients, tools, migrations, personalities, synthesis schema.
Aucert specifics:
- Postgres-backed
topics,topic_inputs,topic_syntheses,topic_embeddingsschema - pgvector with HNSW indexes for semantic search
- Canonical vocabulary lookup from
shared_kb_db - Composable personalities from
astra_db.personalities - Codebase-grounded reading via
LocalRepoClient - Azure Key Vault secret integration
- Astra task_run lifecycle coordination
- 51 domain-specific tools
Layer 3 — Cross-agent orchestration
What it does: Agents talking to agents. Pattern C: "coder agent delegates to reviewer agent, waits synchronously from its perspective, resumes when done."
Code size: This is where complexity explodes. Naive custom implementation is 2,000-5,000 lines of coordination plumbing (event bus, state reconciliation, deadlock prevention, retry, compensation).
Layer 4 — Durable workflow execution
What it does: Long-running workflows survive container crashes, retries, timeouts. Wave-based dispatch with scatter-gather. Multi-day execution.
Code size: Building this from scratch is months of engineering. It is a well-known distributed-systems problem with mature solutions.
Decisions per layer
Layer 1 — Custom Kotlin
Decision: Write our own loop.
Reasoning:
- The loop is ~50 lines. Framework wrappers add 500+ lines of scaffolding for no meaningful savings.
- Our
ModelClientinterface cleanly abstracts providers (Bedrock now, Foundry later). Any framework would impose its own provider abstraction that does not know about our output token tiers (32k/16k/8k) or runtime config. - We need specific behaviors (role-transparency signature, composable personality fragment injection, codebase-grounded reasoning) that don't map to framework templates.
Cost: ~200 lines of Kotlin across AgentExecutor base class, AgentLoop, per-agent subclasses.
Risk: Low. The loop pattern is well-understood and stable.
Layer 2 — Custom Kotlin
Decision: Write our own per-agent infrastructure.
Reasoning:
- Every domain abstraction we need (topics, personalities, canonical terms, codebase-grounded reading, Astra integration, event bus) is specific to Aucert. No framework provides these; we'd be building them anyway.
- Framework memory abstractions (conversation buffer, vector store) are too generic. Our Postgres schema is richer and purpose-built.
- The
Toolinterface +ToolRegistrywe built is minimal and fits our needs exactly. LangChain4j's@Toolannotation + tool specification machinery solves the same problem with more indirection.
Cost: ~5,000 lines for spec agent (already built through Wave 1). Amortizes across future agents — ~60% of this code is shared infrastructure (clients, tool registry, Postgres patterns).
Risk: Low. Our code is plain Kotlin with standard patterns; any Kotlin engineer can maintain it.
Layer 3 — Temporal
Decision: Use Temporal workflows for cross-agent orchestration.
Reasoning:
- Pattern C ("delegate and wait") maps directly to Temporal child workflows and signals.
- Building the equivalent custom (event bus coordination, state reconciliation, retry, compensation) is 2,000-5,000 lines of distributed-systems plumbing with known failure modes.
- Temporal has first-class Kotlin SDK. No language split.
- Temporal is not an agent framework — it wraps our agents. Zero coupling between Temporal and the agent's internal logic.
- Temporal server runs in our AKS cluster (~500MB RAM). Deployed April 2026.
Cost: Temporal server operational overhead. Deterministic workflow constraint (side effects must go through activities — minor adjustment, 99% of code lives in activities unchanged).
Risk: Medium-low. Temporal is mature (Uber, Netflix, Stripe scale). The constraint to learn is deterministic workflow code.
Layer 4 — Temporal
Decision: Use Temporal for all durable workflow execution.
Reasoning:
- Same rationale as Layer 3. Temporal is purpose-built for this.
- Native primitives: retries, timeouts, signals, queries, child workflows, continue-as-new for long-running work.
- Temporal Web UI gives us observability for free.
- Dispatcher role shrinks from 2,500 lines to ~500 lines (edge event ingestion, classification, workflow starter). The bulk is delegated to Temporal.
Cost: Same server as Layer 3.
Risk: Same as Layer 3.
What we evaluated and rejected
Coding agent products (Claude Code, OpenCode, Open SWE, Aider, Cline)
What they are: CLI/IDE tools for humans to delegate coding tasks.
Why rejected as harness:
- They are products, not libraries. No embedding API.
- Tool sets are hardcoded to coding (read/write files, bash, grep). No way to plug in
topic_synthesize,canonical_terms_lookup,repo_list_adrs. - State is session-local. No integration with our Postgres topics, Astra task_runs, event bus.
- They assume a human operator approves iterations.
Possible future use: These may be useful as TOOLS inside our coder agent workflows (the coder agent spawns Claude Code in an activity to execute complex edits). That is a different decision, not a harness replacement.
LangChain / LangGraph
What they are: Python-first agent framework + graph-based orchestration.
Why rejected:
- Python. Our backend is Kotlin. Polyglot split adds a service boundary and operational complexity for questionable benefit.
- LangGraph's graph routing solves orchestration inside one agent's conversation, not cross-agent orchestration. Temporal is a better fit for the latter.
- Memory abstractions are generic; our Postgres schema is purpose-built.
- No durable execution without additional infrastructure.
LangChain4j
What it is: JVM port of LangChain.
Why rejected:
- The only framework in our language. Worth real evaluation. Verdict: marginal savings (~100 lines), real costs (framework coupling, unfamiliar debugging, abstractions that fight our domain).
- Our
ModelClientinterface already provides provider abstraction. LangChain4j'sChatLanguageModelis different and doesn't know about our output token tiers or runtime config. - LangChain4j's pre-built agent types (ReAct, tool-calling) don't match our composable personality + role-transparency + codebase-grounded pattern.
- Framework dependency makes future migration harder (if we ever wanted to change something fundamental about the loop).
If we were starting with no code and no specific agent design: LangChain4j might be a reasonable default. We are not.
AutoGen / AG2 (Microsoft)
Why rejected: Python-first. Message-passing multi-agent pattern is less durable than Temporal workflows. Doesn't solve durability.
CrewAI
Why rejected: Python. Role-based abstraction is thinner than what our role enum + Temporal workflows provide.
Pydantic AI
Why rejected: Python. Type safety is already native in Kotlin.
Mastra
Why rejected: TypeScript. Wrong language for backend.
NeMo Agent Toolkit (Nvidia)
Why rejected: Solves multi-framework composition — a problem we don't have (we have ONE harness question).
Google ADK
Why rejected: Python-first. Newer, less mature. Good at multi-agent delegation semantics, but Temporal gives us that with first-class Kotlin support and durability.
The architecture this decision produces
┌────────────────────────────────────────────────────────────┐
│ Dispatcher (thin, ~500 lines) │
│ - Webhook ingestion (GitHub, Slack, Plane, manual API) │
│ - Agent role classification │
│ - Task_run audit logging to astra_db │
│ - Starts Temporal workflows │
└──────────────────┬─────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Temporal Server (Layer 3 + 4) │
│ - Workflow state storage (Postgres) │
│ - Signal routing, timeouts, retries │
│ - Web UI at temporal.aucert.dev │
│ - Child workflows for Pattern C delegation │
└──────────────────┬─────────────────────────────────────────┘
│ dispatches work
▼
┌────────────────────────────────────────────────────────────┐
│ Temporal Worker Pods (one pool per agent role) │
│ │
│ Each pool runs: │
│ - Workflow definitions (Kotlin) │
│ - Activity definitions (Kotlin) │
│ - Our custom Layer 1 agent loop │
│ - Our custom Layer 2 tools/clients/memory │
│ │
│ Example agent roles: │
│ SpecAgent workers → aucert-default:spec-queue │
│ CoderAgent workers → aucert-default:coder-queue │
│ ReviewerAgent workers → aucert-default:reviewer-queue │
└────────────────────────────────────────────────────────────┘
Key properties:
- Framework dependency: Temporal only (orchestration, durable execution)
- All agent logic, tools, memory, prompts: our Kotlin code
- Zero coupling between agent internals and Temporal (agents can run outside Temporal if needed for testing)
- One Kotlin backend, no Python split
Cost summary
| Category | Our custom code | Framework code |
|---|---|---|
| Agent loop (Layer 1) | ~50 lines | 0 |
| Per-agent infra (Layer 2) | ~5,000 lines (spec agent), amortized across future agents | 0 |
| Orchestration (Layer 3) | 0 (Temporal does it) | Temporal SDK + server |
| Durable execution (Layer 4) | 0 (Temporal does it) | Temporal SDK + server |
| Total custom code for spec agent v0.1 | ~5,200 lines | Temporal SDK + server |
Alternative (LangChain4j for Layers 1-2):
- ~5,100 lines still ours (Layer 2 is inherently custom)
- Additional LangChain4j dependency
- Temporal still needed for Layers 3-4
- Savings: ~100 lines. Cost: framework coupling.
Alternative (full custom for Layers 3-4):
- ~7,000+ additional lines for custom orchestration
- Ongoing distributed-systems bugs
- Worse observability
- No durability
When to reconsider
This decision is closed. Do not revisit unless one of these triggers fires:
-
Multi-provider routing becomes complex. If we need fallback providers, cost-based routing, structured output validation across providers, streaming coordination — LangChain4j's provider abstractions become attractive. Revisit only after we've hit real pain with our
ModelClientinterface. -
Temporal pain exceeds its value. If the deterministic workflow constraint becomes a regular blocker, or if operational cost of Temporal server exceeds its benefits, evaluate alternatives. Most likely response: keep Temporal, wrap the painful parts.
-
A new framework emerges with specific fit to our stack. Unlikely but possible. Example: a Kotlin-native agent framework with first-class Temporal integration. Currently does not exist.
-
Team size grows significantly. At 10+ engineers, framework standardization has higher value. Revisit when hiring past that threshold.
Do NOT revisit because:
- A blog post describes a new framework that looks cool
- Claude Code got a new feature
- Someone asks "have you considered X" without a concrete problem our current stack can't solve
- LangChain releases a major version
- You feel like we might be building too much
The thought process is captured here. Refer back to this document instead of re-deriving the answer.
References
- ADR-002 — Bazel staged adoption (related decision on build-system tooling)
- SPEC-010 — Spec agent v0.1 design (uses this layered architecture)
- SPEC-011 — Spec agent v0.1 execution plan (implements this architecture)
- Temporal deployment plan —
infra/terraform/internal-platform/temporal.tf - Wave 1 codebase access amendment —
docs/specs/drafts/SPEC-NNN-wave1-codebase-access-amendment.md
Amendment — 2026-05-06: Multi-provider model abstraction realized
What changed since the original decision: the ModelClient interface (described above as "Bedrock now, Foundry later") now has three working adapters in production. Atlas can route any given workflow to any of them, selected by operator tag in Google Docs comments.
Implemented adapters:
| Adapter | Models | Source |
|---|---|---|
BedrockAdapter | Sonnet 4.6, Opus 4.7 (cross-region inference profiles us.anthropic.claude-*) | internal/backend/.../shared/model/adapters/BedrockAdapter.kt |
AnthropicDirectAdapter | Sonnet 4.6, Opus 4.7 (via api.anthropic.com, separate cost stream from Bedrock) | .../adapters/AnthropicDirectAdapter.kt |
FoundryOpenAIAdapter | Kimi K2.6 (and any Foundry deployment with an OpenAI-compatible interface) | .../adapters/FoundryOpenAIAdapter.kt |
The model selected for a workflow is resolved at activity start via ModelRegistry.resolveEntry(alias) → DefaultModelClientFactory instantiates the matching adapter. There is no per-request switching mid-run.
Operator-facing override: atlas reads tags like [kimi], [opus], [opus-direct], [kimi][opus] from Drive comments and spawns secondary workflows via the request_model_response tool. Each secondary runs its own activity in its own workspace, posts its reply prefixed with a model label ([S46], [O47], [K26]).
Why this is worth recording: the original ADR's prediction held — adding the second and third providers required no changes to the executor, the tools, or the personality prompt structure. Adding a new model (or even a new provider) is now a one-line registry entry plus one label branch, with the rest of the system inheriting support automatically. This validates the layered-abstraction call against the framework alternatives discussed above.
Reference docs:
- Model routing and operator labels — full operator + developer reference
- ADR-008 amendment 2026-05-03 — Foundry deployment naming convention (case+dots) that
FoundryOpenAIAdapterrelies on
Changelog
| Date | Change | Author |
|---|---|---|
| 2026-04-15 | Initial decision captured | Vivek + Claude |
| 2026-05-06 | Amendment — multi-provider abstraction realized; three adapters + operator-tag override + reply labels operational | Vivek + Claude |