Decision: Agent Harness Strategy for Aucert Agent Platform

Status: Decided Date: 2026-04-15 Deciders: Vivek (CEO), Claude (consultation) Supersedes: None Related: ADR-002 (Bazel staged adoption), SPEC-010 (spec agent v0.1 design), Temporal deployment decision (2026-04-14)

TL;DR

The Aucert agent platform uses a layered architecture with different tools at each layer:

Layer	What it handles	Choice
Layer 1 — Single-agent loop	LLM call → tool execution → repeat	Custom Kotlin
Layer 2 — Per-agent infrastructure	Memory, prompts, tools, telemetry	Custom Kotlin
Layer 3 — Cross-agent orchestration	Agent A delegates to Agent B, Pattern C	Temporal
Layer 4 — Durable workflow execution	Long-running workflows, retries, compensation	Temporal

We are NOT using: LangChain, LangChain4j, LangGraph, AutoGen/AG2, CrewAI, Pydantic AI, Mastra, NeMo Agent Toolkit, Google ADK, Claude Code (as harness), OpenCode, Open SWE.

This decision is considered closed. Do not revisit without a concrete trigger (see "When to reconsider" at the end).

Context

During Wave 1 execution of the spec agent v0.1 build, the question arose: "Are we building too much ourselves? Should we use an agent framework?" This question resurfaced multiple times with different framings (LangChain for v0.1, then multi-agent frameworks for orchestration, then coding-agent products as harness). Each time, the evaluation arrived at the same answer. This document captures the final reasoning so it does not need to be re-derived.

The Aucert agent platform envisions 10-15 agents (spec, coder, reviewer, program manager, ops, research, QA, etc.) with cross-agent orchestration (Pattern C delegation), task-linked conversations, event telemetry, and end-to-end automation from spec to PR. This is a real agent platform, not a single script.

The question is: what tools do we take on vs. build ourselves at each layer?

The four layers

Agent platforms decompose into four layers. Conflating them leads to bad framework decisions.

Layer 1 — Single-agent loop

What it does: Take a task, call the LLM with system prompt + message history + tools, execute any tool calls, feed results back, repeat until the model produces a final answer.

Code size: ~50 lines of core logic.

Example (pseudocode):

loop:
    response = model.call(system_prompt, messages, tools)
    if response.is_text:
        return response.text
    if response.is_tool_use:
        for each tool_call in response.tool_calls:
            result = tools.execute(tool_call.name, tool_call.input)
            messages.append(tool_result(tool_call.id, result))
    if loop_count > max_steps:
        return max_steps_exceeded

This is universal — every tool-using agent framework implements this loop.

Layer 2 — Per-agent infrastructure

What it does: Everything a single agent needs beyond the loop — memory, prompt assembly, tool implementations, client integrations (Slack, Postgres, GitHub), observability, secrets.

Code size: Depends on domain complexity. For Aucert's spec agent: ~5,000 lines across clients, tools, migrations, personalities, synthesis schema.

Aucert specifics:

Postgres-backed topics, topic_inputs, topic_syntheses, topic_embeddings schema
pgvector with HNSW indexes for semantic search
Canonical vocabulary lookup from shared_kb_db
Composable personalities from astra_db.personalities
Codebase-grounded reading via LocalRepoClient
Azure Key Vault secret integration
Astra task_run lifecycle coordination
51 domain-specific tools

Layer 3 — Cross-agent orchestration

What it does: Agents talking to agents. Pattern C: "coder agent delegates to reviewer agent, waits synchronously from its perspective, resumes when done."

Code size: This is where complexity explodes. Naive custom implementation is 2,000-5,000 lines of coordination plumbing (event bus, state reconciliation, deadlock prevention, retry, compensation).

Layer 4 — Durable workflow execution

What it does: Long-running workflows survive container crashes, retries, timeouts. Wave-based dispatch with scatter-gather. Multi-day execution.

Code size: Building this from scratch is months of engineering. It is a well-known distributed-systems problem with mature solutions.

Decisions per layer

Layer 1 — Custom Kotlin

Decision: Write our own loop.

Reasoning:

The loop is ~50 lines. Framework wrappers add 500+ lines of scaffolding for no meaningful savings.
Our ModelClient interface cleanly abstracts providers (Bedrock now, Foundry later). Any framework would impose its own provider abstraction that does not know about our output token tiers (32k/16k/8k) or runtime config.
We need specific behaviors (role-transparency signature, composable personality fragment injection, codebase-grounded reasoning) that don't map to framework templates.

Cost: ~200 lines of Kotlin across AgentExecutor base class, AgentLoop, per-agent subclasses.

Risk: Low. The loop pattern is well-understood and stable.

Layer 2 — Custom Kotlin

Decision: Write our own per-agent infrastructure.

Reasoning:

Every domain abstraction we need (topics, personalities, canonical terms, codebase-grounded reading, Astra integration, event bus) is specific to Aucert. No framework provides these; we'd be building them anyway.
Framework memory abstractions (conversation buffer, vector store) are too generic. Our Postgres schema is richer and purpose-built.
The Tool interface + ToolRegistry we built is minimal and fits our needs exactly. LangChain4j's @Tool annotation + tool specification machinery solves the same problem with more indirection.

Cost: ~5,000 lines for spec agent (already built through Wave 1). Amortizes across future agents — ~60% of this code is shared infrastructure (clients, tool registry, Postgres patterns).

Risk: Low. Our code is plain Kotlin with standard patterns; any Kotlin engineer can maintain it.

Layer 3 — Temporal

Decision: Use Temporal workflows for cross-agent orchestration.

Reasoning:

Pattern C ("delegate and wait") maps directly to Temporal child workflows and signals.
Building the equivalent custom (event bus coordination, state reconciliation, retry, compensation) is 2,000-5,000 lines of distributed-systems plumbing with known failure modes.
Temporal has first-class Kotlin SDK. No language split.
Temporal is not an agent framework — it wraps our agents. Zero coupling between Temporal and the agent's internal logic.
Temporal server runs in our AKS cluster (~500MB RAM). Deployed April 2026.

Cost: Temporal server operational overhead. Deterministic workflow constraint (side effects must go through activities — minor adjustment, 99% of code lives in activities unchanged).

Risk: Medium-low. Temporal is mature (Uber, Netflix, Stripe scale). The constraint to learn is deterministic workflow code.

Layer 4 — Temporal

Decision: Use Temporal for all durable workflow execution.

Reasoning:

Same rationale as Layer 3. Temporal is purpose-built for this.
Native primitives: retries, timeouts, signals, queries, child workflows, continue-as-new for long-running work.
Temporal Web UI gives us observability for free.
Dispatcher role shrinks from 2,500 lines to ~500 lines (edge event ingestion, classification, workflow starter). The bulk is delegated to Temporal.

Cost: Same server as Layer 3.

Risk: Same as Layer 3.

What we evaluated and rejected

Coding agent products (Claude Code, OpenCode, Open SWE, Aider, Cline)

What they are: CLI/IDE tools for humans to delegate coding tasks.

Why rejected as harness:

They are products, not libraries. No embedding API.
Tool sets are hardcoded to coding (read/write files, bash, grep). No way to plug in topic_synthesize, canonical_terms_lookup, repo_list_adrs.
State is session-local. No integration with our Postgres topics, Astra task_runs, event bus.
They assume a human operator approves iterations.

Possible future use: These may be useful as TOOLS inside our coder agent workflows (the coder agent spawns Claude Code in an activity to execute complex edits). That is a different decision, not a harness replacement.

LangChain / LangGraph

What they are: Python-first agent framework + graph-based orchestration.

Why rejected:

Python. Our backend is Kotlin. Polyglot split adds a service boundary and operational complexity for questionable benefit.
LangGraph's graph routing solves orchestration inside one agent's conversation, not cross-agent orchestration. Temporal is a better fit for the latter.
Memory abstractions are generic; our Postgres schema is purpose-built.
No durable execution without additional infrastructure.

LangChain4j

What it is: JVM port of LangChain.

Why rejected:

The only framework in our language. Worth real evaluation. Verdict: marginal savings (~100 lines), real costs (framework coupling, unfamiliar debugging, abstractions that fight our domain).
Our ModelClient interface already provides provider abstraction. LangChain4j's ChatLanguageModel is different and doesn't know about our output token tiers or runtime config.
LangChain4j's pre-built agent types (ReAct, tool-calling) don't match our composable personality + role-transparency + codebase-grounded pattern.
Framework dependency makes future migration harder (if we ever wanted to change something fundamental about the loop).

If we were starting with no code and no specific agent design: LangChain4j might be a reasonable default. We are not.

AutoGen / AG2 (Microsoft)

Why rejected: Python-first. Message-passing multi-agent pattern is less durable than Temporal workflows. Doesn't solve durability.

CrewAI

Why rejected: Python. Role-based abstraction is thinner than what our role enum + Temporal workflows provide.

Pydantic AI

Why rejected: Python. Type safety is already native in Kotlin.

Mastra

Why rejected: TypeScript. Wrong language for backend.

NeMo Agent Toolkit (Nvidia)

Why rejected: Solves multi-framework composition — a problem we don't have (we have ONE harness question).

Google ADK

Why rejected: Python-first. Newer, less mature. Good at multi-agent delegation semantics, but Temporal gives us that with first-class Kotlin support and durability.

The architecture this decision produces

┌────────────────────────────────────────────────────────────┐
│  Dispatcher (thin, ~500 lines)                             │
│  - Webhook ingestion (GitHub, Slack, Plane, manual API)    │
│  - Agent role classification                               │
│  - Task_run audit logging to astra_db                      │
│  - Starts Temporal workflows                               │
└──────────────────┬─────────────────────────────────────────┘
                   │
                   ▼
┌────────────────────────────────────────────────────────────┐
│  Temporal Server (Layer 3 + 4)                             │
│  - Workflow state storage (Postgres)                       │
│  - Signal routing, timeouts, retries                       │
│  - Web UI at temporal.aucert.dev                           │
│  - Child workflows for Pattern C delegation                │
└──────────────────┬─────────────────────────────────────────┘
                   │ dispatches work
                   ▼
┌────────────────────────────────────────────────────────────┐
│  Temporal Worker Pods (one pool per agent role)            │
│                                                            │
│  Each pool runs:                                           │
│  - Workflow definitions (Kotlin)                           │
│  - Activity definitions (Kotlin)                           │
│  - Our custom Layer 1 agent loop                           │
│  - Our custom Layer 2 tools/clients/memory                 │
│                                                            │
│  Example agent roles:                                      │
│    SpecAgent workers     → aucert-default:spec-queue       │
│    CoderAgent workers    → aucert-default:coder-queue      │
│    ReviewerAgent workers → aucert-default:reviewer-queue   │
└────────────────────────────────────────────────────────────┘

Key properties:

Framework dependency: Temporal only (orchestration, durable execution)
All agent logic, tools, memory, prompts: our Kotlin code
Zero coupling between agent internals and Temporal (agents can run outside Temporal if needed for testing)
One Kotlin backend, no Python split

Cost summary

Category	Our custom code	Framework code
Agent loop (Layer 1)	~50 lines	0
Per-agent infra (Layer 2)	~5,000 lines (spec agent), amortized across future agents	0
Orchestration (Layer 3)	0 (Temporal does it)	Temporal SDK + server
Durable execution (Layer 4)	0 (Temporal does it)	Temporal SDK + server
Total custom code for spec agent v0.1	~5,200 lines	Temporal SDK + server

Alternative (LangChain4j for Layers 1-2):

~5,100 lines still ours (Layer 2 is inherently custom)
Additional LangChain4j dependency
Temporal still needed for Layers 3-4
Savings: ~100 lines. Cost: framework coupling.

Alternative (full custom for Layers 3-4):

~7,000+ additional lines for custom orchestration
Ongoing distributed-systems bugs
Worse observability
No durability

When to reconsider

This decision is closed. Do not revisit unless one of these triggers fires:

Multi-provider routing becomes complex. If we need fallback providers, cost-based routing, structured output validation across providers, streaming coordination — LangChain4j's provider abstractions become attractive. Revisit only after we've hit real pain with our ModelClient interface.
Temporal pain exceeds its value. If the deterministic workflow constraint becomes a regular blocker, or if operational cost of Temporal server exceeds its benefits, evaluate alternatives. Most likely response: keep Temporal, wrap the painful parts.
A new framework emerges with specific fit to our stack. Unlikely but possible. Example: a Kotlin-native agent framework with first-class Temporal integration. Currently does not exist.
Team size grows significantly. At 10+ engineers, framework standardization has higher value. Revisit when hiring past that threshold.

Do NOT revisit because:

A blog post describes a new framework that looks cool
Claude Code got a new feature
Someone asks "have you considered X" without a concrete problem our current stack can't solve
LangChain releases a major version
You feel like we might be building too much

The thought process is captured here. Refer back to this document instead of re-deriving the answer.

References

ADR-002 — Bazel staged adoption (related decision on build-system tooling)
SPEC-010 — Spec agent v0.1 design (uses this layered architecture)
SPEC-011 — Spec agent v0.1 execution plan (implements this architecture)
Temporal deployment plan — infra/terraform/internal-platform/temporal.tf
Wave 1 codebase access amendment — docs/specs/drafts/SPEC-NNN-wave1-codebase-access-amendment.md

Amendment — 2026-05-06: Multi-provider model abstraction realized

What changed since the original decision: the ModelClient interface (described above as "Bedrock now, Foundry later") now has three working adapters in production. Atlas can route any given workflow to any of them, selected by operator tag in Google Docs comments.

Implemented adapters:

Adapter	Models	Source
`BedrockAdapter`	Sonnet 4.6, Opus 4.7 (cross-region inference profiles `us.anthropic.claude-*`)	`internal/backend/.../shared/model/adapters/BedrockAdapter.kt`
`AnthropicDirectAdapter`	Sonnet 4.6, Opus 4.7 (via api.anthropic.com, separate cost stream from Bedrock)	`.../adapters/AnthropicDirectAdapter.kt`
`FoundryOpenAIAdapter`	Kimi K2.6 (and any Foundry deployment with an OpenAI-compatible interface)	`.../adapters/FoundryOpenAIAdapter.kt`

The model selected for a workflow is resolved at activity start via ModelRegistry.resolveEntry(alias) → DefaultModelClientFactory instantiates the matching adapter. There is no per-request switching mid-run.

Operator-facing override: atlas reads tags like [kimi], [opus], [opus-direct], [kimi][opus] from Drive comments and spawns secondary workflows via the request_model_response tool. Each secondary runs its own activity in its own workspace, posts its reply prefixed with a model label ([S46], [O47], [K26]).

Why this is worth recording: the original ADR's prediction held — adding the second and third providers required no changes to the executor, the tools, or the personality prompt structure. Adding a new model (or even a new provider) is now a one-line registry entry plus one label branch, with the rest of the system inheriting support automatically. This validates the layered-abstraction call against the framework alternatives discussed above.

Reference docs:

Model routing and operator labels — full operator + developer reference
ADR-008 amendment 2026-05-03 — Foundry deployment naming convention (case+dots) that FoundryOpenAIAdapter relies on

Changelog

Date	Change	Author
2026-04-15	Initial decision captured	Vivek + Claude
2026-05-06	Amendment — multi-provider abstraction realized; three adapters + operator-tag override + reply labels operational	Vivek + Claude

TL;DR​

Context​

The four layers​

Layer 1 — Single-agent loop​

Layer 2 — Per-agent infrastructure​

Layer 3 — Cross-agent orchestration​

Layer 4 — Durable workflow execution​

Decisions per layer​

Layer 1 — Custom Kotlin​

Layer 2 — Custom Kotlin​

Layer 3 — Temporal​

Layer 4 — Temporal​

What we evaluated and rejected​

Coding agent products (Claude Code, OpenCode, Open SWE, Aider, Cline)​

LangChain / LangGraph​

LangChain4j​

AutoGen / AG2 (Microsoft)​

CrewAI​

Pydantic AI​

Mastra​

NeMo Agent Toolkit (Nvidia)​

Google ADK​

The architecture this decision produces​

Cost summary​

When to reconsider​

References​

Amendment — 2026-05-06: Multi-provider model abstraction realized​

Changelog​

TL;DR

Context

The four layers

Layer 1 — Single-agent loop

Layer 2 — Per-agent infrastructure

Layer 3 — Cross-agent orchestration

Layer 4 — Durable workflow execution

Decisions per layer

Layer 1 — Custom Kotlin

Layer 2 — Custom Kotlin

Layer 3 — Temporal

Layer 4 — Temporal

What we evaluated and rejected

Coding agent products (Claude Code, OpenCode, Open SWE, Aider, Cline)

LangChain / LangGraph

LangChain4j

AutoGen / AG2 (Microsoft)

CrewAI

Pydantic AI

Mastra

NeMo Agent Toolkit (Nvidia)

Google ADK

The architecture this decision produces

Cost summary

When to reconsider

References

Amendment — 2026-05-06: Multi-provider model abstraction realized

Changelog