ADR-009: LLM model selection strategy for the 5-layer testing pipeline

Context

Each layer of the Aucert pipeline has different LLM requirements:

Layer	Requirement	Key capability
Generation (L1)	Design test scenarios from app screenshots	Multimodal vision, agent swarm
Execution (L2)	Run tests on emulators	Not LLM-driven (engine)
Analysis (L3)	Bulk result interpretation	High throughput, low cost
Decision (L4)	Pass/fail with confidence scoring	Reasoning quality
Reporting (L5)	Bug reports, dashboards	Cheap natural language generation

We evaluated available "Direct from Azure" (credit-eligible) models across: benchmark performance (SWE-bench, HumanEval), cost per million tokens, multimodal capability, agentic tool-calling, and context window.

Model	Input $/1M	Output $/1M	Vision	Reasoning	Context	Registration
GPT-5.1-Codex	~$2.00	~$8.00	No	High	400K	None
Kimi K2.5	~$0.28	~$0.77	Yes	Medium	256K	None
DeepSeek V3.2	~$0.14	~$0.42	No	Medium	128K	None
GPT-4o-mini	$0.15	$0.60	Yes	Low-Med	128K	None
GPT-5.2-Codex	~$3.00	~$12.00	No	Very High	400K	Required
GPT-5.3-Codex	~$5.00	~$20.00	No	Highest	400K	Required; approved 2026-04-15
GPT-5.4	~$4.00	~$16.00	No	Very High	400K+	Required; approved 2026-04-15
GPT-5.4-Pro	~$8.00	~$32.00	No	Highest+	400K+	Required; approved 2026-04-15

Decision

Layer-to-model mapping with phased upgrades:

Pipeline layer	Day 1 model	Current model (as of 2026-04-20)	Why this model
Generation	Kimi K2.5	Kimi K2.5 (unchanged)	Native multimodal vision (screenshots to test cases), Agent Swarm for parallel generation, 256K context, cheapest vision model
Execution	N/A	N/A	Engine-driven, not LLM
Analysis	DeepSeek V3.2	DeepSeek V3.2 (unchanged)	Cheapest capable model for bulk structured analysis, good at classification/extraction
Decision (standard)	GPT-5.1-Codex	GPT-5.4	Approved gated model; stronger reasoning than 5.1-Codex; incorporates Codex-class coding ability
Decision (complex)	N/A	GPT-5.4-Pro	Heavy multi-step reasoning; approved gated model; use sparingly (higher cost)
Decision (code-specific)	N/A	GPT-5.3-Codex	Pure code generation/analysis tasks; approved gated model
Reporting	GPT-4o-mini	GPT-4o-mini (unchanged)	Cheapest useful model, excellent at natural language generation

Model routing is via K8s ConfigMap (llm-config) environment variables, not code changes. Upgrading the Decision layer model is a one-line ConfigMap patch.

Estimated monthly cost: $45-105. At this burn rate, $1,000 Founders Hub credits last 10-22 months.

Alternatives considered

Option	Pros	Cons
Per-layer optimization (chosen)	Lowest cost, each layer gets best-fit model, all credit-eligible	Multiple models to monitor, different failure modes per layer
Single model for all layers (GPT-4.1-mini)	Simplest — one model, one cost profile	Sacrifices reasoning quality for Decision layer, misses Kimi's multimodal advantage for Generation
Claude Opus 4.6 for all layers	Best quality (80.8% SWE-bench)	NOT credit-eligible (marketplace item, $15/$25 per 1M tokens). Would drain credits in weeks.
Gemini via Vertex	Strong multimodal	Not available on Azure Foundry. Would require GCP infrastructure.

Consequences

What becomes easier

Cost optimization: each layer uses the most cost-effective model for its task
Model upgrades: change one ConfigMap value, zero downtime, zero code changes
Experimentation: deploy additional models (zero idle cost) and A/B test per layer

What becomes harder

Debugging: different models have different failure modes and output formats
Prompt engineering: prompts may need per-model tuning (token limits, instruction following)
Monitoring: need per-model metrics (latency, error rate, token usage)

Risks

Chinese model provenance: Kimi (Moonshot AI) and DeepSeek may be flagged by enterprise customers in security reviews. Mitigation: these models process app metadata, not customer PII. Can swap to GPT-4.1 variants if required by a customer's security policy.
GPT-5.1-Codex Responses API only: This model does NOT support /chat/completions (chatCompletion: false). It requires the Responses API (/responses). The backend LLM adapter must handle this different API format for the Decision layer.
Kimi K2.5 thinking model: Responses go into reasoning_content (not content). Backend must read reasoning_content and use higher token limits (100+). Known issues when using thinking + tool-calling simultaneously. Mitigation: monitor error rates, fall back to GPT-4.1 if needed.
DeepSeek V3.2 context degradation: Quality degrades past 32-64K context in practice despite 128K limit. Mitigation: chunk analysis inputs to stay under 32K.
Gated model wait: GPT-5.2/5.3-Codex require registration with 1-2 week approval. Mitigation: GPT-5.1-Codex is capable enough for Day 1; registration submitted in parallel.

Amendment — 2026-04-20

Decision: Upgrade the Decision layer from GPT-5.1-Codex to GPT-5.4 (standard), and introduce two new Decision sub-roles: GPT-5.4-Pro (complex multi-step) and GPT-5.3-Codex (code-specific). All three gated models were approved 2026-04-15.

Rationale: GPT-5.4 delivers stronger reasoning than GPT-5.1-Codex and incorporates Codex-class code understanding. Splitting the Decision layer into three sub-roles enables cost-optimal routing: standard tasks use GPT-5.4, heavy reasoning uses GPT-5.4-Pro (sparingly), and code analysis uses GPT-5.3-Codex.

Changes made:

infra/terraform/foundation/foundry.tf: Added gpt_54, gpt_53_codex deployments (both in westus primary account)
infra/terraform/foundation/foundry-east.tf (new): Secondary Foundry account aucert-ai-east in eastus2 hosting gpt_54_pro (the only region where that model is available as of 2026-04-20)
infra/k8s/aucert-dev/llm-config.yaml: FOUNDRY_MODEL_REASONING → gpt-5-4; added FOUNDRY_MODEL_REASONING_PRO: gpt-5-4-pro, FOUNDRY_MODEL_CODEX: gpt-5-3-codex, and FOUNDRY_ENDPOINT_REASONING_PRO pointing to the east endpoint
gpt-5-1-codex deployment retained in Terraform for rollback; no longer active in ConfigMap routing

Model → Foundry account routing (current):

Deployment	Foundry account	Region
Kimi-K2.6	aucert-ai	westus
DeepSeek-V3.2	aucert-ai	westus
gpt-5.1-codex (retired)	aucert-ai	westus
gpt-5.4	aucert-ai	westus
gpt-5.3-codex	aucert-ai	westus
gpt-5.4-pro	aucert-ai-east	eastus2

See ADR-008 (§Amendment 2026-04-20) for the region strategy and the dual-account rationale.

Deviation from original intent: the pipeline model mapping table earlier in this ADR treats the Decision layer as three sub-roles (standard, complex, code-specific) served by three deployments. Those deployments now live in two different Foundry accounts. Backend routing selects the endpoint per model; semantically nothing changes.

Amendment — 2026-05-03

Decision: Replace Kimi K2.5 with Kimi K2.6 for L1 Generation. Remove the dedicated gpt-4o-mini deployment for L5 Reporting; alias L5 onto the Kimi-K2.6 deployment as a placeholder. Rename all Foundry deployments so deployment name matches the underlying model name exactly (case + dots).

Rationale:

K2.6 (Moonshot version 2026-04-20) supersedes K2.5 in capability while remaining in the same cost band. Confirmed available in westus on 2026-05-03 via az cognitiveservices model list.
gpt-4o-mini retires 2026-09-30 (per Foundry portal). Removing it now and pointing L5 at the existing K2.6 deployment is simpler than introducing a new mini-tier deployment, and the cost delta is rounding error at MVP volume.
Deployment-name-equals-model-name (e.g., Kimi-K2.6, gpt-5.1-codex) eliminates the dash-vs-dot footgun previously called out in the foundry usage guide. Required for opencode and portal triage.

Changes made:

infra/terraform/foundation/foundry.tf: Renamed gpt_51_codex/deepseek_v32/gpt_54/gpt_53_codex name fields to match their model names exactly. Removed gpt_4o_mini resource. Replaced kimi_k25 with kimi_k26 (Kimi-K2.6, version 2026-04-20).
infra/terraform/foundation/foundry-east.tf: Renamed gpt_54_pro deployment to gpt-5.4-pro.
infra/k8s/aucert-dev/llm-config.yaml: Updated all FOUNDRY_MODEL_* to the new exact deployment names. FOUNDRY_MODEL_REPORTING now points at Kimi-K2.6 (shared with L1).
infra/migrations/internal-shared/V006__seed_model_registry.sql: Rewritten to seed the new inventory for fresh DBs.
infra/migrations/internal-shared/V013__update_foundry_model_registry.sql: Idempotent forward migration to bring already-provisioned DBs into line — renames provider_model_id values, deprecates kimi-k2-5 and gpt-4o-mini, inserts kimi-k2-6.

Revisit triggers:

L1 + L5 contention on the shared Kimi-K2.6 deployment causes 429s — split L5 onto a dedicated deployment (gpt-5.4-mini or gpt-5-nano are westus-available candidates).
Kimi K2.6 deprecated/replaced — swap and update both env-var pointers in lockstep.

Amendment — 2026-06-26: Routing splits — selection policy in Tower, resolution in the engine (SPEC-042)

Status: introduced by SPEC-042 (draft, under review). The per-task model-selection strategy of this ADR is preserved; its location splits, and the selection unit is re-cut from layers to call-intent purposes.

Decision: Per-task model selection is split across the Tower boundary and re-cut from layer/stage names into call-intent purposes (routing equivalence classes), not roles:

Selection policy → Tower (IP, Kotlin/config): purpose × tenant-policy × context → a capability tier (model group). Purposes are call-intents — SCREEN_UNDERSTANDING, TEST_SYNTHESIS, RESULT_JUDGMENT, VERDICT_REASONING, REPORTING — shared across roles (e.g. SCREEN_UNDERSTANDING is used by exploration, test-gen, and evaluation) and no longer 1:1 with the 5 layers. Tower picks the tier, not the raw deployment.
Group resolution + execution → the engine (config-as-code): tier/group → physical deployment(s) with fallback, load-balancing, version pins, dual-account endpoint selection (FOUNDRY_ENDPOINT_REASONING_PRO etc.), and per-model API quirks (Responses-API-only, reasoning_content).

The llm-config ConfigMap routing is retired: the backend names a Purpose, Tower selects a tier, the engine resolves and executes.

Rationale: the selection decision (stakes × multimodality, entitlement tier, budget, experiments) is product IP needing Kotlin's expressiveness + tests + code-versioning; the resolution + resilience is undifferentiated plumbing the engine does well. Splitting keeps each where it belongs and keeps the engine swappable. Tower picking the tier (not the deployment) preserves the engine's fallback.

What this amendment changes:

Aspect	Before	After
Routing mechanism	`llm-config` ConfigMap env vars	Selection policy in Tower (code/config) + `group → deployment + fallback` in engine config (git)
Selection unit	Layer / deployment name	`Purpose` (call-intent) → capability tier; Tower picks the tier, engine picks the deployment
Purpose ↔ layer	Implicit 1:1 with layers	Many-to-many; call-intent purposes shared across roles
Endpoint selection (dual-account)	Backend adapter override logic	Engine (holds both accounts)
Model version pinning	Implicit (deployment name)	Explicit pinned versions in engine config

Revisit triggers:

Cost-tier mapping per Purpose finalized with data — likely inverts "cheap model for evaluation" (RESULT_JUDGMENT/VERDICT_REASONING want the strongest vision models; SCREEN_UNDERSTANDING during exploration tolerates cheaper). SPEC-042 Open questions.
A purpose needs different tiers per tenant → handled by the tenant-policy overlay in Tower, not a new purpose or a role axis.
A second provider added to a group → confirm fallback ordering and version pins.

Context​

Decision​

Alternatives considered​

Consequences​

What becomes easier​

What becomes harder​

Risks​

Amendment — 2026-04-20​

Amendment — 2026-05-03​

Amendment — 2026-06-26: Routing splits — selection policy in Tower, resolution in the engine (SPEC-042)​