ADR-009: LLM model selection strategy for the 5-layer testing pipeline
Context
Each layer of the Aucert pipeline has different LLM requirements:
| Layer | Requirement | Key capability |
|---|---|---|
| Generation (L1) | Design test scenarios from app screenshots | Multimodal vision, agent swarm |
| Execution (L2) | Run tests on emulators | Not LLM-driven (engine) |
| Analysis (L3) | Bulk result interpretation | High throughput, low cost |
| Decision (L4) | Pass/fail with confidence scoring | Reasoning quality |
| Reporting (L5) | Bug reports, dashboards | Cheap natural language generation |
We evaluated available "Direct from Azure" (credit-eligible) models across: benchmark performance (SWE-bench, HumanEval), cost per million tokens, multimodal capability, agentic tool-calling, and context window.
| Model | Input $/1M | Output $/1M | Vision | Reasoning | Context | Registration |
|---|---|---|---|---|---|---|
| GPT-5.1-Codex | ~$2.00 | ~$8.00 | No | High | 400K | None |
| Kimi K2.5 | ~$0.28 | ~$0.77 | Yes | Medium | 256K | None |
| DeepSeek V3.2 | ~$0.14 | ~$0.42 | No | Medium | 128K | None |
| GPT-4o-mini | $0.15 | $0.60 | Yes | Low-Med | 128K | None |
| GPT-5.2-Codex | ~$3.00 | ~$12.00 | No | Very High | 400K | Required |
| GPT-5.3-Codex | ~$5.00 | ~$20.00 | No | Highest | 400K | Required; approved 2026-04-15 |
| GPT-5.4 | ~$4.00 | ~$16.00 | No | Very High | 400K+ | Required; approved 2026-04-15 |
| GPT-5.4-Pro | ~$8.00 | ~$32.00 | No | Highest+ | 400K+ | Required; approved 2026-04-15 |
Decision
Layer-to-model mapping with phased upgrades:
| Pipeline layer | Day 1 model | Current model (as of 2026-04-20) | Why this model |
|---|---|---|---|
| Generation | Kimi K2.5 | Kimi K2.5 (unchanged) | Native multimodal vision (screenshots to test cases), Agent Swarm for parallel generation, 256K context, cheapest vision model |
| Execution | N/A | N/A | Engine-driven, not LLM |
| Analysis | DeepSeek V3.2 | DeepSeek V3.2 (unchanged) | Cheapest capable model for bulk structured analysis, good at classification/extraction |
| Decision (standard) | GPT-5.1-Codex | GPT-5.4 | Approved gated model; stronger reasoning than 5.1-Codex; incorporates Codex-class coding ability |
| Decision (complex) | N/A | GPT-5.4-Pro | Heavy multi-step reasoning; approved gated model; use sparingly (higher cost) |
| Decision (code-specific) | N/A | GPT-5.3-Codex | Pure code generation/analysis tasks; approved gated model |
| Reporting | GPT-4o-mini | GPT-4o-mini (unchanged) | Cheapest useful model, excellent at natural language generation |
Model routing is via K8s ConfigMap (llm-config) environment variables, not code changes. Upgrading the Decision layer model is a one-line ConfigMap patch.
Estimated monthly cost: $45-105. At this burn rate, $1,000 Founders Hub credits last 10-22 months.
Alternatives considered
| Option | Pros | Cons |
|---|---|---|
| Per-layer optimization (chosen) | Lowest cost, each layer gets best-fit model, all credit-eligible | Multiple models to monitor, different failure modes per layer |
| Single model for all layers (GPT-4.1-mini) | Simplest — one model, one cost profile | Sacrifices reasoning quality for Decision layer, misses Kimi's multimodal advantage for Generation |
| Claude Opus 4.6 for all layers | Best quality (80.8% SWE-bench) | NOT credit-eligible (marketplace item, $15/$25 per 1M tokens). Would drain credits in weeks. |
| Gemini via Vertex | Strong multimodal | Not available on Azure Foundry. Would require GCP infrastructure. |
Consequences
What becomes easier
- Cost optimization: each layer uses the most cost-effective model for its task
- Model upgrades: change one ConfigMap value, zero downtime, zero code changes
- Experimentation: deploy additional models (zero idle cost) and A/B test per layer
What becomes harder
- Debugging: different models have different failure modes and output formats
- Prompt engineering: prompts may need per-model tuning (token limits, instruction following)
- Monitoring: need per-model metrics (latency, error rate, token usage)
Risks
- Chinese model provenance: Kimi (Moonshot AI) and DeepSeek may be flagged by enterprise customers in security reviews. Mitigation: these models process app metadata, not customer PII. Can swap to GPT-4.1 variants if required by a customer's security policy.
- GPT-5.1-Codex Responses API only: This model does NOT support
/chat/completions(chatCompletion: false). It requires the Responses API (/responses). The backend LLM adapter must handle this different API format for the Decision layer. - Kimi K2.5 thinking model: Responses go into
reasoning_content(notcontent). Backend must readreasoning_contentand use higher token limits (100+). Known issues when using thinking + tool-calling simultaneously. Mitigation: monitor error rates, fall back to GPT-4.1 if needed. - DeepSeek V3.2 context degradation: Quality degrades past 32-64K context in practice despite 128K limit. Mitigation: chunk analysis inputs to stay under 32K.
- Gated model wait: GPT-5.2/5.3-Codex require registration with 1-2 week approval. Mitigation: GPT-5.1-Codex is capable enough for Day 1; registration submitted in parallel.
Amendment — 2026-04-20
Decision: Upgrade the Decision layer from GPT-5.1-Codex to GPT-5.4 (standard), and introduce two new Decision sub-roles: GPT-5.4-Pro (complex multi-step) and GPT-5.3-Codex (code-specific). All three gated models were approved 2026-04-15.
Rationale: GPT-5.4 delivers stronger reasoning than GPT-5.1-Codex and incorporates Codex-class code understanding. Splitting the Decision layer into three sub-roles enables cost-optimal routing: standard tasks use GPT-5.4, heavy reasoning uses GPT-5.4-Pro (sparingly), and code analysis uses GPT-5.3-Codex.
Changes made:
infra/terraform/foundation/foundry.tf: Addedgpt_54,gpt_53_codexdeployments (both in westus primary account)infra/terraform/foundation/foundry-east.tf(new): Secondary Foundry accountaucert-ai-eastin eastus2 hostinggpt_54_pro(the only region where that model is available as of 2026-04-20)infra/k8s/aucert-dev/llm-config.yaml:FOUNDRY_MODEL_REASONING→gpt-5-4; addedFOUNDRY_MODEL_REASONING_PRO: gpt-5-4-pro,FOUNDRY_MODEL_CODEX: gpt-5-3-codex, andFOUNDRY_ENDPOINT_REASONING_PROpointing to the east endpointgpt-5-1-codexdeployment retained in Terraform for rollback; no longer active in ConfigMap routing
Model → Foundry account routing (current):
| Deployment | Foundry account | Region |
|---|---|---|
| Kimi-K2.6 | aucert-ai | westus |
| DeepSeek-V3.2 | aucert-ai | westus |
| gpt-5.1-codex (retired) | aucert-ai | westus |
| gpt-5.4 | aucert-ai | westus |
| gpt-5.3-codex | aucert-ai | westus |
| gpt-5.4-pro | aucert-ai-east | eastus2 |
See ADR-008 (§Amendment 2026-04-20) for the region strategy and the dual-account rationale.
Deviation from original intent: the pipeline model mapping table earlier in this ADR treats the Decision layer as three sub-roles (standard, complex, code-specific) served by three deployments. Those deployments now live in two different Foundry accounts. Backend routing selects the endpoint per model; semantically nothing changes.
Amendment — 2026-05-03
Decision: Replace Kimi K2.5 with Kimi K2.6 for L1 Generation. Remove the dedicated gpt-4o-mini deployment for L5 Reporting; alias L5 onto the Kimi-K2.6 deployment as a placeholder. Rename all Foundry deployments so deployment name matches the underlying model name exactly (case + dots).
Rationale:
- K2.6 (Moonshot version
2026-04-20) supersedes K2.5 in capability while remaining in the same cost band. Confirmed available in westus on 2026-05-03 viaaz cognitiveservices model list. - gpt-4o-mini retires 2026-09-30 (per Foundry portal). Removing it now and pointing L5 at the existing K2.6 deployment is simpler than introducing a new mini-tier deployment, and the cost delta is rounding error at MVP volume.
- Deployment-name-equals-model-name (e.g.,
Kimi-K2.6,gpt-5.1-codex) eliminates the dash-vs-dot footgun previously called out in the foundry usage guide. Required for opencode and portal triage.
Changes made:
infra/terraform/foundation/foundry.tf: Renamedgpt_51_codex/deepseek_v32/gpt_54/gpt_53_codexnamefields to match their model names exactly. Removedgpt_4o_miniresource. Replacedkimi_k25withkimi_k26(Kimi-K2.6, version2026-04-20).infra/terraform/foundation/foundry-east.tf: Renamedgpt_54_prodeployment togpt-5.4-pro.infra/k8s/aucert-dev/llm-config.yaml: Updated allFOUNDRY_MODEL_*to the new exact deployment names.FOUNDRY_MODEL_REPORTINGnow points atKimi-K2.6(shared with L1).infra/migrations/internal-shared/V006__seed_model_registry.sql: Rewritten to seed the new inventory for fresh DBs.infra/migrations/internal-shared/V013__update_foundry_model_registry.sql: Idempotent forward migration to bring already-provisioned DBs into line — renamesprovider_model_idvalues, deprecateskimi-k2-5andgpt-4o-mini, insertskimi-k2-6.
Revisit triggers:
- L1 + L5 contention on the shared Kimi-K2.6 deployment causes 429s — split L5 onto a dedicated deployment (
gpt-5.4-miniorgpt-5-nanoare westus-available candidates). - Kimi K2.6 deprecated/replaced — swap and update both env-var pointers in lockstep.