Skip to main content

ADR-009: LLM model selection strategy for the 5-layer testing pipeline

Context

Each layer of the Aucert pipeline has different LLM requirements:

LayerRequirementKey capability
Generation (L1)Design test scenarios from app screenshotsMultimodal vision, agent swarm
Execution (L2)Run tests on emulatorsNot LLM-driven (engine)
Analysis (L3)Bulk result interpretationHigh throughput, low cost
Decision (L4)Pass/fail with confidence scoringReasoning quality
Reporting (L5)Bug reports, dashboardsCheap natural language generation

We evaluated available "Direct from Azure" (credit-eligible) models across: benchmark performance (SWE-bench, HumanEval), cost per million tokens, multimodal capability, agentic tool-calling, and context window.

ModelInput $/1MOutput $/1MVisionReasoningContextRegistration
GPT-5.1-Codex~$2.00~$8.00NoHigh400KNone
Kimi K2.5~$0.28~$0.77YesMedium256KNone
DeepSeek V3.2~$0.14~$0.42NoMedium128KNone
GPT-4o-mini$0.15$0.60YesLow-Med128KNone
GPT-5.2-Codex~$3.00~$12.00NoVery High400KRequired
GPT-5.3-Codex~$5.00~$20.00NoHighest400KRequired; approved 2026-04-15
GPT-5.4~$4.00~$16.00NoVery High400K+Required; approved 2026-04-15
GPT-5.4-Pro~$8.00~$32.00NoHighest+400K+Required; approved 2026-04-15

Decision

Layer-to-model mapping with phased upgrades:

Pipeline layerDay 1 modelCurrent model (as of 2026-04-20)Why this model
GenerationKimi K2.5Kimi K2.5 (unchanged)Native multimodal vision (screenshots to test cases), Agent Swarm for parallel generation, 256K context, cheapest vision model
ExecutionN/AN/AEngine-driven, not LLM
AnalysisDeepSeek V3.2DeepSeek V3.2 (unchanged)Cheapest capable model for bulk structured analysis, good at classification/extraction
Decision (standard)GPT-5.1-CodexGPT-5.4Approved gated model; stronger reasoning than 5.1-Codex; incorporates Codex-class coding ability
Decision (complex)N/AGPT-5.4-ProHeavy multi-step reasoning; approved gated model; use sparingly (higher cost)
Decision (code-specific)N/AGPT-5.3-CodexPure code generation/analysis tasks; approved gated model
ReportingGPT-4o-miniGPT-4o-mini (unchanged)Cheapest useful model, excellent at natural language generation

Model routing is via K8s ConfigMap (llm-config) environment variables, not code changes. Upgrading the Decision layer model is a one-line ConfigMap patch.

Estimated monthly cost: $45-105. At this burn rate, $1,000 Founders Hub credits last 10-22 months.

Alternatives considered

OptionProsCons
Per-layer optimization (chosen)Lowest cost, each layer gets best-fit model, all credit-eligibleMultiple models to monitor, different failure modes per layer
Single model for all layers (GPT-4.1-mini)Simplest — one model, one cost profileSacrifices reasoning quality for Decision layer, misses Kimi's multimodal advantage for Generation
Claude Opus 4.6 for all layersBest quality (80.8% SWE-bench)NOT credit-eligible (marketplace item, $15/$25 per 1M tokens). Would drain credits in weeks.
Gemini via VertexStrong multimodalNot available on Azure Foundry. Would require GCP infrastructure.

Consequences

What becomes easier

  • Cost optimization: each layer uses the most cost-effective model for its task
  • Model upgrades: change one ConfigMap value, zero downtime, zero code changes
  • Experimentation: deploy additional models (zero idle cost) and A/B test per layer

What becomes harder

  • Debugging: different models have different failure modes and output formats
  • Prompt engineering: prompts may need per-model tuning (token limits, instruction following)
  • Monitoring: need per-model metrics (latency, error rate, token usage)

Risks

  • Chinese model provenance: Kimi (Moonshot AI) and DeepSeek may be flagged by enterprise customers in security reviews. Mitigation: these models process app metadata, not customer PII. Can swap to GPT-4.1 variants if required by a customer's security policy.
  • GPT-5.1-Codex Responses API only: This model does NOT support /chat/completions (chatCompletion: false). It requires the Responses API (/responses). The backend LLM adapter must handle this different API format for the Decision layer.
  • Kimi K2.5 thinking model: Responses go into reasoning_content (not content). Backend must read reasoning_content and use higher token limits (100+). Known issues when using thinking + tool-calling simultaneously. Mitigation: monitor error rates, fall back to GPT-4.1 if needed.
  • DeepSeek V3.2 context degradation: Quality degrades past 32-64K context in practice despite 128K limit. Mitigation: chunk analysis inputs to stay under 32K.
  • Gated model wait: GPT-5.2/5.3-Codex require registration with 1-2 week approval. Mitigation: GPT-5.1-Codex is capable enough for Day 1; registration submitted in parallel.

Amendment — 2026-04-20

Decision: Upgrade the Decision layer from GPT-5.1-Codex to GPT-5.4 (standard), and introduce two new Decision sub-roles: GPT-5.4-Pro (complex multi-step) and GPT-5.3-Codex (code-specific). All three gated models were approved 2026-04-15.

Rationale: GPT-5.4 delivers stronger reasoning than GPT-5.1-Codex and incorporates Codex-class code understanding. Splitting the Decision layer into three sub-roles enables cost-optimal routing: standard tasks use GPT-5.4, heavy reasoning uses GPT-5.4-Pro (sparingly), and code analysis uses GPT-5.3-Codex.

Changes made:

  • infra/terraform/foundation/foundry.tf: Added gpt_54, gpt_53_codex deployments (both in westus primary account)
  • infra/terraform/foundation/foundry-east.tf (new): Secondary Foundry account aucert-ai-east in eastus2 hosting gpt_54_pro (the only region where that model is available as of 2026-04-20)
  • infra/k8s/aucert-dev/llm-config.yaml: FOUNDRY_MODEL_REASONINGgpt-5-4; added FOUNDRY_MODEL_REASONING_PRO: gpt-5-4-pro, FOUNDRY_MODEL_CODEX: gpt-5-3-codex, and FOUNDRY_ENDPOINT_REASONING_PRO pointing to the east endpoint
  • gpt-5-1-codex deployment retained in Terraform for rollback; no longer active in ConfigMap routing

Model → Foundry account routing (current):

DeploymentFoundry accountRegion
Kimi-K2.6aucert-aiwestus
DeepSeek-V3.2aucert-aiwestus
gpt-5.1-codex (retired)aucert-aiwestus
gpt-5.4aucert-aiwestus
gpt-5.3-codexaucert-aiwestus
gpt-5.4-proaucert-ai-easteastus2

See ADR-008 (§Amendment 2026-04-20) for the region strategy and the dual-account rationale.

Deviation from original intent: the pipeline model mapping table earlier in this ADR treats the Decision layer as three sub-roles (standard, complex, code-specific) served by three deployments. Those deployments now live in two different Foundry accounts. Backend routing selects the endpoint per model; semantically nothing changes.

Amendment — 2026-05-03

Decision: Replace Kimi K2.5 with Kimi K2.6 for L1 Generation. Remove the dedicated gpt-4o-mini deployment for L5 Reporting; alias L5 onto the Kimi-K2.6 deployment as a placeholder. Rename all Foundry deployments so deployment name matches the underlying model name exactly (case + dots).

Rationale:

  • K2.6 (Moonshot version 2026-04-20) supersedes K2.5 in capability while remaining in the same cost band. Confirmed available in westus on 2026-05-03 via az cognitiveservices model list.
  • gpt-4o-mini retires 2026-09-30 (per Foundry portal). Removing it now and pointing L5 at the existing K2.6 deployment is simpler than introducing a new mini-tier deployment, and the cost delta is rounding error at MVP volume.
  • Deployment-name-equals-model-name (e.g., Kimi-K2.6, gpt-5.1-codex) eliminates the dash-vs-dot footgun previously called out in the foundry usage guide. Required for opencode and portal triage.

Changes made:

  • infra/terraform/foundation/foundry.tf: Renamed gpt_51_codex/deepseek_v32/gpt_54/gpt_53_codex name fields to match their model names exactly. Removed gpt_4o_mini resource. Replaced kimi_k25 with kimi_k26 (Kimi-K2.6, version 2026-04-20).
  • infra/terraform/foundation/foundry-east.tf: Renamed gpt_54_pro deployment to gpt-5.4-pro.
  • infra/k8s/aucert-dev/llm-config.yaml: Updated all FOUNDRY_MODEL_* to the new exact deployment names. FOUNDRY_MODEL_REPORTING now points at Kimi-K2.6 (shared with L1).
  • infra/migrations/internal-shared/V006__seed_model_registry.sql: Rewritten to seed the new inventory for fresh DBs.
  • infra/migrations/internal-shared/V013__update_foundry_model_registry.sql: Idempotent forward migration to bring already-provisioned DBs into line — renames provider_model_id values, deprecates kimi-k2-5 and gpt-4o-mini, inserts kimi-k2-6.

Revisit triggers:

  • L1 + L5 contention on the shared Kimi-K2.6 deployment causes 429s — split L5 onto a dedicated deployment (gpt-5.4-mini or gpt-5-nano are westus-available candidates).
  • Kimi K2.6 deprecated/replaced — swap and update both env-var pointers in lockstep.