Model orchestration
Aucert uses a tiered model strategy to balance quality and cost across the 5-layer pipeline. Each pipeline layer has different requirements — generation needs large context and vision, analysis needs cheap bulk processing, decision needs strong reasoning — so a single model cannot optimize all layers.
Model tiers
Aucert separates AI workloads into three cost tiers, each serving a different purpose:
| Tier | Description | Cost/1M tokens | Use case | Provider |
|---|---|---|---|---|
| Tier S | Self-hosted fine-tuned models | $0.01–0.05 | High-volume, domain-specific tasks | Future (Month 12+) |
| Tier Foundry | Azure AI Foundry (serverless) | $0.14–8.00 | Product inference — per-layer optimization | Azure AI Foundry |
| Tier API | Direct API (Claude, GPT) | ~$3.00–15.00 | Coding agents, not product inference | Anthropic / Bedrock |
Why separate tiers? Product inference (testing customer apps) and developer tooling (coding agents) have different cost profiles, latency requirements, and billing streams. Keeping them separate means Founders Hub credits cover product inference while coding agents bill to a separate Anthropic/AWS account.
Current model assignments
Phase 1 uses static per-layer model assignment via Kubernetes ConfigMap (ADR-009):
| Pipeline layer | Model | Deployment | Cost/1M input | Why this model |
|---|---|---|---|---|
| L1 Generation | Kimi K2.6 | Kimi-K2.6 | ~$0.28 | Multimodal thinking model with 128K context window. Handles full KG context snapshot + UI screenshots for scenario design. Agent workflow optimization. |
| L2 Execution | N/A | N/A | N/A | Engine-driven (ADB commands), not LLM-powered |
| L3 Analysis | DeepSeek V3.2 | DeepSeek-V3.2 | ~$0.14 | Cheapest capable model for bulk visual analysis. L3 processes every screenshot in every test run — per-token cost dominates. Sufficient visual reasoning quality. |
| L4 Decision | GPT-5.4 | gpt-5.4 | ~$4.00 | Best reasoning quality for the Verification Cascade. Only invoked for ambiguous results (Stages 3–4), so high per-token cost is acceptable due to low volume. |
| L5 Reporting | Kimi K2.6 (shared) | Kimi-K2.6 | ~$0.28 | Placeholder — shares L1 deployment. Reporting is pure NLG (bug summaries, severity classification); no need for a separate model until L1+L5 contention emerges. |
Model capability comparison
| Capability | Kimi K2.6 | DeepSeek V3.2 | GPT-5.4 |
|---|---|---|---|
| Context window | 128K | 128K | 256K |
| Vision (multimodal) | Yes | Yes | Yes |
| Reasoning depth | Medium (thinking model) | Low-medium | High |
| Structured output | Good | Good | Excellent |
| Speed (tokens/sec) | Medium | Fast | Medium |
| Cost tier | Low | Low | Medium-high |
Configuration
Model routing is controlled entirely by the llm-config ConfigMap — no code changes needed to switch models:
# k8s/aucert-dev/llm-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config
namespace: aucert-dev
data:
LLM_PROVIDER: "azure-foundry"
FOUNDRY_ENDPOINT: "https://aucert-ai.cognitiveservices.azure.com/"
FOUNDRY_MODEL_GENERATION: "Kimi-K2.6" # L1
FOUNDRY_MODEL_ANALYSIS: "DeepSeek-V3.2" # L3
FOUNDRY_MODEL_REASONING: "gpt-5.4" # L4
FOUNDRY_MODEL_REPORTING: "Kimi-K2.6" # L5 — shares L1 deployment
Switching a model requires two steps:
- Deploy the new model in Foundry (
foundry.tf→terraform apply) - Update the ConfigMap deployment name → restart pod
The backend reads these values at startup through the LLMConfig class, which maps each pipeline layer to its Foundry deployment name.
Cost analysis
Per-test-run cost estimate
A typical test run generates ~20 test scenarios, each with ~5 steps:
| Layer | Invocations/run | Avg tokens/call | Cost/run |
|---|---|---|---|
| L1 Generation | 1 (full batch) | ~8,000 in + ~4,000 out | $0.002 |
| L3 Analysis | ~100 (20 scenarios × 5 screenshots) | ~2,000 in + ~500 out | $0.035 |
| L4 Decision | ~20 (one per scenario, Stage 1 only) | ~1,000 in + ~200 out | $0.010 |
| L5 Reporting | 1 (summary) | ~3,000 in + ~1,500 out | $0.001 |
| Total | ~$0.048/run |
Monthly cost projections
| Scale | Runs/day | Monthly cost | Notes |
|---|---|---|---|
| Development | 5–10 | $7–15 | Current stage |
| Early customers | 30–50 | $45–75 | Target for Month 3–6 |
| Growth | 100–200 | $150–300 | Revisit self-hosting at this point |
| Scale (Tier S needed) | 500+ | $750+ | Self-hosted models become cheaper |
All costs are covered by Founders Hub credits ($1,000) for the first 10–22 months at development scale.
LLM API access
| Use case | Provider | Status | Notes |
|---|---|---|---|
| Product inference | Azure AI Foundry (aucert-ai) | Active — 4 models deployed | Founders Hub credits |
| Coding agents | Anthropic Direct API | Active (fallback) | Separate billing |
| Coding agents | AWS Bedrock | Pending Activate credits | Will be primary for coding agents |
Planned: dynamic routing (Phase 2+)
Dynamic routing is designed but not built. Phase 1 uses static ConfigMap routing. This section describes the Phase 2 architecture.
Phase 2 introduces a RouterPolicyEngine that dynamically selects the optimal model per request based on task characteristics:
Routing dimensions
| Dimension | Signal | Effect |
|---|---|---|
| Task complexity | Prompt length, required reasoning steps, prior-stage confidence | Simple tasks → cheapest model; complex tasks → reasoning model |
| Required capabilities | Image attachments, structured output schema, function calling | Routes to models with the needed capability |
| Cost budget | Per-customer monthly budget, current spend | Downgrades to cheaper model when approaching budget limit |
| Latency SLA | Customer tier, interactive vs batch | Fast models for interactive; slow-but-better for batch |
| Quality feedback | Historical accuracy for this task type + model | Learns which models perform best for specific task patterns |