Skip to main content

Model orchestration

Aucert uses a tiered model strategy to balance quality and cost across the 5-layer pipeline. Each pipeline layer has different requirements — generation needs large context and vision, analysis needs cheap bulk processing, decision needs strong reasoning — so a single model cannot optimize all layers.

Model tiers

Aucert separates AI workloads into three cost tiers, each serving a different purpose:

TierDescriptionCost/1M tokensUse caseProvider
Tier SSelf-hosted fine-tuned models$0.01–0.05High-volume, domain-specific tasksFuture (Month 12+)
Tier FoundryAzure AI Foundry (serverless)$0.14–8.00Product inference — per-layer optimizationAzure AI Foundry
Tier APIDirect API (Claude, GPT)~$3.00–15.00Coding agents, not product inferenceAnthropic / Bedrock
info

Why separate tiers? Product inference (testing customer apps) and developer tooling (coding agents) have different cost profiles, latency requirements, and billing streams. Keeping them separate means Founders Hub credits cover product inference while coding agents bill to a separate Anthropic/AWS account.

Current model assignments

Phase 1 uses static per-layer model assignment via Kubernetes ConfigMap (ADR-009):

Pipeline layerModelDeploymentCost/1M inputWhy this model
L1 GenerationKimi K2.6Kimi-K2.6~$0.28Multimodal thinking model with 128K context window. Handles full KG context snapshot + UI screenshots for scenario design. Agent workflow optimization.
L2 ExecutionN/AN/AN/AEngine-driven (ADB commands), not LLM-powered
L3 AnalysisDeepSeek V3.2DeepSeek-V3.2~$0.14Cheapest capable model for bulk visual analysis. L3 processes every screenshot in every test run — per-token cost dominates. Sufficient visual reasoning quality.
L4 DecisionGPT-5.4gpt-5.4~$4.00Best reasoning quality for the Verification Cascade. Only invoked for ambiguous results (Stages 3–4), so high per-token cost is acceptable due to low volume.
L5 ReportingKimi K2.6 (shared)Kimi-K2.6~$0.28Placeholder — shares L1 deployment. Reporting is pure NLG (bug summaries, severity classification); no need for a separate model until L1+L5 contention emerges.

Model capability comparison

CapabilityKimi K2.6DeepSeek V3.2GPT-5.4
Context window128K128K256K
Vision (multimodal)YesYesYes
Reasoning depthMedium (thinking model)Low-mediumHigh
Structured outputGoodGoodExcellent
Speed (tokens/sec)MediumFastMedium
Cost tierLowLowMedium-high

Configuration

Model routing is controlled entirely by the llm-config ConfigMap — no code changes needed to switch models:

# k8s/aucert-dev/llm-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config
namespace: aucert-dev
data:
LLM_PROVIDER: "azure-foundry"
FOUNDRY_ENDPOINT: "https://aucert-ai.cognitiveservices.azure.com/"
FOUNDRY_MODEL_GENERATION: "Kimi-K2.6" # L1
FOUNDRY_MODEL_ANALYSIS: "DeepSeek-V3.2" # L3
FOUNDRY_MODEL_REASONING: "gpt-5.4" # L4
FOUNDRY_MODEL_REPORTING: "Kimi-K2.6" # L5 — shares L1 deployment

Switching a model requires two steps:

  1. Deploy the new model in Foundry (foundry.tfterraform apply)
  2. Update the ConfigMap deployment name → restart pod

The backend reads these values at startup through the LLMConfig class, which maps each pipeline layer to its Foundry deployment name.

Cost analysis

Per-test-run cost estimate

A typical test run generates ~20 test scenarios, each with ~5 steps:

LayerInvocations/runAvg tokens/callCost/run
L1 Generation1 (full batch)~8,000 in + ~4,000 out$0.002
L3 Analysis~100 (20 scenarios × 5 screenshots)~2,000 in + ~500 out$0.035
L4 Decision~20 (one per scenario, Stage 1 only)~1,000 in + ~200 out$0.010
L5 Reporting1 (summary)~3,000 in + ~1,500 out$0.001
Total~$0.048/run

Monthly cost projections

ScaleRuns/dayMonthly costNotes
Development5–10$7–15Current stage
Early customers30–50$45–75Target for Month 3–6
Growth100–200$150–300Revisit self-hosting at this point
Scale (Tier S needed)500+$750+Self-hosted models become cheaper

All costs are covered by Founders Hub credits ($1,000) for the first 10–22 months at development scale.

LLM API access

Use caseProviderStatusNotes
Product inferenceAzure AI Foundry (aucert-ai)Active — 4 models deployedFounders Hub credits
Coding agentsAnthropic Direct APIActive (fallback)Separate billing
Coding agentsAWS BedrockPending Activate creditsWill be primary for coding agents

Planned: dynamic routing (Phase 2+)

warning

Dynamic routing is designed but not built. Phase 1 uses static ConfigMap routing. This section describes the Phase 2 architecture.

Phase 2 introduces a RouterPolicyEngine that dynamically selects the optimal model per request based on task characteristics:

Routing dimensions

DimensionSignalEffect
Task complexityPrompt length, required reasoning steps, prior-stage confidenceSimple tasks → cheapest model; complex tasks → reasoning model
Required capabilitiesImage attachments, structured output schema, function callingRoutes to models with the needed capability
Cost budgetPer-customer monthly budget, current spendDowngrades to cheaper model when approaching budget limit
Latency SLACustomer tier, interactive vs batchFast models for interactive; slow-but-better for batch
Quality feedbackHistorical accuracy for this task type + modelLearns which models perform best for specific task patterns

RouterPolicyEngine interface

interface RouterPolicyEngine {
suspend fun selectModel(
layer: PipelineLayer,
taskProfile: TaskProfile,
customerBudget: BudgetConstraints
): ModelSelection
}

data class TaskProfile(
val complexity: Complexity, // SIMPLE, MODERATE, COMPLEX
val requiresVision: Boolean,
val requiresReasoning: Boolean,
val estimatedInputTokens: Int,
val estimatedOutputTokens: Int
)

data class ModelSelection(
val deployment: String, // Foundry deployment name
val tier: ModelTier, // S, FOUNDRY, API
val estimatedCost: BigDecimal,
val fallback: String? // Fallback deployment if primary fails
)

When Tier S (self-hosting) makes sense

Self-hosting open models (Llama, DeepSeek, Mistral) on GPU VMs becomes cost-effective when:

ConditionThresholdCurrent
Monthly API spend> $10,000/month~$50/month
Request volume> 500 runs/day sustained~5/day
Domain-specific qualityFine-tuned model > general modelNot yet measured
Latency requirementsP99 < 500ms neededNot a constraint

We are nowhere near self-hosting thresholds. Revisit at Month 12+ if paying customers generate high inference volume.

What's next