Skip to main content

5-layer pipeline deep dive

The Aucert platform processes mobile application testing through a 5-layer AI pipeline. Each layer is a distinct processing stage connected by the MCP (Model Context Protocol) message envelope. The Knowledge Graph feeds context into L1, and the Device Twin (Phase 2) will overlay L2.

MCP message envelope

All inter-layer communication uses the MCPMessage protobuf (defined in proto/pipeline.proto). This is the canonical contract between layers — every request and response is wrapped in this envelope.

message MCPMessage {
string task_id = 1; // Unique per test run
string source_layer = 2; // e.g. "generation"
string target_layer = 3; // e.g. "execution"
bytes payload = 4; // Layer-specific protobuf
map<string, string> context_snapshot = 5; // KG context subset
double confidence_score = 6; // 0.0-1.0, set by L3+
string trace_id = 7; // Distributed tracing
int64 timestamp_ms = 8; // Unix epoch millis
}

The context_snapshot field carries a subset of the Knowledge Graph relevant to the current test — this means each layer has the context it needs without querying the KG directly.

Layer details

L1: Generation

AspectDetail
Code pathbackend/platform/src/main/kotlin/.../domain/generation/
Model (current)Kimi K2.6 via Azure AI Foundry (Kimi-K2.6 deployment)
InputTestRunRequest + KG context snapshot
OutputSet of TestScenario messages, each with ordered action steps and expected outcomes

The Generation layer takes input from the Knowledge Graph and produces test scenarios. It analyzes app structure (screens, navigation flows, API contracts) and generates comprehensive test plans covering:

  • Critical user paths — High-connectivity graph traversals (login → purchase → checkout)
  • Edge cases — Unusual state combinations derived from node property analysis
  • Regression scenarios — Weighted toward nodes with historical bug patterns
  • Accessibility checks — Screen reader compatibility, contrast ratios, touch target sizes
info

Why Kimi K2.6? It's a multimodal thinking model optimized for agent workflows. Its large context window handles the full KG context snapshot, and its vision capability enables UI-aware scenario design. Replaced K2.5 on 2026-05-03. See model orchestration for the full model selection rationale.

L2: Execution

AspectDetail
Code pathbackend/platform/src/main/kotlin/.../domain/execution/
ModelNone — engine-driven, not LLM
InputTestScenario messages from L1
OutputExecutionTrace with screenshots, timing, and action logs

The Execution layer is the only non-LLM layer. It runs generated test scenarios on Android emulators:

  1. App installation — Installs the APK/AAB on the emulator
  2. Scenario playback — Executes each action step (tap, swipe, text input, scroll)
  3. Screenshot capture — Takes a screenshot after each action for L3 analysis
  4. Trace recording — Records timing data, network requests, and crash/ANR events

Device Twin overlay (Phase 2): Will augment emulator results with predicted real-device behavior, adjusting for known divergences like touch sensitivity, GPU rendering differences, and sensor behavior.

L3: Analysis

AspectDetail
Code pathbackend/platform/src/main/kotlin/.../domain/analysis/
Model (current)DeepSeek V3.2 via Azure AI Foundry (DeepSeek-V3.2 deployment)
InputExecutionTrace with screenshots from L2
OutputPer-step AnalysisResult with observations and confidence scores

The Analysis layer applies visual reasoning to screenshots and execution logs. The multimodal model compares expected vs actual behavior, detecting:

  • UI regressions — Layout shifts, missing elements, color changes, broken fonts
  • Functional failures — Wrong screen, error states, crashes, data mismatch
  • Performance issues — Slow transitions, unresponsive UI, loading spinners that don't resolve
info

Why DeepSeek V3.2? It's the cheapest capable model for bulk visual analysis. L3 processes every screenshot in every test run, so per-token cost dominates. DeepSeek V3.2 provides sufficient visual reasoning at ~$0.14/1M input tokens. See model orchestration.

L4: Decision

AspectDetail
Code pathbackend/platform/src/main/kotlin/.../domain/pipeline/
Model (MVP)N/A — threshold-based only
Model (Phase 2)GPT-5.4 for Stages 3-4 reasoning
InputAnalysisResult with confidence scores from L3
OutputPass/fail decision per test scenario

MVP: Confidence threshold only — if the L3 confidence score exceeds the configured threshold (default 95%), the test passes. Below threshold, it's flagged for human review.

Full version (Phase 2+): 4-stage Verification Cascade with escalating cost and rigor. See Verification Cascade for the complete design.

L5: Reporting

AspectDetail
Code pathbackend/platform/src/main/kotlin/.../domain/pipeline/
Model (current)Kimi K2.6 via Azure AI Foundry (Kimi-K2.6 deployment, shared with L1)
InputPass/fail decisions + all upstream data
OutputBugReport messages + dashboard data

MVP: Dashboard page showing test results with per-scenario pass/fail, confidence scores, and annotated screenshots.

Full version (Phase 2+): Structured bug reports with Jira/Linear integration, team notifications via Slack, and feedback loops that update the Knowledge Graph with test outcomes.

info

Why Kimi K2.6 (shared with L1)? Reporting is natural language generation — structuring bug reports, writing summaries, classifying severity. After gpt-4o-mini was removed on 2026-05-03, L5 was pointed at the existing Kimi-K2.6 deployment as a placeholder: it's already paid for, capable enough for prose, and avoids a separate mini-tier deployment. Revisit if L1 + L5 contention causes 429s.

Interface + adapter pattern

All layers communicate through interfaces, not concrete implementations. This is the core architectural pattern that enables future extraction to microservices.

// Each layer exposes an interface
interface GenerationService {
suspend fun generate(request: TestRunRequest, context: KGContext): List<TestScenario>
}

// Local implementation (monolith mode)
class LocalGenerationService : GenerationService { ... }

// Remote implementation (microservice mode, Phase 2+)
class RemoteGenerationService(private val grpcClient: GenerationGrpc) : GenerationService { ... }

Rules:

  • Layers never import each other's internals — only the interface
  • All inter-layer data passes through MCP protobuf types
  • No shared mutable state between layers
  • Each layer owns its own error handling
  • Config flag aucert.pipeline.mode=local|remote switches implementations

Current model assignments

LayerModelDeploymentCost/1M tokensProvider
L1 GenerationKimi K2.6Kimi-K2.6~$0.28Azure AI Foundry
L2 ExecutionNoneN/AN/AEngine-driven
L3 AnalysisDeepSeek V3.2DeepSeek-V3.2~$0.14Azure AI Foundry
L4 DecisionGPT-5.4gpt-5.4~$4.00Azure AI Foundry
L5 ReportingKimi K2.6 (shared)Kimi-K2.6~$0.28Azure AI Foundry

All models are "Direct from Azure" — covered by Founders Hub credits. Estimated total: $55-135/month at development scale.

What's next