5-layer pipeline deep dive
The Aucert platform processes mobile application testing through a 5-layer AI pipeline. Each layer is a distinct processing stage connected by the MCP (Model Context Protocol) message envelope. The Knowledge Graph feeds context into L1, and the Device Twin (Phase 2) will overlay L2.
MCP message envelope
All inter-layer communication uses the MCPMessage protobuf (defined in proto/pipeline.proto). This is the canonical contract between layers — every request and response is wrapped in this envelope.
message MCPMessage {
string task_id = 1; // Unique per test run
string source_layer = 2; // e.g. "generation"
string target_layer = 3; // e.g. "execution"
bytes payload = 4; // Layer-specific protobuf
map<string, string> context_snapshot = 5; // KG context subset
double confidence_score = 6; // 0.0-1.0, set by L3+
string trace_id = 7; // Distributed tracing
int64 timestamp_ms = 8; // Unix epoch millis
}
The context_snapshot field carries a subset of the Knowledge Graph relevant to the current test — this means each layer has the context it needs without querying the KG directly.
Layer details
L1: Generation
| Aspect | Detail |
|---|---|
| Code path | backend/platform/src/main/kotlin/.../domain/generation/ |
| Model (current) | Kimi K2.6 via Azure AI Foundry (Kimi-K2.6 deployment) |
| Input | TestRunRequest + KG context snapshot |
| Output | Set of TestScenario messages, each with ordered action steps and expected outcomes |
The Generation layer takes input from the Knowledge Graph and produces test scenarios. It analyzes app structure (screens, navigation flows, API contracts) and generates comprehensive test plans covering:
- Critical user paths — High-connectivity graph traversals (login → purchase → checkout)
- Edge cases — Unusual state combinations derived from node property analysis
- Regression scenarios — Weighted toward nodes with historical bug patterns
- Accessibility checks — Screen reader compatibility, contrast ratios, touch target sizes
Why Kimi K2.6? It's a multimodal thinking model optimized for agent workflows. Its large context window handles the full KG context snapshot, and its vision capability enables UI-aware scenario design. Replaced K2.5 on 2026-05-03. See model orchestration for the full model selection rationale.
L2: Execution
| Aspect | Detail |
|---|---|
| Code path | backend/platform/src/main/kotlin/.../domain/execution/ |
| Model | None — engine-driven, not LLM |
| Input | TestScenario messages from L1 |
| Output | ExecutionTrace with screenshots, timing, and action logs |
The Execution layer is the only non-LLM layer. It runs generated test scenarios on Android emulators:
- App installation — Installs the APK/AAB on the emulator
- Scenario playback — Executes each action step (tap, swipe, text input, scroll)
- Screenshot capture — Takes a screenshot after each action for L3 analysis
- Trace recording — Records timing data, network requests, and crash/ANR events
Device Twin overlay (Phase 2): Will augment emulator results with predicted real-device behavior, adjusting for known divergences like touch sensitivity, GPU rendering differences, and sensor behavior.
L3: Analysis
| Aspect | Detail |
|---|---|
| Code path | backend/platform/src/main/kotlin/.../domain/analysis/ |
| Model (current) | DeepSeek V3.2 via Azure AI Foundry (DeepSeek-V3.2 deployment) |
| Input | ExecutionTrace with screenshots from L2 |
| Output | Per-step AnalysisResult with observations and confidence scores |
The Analysis layer applies visual reasoning to screenshots and execution logs. The multimodal model compares expected vs actual behavior, detecting:
- UI regressions — Layout shifts, missing elements, color changes, broken fonts
- Functional failures — Wrong screen, error states, crashes, data mismatch
- Performance issues — Slow transitions, unresponsive UI, loading spinners that don't resolve
Why DeepSeek V3.2? It's the cheapest capable model for bulk visual analysis. L3 processes every screenshot in every test run, so per-token cost dominates. DeepSeek V3.2 provides sufficient visual reasoning at ~$0.14/1M input tokens. See model orchestration.
L4: Decision
| Aspect | Detail |
|---|---|
| Code path | backend/platform/src/main/kotlin/.../domain/pipeline/ |
| Model (MVP) | N/A — threshold-based only |
| Model (Phase 2) | GPT-5.4 for Stages 3-4 reasoning |
| Input | AnalysisResult with confidence scores from L3 |
| Output | Pass/fail decision per test scenario |
MVP: Confidence threshold only — if the L3 confidence score exceeds the configured threshold (default 95%), the test passes. Below threshold, it's flagged for human review.
Full version (Phase 2+): 4-stage Verification Cascade with escalating cost and rigor. See Verification Cascade for the complete design.
L5: Reporting
| Aspect | Detail |
|---|---|
| Code path | backend/platform/src/main/kotlin/.../domain/pipeline/ |
| Model (current) | Kimi K2.6 via Azure AI Foundry (Kimi-K2.6 deployment, shared with L1) |
| Input | Pass/fail decisions + all upstream data |
| Output | BugReport messages + dashboard data |
MVP: Dashboard page showing test results with per-scenario pass/fail, confidence scores, and annotated screenshots.
Full version (Phase 2+): Structured bug reports with Jira/Linear integration, team notifications via Slack, and feedback loops that update the Knowledge Graph with test outcomes.
Why Kimi K2.6 (shared with L1)? Reporting is natural language generation — structuring bug reports, writing summaries, classifying severity. After gpt-4o-mini was removed on 2026-05-03, L5 was pointed at the existing Kimi-K2.6 deployment as a placeholder: it's already paid for, capable enough for prose, and avoids a separate mini-tier deployment. Revisit if L1 + L5 contention causes 429s.
Interface + adapter pattern
All layers communicate through interfaces, not concrete implementations. This is the core architectural pattern that enables future extraction to microservices.
// Each layer exposes an interface
interface GenerationService {
suspend fun generate(request: TestRunRequest, context: KGContext): List<TestScenario>
}
// Local implementation (monolith mode)
class LocalGenerationService : GenerationService { ... }
// Remote implementation (microservice mode, Phase 2+)
class RemoteGenerationService(private val grpcClient: GenerationGrpc) : GenerationService { ... }
Rules:
- Layers never import each other's internals — only the interface
- All inter-layer data passes through MCP protobuf types
- No shared mutable state between layers
- Each layer owns its own error handling
- Config flag
aucert.pipeline.mode=local|remoteswitches implementations
Current model assignments
| Layer | Model | Deployment | Cost/1M tokens | Provider |
|---|---|---|---|---|
| L1 Generation | Kimi K2.6 | Kimi-K2.6 | ~$0.28 | Azure AI Foundry |
| L2 Execution | None | N/A | N/A | Engine-driven |
| L3 Analysis | DeepSeek V3.2 | DeepSeek-V3.2 | ~$0.14 | Azure AI Foundry |
| L4 Decision | GPT-5.4 | gpt-5.4 | ~$4.00 | Azure AI Foundry |
| L5 Reporting | Kimi K2.6 (shared) | Kimi-K2.6 | ~$0.28 | Azure AI Foundry |
All models are "Direct from Azure" — covered by Founders Hub credits. Estimated total: $55-135/month at development scale.
What's next
- Knowledge Graph internals — How the KG engine works
- Verification Cascade — Multi-stage decision making
- Model orchestration — Model tier routing and cost optimization