5-layer pipeline deep dive

The Aucert platform processes mobile application testing through a 5-layer AI pipeline. Each layer is a distinct processing stage connected by the MCP (Model Context Protocol) message envelope. The Knowledge Graph feeds context into L1, and the Device Twin (Phase 2) will overlay L2.

MCP message envelope

All inter-layer communication uses the MCPMessage protobuf (defined in proto/pipeline.proto). This is the canonical contract between layers — every request and response is wrapped in this envelope.

message MCPMessage {
  string task_id = 1;          // Unique per test run
  string source_layer = 2;     // e.g. "generation"
  string target_layer = 3;     // e.g. "execution"
  bytes payload = 4;           // Layer-specific protobuf
  map<string, string> context_snapshot = 5;  // KG context subset
  double confidence_score = 6; // 0.0-1.0, set by L3+
  string trace_id = 7;        // Distributed tracing
  int64 timestamp_ms = 8;     // Unix epoch millis
}

The context_snapshot field carries a subset of the Knowledge Graph relevant to the current test — this means each layer has the context it needs without querying the KG directly.

Layer details

L1: Generation

Aspect	Detail
Code path	`backend/platform/src/main/kotlin/.../domain/generation/`
Model (current)	Kimi K2.6 via Azure AI Foundry (`Kimi-K2.6` deployment)
Input	`TestRunRequest` + KG context snapshot
Output	Set of `TestScenario` messages, each with ordered action steps and expected outcomes

The Generation layer takes input from the Knowledge Graph and produces test scenarios. It analyzes app structure (screens, navigation flows, API contracts) and generates comprehensive test plans covering:

Critical user paths — High-connectivity graph traversals (login → purchase → checkout)
Edge cases — Unusual state combinations derived from node property analysis
Regression scenarios — Weighted toward nodes with historical bug patterns
Accessibility checks — Screen reader compatibility, contrast ratios, touch target sizes

info

Why Kimi K2.6? It's a multimodal thinking model optimized for agent workflows. Its large context window handles the full KG context snapshot, and its vision capability enables UI-aware scenario design. Replaced K2.5 on 2026-05-03. See model orchestration for the full model selection rationale.

L2: Execution

Aspect	Detail
Code path	`backend/platform/src/main/kotlin/.../domain/execution/`
Model	None — engine-driven, not LLM
Input	`TestScenario` messages from L1
Output	`ExecutionTrace` with screenshots, timing, and action logs

The Execution layer is the only non-LLM layer. It runs generated test scenarios on Android emulators:

App installation — Installs the APK/AAB on the emulator
Scenario playback — Executes each action step (tap, swipe, text input, scroll)
Screenshot capture — Takes a screenshot after each action for L3 analysis
Trace recording — Records timing data, network requests, and crash/ANR events

Device Twin overlay (Phase 2): Will augment emulator results with predicted real-device behavior, adjusting for known divergences like touch sensitivity, GPU rendering differences, and sensor behavior.

L3: Analysis

Aspect	Detail
Code path	`backend/platform/src/main/kotlin/.../domain/analysis/`
Model (current)	DeepSeek V3.2 via Azure AI Foundry (`DeepSeek-V3.2` deployment)
Input	`ExecutionTrace` with screenshots from L2
Output	Per-step `AnalysisResult` with observations and confidence scores

The Analysis layer applies visual reasoning to screenshots and execution logs. The multimodal model compares expected vs actual behavior, detecting:

UI regressions — Layout shifts, missing elements, color changes, broken fonts
Functional failures — Wrong screen, error states, crashes, data mismatch
Performance issues — Slow transitions, unresponsive UI, loading spinners that don't resolve

info

Why DeepSeek V3.2? It's the cheapest capable model for bulk visual analysis. L3 processes every screenshot in every test run, so per-token cost dominates. DeepSeek V3.2 provides sufficient visual reasoning at ~$0.14/1M input tokens. See model orchestration.

L4: Decision

Aspect	Detail
Code path	`backend/platform/src/main/kotlin/.../domain/pipeline/`
Model (MVP)	N/A — threshold-based only
Model (Phase 2)	GPT-5.4 for Stages 3-4 reasoning
Input	`AnalysisResult` with confidence scores from L3
Output	Pass/fail decision per test scenario

MVP: Confidence threshold only — if the L3 confidence score exceeds the configured threshold (default 95%), the test passes. Below threshold, it's flagged for human review.

Full version (Phase 2+): 4-stage Verification Cascade with escalating cost and rigor. See Verification Cascade for the complete design.

L5: Reporting

Aspect	Detail
Code path	`backend/platform/src/main/kotlin/.../domain/pipeline/`
Model (current)	Kimi K2.6 via Azure AI Foundry (`Kimi-K2.6` deployment, shared with L1)
Input	Pass/fail decisions + all upstream data
Output	`BugReport` messages + dashboard data

MVP: Dashboard page showing test results with per-scenario pass/fail, confidence scores, and annotated screenshots.

Full version (Phase 2+): Structured bug reports with Jira/Linear integration, team notifications via Slack, and feedback loops that update the Knowledge Graph with test outcomes.

info

Why Kimi K2.6 (shared with L1)? Reporting is natural language generation — structuring bug reports, writing summaries, classifying severity. After gpt-4o-mini was removed on 2026-05-03, L5 was pointed at the existing Kimi-K2.6 deployment as a placeholder: it's already paid for, capable enough for prose, and avoids a separate mini-tier deployment. Revisit if L1 + L5 contention causes 429s.

Interface + adapter pattern

All layers communicate through interfaces, not concrete implementations. This is the core architectural pattern that enables future extraction to microservices.

// Each layer exposes an interface
interface GenerationService {
    suspend fun generate(request: TestRunRequest, context: KGContext): List<TestScenario>
}

// Local implementation (monolith mode)
class LocalGenerationService : GenerationService { ... }

// Remote implementation (microservice mode, Phase 2+)
class RemoteGenerationService(private val grpcClient: GenerationGrpc) : GenerationService { ... }

Rules:

Layers never import each other's internals — only the interface
All inter-layer data passes through MCP protobuf types
No shared mutable state between layers
Each layer owns its own error handling
Config flag aucert.pipeline.mode=local|remote switches implementations

Current model assignments

Layer	Model	Deployment	Cost/1M tokens	Provider
L1 Generation	Kimi K2.6	`Kimi-K2.6`	~$0.28	Azure AI Foundry
L2 Execution	None	N/A	N/A	Engine-driven
L3 Analysis	DeepSeek V3.2	`DeepSeek-V3.2`	~$0.14	Azure AI Foundry
L4 Decision	GPT-5.4	`gpt-5.4`	~$4.00	Azure AI Foundry
L5 Reporting	Kimi K2.6 (shared)	`Kimi-K2.6`	~$0.28	Azure AI Foundry

All models are "Direct from Azure" — covered by Founders Hub credits. Estimated total: $55-135/month at development scale.

What's next

Knowledge Graph internals — How the KG engine works
Verification Cascade — Multi-stage decision making
Model orchestration — Model tier routing and cost optimization

MCP message envelope​

Layer details​

L1: Generation​

L2: Execution​

L3: Analysis​

L4: Decision​

L5: Reporting​

Interface + adapter pattern​

Current model assignments​

What's next​

MCP message envelope

Layer details

L1: Generation

L2: Execution

L3: Analysis

L4: Decision

L5: Reporting

Interface + adapter pattern

Current model assignments

What's next