Verification Cascade
The Verification Cascade is a 4-stage confidence-gated system that ensures Aucert's false positive rate stays below 5%. Each stage increases verification rigor (and cost) — most results resolve cheaply at Stage 1, with expensive multi-model verification reserved for ambiguous cases.
Stage 1: Self-confidence (~$0.001/test)
The L3 Analysis model reports its own confidence score alongside each observation. If the score is ≥95%, the result passes without further verification.
How it works: The analysis prompt includes an instruction to rate confidence on a 0-100 scale. The model evaluates: "How certain am I that this screenshot shows [expected behavior]?"
Resolution rate: ~80% of results. Most test steps produce clear pass/fail signals — a login screen either shows the home screen or an error, leaving little ambiguity.
Example:
Test: Navigate to Settings → Toggle dark mode → Verify background changes
Confidence: 98.2%
Decision: PASS (above 95% threshold)
Cost: $0.001 (single model invocation)
Stage 2: Self-consistency (~$0.003/test)
The same model re-evaluates the result 3 times with different prompts (temperature variation + rephrased instructions). If all 3 evaluations agree on the same conclusion, the result is confirmed.
Why it works: Inconsistent reasoning indicates the model is uncertain. If the model says "pass" with one prompt but "fail" with a slightly different framing, the result is genuinely ambiguous.
Resolution rate: ~15% of results (those that didn't resolve at Stage 1).
Example:
Test: Verify "Add to Cart" button is disabled when item is out of stock
Evaluation 1: FAIL (confidence 87%) — "Button appears tappable"
Evaluation 2: FAIL (confidence 91%) — "Button lacks disabled styling"
Evaluation 3: FAIL (confidence 85%) — "Button does not have reduced opacity"
Verdict: 3/3 agree FAIL → confirmed FAIL
Cost: $0.003 (3x model invocation)
Stage 3: Cross-model vote (~$0.01/test)
Multiple model tiers evaluate the result independently. A majority vote decides. This catches model-specific biases — one model might consistently misinterpret certain UI patterns that another handles correctly.
Models used: The analysis model (DeepSeek V3.2) plus the reasoning model (GPT-5.1-Codex). Each evaluates independently without seeing the other's output.
Resolution rate: ~4% of results.
Example:
Test: Verify loading spinner disappears within 3 seconds
DeepSeek V3.2: FAIL — "Spinner still visible in frame 3"
GPT-5.1-Codex: PASS — "Spinner is transitioning out, partially visible due to animation"
GPT-5.1-Codex (second eval): PASS — "Frame timing shows spinner at 2.8s, within threshold"
Verdict: 2/3 vote PASS → PASS
Cost: $0.01 (3 model invocations across 2 tiers)
Stage 4: Structured debate (~$0.05/test)
Two model instances argue opposing conclusions with structured arguments. A judge model evaluates the quality of arguments and renders a final decision. The full debate transcript is preserved as an audit trail.
Format:
- Prosecutor argues the test failed, citing specific evidence from screenshots
- Defender argues the test passed, providing counter-evidence
- Each side gets 2 rounds of rebuttal
- Judge evaluates argument quality and renders a verdict with written reasoning
Resolution rate: ~1% of results — only the most ambiguous cases reach this stage.
Example:
Test: Verify search results update when filter is applied
Prosecutor: "Results list is identical before and after applying the 'Price: Low to High' filter"
Defender: "Results are the same items but the order has changed — note item 2 and item 3 swapped"
Judge: "Defender's argument is correct. Visual diff shows identical items in different order.
The filter is a sort operation, not a filter. PASS."
Cost: $0.05 (7 model invocations: 2 sides × 2 rounds + judge × 3 evaluations)
Cost analysis
| Stage | Cost/test | Cumulative | % of results | Monthly cost (1000 tests/day) |
|---|---|---|---|---|
| Stage 1 | $0.001 | $0.001 | 80% | $24 |
| Stage 2 | $0.003 | $0.004 | 15% | $1.35 |
| Stage 3 | $0.01 | $0.014 | 4% | $12 |
| Stage 4 | $0.05 | $0.064 | 1% | $15 |
| Weighted average | $0.0017 | $52.35 |
The weighted average cost per verification is ~$0.002 — well under the $0.005 target.
MVP status
Phase 1 uses Stage 1 only (confidence threshold). Results below 95% are flagged for human review rather than escalating through Stages 2-4.
What's built: The confidence scoring pipeline in L3/L4 and the threshold-based pass/fail logic.
What's planned: Stages 2-4 implementation, configurable thresholds per customer, and the audit trail storage for Stage 4 debates.
Target metrics
| Metric | Target | Current (Stage 1 only) |
|---|---|---|
| False Positive Rate (FPR) | < 5% | Measuring in dev |
| False Negative Rate (FNR) | < 10% | Measuring in dev |
| Stage 1 resolution rate | ~80% | Expected |
| Average cost per verification | < $0.005 | ~$0.001 (Stage 1 only) |
| P95 verification latency | < 5s | < 2s (Stage 1 only) |
Audit trail
Every verification decision is logged with:
{
"test_id": "uuid",
"scenario_id": "uuid",
"stage_reached": 1,
"confidence_scores": [0.982],
"decision": "pass",
"reasoning": "Self-confidence score 98.2% exceeds 95% threshold",
"cost_usd": 0.001,
"latency_ms": 1200,
"timestamp": "2026-04-07T10:30:00Z"
}
For Stage 4, the audit trail includes the full debate transcript — prosecutor arguments, defender arguments, judge reasoning, and the final verdict.
What's next
- 5-layer deep dive — Full pipeline architecture
- Model orchestration — Model tier routing and cost optimization