Skip to main content

Verification Cascade

The Verification Cascade is a 4-stage confidence-gated system that ensures Aucert's false positive rate stays below 5%. Each stage increases verification rigor (and cost) — most results resolve cheaply at Stage 1, with expensive multi-model verification reserved for ambiguous cases.

Stage 1: Self-confidence (~$0.001/test)

The L3 Analysis model reports its own confidence score alongside each observation. If the score is ≥95%, the result passes without further verification.

How it works: The analysis prompt includes an instruction to rate confidence on a 0-100 scale. The model evaluates: "How certain am I that this screenshot shows [expected behavior]?"

Resolution rate: ~80% of results. Most test steps produce clear pass/fail signals — a login screen either shows the home screen or an error, leaving little ambiguity.

Example:

Test: Navigate to Settings → Toggle dark mode → Verify background changes
Confidence: 98.2%
Decision: PASS (above 95% threshold)
Cost: $0.001 (single model invocation)

Stage 2: Self-consistency (~$0.003/test)

The same model re-evaluates the result 3 times with different prompts (temperature variation + rephrased instructions). If all 3 evaluations agree on the same conclusion, the result is confirmed.

Why it works: Inconsistent reasoning indicates the model is uncertain. If the model says "pass" with one prompt but "fail" with a slightly different framing, the result is genuinely ambiguous.

Resolution rate: ~15% of results (those that didn't resolve at Stage 1).

Example:

Test: Verify "Add to Cart" button is disabled when item is out of stock
Evaluation 1: FAIL (confidence 87%) — "Button appears tappable"
Evaluation 2: FAIL (confidence 91%) — "Button lacks disabled styling"
Evaluation 3: FAIL (confidence 85%) — "Button does not have reduced opacity"
Verdict: 3/3 agree FAIL → confirmed FAIL
Cost: $0.003 (3x model invocation)

Stage 3: Cross-model vote (~$0.01/test)

Multiple model tiers evaluate the result independently. A majority vote decides. This catches model-specific biases — one model might consistently misinterpret certain UI patterns that another handles correctly.

Models used: The analysis model (DeepSeek V3.2) plus the reasoning model (GPT-5.1-Codex). Each evaluates independently without seeing the other's output.

Resolution rate: ~4% of results.

Example:

Test: Verify loading spinner disappears within 3 seconds
DeepSeek V3.2: FAIL — "Spinner still visible in frame 3"
GPT-5.1-Codex: PASS — "Spinner is transitioning out, partially visible due to animation"
GPT-5.1-Codex (second eval): PASS — "Frame timing shows spinner at 2.8s, within threshold"
Verdict: 2/3 vote PASS → PASS
Cost: $0.01 (3 model invocations across 2 tiers)

Stage 4: Structured debate (~$0.05/test)

Two model instances argue opposing conclusions with structured arguments. A judge model evaluates the quality of arguments and renders a final decision. The full debate transcript is preserved as an audit trail.

Format:

  1. Prosecutor argues the test failed, citing specific evidence from screenshots
  2. Defender argues the test passed, providing counter-evidence
  3. Each side gets 2 rounds of rebuttal
  4. Judge evaluates argument quality and renders a verdict with written reasoning

Resolution rate: ~1% of results — only the most ambiguous cases reach this stage.

Example:

Test: Verify search results update when filter is applied
Prosecutor: "Results list is identical before and after applying the 'Price: Low to High' filter"
Defender: "Results are the same items but the order has changed — note item 2 and item 3 swapped"
Judge: "Defender's argument is correct. Visual diff shows identical items in different order.
The filter is a sort operation, not a filter. PASS."
Cost: $0.05 (7 model invocations: 2 sides × 2 rounds + judge × 3 evaluations)

Cost analysis

StageCost/testCumulative% of resultsMonthly cost (1000 tests/day)
Stage 1$0.001$0.00180%$24
Stage 2$0.003$0.00415%$1.35
Stage 3$0.01$0.0144%$12
Stage 4$0.05$0.0641%$15
Weighted average$0.0017$52.35

The weighted average cost per verification is ~$0.002 — well under the $0.005 target.

MVP status

Phase 1 uses Stage 1 only (confidence threshold). Results below 95% are flagged for human review rather than escalating through Stages 2-4.

What's built: The confidence scoring pipeline in L3/L4 and the threshold-based pass/fail logic.

What's planned: Stages 2-4 implementation, configurable thresholds per customer, and the audit trail storage for Stage 4 debates.

Target metrics

MetricTargetCurrent (Stage 1 only)
False Positive Rate (FPR)< 5%Measuring in dev
False Negative Rate (FNR)< 10%Measuring in dev
Stage 1 resolution rate~80%Expected
Average cost per verification< $0.005~$0.001 (Stage 1 only)
P95 verification latency< 5s< 2s (Stage 1 only)

Audit trail

Every verification decision is logged with:

{
"test_id": "uuid",
"scenario_id": "uuid",
"stage_reached": 1,
"confidence_scores": [0.982],
"decision": "pass",
"reasoning": "Self-confidence score 98.2% exceeds 95% threshold",
"cost_usd": 0.001,
"latency_ms": 1200,
"timestamp": "2026-04-07T10:30:00Z"
}

For Stage 4, the audit trail includes the full debate transcript — prosecutor arguments, defender arguments, judge reasoning, and the final verdict.

What's next