Verification Cascade

The Verification Cascade is a 4-stage confidence-gated system that ensures Aucert's false positive rate stays below 5%. Each stage increases verification rigor (and cost) — most results resolve cheaply at Stage 1, with expensive multi-model verification reserved for ambiguous cases.

Stage 1: Self-confidence (~$0.001/test)

The L3 Analysis model reports its own confidence score alongside each observation. If the score is ≥95%, the result passes without further verification.

How it works: The analysis prompt includes an instruction to rate confidence on a 0-100 scale. The model evaluates: "How certain am I that this screenshot shows [expected behavior]?"

Resolution rate: ~80% of results. Most test steps produce clear pass/fail signals — a login screen either shows the home screen or an error, leaving little ambiguity.

Example:

Test: Navigate to Settings → Toggle dark mode → Verify background changes
Confidence: 98.2%
Decision: PASS (above 95% threshold)
Cost: $0.001 (single model invocation)

Stage 2: Self-consistency (~$0.003/test)

The same model re-evaluates the result 3 times with different prompts (temperature variation + rephrased instructions). If all 3 evaluations agree on the same conclusion, the result is confirmed.

Why it works: Inconsistent reasoning indicates the model is uncertain. If the model says "pass" with one prompt but "fail" with a slightly different framing, the result is genuinely ambiguous.

Resolution rate: ~15% of results (those that didn't resolve at Stage 1).

Example:

Test: Verify "Add to Cart" button is disabled when item is out of stock
Evaluation 1: FAIL (confidence 87%) — "Button appears tappable"
Evaluation 2: FAIL (confidence 91%) — "Button lacks disabled styling"
Evaluation 3: FAIL (confidence 85%) — "Button does not have reduced opacity"
Verdict: 3/3 agree FAIL → confirmed FAIL
Cost: $0.003 (3x model invocation)

Stage 3: Cross-model vote (~$0.01/test)

Multiple model tiers evaluate the result independently. A majority vote decides. This catches model-specific biases — one model might consistently misinterpret certain UI patterns that another handles correctly.

Models used: The analysis model (DeepSeek V3.2) plus the reasoning model (GPT-5.1-Codex). Each evaluates independently without seeing the other's output.

Resolution rate: ~4% of results.

Example:

Test: Verify loading spinner disappears within 3 seconds
DeepSeek V3.2: FAIL — "Spinner still visible in frame 3"
GPT-5.1-Codex: PASS — "Spinner is transitioning out, partially visible due to animation"
GPT-5.1-Codex (second eval): PASS — "Frame timing shows spinner at 2.8s, within threshold"
Verdict: 2/3 vote PASS → PASS
Cost: $0.01 (3 model invocations across 2 tiers)

Stage 4: Structured debate (~$0.05/test)

Two model instances argue opposing conclusions with structured arguments. A judge model evaluates the quality of arguments and renders a final decision. The full debate transcript is preserved as an audit trail.

Format:

Prosecutor argues the test failed, citing specific evidence from screenshots
Defender argues the test passed, providing counter-evidence
Each side gets 2 rounds of rebuttal
Judge evaluates argument quality and renders a verdict with written reasoning

Resolution rate: ~1% of results — only the most ambiguous cases reach this stage.

Example:

Test: Verify search results update when filter is applied
Prosecutor: "Results list is identical before and after applying the 'Price: Low to High' filter"
Defender: "Results are the same items but the order has changed — note item 2 and item 3 swapped"
Judge: "Defender's argument is correct. Visual diff shows identical items in different order.
        The filter is a sort operation, not a filter. PASS."
Cost: $0.05 (7 model invocations: 2 sides × 2 rounds + judge × 3 evaluations)

Cost analysis

Stage	Cost/test	Cumulative	% of results	Monthly cost (1000 tests/day)
Stage 1	$0.001	$0.001	80%	$24
Stage 2	$0.003	$0.004	15%	$1.35
Stage 3	$0.01	$0.014	4%	$12
Stage 4	$0.05	$0.064	1%	$15
Weighted average		$0.0017		$52.35

The weighted average cost per verification is ~$0.002 — well under the $0.005 target.

MVP status

Phase 1 uses Stage 1 only (confidence threshold). Results below 95% are flagged for human review rather than escalating through Stages 2-4.

What's built: The confidence scoring pipeline in L3/L4 and the threshold-based pass/fail logic.

What's planned: Stages 2-4 implementation, configurable thresholds per customer, and the audit trail storage for Stage 4 debates.

Target metrics

Metric	Target	Current (Stage 1 only)
False Positive Rate (FPR)	< 5%	Measuring in dev
False Negative Rate (FNR)	< 10%	Measuring in dev
Stage 1 resolution rate	~80%	Expected
Average cost per verification	< $0.005	~$0.001 (Stage 1 only)
P95 verification latency	< 5s	< 2s (Stage 1 only)

Audit trail

Every verification decision is logged with:

{
  "test_id": "uuid",
  "scenario_id": "uuid",
  "stage_reached": 1,
  "confidence_scores": [0.982],
  "decision": "pass",
  "reasoning": "Self-confidence score 98.2% exceeds 95% threshold",
  "cost_usd": 0.001,
  "latency_ms": 1200,
  "timestamp": "2026-04-07T10:30:00Z"
}

For Stage 4, the audit trail includes the full debate transcript — prosecutor arguments, defender arguments, judge reasoning, and the final verdict.

What's next

5-layer deep dive — Full pipeline architecture
Model orchestration — Model tier routing and cost optimization

Stage 1: Self-confidence (~$0.001/test)​

Stage 2: Self-consistency (~$0.003/test)​

Stage 3: Cross-model vote (~$0.01/test)​

Stage 4: Structured debate (~$0.05/test)​

Cost analysis​

MVP status​

Target metrics​

Audit trail​

What's next​

Stage 1: Self-confidence (~$0.001/test)

Stage 2: Self-consistency (~$0.003/test)

Stage 3: Cross-model vote (~$0.01/test)

Stage 4: Structured debate (~$0.05/test)

Cost analysis

MVP status

Target metrics

Audit trail

What's next