0Overview
This walkthrough explains the validation-graph design end-to-end: what it is, why each decision was made, and how it fits together. Aimed at the engineering team for shared understanding before implementation begins.
The validation graph is the knowledge layer of the Aucert product. It comprises two graphs (one per customer + one shared), enforces multi-tenant isolation and ACLs, supports natural-language and structured queries, and is continuously enriched by autonomous agents (the rover swarm). It's the substrate against which everything else is validated.
For full reasoning + alternatives + worked examples, see the design notes and SPEC-035.
1Premise: a two-graph knowledge layer
Aucert's product depends on a substrate that captures everything we know — about apps, policies, bugs, workflows, people, and the patterns that emerge across customers.
Two graphs, not one
The knowledge layer comprises two graphs:
- Tenant Graph (per-customer, private). Holds everything one customer owns or discovers — apps, bugs, policies, people, processes. Deployable in Aucert cloud OR customer-hosted on-prem. The thing the customer "takes with them."
- Ecosystem Graph (singular, Aucert-owned, shared). Holds curated ontology, abstracted patterns, public policies (Google Play, App Store), and learnings derived across the tenant base. The "holy grail" of mobile testing knowledge.
flowchart LR
subgraph Customers["Per-customer (cloud OR on-prem)"]
TG1["Tenant Graph
(Acme)"]
TG2["Tenant Graph
(Beta Co)"]
TG3["Tenant Graph
(...)"]
end
subgraph Aucert["Aucert (cloud-only)"]
EG["Ecosystem Graph
canonical ontology
+ patterns + policies"]
end
TG1 -->|references via eco: prefix| EG
TG2 -->|references via eco: prefix| EG
TG3 -->|references via eco: prefix| EG
TG1 -.promotion pipeline.-> EG
TG2 -.promotion pipeline.-> EG
Why two graphs?
Tenant data is private and customer-owned. Ecosystem data is shared abstracted learnings owned by Aucert. Mixing them would either leak customer data or restrict ecosystem flexibility. Separating them gives both proper governance.
Validation substrate
Both graphs at end-state act as a validation substrate. Downstream — the 5-layer testing pipeline, the rover, agents — uses the graph to validate test outcomes, decisions, and app behavior. The Tenant Graph also has an internal validation pipeline because raw rover observations arrive unverified.
2Design principles
Thirteen user-stated principles shape every decision. If a decision violates a principle, the principle wins until amended.
3Substrate: how facts are represented
Every fact about the world is a claim. The graph is the materialized current-truth view; the claim log is the append-only source of truth.
flowchart TB
R[Rover / Agent / Human / Integration] -->|asserts claim| CL[Append-only Claim Log
D3 + C1]
CL -->|materialized into| GE[Graph Entity
nodes + edges]
GE -->|fast hot reads| Q[Query Layer]
CL -->|cold path: provenance, audit, conflict| Q
R2[Rover bug discovered] -->|retraction claim D10| CL
CL -->|recompute current truth| GE
Key model decisions
retracts; targets a prior claim. Append-only. Restoration = retract the retraction. Multi-party conflicts preserved (rover-A retracted, admin un-retracted).subject_context is a Map<variable_id, value>. tenant_id and partition_id stay fixed envelope fields outside the bindings (storage isolation primitives).asserted_at (mandatory system time) + optional valid_from / valid_to. Engine answers both "what's true now?" and "what did we believe at time T?"unverified → pending_review → verified | rejected | disputed. Transitions are themselves claims; full audit of how status evolved.creator (immutable audit) + owner (transferable accountable) + maintainer(s) (collaborators). Single-owner field is too coarse for an agent-driven graph.entity_id. Identity resolution at write is writer-side. Split via (predicate: split_into, targets: [...]); merge via (predicate: merged_into, target: canonical). Both reversible via standard retraction.Worked example: the rover discovers a screen
{
"claim_id": "clm_01h3v2k...",
"tenant_id": "acme",
"partition_id": "payments-app",
"subject_context": {
"env": "prod",
"app_version": "3.5.0",
"rn_version": "0.71.4",
"enable_wallet": true,
"device_type": "android"
},
"target_type": "NODE",
"target_id": "wallet-screen",
"asserted_by": "agent:rover-v3",
"asserted_at": "2026-05-15T14:32:00Z",
"valid_from": "2026-05-15T14:32:00Z",
"confidence": 0.93,
"validation_status": "unverified",
"value_payload": { "exists": true, "screen_role": "Wallet" }
}
4Tenancy & partitions
One logical Tenant Graph per customer, internally organized by named partitions.
flowchart TB
subgraph TG["Tenant Graph (Acme)"]
P1["payments-app partition"]
P2["hr partition"]
P3["policies partition"]
P4["_acl reserved partition
(Roles, Policies, Tags, Mappings)"]
P5["_types reserved partition
(type definitions)"]
end
P1 -.tagged 'project-redesign-2026'.-> T1[Tag entity]
P3 -.tagged.-> T1
_.payments-app, hr, etc. Partitions are first-class organizing scopes — natural ACL boundaries, natural promotion scopes, natural blast-radius limits.has_tag(target, tag_id) plugs into CEL filters. Solves "working group X gets permissions on a project's entities" without graph traversal — partitions are too structural for transient grouping.5Identity & ACL
Identity is externalized; permissions live in the graph; ACL is hybrid (ABAC + role-derived shortcuts).
Identity: externalized via WorkOS
flowchart LR
Okta["Customer IdP
(Okta / Azure AD / Google)"] -->|SCIM/SSO| WorkOS
WorkOS -->|unified API| IDS[Aucert Identity Service]
IDS -->|principal_id refs + events| VG[Validation Graph]
VG -.does NOT mirror.-x WorkOS
principal_id; no mirroring.principal_type discriminator. AGENT principals (rover, atlas, future agents) live as graph entities; HUMAN/GROUP/ORG live in identity service.principal_id. Identity service emits canonical events (deactivation, group membership change, deletion); graph reacts via cache invalidation + auto-create/retract of GroupRoleMapping-derived assignments.ACL evaluation pipeline
flowchart LR
Req["Request: principal P, action A, target T"] --> Eval[ACL Evaluator]
Eval --> R1{D7 role-derived?
creator/owner/maintainer/viewer}
R1 -->|implies| Allow1[ALLOW]
Eval --> R2{Custom role assignment?
D13 RoleAssignment}
R2 -->|matching| Allow2[ALLOW]
Eval --> R3{CEL policy match?
D13.7 _acl partition}
R3 -->|effect| Combine[Combine: deny-overrides D13.4]
R1 --> Combine
R2 --> Combine
Combine --> Result[ALLOW or DENY]
Combine --> Default{No match?}
Default -->|deny-by-default D13.5| DenyDefault[DENY]
D13.2 → D13.7: ACL sub-decisions
object_filter<resource>:<verb> — 22 actions across 9 namespaces (entity / claim / role / permission / tag / variable / condition / validation / ownership). Wildcards via CEL string ops. Action implications explicit.traversal_visibility ∈ {visible, hidden, inherit_target}. Default visible returns redacted placeholders for unreadable targets. New entity:traverse action separable from entity:read._acl reserved partitionrules_evaluated_at_version for honest audit.Worked example: custom role for a transient working group
# Transient team (D8 GROUP/TEAM principal in identity service)
team "wg-payment-redesign-2026" {
members: [user:alice, user:bob, team:design-leads]
expires_at: "2026-12-31"
}
# Tag relevant entities (D14)
tag "project-payment-redesign-2026" applied to:
- workflow:checkout-v2
- workflow:wallet-integration-v2
- bug:payment-flow-stale-state
# Custom role assignment with tag-based scope
assign role "maintainer"
to team:wg-payment-redesign-2026
on scope: object_filter: 'has_tag(target, "project-payment-redesign-2026")'
expires_at: "2026-12-31"
When the project ends → tag archived → permissions auto-stop applying → team dissolves. Clean.
6Conditions & variables
Entities carry conditions — structural validity rules expressed in CEL. Variables are first-class.
subject_context (observation point); entities carry condition (structural validity rule). Clean separation between observation and rule.evidence_strength: inferred + derivation chain. Forward and gap-fill on by default for monotonic variables.unknown. Per-variable optional default_value. Security guard allows_default: false on tenant_id, partition_id, etc.Worked example: condition on a Wallet entity
# CEL text (canonical storage form per C3)
version_gte(rn_version, "0.71") &&
version_in_range(app_version, "3.5", "4.0") &&
enable_wallet == true
Variables involved:
rn_version— L2 ecosystem-shared, type Version, monotonicitymostly-incrementalapp_version— L2 ecosystem-shared, type Version, monotonicitymostly-incrementalenable_wallet— L3 tenant-specific (acme), type Boolean, monotonicitynone
{rn_version: "0.72", app_version: "3.7"} — enable_wallet is unbound, no default. Result: true && unknown → unknown. Consumer (e.g., L1 generation) sees this and generates BOTH enable_wallet=true and enable_wallet=false test paths automatically.
7Type system
Three-layer hybrid type registry; required-core + open extensions; single inheritance + interfaces.
| Layer | What | Examples |
|---|---|---|
| L1 — Built-in | Aucert-defined; available to all tenants | Entity, Workflow, Tag, Role, Policy, Variable |
| L2 — Ecosystem-shared | Declared in Ecosystem Graph; canonical cross-tenant meaning | Screen, Component, Bug, TestCase, Document, WikiPage, Spec, GoogleDoc |
| L3 — Tenant-specific | Declared in tenant graph; tenant-prefixed names | acme:LoyaltyPoint, acme:RegionalPricingTier |
property_schema declares required-core properties (validated at write). Instances may carry additional properties as extensions (accepted, not validated). Promotion to required-core via deliberate schema versioning. Matches how the rover discovers properties before formal declaration.extends parent per type, transitive. Multiple implements for cross-cutting capabilities (Searchable, Embedded, Spatial, Localized, Versioned). Polymorphic queries resolve type-set membership.8Embeddings & AI integration
AI consumers (rover, atlas, agents) need semantic retrieval, not just symbolic queries.
flowchart LR
subgraph Storage["Storage (D20)"]
V[Vector property on entity
source of truth in graph]
VI[(VectorIndex
derived data)]
V -.replicates.-> VI
end
Q[Query: by-vector / by-entity / by-NL] --> SS[search_semantic D22]
SS -->|filter via CEL| VI
VI --> Results[Top-K results + similarity scores
+ ACL filtering]
embedding_source (which properties feed the vector).VectorIndex abstraction. Implementations: pgvector (default), Qdrant, Pinecone, Null (OSS). Vector dimension per-model; tenant isolation via tenant_id partitioning.embedding_model_id. Vector index partitions by model. New model → both partitions live → background re-embed → old decommissioned when coverage threshold met.Worked NL → search example
# NL: "Find me Wallet-like screens in payments-app that I have access to."
search_semantic({
entity_id: "wallet",
filter: 'object_type(target) == "Screen"
&& entity_partition(target) == "payments-app"',
k: 10
})
# ACL applied automatically; results ranked by similarity.
9Cross-graph references
Tenant entities can reference ecosystem entities. Compact prefix scheme; one-way only.
eco:<entity_id>[@version] for ecosystem references. Direction restricted to one-way: tenant → ecosystem only. No tenant→tenant or ecosystem→tenant; engine rejects at write.| Reference | Resolves to |
|---|---|
scr_01h3... | Intra-tenant entity (current tenant graph) |
eco:wallet-pattern-v3 | Ecosystem entity, current floating version |
eco:gdpr-policy-v5@5 | Ecosystem entity, pinned to schema_version 5 |
Lifecycle states (deprecated / sunsetting / superseded / merged_into) returned in resolution metadata; consumer interprets. No "broken" refs.
10API shape
gRPC + protobuf as canonical core; thin adapters for other surfaces.
flowchart TB
subgraph Core["Core API: gRPC + protobuf"]
Ops["Entity CRUD · Claims · Filter queries (CEL)
Traversal · Path queries · search_semantic
Subscriptions · Registry CRUD · Introspection"]
end
Core -.adapter.-> REST[REST adapter
auto-gen from proto]
Core -.adapter.-> NL[NL adapter
LLM-translated]
Core -.adapter.-> GQL["GraphQL adapter
F-graphql-api (deferred)"]
Core -->|direct| Internal[Internal services
L1-L5 pipeline, billing, identity]
REST --> Humans[Humans / scripts]
NL --> Agents[AI agents]
GQL --> UIs[Admin console / dashboards]
11Deployment & tenancy tiers
Same code path across cloud, on-prem, and OSS — only the entitlement source differs.
flowchart LR
subgraph Cloud["Cloud"]
BS[Billing Service] -->|entitlements.changed events| EE1[Engine: BillingServiceEntitlementSource]
EE1 --> CFG1[ACL + feature gates]
end
subgraph OnPrem["On-prem (commercial license)"]
LK[Cryptographic License Key
RSA/ECDSA signed, verified locally] --> EE2[Engine: LicenseKeyEntitlementSource]
EE2 --> CFG2[ACL + feature gates]
end
subgraph OSS["OSS community (potential future)"]
NoLK[No license; basic features only] --> EE3[Engine: NullEntitlementSource]
EE3 --> CFG3[ACL + basic features only]
end
Tiered tenancy + horizontal pod scaling
| Tier | Tenants | Isolation |
|---|---|---|
| Small / hobbyist | 100K+ | RLS-pooled across multiple Postgres pods (5K–10K per pod) |
| Mid-market | 1K–10K | Schema-per-tenant in shared cluster |
| Enterprise | 100–1K | Database-per-tenant or dedicated cluster |
flowchart TB
Router["Routing Layer
(tenant_id → pod_id)"]
Router --> P1["Pod 1
5-10K tenants
RLS"]
Router --> P2["Pod 2
5-10K tenants
RLS"]
Router --> P3["Pod 3
5-10K tenants
RLS"]
Router --> Pn["Pod N
5-10K tenants
RLS"]
tenant_id → pod via global metadata service. Salesforce-pod pattern.12Tech choice & migration paths
Postgres + pgvector for both day-1; pluggable abstraction keeps migration paths open.
flowchart LR
subgraph Day1["Day 1 (cloud + on-prem)"]
PG[PostgresGraphStorageEngine
Postgres 18 + pgvector + JSONB + CTEs]
end
subgraph CloudFuture["Cloud — when triggered"]
Mem[MemgraphGraphStorageEngine]
N4[Neo4jGraphStorageEngine]
Citus[Sharded Postgres / Citus]
end
subgraph OnPrem["On-prem — permanent"]
PGOnPrem[PostgresGraphStorageEngine
same code]
end
PG -->|migration triggers fire| Mem
PG -->|or| N4
PG -->|or| Citus
PG -.permanent.-> PGOnPrem
PostgresGraphStorageEngine for both cloud and on-prem day-1. Cloud migration triggers documented; on-prem stays permanently. Multi-SQL-backend support is custom-engagement only — Postgres bundle ships standalone for "we don't run Postgres" customers.Migration triggers (any one fires)
| Trigger | Migration target |
|---|---|
| Cloud aggregate ≥ 100M entities | Memgraph / Neo4j / Citus (TBD by measurement) |
| p95 traversal latency >500ms at depth ≥3 sustained | Same |
| Shortest-path latency at depth ≥5 consistently >500ms p95 | Same (likely first to fire) |
| Cloud vector index ≥50M embeddings OR pgvector p95 search >500ms | Qdrant or Pinecone |
| Single-shard write throughput saturated | Citus OR distributed graph engine |
Replay-based per-tenant migration playbook
- Snapshot tenant X at time T from Postgres.
- Migrate snapshot to new engine.
- Replay claims since T from Postgres claim log into new engine.
- Run shadow-reads (queries hit both; results compared).
- Cut over reads to new engine when shadow-read confidence threshold met.
- Keep Postgres write path active for N days; can roll back.
- Decommission Postgres for that tenant.
Per-tenant migration means: small tenants migrate fast; large ones carefully; never "stuck" mid-migration.
13The five plug-in abstractions
Locked from day 1. Migration paths flow through these.
| Abstraction | Day-1 implementation | Future implementations |
|---|---|---|
GraphStorageEngine | PostgresGraphStorageEngine | Memgraph / Neo4j / Sharded Postgres |
VectorIndex | PgvectorVectorIndex | Qdrant / Pinecone / Null |
EntitlementSource | Billing (cloud) / License (on-prem) / Null (OSS) | (covered) |
IdentityServiceClient | gRPC to Aucert identity service | (single; identity service swappable) |
EventBus | Postgres LISTEN/NOTIFY | Kafka / Redpanda |
graph.findShortestPath(from, to, maxDepth=10). The Postgres adapter implements this with CTEs internally; the eventual Neo4j adapter uses Cypher's shortestPath. Same call site; different backend. No SQL or Cypher above the abstraction.
14Scale envelope
D25 — illustrative ranges; tunes as we measure year-1 customer behavior.
Per-tenant ranges
| Dimension | Small early | Mid-market | Large enterprise |
|---|---|---|---|
| Apps in scope | 1–3 mobile | 5–10 mobile + web | 10–50 across mobile/web/backend |
| Entities | 3K–15K | 50K–200K | 500K–2M |
| Claims (full history) | 30K–750K | 5M–50M | 50M–500M |
| Storage | 100MB–2GB | 5GB–50GB | 50GB–500GB |
Cloud aggregate by year 3 (pessimistic)
~2B entities · ~500B claims · ~1PB storage · 100K–1M reads/sec · 10K–100K writes/sec
Latency targets
| Operation | p50 | p95 | p99 |
|---|---|---|---|
| Get entity by ID | <10ms | <50ms | <200ms |
| Filter query (CEL) | <50ms | <200ms | <1s |
| Semantic search | <100ms | <500ms | <2s |
| Traversal (3 hops) | <100ms | <500ms | <2s |
| ACL evaluation (per request) | <5ms cached / <50ms uncached | ||
15Future enhancements
End-state goals with explicit migration paths. Today's design preserves them all.
16What's next
From design to implementation.
Already done
- Design notes (`.tasks/drafts/validation-graph-design-notes.md`) — full reasoning, alternatives, worked examples, NL ↔ structured pairs. ~1500 lines.
- Architecture summary (`.tasks/drafts/validation-graph-architecture-summary.md`) — readable distillation with mermaid diagrams.
- SPEC-035 (`docs/specs/drafts/SPEC-035-validation-graph.md`) — formal spec ready for review.
- ADR-005 amended with supersession note pointing to SPEC-035.
- This walkthrough — for team explanation.
Coming up
- Team review of SPEC-035 before approval.
- Day-1 POC — small Postgres + pgvector + CEL POC validating the design before full implementation.
- Implementation roadmap — module structure, sequencing, team allocation.
- Operational sub-specs — Q-* backlog items get their own specs as their time comes (validation workflows, pod architecture, embedding rollout, migration playbook, etc.).
- Promotion pipeline design (Q1) — once Tenant Graph is built, design how facts flow tenant → ecosystem.
Open operational backlog (parked)
Not graph-shaping; will be designed when their time comes. See design notes' "Operational" section for the full list — Q-validation-workflows, Q-merge-split-deep-dive, Q-vector-replication, Q-embedding-rollout, Q-postgres-adapter-optimizations, Q-pod-architecture, Q-migration-playbook, Q-tenancy-tiering, Q-identity-resolution, Q-debug, Q-license-*, Q-entitlement-*, Q-hook-ip-framework, plus Q5.2–Q5.5 (rover internals).
17References
Links to companion documents and external standards.
Aucert documents
- Design notes — full design with reasoning, alternatives, worked examples
- Architecture summary — readable distillation
- SPEC-035 — Validation Graph — formal spec (draft)
- ADR-005 — superseded by SPEC-035 (amendment)
- SPEC-020 — User Management (identity background)
- SPEC-021 — Identity Domain (interface this depends on)
External standards
- CEL — Common Expression Language (D12.2, D13, D22)
- pgvector (D20 default vector index)
- Citus — sharded Postgres (one D26 migration target)
- Memgraph, Neo4j — candidate cloud-migration targets
- WorkOS — identity provider federation (D13.8)
- OpenFGA / SpiceDB — Zanzibar implementations (F2 reference)
Generated from the design notes 2026-05-15. Last updated as the SPEC and design notes evolve.