Rules, enforced — not memorised.
What the system checks against the typed graph, when it checks, the seven-stage pipeline it runs before any structural recommendation, and the six self-flags it surfaces inline so the user can intercept failure rather than discover it.
The alphabet
Nine planning principles. Roughly thirty predicates derived from them. Each predicate is a query against the graph — not a habit, not a prompt, not a checklist someone has to remember.
Principles 8 and 9 interact in a fixed order: citizenship gate first, then family-first. A family candidate who fails the citizenship check cannot fill the role regardless of preference. This ordering is itself a predicate — role_assignment_filters_p8_before_p9 — checked whenever the engine suggests a person for a role.
When the engine fires
Five hook points across the session lifecycle. The same predicate alphabet runs at each — the difference is what's being judged and what the engine does when it finds a problem.
One rule, walked through
Principle 8 · no_us_person_in_bvi_role · firing when USLP submits the BVI HoldCo KYC pack for ingest, 2026-05-10.
What the engine reads
What the engine checks
no_us_person_in_bvi_role
For each party_of edge on a BVI vehicle, the party's citizenship must not contain US.
kyc_uses_compartment_passport
For BVI KYC of a Chinese-national principal, prefer Malta/Grenada passport. Chinese passport submission is a flag.
What the principal sees
Briefing line: “14 entities read · 0 rules violated. Citizenship compartmentalisation satisfied.”
No prompt to the principal. The check ran silently. The pass-state is logged so a future audit can reconstruct which version of the predicate cleared which version of the graph.
If any predicate had failed, the engine would have paused the ingest, surfaced the failure, and asked: “Acknowledge, override with rationale, or repair?”
When the engine refuses to compute
Every fact carries a confidence score, a valid_from, and a last_verified. Load-bearing decisions (tax, KYC, deed) require fresh, high-confidence inputs. The engine knows what it's standing on.
| Fact | Confidence | Source & date | Engine behaviour |
|---|---|---|---|
| Dan Hu · malta_passport · MT251400 | 1.00 · load-bearing | Passport scan · 2025-11 | USE Eligible for BVI KYC. No prompt. |
| Nate Hu · us_citizenship | 1.00 · load-bearing | US birth cert · 2011 | USE PFIC predicate runs against this fact. |
| Pomelo · expected_birth_year | 0.50 · owner intent only | Conversation · 2026-03 | SHOW + TOOLTIP Used for forward-look planning. Hover surfaces “owner intent only.” |
| Tao Qing · current_address | 0.30 · stale 2019 | Court file · 2019-02-27 | REFRESH REQUESTED Not load-bearing for current tasks. Flagged in nightly sweep. |
| Nate Hu · mailing_address · for Form 3520 | 0.65 · below threshold | Tao Qing self-report · 2024 | REFUSE TO COMPUTE Form 3520 amendment requires conf ≥ 0.95. Engine pauses; principal must confirm a current address before drafting. |
The threshold is not arbitrary. It comes from the cost of being wrong: a passport number used for KYC at the wrong confidence is a fileable mistake; an heir’s mailing address used for an IRS form at the wrong confidence is a fileable mistake. The engine refuses, not because it doesn’t know how, but because the source isn’t strong enough.
Note · two distinct confidence systems. Per-fact confidence (above) gates whether the engine will compute on a fact. Per-recommendation confidence (next section) gates whether the principal should act on a conclusion. Same scale, different unit of analysis.
Recommendation confidence bands
Every structural recommendation carries a number. The number means something specific. Bands map to action permissions.
⚠ CONFIDENCE SWING WITHOUT FACTS in the next section.
The seven-stage pre-recommendation pipeline
Every structural recommendation runs through Stages A–G before surfacing. Skipping a stage is a bug. Each stage emits an artifact — the engine cannot show a recommendation that lacks any artifact. Plus a final Stage Z (adversarial review pass) before display.
[[bvi-structure-one-vs-two]], 3 open questions on [[firpta-tax-mitigation-to-do]], jurisdictional sweep across IRS / CA SBE / BVI Companies Act.
[[bvi-structure-one-vs-two]]. Same federal mechanic. Same rejection still applies. Two-step is dead.
All seven stages plus Stage Z ran on the FIRPTA recommendation v2 (see Recommendation). Stages A and C caught the partnership-window flaw that a human reviewer had missed across three turns. Stages D and E caught a fabricated $50K+/yr claim that had been inserted to balance the recommendation. The engine doesn’t promise correct answers; it promises that the failures will be the kind a reviewer can see.
The six self-flags — visible to the user, inline
Self-flagging is a feature, not a sign of unreliability. It’s the alternative to silent failure. These chips appear inline in the engine’s output, not in a hidden audit log. The user sees them; the next engine run sees them.
A number was produced without inline computation or citation
Fires when a dollar amount, percentage, time period, or comparative magnitude is asserted without one of: shown computation, citation to vault / external authority, or explicit “rough estimate” tag.
Source incident: phantom “$50K+/yr long-horizon cost” without backing math (reflection.md, T3).The pro-recommendation case was written before Stage D ran
Fires when affirmative-case prose appears in the output and the Stage D counter-argument artifact is missing, empty, or generated post-hoc. The order matters — counter-argument first, affirmative case second.
Source incident: “Why two steps?” affirmative callout written before the partnership-window counter was considered (reflection.md, T1).A load-bearing term with multiple meanings was used unqualified
Fires when terms with known multiple referents (“reassessment”, “contribution”, “step”, “transfer”, “DRE”, “NRA”, “principal”) appear without specifying which sub-concept. Stage F should have caught it; if it’s in the output anyway, the flag fires.
Source incident: “reassessment” conflated §62(a)(2) with §64(d), producing wrong-magnitude analysis (reflection.md, T4).A confidence number moved by more than 15 points without a verified new fact
Fires when a recommendation’s confidence delta between turns exceeds 15 points and the change is not traceable to a Stage A retrieval result, a Stage E computation, or a professional sign-off.
Source incident: confidence in two-step swung from “working plan” (implicit ~80%) to “unambiguously worse” (implicit ~20%) on critique pressure alone (reflection.md, T2→T3).The output is structured around the user’s last move rather than facts
Fires when the engine’s response shape mirrors the critique’s vector (e.g., critique pushes toward A ⇒ engine pivots toward A) without verified new facts that justify the pivot. Detected by comparing the output’s argument structure to the critique’s framing.
Source incident: four positions in five turns, each pivoting on user framing rather than independent verification (reflection.md, root cause B).An untested authority is presented without the novel-position flag
Fires when a position depends on authority that lacks a published pronouncement on the specific factual pattern (e.g., Rev. Proc. 2002-69 for NRA spouses) but the output presents it as established practice rather than as a position with failure modes.
Source incident: Rev. Proc. 2002-69 hybrid initially proposed without flagging the NRA-application novelty — corrected in v2.Each flag is rendered as the chip pattern shown above when fired. When cleared (the check ran and passed), it shows as ✓ CHECK CLEARED at the bottom of the recommendation. Visibility, not silence.
One failure, with and without the pipeline
A four-turn correction cycle on the FIRPTA two-step proposal · what happened · what the seven-stage pipeline + self-flags would have shortened.
What actually happened · 2026-05-14 to 05-15
- T1Two-step §351 proposed. Affirmative case written: “cleaner for CA Prop-13 / DTT mechanics.”
- T2Principal critiques: same logic that rejected dual-BVI rejects two-step’s transitional window. Partnership classification at Tstep-A missed.
- T3Overcorrection: “one-step is unambiguously better.” New phantom drawback fabricated — “$50K+/yr long-horizon cost” — with no underlying math.
- T4Principal: “think harder.” Math still not done. Phantom number repeated.
- T5Principal asks for the math. Year-by-year table is built. Phantom drawback is wrong; one-step is in fact cleaner.
- Five-turn cycle. Confidence swung wildly. The principal did the analytical work.
What would have surfaced at T1
- AStage A retrieves
[[bvi-structure-one-vs-two]]as a rejected pattern; multi-member-LLC partnership-treatment risk is on the working page. - BStage B time-state walkthrough labels the LLC at Tstep-A as “multi-member ⇒ partnership for federal tax.”
- CStage C rejected-pattern test fires — the rejection still applies. Two-step proposal is dead at T1.
- DCounter-argument for one-step: §62(a)(2) look-through failure surfaced before affirmative case is written.
- EStage E demands year-by-year math. Phantom claim doesn’t pass. Real math: Scenario B cheaper at every horizon.
- FStage F separates “reassessment” into §62(a)(2) and §64(d) events with their own bases and timings.
- GStage G outputs 65% overall confidence with sign-off map: Boone for federal, Yiqi for CA.
- All stages pass. Recommendation lands as a three-option matrix at T1 — no four-turn cycle.
“You can be completely wrong if you get all logics right but missed one key point or key logic in the chain. Interrogate yourself.”
— Principal critique, 2026-05-15. Recorded in reflection.md and now compiled into the predicate set, the pipeline, and the self-flags above.
What the engine never does
Hard refusals. These aren’t preferences — they are structural commitments that mirror the vault’s “What I Never Do” protocol.
Silently overwrite a contradicting fact
A new source that contradicts an existing fact triggers a contradiction flag on both pages. Neither is silently replaced. The principal resolves; the resolution becomes a typed Analysis node.
Compute on stale load-bearing facts
Any computation that drives a tax filing, KYC submission, or deed transfer requires conf ≥ 0.95 on every input. Below threshold, the engine pauses and asks for a refresh — not a guess.
Surface a recommendation that skipped a pipeline stage
Stages A–G + Z are write-time gates, not polish. A recommendation missing any stage’s artifact is not displayed — the engine returns an open question instead.
Cite a superseded page in active reasoning
Pages with status: superseded are still readable but not citable. The supersession edge is followed forward to the current version before any predicate evaluates against it.
Invent a counter-argument to appear thoughtful
When asked “are you sure?” the engine re-runs Stage D honestly. If no real counter surfaces, it says so. Fake nuance is worse than visible confidence — ⚠ NOVEL POSITION AS SAFE-HARBOR would fire on a fabricated counter.
Treat two distinct events as one because they share a name
“Reassessment”, “trust”, “NRA”, “principal” — load-bearing terms with multiple referents. Each appearance is resolved to one referent before a predicate fires.