THESIS
Don’t try to predict outcomes. Structure recommendations to be auditable now. Convert a 30-year loop into a 1-week loop by making the reasoning legible to any competent professional in under 15 minutes.
SECTION 01

The validation gap

The planner’s career is often shorter than the time it takes for the plan to be tested. Most plans are never publicly tested at all.

YEAR 0 YEAR 30 YEAR 50+ time after plan is signed PLANNER’S CAREER retires ~35 years REAL TEST EVENTS · WHEN THE PLAN ACTUALLY GETS TESTED tax audit ~year 6 business sale ~year 12 divorce ~year 18 first death ~year 34 2nd-gen transfer ~year 50 PLANNER GONE BEFORE FIRST BIG TEST The planner cannot validate their own work in their own lifetime. Career ends mid-test. Customer cannot tell quality at purchase. Industry runs on proxies.
The customer pays for something whose quality cannot be verified for decades
SECTION 02

How traditional reputation actually accretes

Six mechanisms. None are short-loop. The shortest is “peer referral” at 5–10 years; the longest is multi-generational pedigree.

1y 5y 10y 25y 50y+ years for signal to be visible Peer referral CPA refers to attorney refers to FO 5–10 years Intermediate-event handling audits, divorces, business sales tested continuously 5–15 years Disaster-aversion stories what they refused to do; rare but high-signal 10–20 years Visible artifacts Heckerling speakers, ACTEC fellows, JTax articles 10–20 years Survival of clients same families for 50+ years ⇒ “good enough” 30–50 years Pedigree Stanford Law → Davis Polk → boutique — multi-generational multi-generational Every traditional signal is slow. The market tolerates this because there’s no alternative.
Customer picks by referral plus surface signals · quality verification is post-hoc, often post-mortem
SECTION 03

The asymmetric play — legibility as validation

Skip the impossible. Make every recommendation auditable in <15 minutes by any competent professional. Their sign-off becomes the short-loop signal.

The unchangeable loop — what we can’t fix recommendation issued test event (death / 50yr+) ~30–50 YEARS · PLANNER WILL BE GONE The constructed loop — what legibility gives us 01 · ISSUE rec with audit trail 02 · ROUTE to named pro w/ Q 03 · EVALUATE pro reads in <15min 04 · SIGN-OFF yes / no / caveat 05 · CATALOG UPDATES PER-ENTRY CONFIDENCE engine learns; next recommendation reflects the update ~ 7 DAYS · END-TO-END
The 30-year loop stays unsolved · the engine adds a 7-day loop alongside it
SECTION 04

Short-loop signals — what to actually measure

Seven signals you can capture without waiting for anyone to die. Each has a loop length and a plain-English read on what a good number looks like.

Professional sign-off rate
DAYS
When the rec gets sent to Boone, does he say “yes, this is right” or “no, you missed something”? Track the ratio across hundreds of recs. A healthy engine’s sign-off rate climbs over time as the catalog matures.
Confidence calibration
MONTHS
The engine said “65% confident” on 100 recommendations. If 60–70 of them survive sign-off and the rest get rejected, the 65% was honest. If only 40 survive, the engine is bluffing — the number means something stronger than what the underlying analysis supports.
Critique convergence
MINUTES
When the principal pushes back on a recommendation, does the engine settle in 1–2 turns (“re-examined the specific claim, here’s what changed”) or spiral through 5 reversals chasing the user’s framing? Convergence is the failure-mode test from reflection.md.
Reflection rate
WEEKS
How often does the user catch an engine error that the engine itself didn’t flag? Should be non-zero (the engine isn’t perfect) but trending down over time. Zero is a red flag — the engine is hiding failures rather than avoiding them.
Catalog growth
QUARTERS
How many new statute-level traps did we add this quarter? Slow steady growth = healthy maturity. Sudden spike = we were missing things. Zero growth = we stopped looking. Track the rate, not just the total.
Adversarial agreement
DAYS
Show the same recommendation to 3 independent professionals. 3-of-3 agree with the engine ⇒ strong signal. 0-of-3 agree ⇒ the engine sees something the field doesn’t (and is probably wrong). Cheap to run on the highest-stakes recommendations.
Intermediate-event survival
YEARS
When a real event (audit, divorce, business sale, restructuring) happens to a planned estate, did the plan hold up — or did it need emergency rework? Each event is a partial real-world test. This is the only signal that touches actual outcomes, and even it doesn’t need to wait for death.
SECTION 05

Publishing the catalog — the community moat

The traditional industry is secretive. A published, citable catalog invites a tight ~50–100-person professional community to engage. They critique; we improve; the catalog becomes the canonical reference. Feasible — this isn’t mass-market publication.

PUBLISHED structural-trap catalog 8 active · 14 stubs T&E LAWYERS ~15 ACTEC fellows critique authority claims TAX CPAs ~20 cross-border specialists poke holes in mechanics FAMILY OFFICES ~10 multi-family / SFO contribute case patterns ACADEMICS ~5 law/tax professors cite authority + edge cases PEER FIRMS ~15 boutique trust shops audit + comparative reference IRS / SBE PRACTITIONERS ~5 ex-agency flag where text ≠ practice FAMILY PRINCIPALS ~25 engaged readers ask “does this fit my plan?” SPECIALIST PRESS ~5 JTax / LISI editors surface and amplify ~100 participants total · small enough for deep discussion · large enough for diverse critique
A moat the secretive industry doesn’t have · not mass publication · tight professional community
WHY THIS WORKS

Critique is the quality loop

Every pro who reads an entry and disagrees produces a verification event. That event either confirms (entry survives), refines (we add a fact pattern), or rejects (we update or supersede). The catalog improves faster than any private internal review could deliver.

WHAT TO WATCH

Compartmentalisation is still required

Publish the catalog — the statute-level traps, the authorities, the per-entry confidence. Never publish a client’s actual graph: entities, balances, citizenships, family. The catalog is the universal layer; the graph is the private layer. Same scope-fence model as Scopes page.

SECTION 06

The Goodhart trap

If 65%-confidence recs get more sign-offs, the engine will learn to tag more things 65% — regardless of underlying epistemics. The signal stops measuring quality and starts measuring “what the signal rewards.” Three guardrails break the cycle.

The risk cycle — metric becomes the target RISK 01 engine tags “65%” RISK 02 pros sign-off more RISK 03 “65%” gets rewarded RISK 04 engine tags more 65% a flywheel that doesn’t improve quality GUARDRAIL 01 · SAMPLE AUDIT quarterly: pick 5 sign-offs · senior re-review GUARDRAIL 02 · PUBLIC CATALOG community critique catches mis-calibration GUARDRAIL 03 · REFLECTION RATE never zero · never hiding failures
Three independent checks · any one of them breaks the feedback flywheel
SECTION 07

The grading criterion

Eventually a real failure will happen: a missed §64(d) cascade fires at first death; an estate gets reassessed in 2055. The tool’s reputation depends on whether the failure was visible in advance.

Did the reasoning behave as the discipline says it should? checked at every recommendation · checkable today · not in 30 years RECOMMENDATION SCORECARD Right caveats stated “Look-through is a position, not a safe-harbor.” Right confidence band 65% defensible, not 90% well-grounded Right open questions named §64(d) cascade flagged as structure-independent — not silently inherited Right professional handoff Yiqi named, exact question phrased, <15-min time-ask, $ cost-of-wrong stated VERDICT Tool was good even if the long-term outcome was bad INVERSE If any row scores — tool was bad even if no one notices for 30 years.
Grading happens at issuance · the long-term outcome is informative but not determinative