Estate planning has the cruelest validation loop in professional services. The test event is decades away. Here’s how reputation actually accretes today, why none of it works in real time, and the asymmetric play for a tool.
THESIS
Don’t try to predict outcomes. Structure recommendations to be auditable now. Convert a 30-year loop into a 1-week loop by making the reasoning legible to any competent professional in under 15 minutes.
SECTION 01
The validation gap
The planner’s career is often shorter than the time it takes for the plan to be tested. Most plans are never publicly tested at all.
The customer pays for something whose quality cannot be verified for decades
SECTION 02
How traditional reputation actually accretes
Six mechanisms. None are short-loop. The shortest is “peer referral” at 5–10 years; the longest is multi-generational pedigree.
Customer picks by referral plus surface signals · quality verification is post-hoc, often post-mortem
SECTION 03
The asymmetric play — legibility as validation
Skip the impossible. Make every recommendation auditable in <15 minutes by any competent professional. Their sign-off becomes the short-loop signal.
The 30-year loop stays unsolved · the engine adds a 7-day loop alongside it
SECTION 04
Short-loop signals — what to actually measure
Seven signals you can capture without waiting for anyone to die. Each has a loop length and a plain-English read on what a good number looks like.
Professional sign-off rate
DAYS
When the rec gets sent to Boone, does he say “yes, this is right” or “no, you missed something”? Track the ratio across hundreds of recs. A healthy engine’s sign-off rate climbs over time as the catalog matures.
Confidence calibration
MONTHS
The engine said “65% confident” on 100 recommendations. If 60–70 of them survive sign-off and the rest get rejected, the 65% was honest. If only 40 survive, the engine is bluffing — the number means something stronger than what the underlying analysis supports.
Critique convergence
MINUTES
When the principal pushes back on a recommendation, does the engine settle in 1–2 turns (“re-examined the specific claim, here’s what changed”) or spiral through 5 reversals chasing the user’s framing? Convergence is the failure-mode test from reflection.md.
Reflection rate
WEEKS
How often does the user catch an engine error that the engine itself didn’t flag? Should be non-zero (the engine isn’t perfect) but trending down over time. Zero is a red flag — the engine is hiding failures rather than avoiding them.
Catalog growth
QUARTERS
How many new statute-level traps did we add this quarter? Slow steady growth = healthy maturity. Sudden spike = we were missing things. Zero growth = we stopped looking. Track the rate, not just the total.
Adversarial agreement
DAYS
Show the same recommendation to 3 independent professionals. 3-of-3 agree with the engine ⇒ strong signal. 0-of-3 agree ⇒ the engine sees something the field doesn’t (and is probably wrong). Cheap to run on the highest-stakes recommendations.
Intermediate-event survival
YEARS
When a real event (audit, divorce, business sale, restructuring) happens to a planned estate, did the plan hold up — or did it need emergency rework? Each event is a partial real-world test. This is the only signal that touches actual outcomes, and even it doesn’t need to wait for death.
SECTION 05
Publishing the catalog — the community moat
The traditional industry is secretive. A published, citable catalog invites a tight ~50–100-person professional community to engage. They critique; we improve; the catalog becomes the canonical reference. Feasible — this isn’t mass-market publication.
A moat the secretive industry doesn’t have · not mass publication · tight professional community
WHY THIS WORKS
Critique is the quality loop
Every pro who reads an entry and disagrees produces a verification event. That event either confirms (entry survives), refines (we add a fact pattern), or rejects (we update or supersede). The catalog improves faster than any private internal review could deliver.
WHAT TO WATCH
Compartmentalisation is still required
Publish the catalog — the statute-level traps, the authorities, the per-entry confidence. Never publish a client’s actual graph: entities, balances, citizenships, family. The catalog is the universal layer; the graph is the private layer. Same scope-fence model as Scopes page.
SECTION 06
The Goodhart trap
If 65%-confidence recs get more sign-offs, the engine will learn to tag more things 65% — regardless of underlying epistemics. The signal stops measuring quality and starts measuring “what the signal rewards.” Three guardrails break the cycle.
Three independent checks · any one of them breaks the feedback flywheel
SECTION 07
The grading criterion
Eventually a real failure will happen: a missed §64(d) cascade fires at first death; an estate gets reassessed in 2055. The tool’s reputation depends on whether the failure was visible in advance.
Grading happens at issuance · the long-term outcome is informative but not determinative