Built with Opus 4.7 · Anthropic Hackathon · 2026
Lock the threshold before the run.
Then prove it.
Pre-register the claim. Hash the spec. Run clean.
Tamper the spec after the fact — CI exits 3.
The discipline is optional. The hash is not.
Most agents help you build faster.
This one helps you disprove faster.
You write down the claim and the failure criteria before you look at the data. The engine locks the spec with a hash, runs the declared experiment, and refuses to let you rewrite the threshold after seeing the result.
Spec is required to include an executable test plan, a metric, and a pre-registered threshold. Vague specs are rejected at lock time. Once locked, the canonical YAML is hashed — any edit produces a new hash and must be justified as an amendment.
The engine does not invent experiments. It calls the script you declared, in the environment you declared, and verifies the spec hash has not drifted between lock and run.
A numeric comparison against the locked threshold. No LLM judgment, no natural-language hedging. Exit code 0 for PASS, 10 for FAIL, 3 for tampered — so CI can gate on it.
A git commit-msg hook scans all logged verdicts and blocks commits, README edits, or marketing copy that contradict the recorded result.
Init. Lock. Run.
Verdict. Guard.
The claim becomes the spec. The spec becomes the hash. The hash becomes the verdict. The order is the contract.
The number changed.
The locked spec caught it.
JUJU LAB is my prediction market research system. In March 2026 its operating note recorded Hypothesis A as FAIL at -0.0139. I repositioned JUJU honestly — from “trading system” to “ranking framework.” For the hackathon, I re-ran the same hypothesis through falsify against the current ledger. The number came out materially different. The engine surfaced a disagreement the old notebook could not.
Positive magnitude, small sample. The verdict is PASS against a pre-registered threshold of zero, but the engine refuses to let any “alpha confirmed” claim ship until the minimum sample size is met.
127 resolved events across the locked parent cohort. 16 eligible events, 19 demoted. Two locked specs (v1 March-frozen, v2 April-live), one contradictory historical claim. Reconciliation before marketing, not after.
Frozen corpus: outcome_ts < 2026-03-28. PASS against > 0 threshold.
Live corpus: full resolved ledger as of run. PASS, unreconciled with op-note.
JUJU’s original March verdict. Engine cannot reproduce — methodology drift exposed.
Your past self isthe one your futureself answers to.
It does not tell you if you’re right. It tells you if you changed the rules.
pip install falsify
single-file CLI · no daemon · no account