"Every claim needs a metric, a counterfactual, a control, and a confidence level. If a conclusion sounds exciting, attack it harder before reporting it."
Phase 1–3 atlas · 39 SFT experiments · 6 format ablations · 3 model scales · 2 architectures
Fully reproducible: python scripts/run_*.py