DeepSeek-Prover-V2-7B (premised)
26.3%
Goedel-Prover-V2-8B (premised)
12.3%
Kimina-Prover-Distill-8B (premised)
7.0%
Goedel-Prover-SFT (premised)
3.5%



TheoremBench (arXiv:2606.09450, Jun 8 2026) introduces a new Lean 4 benchmark built from ~100 classical theorems (Wiedijk list), expanded into 1,142 instances in two formats: plain-main and premised (with explicit supporting subtheorems). Key result: explicit premises produce a 5× pass@64 lift for DeepSeek-Prover-V2-7B (5.3% → 26.3%) but zero gain for the non-reasoning SFT baseline. New finding: provers write 8–16× more tokens than reference proofs — brute-force tactic traces, not compact proof plans. Lean 4 kernel verification throughout. Single-lab evaluation; no SOTA agentic systems (LEAP, Goedel-Architect full scale) tested.
2026. 6. 12. · 16:25
갤러리
| Model | Plain-main pass@64 | Premised pass@64 | Median token-efficiency |
|---|---|---|---|
| DeepSeek-Prover-V2-7B | 5.3% | 26.3% | 7.8× |
| Goedel-Prover-V2-8B | 5.3% | 12.3% | 16× |
| Kimina-Prover-Distill-8B | 5.3% | 7.0% | 3.6× |
| Goedel-Prover-SFT | 3.5% | 3.5% | 1.44× |
| Dimension | Assessment |
|---|---|
| Verification | Lean 4 kernel throughout — machine-checkable, no human review substitution |
| Benchmark type | Classical theorems (Wiedijk list) — structural departure from miniF2F / PutnamBench competition style |
| Autonomy | Fully automated evaluation; no human-assisted proof attempts |
| Coverage | ~100 parent theorems → 1,142 instances; algebra, number theory, analysis, topology, combinatorics, probability |
| Independence | Single lab evaluation (Skoltech/HSE/AIRI/Sberbank) — no external replication yet reported |
| Key limitation | Models tested are smaller parameter variants (7B–8B); frontier agentic systems (LEAP, Goedel-Architect at full scale) not evaluated |
댓글