TheoremBench: Classical Math Exposes New Gaps in Lean 4 Provers

A new Lean 4 benchmark from Skoltech, HSE University, AIRI, and Sberbank targets the regime between competition-style sprints and real-world Lean projects — classical mathematical theorems. The results reveal that current specialized provers struggle even more than competition numbers suggest.

What happened

TheoremBench (arXiv:2606.09450, submitted June 8 2026) introduces a Lean 4 benchmark built from ~100 classical theorems drawn from Freek Wiedijk's canonical list, expanded into 1,142 instances across two complementary dataset views:

Plain-main: one standalone target theorem per instance
Premised: each theorem expanded into a structured group including the main theorem plus automatically extracted supporting subtheorems, with prior results exposed as explicit premise binders

How it works

The benchmark construction pipeline parses raw Lean 4 formalizations of classical results, reconstructs required compilation context for each extracted instance, and verifies every instance against the Lean 4 kernel before inclusion. The "premised" format converts relevant prior Lean results from surrounding developments into explicit binder assumptions in theorem declarations — producing self-contained snippets that can be solved without reconstructing the full file context.

Four provers were evaluated: DeepSeek-Prover-V2-7B, Goedel-Prover-V2-8B, Kimina-Prover-Distill-8B, and Goedel-Prover-SFT (non-reasoning baseline). All candidates were sampled up to k=64, with every attempt checked by Lean 4.

New metrics introduced:

Theorem-level coverage: fraction of supporting subtheorems proved within a parent theorem group
Token-efficiency: ratio of generated proof tokens to ground-truth proof tokens (measures verbosity)

Key results

DeepSeek-Prover-V2-7B (premised)

26.3%

Goedel-Prover-V2-8B (premised)

12.3%

Kimina-Prover-Distill-8B (premised)

7.0%

Goedel-Prover-SFT (premised)

3.5%

통계 카드를 불러오는 중…

Model	Plain-main pass@64	Premised pass@64	Median token-efficiency
DeepSeek-Prover-V2-7B	5.3%	26.3%	7.8×
Goedel-Prover-V2-8B	5.3%	12.3%	16×
Kimina-Prover-Distill-8B	5.3%	7.0%	3.6×
Goedel-Prover-SFT	3.5%	3.5%	1.44×

Explicit premises produce a 5× lift for DeepSeek-Prover-V2-7B (5.3% → 26.3%) but essentially no gain for the non-reasoning SFT model — confirming the benefit is model-dependent, not mechanical.

Token-efficiency ratios are striking: Goedel-Prover-V2-8B generates proofs with a median 16× the token count of the reference proof. These are valid Lean proofs, but they are padded, inefficient tactic traces rather than compact proof plans.

Claim audit

Dimension	Assessment
Verification	Lean 4 kernel throughout — machine-checkable, no human review substitution
Benchmark type	Classical theorems (Wiedijk list) — structural departure from miniF2F / PutnamBench competition style
Autonomy	Fully automated evaluation; no human-assisted proof attempts
Coverage	~100 parent theorems → 1,142 instances; algebra, number theory, analysis, topology, combinatorics, probability
Independence	Single lab evaluation (Skoltech/HSE/AIRI/Sberbank) — no external replication yet reported
Key limitation	Models tested are smaller parameter variants (7B–8B); frontier agentic systems (LEAP, Goedel-Architect at full scale) not evaluated

Context

TheoremBench sits between two already-covered benchmarks on this channel:

MiniF2F / PutnamBench: competition problems, typically self-contained, Goedel-Architect already hits 99.2% and 88.8% respectively
SorryDB: real-world sorry-closure from live Lean 4 projects, where Kimina hits only 1.0% pass@1

TheoremBench fills the gap: classical mathematical developments with dependency structure, not adversarially crafted competition problems and not messy real-world sorry holes. The premised variant mirrors how a real Lean development is structured — you have intermediate lemmas, and a prover that cannot exploit them is navigating blind.

The token-efficiency finding adds a new diagnostic axis: passing is necessary but not sufficient. Provers that write 16× the reference proof length are using brute-force tactic enumeration, not genuine proof understanding.

Sources

arXiv preprint: https://arxiv.org/abs/2606.09450
HTML version (figures): https://arxiv.org/html/2606.09450v1