ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations

Source: arXiv:2510.15700 · Published 2025-10-17 · By Alex Gu, Bartosz Piotrowski, Fabian Gloeckle, Kaiyu Yang, Aram H. Markosyan

TL;DR

The paper addresses a critical bottleneck in neural theorem proving: the excessive length and complexity of formal proofs generated by state-of-the-art reinforcement learning (RL) trained language models in systems like Lean. These long proofs hinder human understanding, reduce mathematical insight, and slow down verification and downstream learning. Prior approaches that use off-the-shelf LLMs with agentic scaffolding fail to simplify the extremely long, RL-generated proofs effectively, and supervised data for proof simplification is scarce. To overcome this, the authors present ProofOptimizer, the first language model explicitly trained to simplify Lean proofs without human-labeled simplification data. It combines a symbolic linter, a 7B parameter model fine-tuned via expert iteration and RL using Lean verification as reward, and an iterative inference-time algorithm to progressively shorten proofs. Empirical results on key benchmarks miniF2F, PutnamBench, and Seed-Prover's IMO 2025 proofs show substantial proof length reductions (87%, 57%, and 49% respectively). Simplified proofs also check faster and improve downstream prover performance when used for training. The paper demonstrates that training proof simplification models from scratch using formal verification signals is viable and effective for scaling proof readability and utility.

Key findings

ProofOptimizer reduces proof length by 87% on miniF2F (average proof length 334 tokens), 57% on PutnamBench (avg. 1468 tokens), and 49% on Seed-Prover IMO 2025 proofs.
Expert iteration training steadily improves simplification quality across three rounds; final iterations achieve min@1 proof length of 241 tokens on miniF2F and 1328 tokens on PutnamBench.
ProofOptimizer-RL yields higher single-sample (min@1) reductions than expert iteration but suffers decreased diversity (smaller gap between min@1 and min@32).
Test-time RL further improves single-shot simplification on miniF2F to 72.5% relative length reduction.
Symbolic linter removing redundant tactics reduces proof size before LM simplification steps.
Iterative proof shortening (multiple rounds of sampling and selecting the shortest valid proof) substantially amplifies reductions—achieving up to 87.9% shrinkage on miniF2F after 8 iterations with 1024 samples at last steps.
Training a base prover on ProofOptimizer-simplified proofs improves miniF2F performance by 2% relative to training on original lengthy proofs.
Simplified proofs execute faster: 50/75 PutnamBench proofs speed up by at least 10%, with 22/75 speeding by over 50%. Similar trends hold for miniF2F.
Optimizing for proof execution heartbeats (runtime proxy) instead of length leads to fewer slowdowns and still achieves significant simplification.

Threat model

n/a - The paper presents a constructive AI/ML method for proof simplification without an adversarial or security threat model.

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly addressed since this is a constructive proof simplification task rather than an adversarial setting. The system assumes access to formally verified Lean proofs generated by RL-trained provers, focusing on improving proof brevity without sacrificing correctness. The model never produces invalid proofs as correctness is enforced by Lean verification.
Data: The training data consists of 145K verified Lean proofs generated by the Goedel-Prover-V2-32B model on extracted sub-theorems. Initial dataset creation involves collecting theorem problems (excluding trivial computational problems) from Goedel-Pset, generating rough proof sketches with the "sorry" tactic for partial steps, extracting those into individual theorems, and filtering trivial ones, yielding 307K sub-theorems. Goedel-Prover-V2-32B then generates full formal proofs for 145K of these, forming the simplification dataset. Evaluation is done on held-out sets from miniF2F (194 proofs, avg length 334 tokens), PutnamBench (75 proofs, avg length 1468 tokens), and four long proofs from Seed-Prover for IMO 2025.
Architecture/Algorithm: ProofOptimizer is a 7B parameter language model fine-tuned from a Lean base model (Qwen-2.5-7B-Instruct). The base model is trained on multiple tasks including formal proof generation and tactic prediction. ProofOptimizer integrates: (i) a symbolic linter that uses Lean's linter.unusedTactic to remove extraneous tactics, decreasing proof length before LM intervention; (ii) the LM which takes tokenized Lean proof scripts as input and outputs candidate simplifications; (iii) an iterative inference-time procedure whereby the model samples multiple candidates, chooses the shortest correct simplification verified by Lean, then repeats simplification on the updated proof for multiple iterations.
Training Regime: Two main training recipes are used. Expert iteration (ExpIt) repeatedly collects simplified proofs from the current model, filters for improvements reducing length by at least 20%, aggregates these into supervised (“teacher”) data, and fine-tunes the model for multiple rounds (three rounds reported). Online reinforcement learning (RL) uses the same dataset, optimizing a reward function based on relative proof length reduction, employing an asynchronous GRPO algorithm variant with no KL regularization and averaging advantage baselines across 8 samples. Test-time RL finetunes the model weights on evaluation sets to further improve performance by directly tuning on test proofs.
Evaluation Protocol: Metrics include min@k (minimum proof length among k samples) and red@k (max relative reduction compared to original proof length). k is typically 1 or 32, sometimes larger (up to 1024 samples for iterative shortening). Comparisons are made against strong baselines including Gemini-2.5-Pro, with and without proof state annotations. Ablations measure effects of expert iteration vs RL training, test-time RL, repair with execution feedback using Goedel-Prover, and iterative shortening. Statistical smoothing (Gaussian) is applied for RL curves to reduce noise. The evaluation datasets are held out from training sets and include distribution shifts—PutnamBench proofs being on average 4x longer than training proofs.
Reproducibility: The paper describes datasets and training procedures in detail but does not mention public code or weights release. The data pipeline uses publicly available Lean codebases (Goedel-Pset), but the full 145K proof dataset generated by Goedel-Prover-V2-32B is likely proprietary.

Example end-to-end: Given an original proof from miniF2F with 334 tokens, the input is first passed through the symbolic linter removing redundant tactics. The ProofOptimizer LM samples 64 candidate simplifications. Each candidate is verified in Lean; the shortest valid candidate replacing the original is selected if it reduces proof length by at least 20%. This proof is then used for training data in expert iteration or RL. At inference, the model runs multiple rounds (up to 8) of repeatedly sampling k candidates (e.g., 64 or 1024), always selecting the shortest valid proof from the batch and feeding that forward to the next round. This leads to progressive proof shortening, ultimately reducing proof length by up to 87%.

Technical innovations

First language model trained to simplify formal proofs in Lean without requiring human-annotated simplification pairs, leveraging only Lean verification as supervision.
Integration of a symbolic linter removing redundant tactics as a preprocessing step to improve simplification efficiency.
Training via expert iteration combining LM inference with Lean verification to bootstrap supervised simplification data from synthetic proofs.
Use of online reinforcement learning optimizing relative proof length reduction reward computed via Lean proof validation.
An iterative inference-time simplification method sampling multiple candidates per iteration and progressively refining proofs.

Datasets

Proof simplification training dataset — 145K verified Lean proofs — generated by Goedel-Prover-V2-32B on Goedel-Pset theorems
miniF2F — 194 proofs, average 334 tokens — publicly available formal math benchmark
PutnamBench — 75 proofs, average 1468 tokens — publicly available formal math benchmark
Seed-Prover IMO 2025 proofs — 4 proofs, ranging 8.6K to 36.5K tokens — from recent state-of-the-art RL prover release

Baselines vs proposed

Gemini-2.5-Pro (miniF2F): red@1 = 24.3% vs ProofOptimizer-RL: 63.6%
Gemini-2.5-Pro + states (miniF2F): red@1 = 26.4% vs ProofOptimizer-ExpIt: 49.0%
ProofOptimizer-ExpIt (miniF2F): min@1 = 241 tokens (improved from 283 base model)
ProofOptimizer-RL + test-time RL (miniF2F): red@1 = 72.5% vs ExpIt alone: 49.0%
Gemini-2.5-Pro (PutnamBench): red@1 = 5.5% vs ProofOptimizer-RL + test-time RL: 23.8%
Seed-Prover IMO 2025 proofs: average reduction across 4 proofs = 49% (e.g., P1 reduced from 36478 to 20506 tokens)

Limitations

No public release of code or datasets limits reproducibility and external validation.
Evaluation focused primarily on length reduction and verification correctness; human interpretability or readability of simplified proofs not quantitatively measured.
RL training suffers from diversity collapse, limiting sample diversity and potentially overfitting to shorter but less varied simplifications.
Execution feedback repair methods produce longer proofs, indicating challenges in effectively leveraging repair without increasing proof size.
Experiments not conducted on proofs beyond Lean or different proof assistants, limiting generalizability.
Downstream benefits of proof simplification (better training, faster proof checking), while promising, are preliminary and need larger-scale validation.

Open questions / follow-ons

Can proof simplification models be extended to jointly optimize multiple metrics beyond length, such as readability, semantic clarity, or proof modularity?
How to effectively combine execution feedback repair with LM simplification to avoid producing longer, more complex proofs?
What are the limits of iterative proof shortening in terms of convergence speed and proof quality across increasingly complex formal mathematical domains?
Could training on simplified proofs systematically improve the capabilities of formal theorem provers beyond minor performance gains?

Why it matters for bot defense

From a bot-defense and CAPTCHA perspective, ProofOptimizer demonstrates how large language models—when tightly integrated with formal verification feedback—can be trained end-to-end to perform highly complex source code transformations without explicit human-labeled data. This general approach of leveraging strong automated verifiers as oracles for safe, correctness-preserving model improvements might inspire analogous methods for guaranteeing bot detection or challenge generation correctness in security contexts. The iterative inference strategy, combined with a symbolic preprocessor (linter), exemplifies a workflow to safely bootstrap complex logical transformations. However, the domain of formal proof scripts is highly specialized; generalizing these techniques to CAPTCHAs or bot-detection tokens would require significant adaptation. Still, this research highlights the power of deeply coupling symbolic verification with LLM-based generation to reduce complexity while maintaining absolute correctness, a crucial principle for reliable bot defense systems.

Cite

bibtex

@article{arxiv2510_15700,
  title={ ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations },
  author={ Alex Gu and Bartosz Piotrowski and Fabian Gloeckle and Kaiyu Yang and Aram H. Markosyan },
  journal={arXiv preprint arXiv:2510.15700},
  year={ 2025 },
  url={https://arxiv.org/abs/2510.15700}
}

ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​