LeanTutor: Towards a Verified AI Mathematical Proof Tutor

Source: arXiv:2506.08321 · Published 2025-06-10 · By Manooshree Patel, Rayna Bhattacharyya, Thomas Lu, Arnav Mehta, Niels Voss, Narges Norouzi et al.

TL;DR

This paper presents LeanTutor, a proof-of-concept AI-based mathematical proof tutoring system that combines the language fluency of Large Language Models (LLMs) with the formal correctness guarantees of the Lean theorem prover. While LLMs enable natural interaction, they are prone to errors and hallucinations, making them unreliable as standalone tutors. The Lean theorem prover provides formal verification and correctness but is difficult for students to use directly due to its complex syntax and backward reasoning style. LeanTutor addresses this by integrating three core modules: an autoformalizer/proof checker that translates and verifies student-written natural language proofs into formal Lean tactics step-by-step; a next-step generator that proposes valid formal tactics continuing the student’s partial proof; and a natural language feedback generator that provides pedagogically useful feedback, hints, or guiding questions without revealing full solutions. To rigorously evaluate these components, the authors introduce PeanoBench, a new dataset of 371 paired informal and formal proofs in Peano arithmetic derived from the Natural Numbers Game. Experiments show that including a reference staff solution in the context significantly improves autoformalization accuracy, especially for incorrect proofs where step-by-step processing outperforms whole-proof translation. Human evaluation of feedback on incorrect proofs demonstrates LeanTutor’s advantage over a baseline in accuracy and relevance, although answer leakage remains a challenge in next-step hints. This work highlights fundamental challenges in faithful autoformalization and tutoring-oriented feedback generation while proposing a model-agnostic modular framework linking LLM-generated natural language understanding with mechanized proof verification.

Key findings

Autoformalization tactic-level accuracy improves from 32.9% to 56.8% on correct proofs when incorporating a staff solution in context (Table 1).
Proof-level autoformalization accuracy on incorrect proofs rises from 14.4% to 30.1% with staff solution context and step-by-step generation vs baseline.
Step-by-step autoformalization outperforms whole-proof autoformalization on incorrect proofs by 8% proof-level accuracy.
89% of correctly predicted tactics appear in the staff solution context, showing strong model reliance on the reference proof.
Generated natural language feedback from LeanTutor scores higher in human evaluations for accuracy and relevance across error identification, hints, and next-step suggestions compared to baseline (Table 2).
Readability and answer leakage scores are comparable between LeanTutor and baseline; next-step hints have expected higher answer leakage.
A novel relaxed exact matching autoformalization accuracy metric is introduced combining tactic string match and semantic-level proof state equivalence.
The PeanoBench dataset includes 371 paired informal/formal proofs with 225 correct and 146 incorrect proofs, enabling evaluation of both correct and faulty student reasoning.

Threat model

The adversary consists of students submitting proofs in natural language for tutoring. They may provide partial, complete, correct, or incorrect proofs to the system. The system assumes a closed set of related Peano arithmetic theorems with at least one known formal staff solution. The adversary cannot submit proofs in arbitrary new domains or completely novel theorems. The system aims to verify or locate student errors and generate pedagogical feedback without leaking full solutions unless explicitly allowed.

Methodology — deep read

The authors define a tutoring-focused threat model where the adversary is a student submitting partial, complete, correct, or incorrect natural language mathematical proofs aiming to receive tutoring feedback. The system must autoformalize student proofs faithfully and provide meaningful next-step suggestions and feedback without giving away answers.

The PeanoBench dataset, derived from the Natural Numbers Game, consists of 371 proofs in Peano arithmetic: 225 correct proofs with three personas (staff solutions, equation-based, justification-based) and 146 incorrect proofs generated by programmatic step skipping to mimic plausible student errors. All natural language proofs were authored and proofread by paper authors to ensure formal-informal alignment; 1:1 correspondence between informal proof steps and formal Lean tactics was maintained.

The autoformalizer/proof checker module translates the student’s natural language input into Lean tactics step-by-step using a general-purpose LLM prompted with the theorem statement, a dictionary of theorem and tactic descriptions, five-shot in-context examples, and a staff solution proof in both natural and formal language as contextual grounding. After generating each tactic, the extended partial proof is compiled with LeanInteract to verify correctness or identify errors. If compilation passes with remaining unsolved goals, the step is accepted; compiler errors mark a formalization failure or student error.

The next-step generator (NSG), activated if the proof is incomplete or contains errors, performs an LLM-directed depth-first search over 12 candidate tactics, filtered by Lean compilation and a progress check that avoids forbidden theorems and cyclic tactics. Tree search depth is bounded at 8 steps. The NSG proposes formal next tactics consistent with a partial proof to guide student corrections or continuation.

The feedback generator ingests the partial formalized student proof, any compiler error messages, and the NSG’s best next tactic to produce three natural language feedback types: error identification (diagnosing common inductive proof mistakes), hints/questions guiding student reasoning, and explicit next-step suggestions. This leverages known error patterns and autoinformalization techniques.

For evaluation, the authors propose a novel relaxed exact matching metric for faithful autoformalization. It first attempts exact string matching of predicted vs ground truth tactics. Failing that, it compares compiled proof states syntactically modulo variable renaming. Proof-level accuracy requires all tactics to be correct in order. Incorrect proof formalization success requires fully correct formalization up to the first erroneous step, with the invalid step triggering a compiler error.

Experiments compare the baseline prompt (theorem, dictionaries, examples) to the baseline plus staff solution context, both step-by-step and whole-proof formalization variants. Feedback quality is scored on accuracy, relevance, readability, and answer leakage by two human evaluator graduate students in math teaching roles, blind to model source. A GPT-4 judge evaluation was also conducted but discounted due to disagreement with human scores.

Reproducibility: The paper provides extensive prompt details, dictionary and staff solution descriptions, and PeanoBench dataset provenance; core code and trained weights are not explicitly noted as released. However, the system operates via prompting GPT-4o-mini-2024-07-18 LLM, incurring low compute costs (~$4 per full evaluation). The staff solution inclusion is required as a scaffold for successful autoformalization.

A concrete example illustrated involves stepwise translation of an inductive proof: a student’s NL base case and inductive step are autoformalized into corresponding Lean tactics, compiled for correctness, then the NSG proposes the next tactic if incomplete. The feedback generator analyzes divergences and compiler errors to produce targeted natural language hints assistive for the student.

Technical innovations

A modular tutoring architecture combining LLM-based step-by-step autoformalization, Lean theorem proving verification, and natural language feedback generation.
Introduction of PeanoBench, an aligned informal/formal proof dataset with 371 paired proofs incorporating multiple natural language personas and both correct and incorrect proof variants.
A novel two-phase relaxed exact matching metric for faithful autoformalization accuracy measuring tactic string and compiled proof state equivalence modulo variable renaming.
Use of staff solution proofs as in-context knowledge combined with a tactic/theorem dictionary to improve stepwise autoformalization in a tutoring setting.
Adaptation of LLM-directed depth-first proof search with compilation progress checks (excluded forbidden theorems, cycle detection) for next-step tactic generation.

Datasets

PeanoBench — 371 proofs (225 correct, 146 incorrect) — derived from the Natural Numbers Game (NNG4) with human-authored NL and Lean formal proofs.

Baselines vs proposed

Baseline autoformalizer: tactic-level correct proofs accuracy = 32.9% ± 3.1% vs Baseline + Staff Solution: 56.8% ± 3.2%.
Baseline autoformalizer: incorrect proofs proof-level accuracy = 14.4% ± 5.7% vs Baseline + Staff Solution: 30.1% ± 7.4%.
Step-by-step autoformalization improves incorrect proof accuracy by 8% over whole-proof autoformalization with staff solution context.
Feedback evaluation accuracy scores: Baseline next-step = 2.8 vs LeanTutor next-step = 3.7; hint/question accuracy: Baseline = 3.1 vs LeanTutor = 3.7 (Table 2).
Feedback relevance scores: Baseline next-step = 2.9 vs LeanTutor = 3.9; readability scores comparable across models.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2506.08321.

Fig 1

Fig 1: LeanTutor is comprised of three modules: an autoformalizer that automatically formalizes an NL student proof into

Limitations

Assumes a strict one-to-one correspondence between natural language proof steps and Lean tactics, which may not generalize to more complex proofs.
Relies on the availability of a formalized staff solution proof in both natural and formal language as context, which may pose scalability challenges for instructors.
Incorrect proof dataset only models step-skipping errors; other common student logic or reasoning errors are not captured.
Natural language proofs are authored by paper authors rather than real students, potentially limiting linguistic diversity and realism of errors.
Compiler error messages sometimes conflate student errors and autoformalization errors; the system cannot always disambiguate.
Feedback generation still suffers from answer leakage, especially in explicit next-step suggestions.
No evaluation under distributional shift or with student-authored natural language inputs, limiting generalizability assessments.

Open questions / follow-ons

How to scale autoformalization beyond one-to-one proof step correspondence, accommodating more intricate or multi-step natural language reasoning.
Methods to reduce dependence on staff solution proofs or generate them automatically to ease instructor burden.
Improved modeling and generation of diverse student errors beyond simple skipped steps for more realistic tutoring scenarios.
Strategies to mitigate answer leakage in next-step feedback, balancing pedagogical guidance and challenge.

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners focused on automated reasoning and challenge generation, LeanTutor demonstrates an innovative approach to combining natural language understanding with formal verification via theorem provers. Its modular framework for stepwise autoformalization, error localization, and guided feedback can inspire robust challenge designs that require both linguistic and logical precision from users, potentially increasing resistance to automated solvers lacking formal reasoning capabilities. The relaxed exact matching metric offers a benchmark for validating semantic equivalence despite syntactic variations, which is useful for fidelity assessments in challenge-response systems. However, the system’s reliance on curated proofs and specific logical domains highlights the challenge of scaling to open-ended, diverse content typical in CAPTCHA or bot-defense tasks.

Cite

bibtex

@article{arxiv2506_08321,
  title={ LeanTutor: Towards a Verified AI Mathematical Proof Tutor },
  author={ Manooshree Patel and Rayna Bhattacharyya and Thomas Lu and Arnav Mehta and Niels Voss and Narges Norouzi and Gireeja Ranade },
  journal={arXiv preprint arXiv:2506.08321},
  year={ 2025 },
  url={https://arxiv.org/abs/2506.08321}
}

LeanTutor: Towards a Verified AI Mathematical Proof Tutor ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​