Wandering Nomad: CriticLean makes the AI “grader” the hero of math formalization

10.7.25

CriticLean makes the AI “grader” the hero of math formalization

Automating the translation of plain-English math into Lean code has felt like grading your own exam: language models write a proof, a compiler checks syntax, and everyone hopes the semantics line up. CriticLean flips that script by training a dedicated critic model—dubbed CriticLeanGPT—that learns to catch logical slips the compiler can’t. Guided by reinforcement learning, that critic doesn’t just reject bad code; it drives an iterative rewrite loop that more than doubles end-to-end accuracy.

From passive judge to active coach

The team fine-tunes a lightweight Qwen backbone to score whether a Lean statement truly matches its natural-language prompt, then bakes those scores into a reward signal. Each failed attempt becomes a teaching moment, producing richer feedback than the usual “compiler error” one-liner. The critic also powers CriticLeanBench, a 500-item test set (half correct, half adversarially wrong) that shows CriticLeanGPT trouncing both open and closed-source baselines at spotting semantic mistakes.

Hard numbers: 38 % → 84 % accuracy

On a 50-problem slice of the Omni-MATH benchmark, a 7 B “Kimina-Autoformalizer” model alone solved just 38 % of tasks. A traditional compiler-feedback loop nudged that to 54 %. Swap in CriticLean’s RL-trained critic and the success rate soars to 84 %—a 30-point leap even seasoned theorem-prover veterans will notice.

A broader 500-problem stress test tells the same story: the multi-attempt CriticLean pipeline verified 52.8 % of statements under a 200-try cap, recovering forty extra points of yield that single-pass systems would toss out.

A new 285 k-problem corpus (and 36 k “diamond” stumpers)

Because the critic can certify semantic correctness without humans, the authors bootstrapped FineLeanCorpus, a 285 ,957-entry Lean dataset spanning 16 math domains with a flatter difficulty curve than the skewed Lean-Workbook previously used for fine-tuning. They also carved out a FineLeanCorpus-Diamond subset—36 k brutal problems meant to push future models beyond textbook algebra.

Why this matters

Reliability over compilation. Syntax is easy; semantics are king. CriticLean proves that investing compute in the grading phase pays bigger dividends than ever-bigger generators.
Plug-and-play RL recipe. The critic-guided loop is model-agnostic and could supervise any auto-formalizer—Lean, Isabelle, even Coq.
Dataset flywheel. With FineLeanCorpus open-sourced, researchers finally have a large, semantically vetted playground instead of noisy web scrapes.

Whether you’re chasing fully automated theorem proving or just want ChatGPT to stop hallucinating Lean syntax, CriticLean’s message is clear: the smartest way forward is to teach your models how to critique themselves.

Paper link: arXiv 2507.06181 (PDF)