Wandering Nomad

10.7.25

CriticLean makes the AI “grader” the hero of math formalization

Automating the translation of plain-English math into Lean code has felt like grading your own exam: language models write a proof, a compiler checks syntax, and everyone hopes the semantics line up. CriticLean flips that script by training a dedicated critic model—dubbed CriticLeanGPT—that learns to catch logical slips the compiler can’t. Guided by reinforcement learning, that critic doesn’t just reject bad code; it drives an iterative rewrite loop that more than doubles end-to-end accuracy.

From passive judge to active coach

The team fine-tunes a lightweight Qwen backbone to score whether a Lean statement truly matches its natural-language prompt, then bakes those scores into a reward signal. Each failed attempt becomes a teaching moment, producing richer feedback than the usual “compiler error” one-liner. The critic also powers CriticLeanBench, a 500-item test set (half correct, half adversarially wrong) that shows CriticLeanGPT trouncing both open and closed-source baselines at spotting semantic mistakes.

Hard numbers: 38 % → 84 % accuracy

On a 50-problem slice of the Omni-MATH benchmark, a 7 B “Kimina-Autoformalizer” model alone solved just 38 % of tasks. A traditional compiler-feedback loop nudged that to 54 %. Swap in CriticLean’s RL-trained critic and the success rate soars to 84 %—a 30-point leap even seasoned theorem-prover veterans will notice.

A broader 500-problem stress test tells the same story: the multi-attempt CriticLean pipeline verified 52.8 % of statements under a 200-try cap, recovering forty extra points of yield that single-pass systems would toss out.

A new 285 k-problem corpus (and 36 k “diamond” stumpers)

Because the critic can certify semantic correctness without humans, the authors bootstrapped FineLeanCorpus, a 285 ,957-entry Lean dataset spanning 16 math domains with a flatter difficulty curve than the skewed Lean-Workbook previously used for fine-tuning. They also carved out a FineLeanCorpus-Diamond subset—36 k brutal problems meant to push future models beyond textbook algebra.

Why this matters

Reliability over compilation. Syntax is easy; semantics are king. CriticLean proves that investing compute in the grading phase pays bigger dividends than ever-bigger generators.
Plug-and-play RL recipe. The critic-guided loop is model-agnostic and could supervise any auto-formalizer—Lean, Isabelle, even Coq.
Dataset flywheel. With FineLeanCorpus open-sourced, researchers finally have a large, semantically vetted playground instead of noisy web scrapes.

Whether you’re chasing fully automated theorem proving or just want ChatGPT to stop hallucinating Lean syntax, CriticLean’s message is clear: the smartest way forward is to teach your models how to critique themselves.

Paper link: arXiv 2507.06181 (PDF)

Phi-4-mini-flash-reasoning: Microsoft’s 3.8 B “Pocket” LLM that Delivers 10× Faster Math & Logic on Edge Devices

Why Another “Mini” Phi Model?

After a year of shipping tightly-focused small language models (SLMs) for reasoning, Microsoft’s Azure AI team has unveiled Phi-4-mini-flash-reasoning—a drop-in upgrade to the earlier Phi-4-mini that targets one pain point: speed. Where the original model excelled at step-by-step maths and logic, the new flash edition achieves up to 10 × higher token throughput and 2-3 × lower latency without sacrificing accuracy. It is purpose-built for resource-constrained hardware such as mobile handsets, single-GPU servers, classroom laptops, and IoT gateways.

Inside the New Architecture

Innovation	What It Does	Why It Matters
SambaY Self-Decoder	Blends state-space Mamba blocks with Sliding-Window Attention (SWA).	Provides linear-time prefilling and local context capture.
Gated Memory Units (GMU)	Tiny gating layers share representations between decoder blocks.	Slashes compute during generation without harming quality.
Decoder-Hybrid-Decoder Layout	One full-attention layer for KV cache, surrounded by lightweight Sambas and GMUs.	Maintains long-context power (64 K tokens) while accelerating every other step.

Together these tricks let Phi-4-mini-flash-reasoning outrun not only its mini predecessor but also larger 6-7 B dense models on vLLM in real-time tests.

Benchmark Snapshot

Metric (single A100-80 GB)	Phi-4-mini-flash	Phi-4-mini	Llama-3-8B-Instruct
Inference latency (256 tok)	≈ 40 ms	95 ms	120 ms
Throughput (tok/s)	> 1 000	110	240
AIME 24/25 (Math, Pass@1)	72 %	70 %	68 %
Math500	81 %	78 %	73 %
GPQA-Diamond	62 %	60 %	55 %

Microsoft internal numbers shown in the blog post graphs

Developer Access & Tooling

Open Weights: Download from Hugging Face or the NVIDIA API Catalog under a permissive MIT-style licence.
Azure AI Foundry: One-click deployment with managed GPUs, safety filters, and function-calling.
vLLM-Ready: Microsoft supplies a reference --flash config enabling the advertised latency on a single GPU.
Edge Builds: TensorRT-LLM and ONNX Runtime packages for Jetson Orin, Apple Silicon, and high-end Android phones.

Ideal Use-Cases

On-Device STEM Tutors – Real-time solution steps for maths homework without cloud calls.
Industrial Logic Controllers – Quick symbolic reasoning for quality-control or robotics arms.
AR/VR Headsets – Localised puzzle hints or game logic with < 50 ms response.
Classroom Labs – Affordable single-GPU servers hosting dozens of simultaneous reasoning sessions.

Looking Ahead

The Azure team hints that the SambaY + GMU blueprint will flow into Phi-4-multimodal-flash later this year, targeting low-latency image and audio reasoning on the same small-footprint devices. Meanwhile, Phi-4-mini-flash-reasoning is live today—ready for developers who need big-brain logic in a micro power envelope.

Whether you’re building an educational app, a smart sensor, or just trimming cloud compute bills, “flash” Phi brings full reasoning to the edge—no compromise required.

Meta AI’s grand blueprint for embodied agents: put a world model at the core

Move over “chatbots with arms.” Meta AI has published a sweeping manifesto that recasts embodied intelligence as a world-model problem. The 40-page paper, Embodied AI Agents: Modeling the World (July 7, 2025), is signed by a who’s-who of researchers from EPFL, Carnegie Mellon, NTU and Meta’s own labs, and argues that any meaningful agent—virtual, wearable or robotic—must learn a compact, predictive model of both the physical and the mental worlds it inhabits.

Three kinds of bodies, one cognitive engine

The authors sort today’s prototypes into three buckets:

Virtual agents (think emotionally intelligent avatars in games or therapy apps)
Wearable agents that live in smart glasses and coach you through daily tasks
Robotic agents capable of general-purpose manipulation and navigation

Despite wildly different form factors, all three need the same six ingredients: multimodal perception, a physical world model, a mental model of the user, action & control, short-/long-term memory, and a planner that ties them together.

What “world modeling” actually means

Meta’s framework breaks the catch-all term into concrete modules:

Multimodal perception – image, video, audio and even touch encoders deliver a unified scene graph.
Physical world model – predicts object dynamics and plans low- to high-level actions.
Mental world model – tracks user goals, emotions and social context for better collaboration.
Memory – fixed (weights), working and external stores that support life-long learning.

The paper contends that current generative LLMs waste compute by predicting every pixel or token. Instead, Meta is experimenting with transformer-based predictive models and JEPA-style latent learning to forecast just the state abstractions an agent needs to plan long-horizon tasks.

New benchmarks to keep them honest

To measure progress, the team proposes a suite of “world-model” stress tests—from Minimal Video Pairs for perceptual prediction to CausalVQA and the WorldPrediction benchmark that evaluates high-level procedural planning. Early results show humans near-perfect and SOTA multimodal models barely above chance, highlighting the gap Meta hopes to close.

Where they’re headed next

Two research directions top the agenda:

Embodied learning loops that pair System A (learning by passive observation) with System B (learning by physical action), each bootstrapping the other.
Multi-agent collaboration, where a family of specialized bodies—your glasses, a kitchen robot, and a home avatar—share a common world model and negotiate tasks.

Ethics is a running theme: privacy for always-on sensors and the risk of over-anthropomorphizing robots both get dedicated sections.

Why it matters

Meta isn’t open-sourcing code here; it’s setting the intellectual agenda. By declaring world models—not ever-larger GPTs—the “missing middle” of embodied AI, the company positions itself for a future where agents must act, not just talk. Expect the next iterations of Meta’s smart-glasses assistant (and perhaps its humanoid robot partners) to lean heavily on the blueprint sketched in this paper.

Paper link: arXiv 2506.22355 (PDF)