Showing posts with label Reinforcement Learning. Show all posts
Showing posts with label Reinforcement Learning. Show all posts

12.9.25

A Survey of Reinforcement Learning for Large Reasoning Models: mapping the promise and the gaps

 Reinforcement learning (RL) isn’t new—but as Large Language Models (LLMs) evolve into reasoning machines, RL is taking a central role not just in alignment, but in building reasoning itself. A new survey, “Reinforcement Learning for Large Reasoning Models (LRMs)” by a large group from Tsinghua, Shanghai AI Lab, SJTU, and others, lays out an exhaustive map of the nascent field: what’s working, what’s risky, and what future architects need to solve. 


What the survey covers

The paper dives into the core building blocks of using RL in reasoning-centered LLMs (often called LRMs): how to define rewards, what training algorithms are in play, how sampling strategies are evolving, and how infrastructure and task domains factor into the picture. It considers both alignment-adjacent RL (e.g. RLHF, preference learning) and RL whose goal is reasoning performance (accuracy, planning, reflection). 


Key themes and insights

  1. Reward design
    The survey classifies rewards into several types:

    • Verifiable rewards (e.g. test correctness, unit tests, exact checks) when tasks allow.

    • Generative / learned reward models for subjective or open domains.

    • Dense rewards vs outcome-only reward schemes—bringing signal into intermediate reasoning steps.

    • Unsupervised or weak rewards when neither full correctness metrics nor human feedback are feasible.
      The authors emphasize that tasks with strong verifiability tend to yield more reliable RL learning

  2. Policy optimization & sampling strategies
    There’s a broad sweep of algorithms: policy gradients, off-policy methods, regularized RL, hybrid approaches, critic-based vs critic-free methods. Sampling strategies—how you gather candidate outputs or intermediate chains—have big effects both on performance and on compute cost. Dynamic / structured sampling (e.g. adaptively adjusting paths, beam vs sampling) is becoming more common. 

  3. Foundational problems and gaps
    Several of these stand out:

    • Distinguishing when RL improves reasoning vs just memorization.

    • Balancing weak model priors: does your base LLM already encode reasoning bias, or do you need to train from scratch?

    • Trap of over-rewarding narrow achievements; reward hacking.

    • Challenges in reward specification in subjective domains.

    • Scaling issues: compute, infrastructure, verifying many candidates. 

  4. Training resources & infrastructure
    The survey catalogues the spectrum of environments and corpora used: from static datasets to dynamic environments (interactive tasks, tool usage), from single-task to multi-agent setups. It also considers RL frameworks and infrastructure tools (e.g. RL pipeline libraries) that enable reproducible LLM+RL research. 

  5. Applications
    RL for LRMs has been used in:

    • Coding: unit tests, code correctness, reflection.

    • Agentic tasks: agents using tools, web retrieval, planning.

    • Multimodal reasoning: vision-language tasks, code+images.

    • Robotics / medical / scientific domains. Each has its own reward/verification constraints. 


Why it matters & what to watch next

  • Reasoning as an explicit target. RL is being woven into models not just to be more “helpful” or “safe,” but to reason more deeply: plan, reflect, self-correct.

  • Verifiability is a power lever. Where tasks allow for exact or semi-exact verification, RL works well. When reward is fuzzy, progress is slower and riskier.

  • Cost and scalability are fundamental constraints. As LRMs become larger and used with more test-time compute (more chain-of-thought, more candidate generations), RL training and inference costs balloon; infrastructure and sampling strategy choices can make or break feasibility.

  • Hybrid and co-evolving reward models are growing. There’s increasing interest in reward models that both learn and evolve alongside the LLM, or in having the model itself critique or verify its own work.


Takeaways for researchers and builders

  • If you’re designing RL for reasoning tasks, aim for verifiable reward signals where possible—they give cleaner gradients and fewer surprises.

  • Pay attention to sampling strategy—generating more candidates or reasoning branches helps, but only when combined with selective reinforcement.

  • For subjective or “open” tasks (creative writing, alignment, etc.), you likely need sophisticated reward models, rubric-based or generative rewards, and strong regularization.

  • Infrastructure matters: your ability to scale RL—from having candidate generation, verifiers, tool execution environments, caching, etc.—significantly affects what you can achieve.


Bottom line: This survey is a timely, comprehensive lookup table for anyone playing at the intersection of LLMs, RL, and reasoning. It confirms that reward design and verifiability are major levers, that RL is now essential for pushing reasoning as a capability, but also that many technical, infrastructural, and algorithmic challenges remain before “reasoning superintelligence.”

Paper link: arXiv 2509.08827 (PDF)

11.9.25

Parallel-R1: Teaching LLMs to reason from multiple angles—permanently

 Modern large language models (LLMs) often reason sequentially—one thought chain at a time. Parallel thinking, in contrast, involves spawning multiple reasoning paths (or perspectives), then merging the insights. While prompting tricks can induce this behavior at inference, they carry heavy overhead and brittle generalization. Parallel-R1, a new paper by Tencent AI Lab Seattle with collaborators, pioneers a training-time RL framework for instilling parallel thinking as a native reasoning strategy. 


What is Parallel-R1

The key idea: don’t just prompt models to use parallel paths—train them to do so. Parallel-R1 has a progressive curriculum:

  1. Cold start (format learning via SFT) — teach the model the syntax/tags of parallel blocks (e.g. <Parallel>, <Path>...</Path>, <Summary>), using easier math problems (GSM8K) where high-quality parallel traces are easy to generate.

  2. Reinforcement learning (RL) on easy tasks, to explore usage of parallel thinking, with reward that combines correctness + usage of parallel structure. 

  3. RL on more difficult problems (e.g. DAPO, AMC, AIME), so the model generalizes both performance and the parallel thinking style. 

The architecture has two variants: a causal (structure-agnostic) version and a structured version. The structured version modifies the attention mechanism (via path-window masking, separate position encodings) so paths are more isolated during reasoning. But structured variants show trade-offs—good for generalization in some settings, but less robust under distribution shift.


Results & gains

On a battery of math benchmarks (MATH, AMC23, AIME24, AIME25), Parallel-R1 shows consistent improvements:

  • The “Seen” variant (causal) achieves ~48.9% average across benchmarks (Mean@16 / Pass@16, etc.), beating baseline GRPO RL on general math tasks. 

  • In particular, on AIME’25, Parallel-R1 raises accuracy by ~8.4% over a purely sequential RL model trained on the harder tasks directly. 

  • The structured (Unseen) variant also performs well under certain reward schedules; the “alternating ACC/PAR” reward schedule (switching between rewarding correctness and parallel structure periodically) helps balance parallel usage and performance. 

Beyond numerical gains, the authors observe a behavioral shift: early in training, the model heavily uses parallel paths as an exploration tool, branching in many places; as the model becomes stronger, it shifts to using parallel paths more conservatively, mostly for verification near the end of reasoning. This shift correlates with stronger final performance. 


Why this matters

  • Performance & efficiency trade-off: Parallel-R1 shows that training models for parallel thinking can yield higher reasoning ability without ballooning inference cost (since only when needed are parallel paths triggered).

  • Better than imitation: Many earlier works used supervised fine-tuning on synthetic parallel reasoning traces under teacher forcing; but those often over-fit to particular patterns. RL in Parallel-R1 helps models learn to decide when parallel paths help, not just how to mimic them.

  • Scaffolding exploration: The cold-start + easy tasks + alternating reward strategy functions as a scaffold, enabling RL to find a stronger policy space than direct RL on hard tasks.

  • Architecture designs matter: The structured variant shows that attention masking and position encodings can help or hurt depending on how well training data matches deployment tasks.


Limitations & future directions

  • The gains, though significant, still leave room before human-level performance in very hard math tasks.

  • The structured variants can struggle under domain shift; care needed in architectural changes that assume particular path structures.

  • Triggering parallel thinking (using <Parallel> blocks) costs some token and compute overhead, though the model learns to use it more sparsely over time.

  • There’s a balance tension between pushing for parallel structure (which encourages exploration) and maximizing accuracy (which sometimes pushes toward fewer divergences). Reward engineering is delicate.


Bottom line: Parallel-R1 is a breakthrough toward training LLMs that think in parallel, not just deeper. By combining curriculum learning, structured or causal variants, and reinforcement learning with rewards for both correctness and reasoning style, it unlocks better performance on challenging math tasks. As reasoning benchmarks and applications demand both correctness and robustness, methods like this will likely become a standard part of the toolkit.

Paper link: arXiv 2509.07980 (PDF)

The Majority Isn’t Always Right: AggLM Learns to Aggregate Better Than Voting

 When logic is tricky, the most common answer isn’t always the correct one. A new Meta/Fair & CMU paper titled “The Majority is not always right: RL training for solution aggregation” challenges the standard practice of combining LLM outputs via voting or reward-scored selection. Their method—AggLM—trains a dedicated aggregator model to review, correct, and synthesize among multiple LLM-generated candidate solutions via reinforcement learning from verifiable rewards (RLVR), yielding big gains over majority voting and reward model baselines. 


Solving it: learned reconciliation vs. counting

Standard aggregation in LLM reasoning often works like this: sample many candidate solutions, then pick the answer that's most frequent (majority voting) or highest scored by some reward model. While effective in many settings, these methods have a blind spot—when correct answers exist only among minority solutions. In contrast, AggLM treats aggregation itself as a reasoning task. It takes a set of candidate solutions, analyzes them, spots mistakes or partial correctness, then combines ideas or corrects missing steps to produce a final solution. Importantly, it’s trained using verifiable rewards—i.e. only when the aggregated output matches a known correct solution. 


Key ingredients & experiments

  • Dataset & training: Using Qwen3-1.7B as the solution generator, AggLM-1.7B is trained on ~446,000 examples drawn from a mixture of “easy” and “hard” sets. Hard sets are those where the majority answer among candidates is actually incorrect; the mix helps the model learn both to follow the majority and to rescue correctness from minority solutions. 

  • Aggregation via RLVR: The model uses Group-Relative Policy Optimization (GRPO), with a binary reward (1 for matching the ground truth, 0 otherwise). The aggregator is initialized from the Qwen3-1.7B model but is tuned via this RL signal. 

  • Benchmarks: Evaluated on four math contest datasets: AIME24, AIME25, HMMT24, HMMT25. AggLM was tested aggregating candidate solutions from both the same generator model (Qwen3-1.7B) and stronger ones (Qwen3-8B), in both thinking and non-thinking modes. 


Results & token-efficiency

  • On solutions from Qwen3-1.7B in thinking mode, AggLM-1.7B lifts accuracy significantly. For example, on AIME25, majority voting with 8 candidates yields ~67.9%, while AggLM pushes it to 50.0% in a different benchmark context (depending on the exact evaluation variant). More striking, when aggregating from the stronger 8B model, AggLM still outperforms majority voting, weighted voting, and reward-model selection baselines. 

  • In non-thinking modes (i.e. when the candidate-generating model is weaker or does not use chain-of-thought reasoning), AggLM retains its lead—showing that it generalizes beyond just cherry-picking strong or specifically-formatted inputs. 

  • Regarding cost, AggLM is more token efficient: instead of needing large numbers of candidate solutions (i.e. very large k) for majority voting to reach high accuracy, AggLM achieves similar or better accuracy with fewer candidate solutions, saving both inference time and compute. 


Implications & what’s next

AggLM shifts thinking in two ways:

  1. Aggregation as reasoning. Aggregation isn’t just picking among options—it’s an opportunity to correct, synthesize, and integrate partial truths. Models that can do that perform better, especially in instances where majority answers mislead.

  2. Balancing examples is key. Training on a mix of easy and hard cases was essential. If you train only on “easy” majority-correct groups, or only on “hard” ones, performance suffers. 

  3. Generalization beyond training generators. AggLM works well even when aggregating from stronger models than those used during training—implying aggregation skills are transferable, not just overfitted to particular output distributions. 

  4. Efficiency trade-off. Instead of scaling k (number of solutions) to very high values, using a learned aggregator yields larger gains per additional candidate, meaning happier ceilings on tokens/time. 


Bottom line: AggLM demonstrates that “the majority vote” should not be the default in reasoning aggregation. Models that are trained to look across candidate solutions—identify hidden truth, correct errors, and combine the best ideas—do better than simple heuristics. Especially in math and logic tasks where minority correct answers exist, learned aggregation via RL with verifiable reward is a strong lever. If you’re designing agents or reasoning pipelines, integrating an aggregator like AggLM can be a powerful performance boost with reasonable cost.

Paper link: arXiv 2509.06870 (PDF)

12.8.25

From Jagged Intelligence to World Models: Demis Hassabis’ Case for an “Omni Model” (and Why Evals Must Grow Up)

 DeepMind’s cadence right now is wild—new drops practically daily. In this conversation, Demis Hassabis connects the dots: “thinking” models (Deep Think), world models that capture physics, and a path toward an omni model that unifies language, vision, audio, and interactive behavior. As an AI practitioner, I buy the core thesis: pure next-token prediction has hit diminishing returns; reasoning, tool-use, and grounded physical understanding are the new scaling dimensions.

I especially agree with the framing of thinking as planning—AlphaGo/AlphaZero DNA brought into the LLM era. The key is not the longest chain of thought, but the right amount of thought: parallel plans, prune, decide, iterate. That’s how strong engineers work, and it’s how models should spend compute. My caveat: “thinking budgets” still pay a real latency/energy cost. Until tool calls and sandboxed execution are bulletproof, deep reasoning will remain spiky in production.

The world model agenda resonates. If you want robust robotics or assistants like Astra/Gemini Live, you need spatiotemporal understanding, not just good text priors. Genie 3 is a striking signal: it can generate coherent worlds where objects persist and physics behaves sensibly. I’m enthusiastic—and I still want tougher tests than “looks consistent.” Sim-to-real is notorious; we’ll need evaluations for controllable dynamics, invariances (occlusion, lighting, continuity), and goal-conditioned behavior before I call it solved.

Hassabis is refreshingly blunt about jagged intelligence. Yes, models ace IMO-style math yet bungle simple logic or even chess legality. Benchmarks saturate (AIME hitting ~99%); we need new stressors. I like Game Arena with Kaggle—self-advancing tournaments give clear, leak-resistant signals and scale with capability. Where I push back: games aren’t the world. Outside well-specified payoffs, reward specification gets messy. The next wave of evals should be multi-objective and long-horizon—measuring planning, memory, tool reliability, and safety traits (e.g., deception) under distribution shift, not just single-shot accuracy.

Another point I applaud: tools as a scaling axis. Let models reason with search, solvers, and domain AIs (AlphaFold-class tools) during planning. The open question—what becomes a built-in capability versus an external tool—is empirical. Coding/math often lifts general reasoning; chess may or may not. My hesitation: as “models become systems,” provenance and governance get harder. Developers will need traceable tool chains, permissions, and reproducible runs—otherwise we ship beautifully wrong answers faster.

Finally, the omni model vision—converging Genie, Veo, and Gemini—feels inevitable. I’m aligned on direction, wary on product surface area. When base models upgrade every few weeks, app teams must design for hot-swappable engines, stable APIs, and eval harnesses that survive version churn.

Net-net: I’m excited by DeepMind’s trajectory—reasoning + tools + world modeling is the right stack. But to turn wow-demos into trustworthy systems, we must grow our evaluations just as aggressively as our models. Give me benchmarks that span days, not prompts; measure alignment under ambiguity; and prove sim-to-real. Do that, and an omni model won’t just impress us—it’ll hold up in the messy, physical, human world it aims to serve.


31.7.25

X-Omni proves RL can make token-based image generators great again

 Diffusion may rule today’s text-to-image scene, but Tencent researchers just reminded everyone why discrete autoregressive models still matter. In a paper titled “X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again,” they show that a sprinkle of reward learning turns a 7 B LLM that predicts visual tokens into a Sora-class image engine—while natively sharing weights with language generation.

Three moving parts

ModuleJobRL impact
Semantic image tokenizerConverts 32 × 32 patch features into a 65 k-token vocabulary without vector-quantization blur.Supplies denser reward signals than pixel-level losses.
Unified AR backboneOne transformer handles both language and image tokens; no diffusion head during training.After SFT it over-fits, but RL fixes fidelity & instruction following.
Offline diffusion decoderA lightweight “decompressor” turns token grids into crisp 1 K-px frames.Keeps inference < 2 s on a single A100.

Why reinforcement learning?

Supervised fine-tuning left the model with warped faces and garbled typography. Policy-gradient updates—rewarded for CLIP aesthetics, OCR accuracy and prompt adherence—steadily cleaned up artifacts and nailed complex layouts, something best-of-N sampling couldn’t match.

Early numbers worth noting

  • FID 1.7 on ImageNet-256 (beating DiT-XL by 9 %)

  • 99.2 % prompt compliance on the new LongText-Bench (Chinese + English captions up to 120 chars)

  • 3.5× faster than diffusion baselines at 1024 × 1024 when streaming tokens with Flash-Attn 3.0

  • < 8.5 GB VRAM for a distilled 1.3 B variant (coming soon, according to the repo)

Why it matters

  1. Unified model, unified budget – No separate diffusion tower; language and image share the same 7 B weights, making deployment simpler and cheaper.

  2. Long-text rendering solved – Posters, UI mock-ups and meme creators finally get reliable lettering without kludgy diffusion guidance.

  3. Open everything – Code, checkpoints and the 200-prompt LongText-Bench live on GitHub under Apache-2.0. Fine-tune away.

The bigger picture

Until now, researchers had mostly written off discrete AR image models as artifacts-prone hold-overs from DALL·E 1. X-Omni flips that narrative: with the right reward design, token predictors can match (and in text rendering, beat) diffusion’s photorealism while keeping the door open for seamless language–vision fusion and future any-to-any generation. Expect a resurgence of AR tokenizers, LoRA packs for brand fonts, and perhaps a new front in the multimodal model wars.

Paper link: arXiv 2507.22058 (PDF)

23.7.25

Qwen3‑Coder: Alibaba’s 480‑B Agentic Code Model Aims for One‑Million‑Token Repos

 When Alibaba’s Qwen research group dropped the link to “Qwen3‑Coder: Agentic Coding in the World,” AI Twitter lit up in minutes. The post introduces Qwen3‑Coder‑480B‑A35B‑Instruct, a gargantuan 480‑billion‑parameter Mixture‑of‑Experts (MoE) language model in which only 35 B parameters activate per token, making deployment far leaner than raw size suggests. Released on July 22, 2025 with permissive access points on GitHub, Hugging Face, and ModelScope, the model claims state‑of‑the‑art results in agent‑style coding and tool use—rivaling Anthropic’s Claude 4 Sonnet while remaining fully open‑weight. 

Architecture built for truly big code

The Qwen team doubled down on “scaling in three dimensions.” First, tokens: 7.5 T training tokens with a hefty 70 % code ratio to anchor programming skill while preserving math and general reasoning. Second, context: the model handles a native 256 K‑token window and can stretch to 1 M tokens using YaRN extrapolation, making whole‑repository prompts or week‑long chat traces finally practical. Third, synthetic data: Qwen2.5‑Coder was used to rewrite noisy corpora, boosting baseline cleanliness before fine‑tuning even starts. 

Reinforcement learning at industrial scale

Rather than stopping at supervised fine‑tune, Qwen3‑Coder undergoes two novel RL phases. “Scaling Code RL” turns automated unit‑test generation into millions of execution‑checked training rounds—improving code‑run accuracy and even general abilities. Then comes Agent RL, where 20 000 parallel cloud environments simulate real SWE‑Bench tickets. The model learns to plan, invoke tools, and iterate until tests pass, producing best‑in‑class scores on SWE‑Bench Verified without any test‑time tricks. 

Benchmarks and agentic chops

Early numbers show Qwen3‑Coder topping every open‑source competitor on Agentic Coding, Agentic Browser‑Use, and Agentic Tool‑Use tracks; Alibaba positions it as “comparable to Claude Sonnet 4” in practical autonomy. In short, it doesn’t just spit snippets—it reasons across multi‑file repos, calls compilers, and revises until green checks appear. For developers chasing fully automated pull‑request bots, that’s a milestone. 

Meet Qwen Code—your command‑line copilot

To make those agentic skills tangible, the team open‑sourced Qwen Code, a Node‑based CLI forked from Gemini CLI. With a one‑line npm i -g @qwen-code/qwen-code, users gain a prompt‑driven shell that speaks directly to Qwen3‑Coder via an OpenAI‑compatible endpoint. Prefer other tooling? The blog shows drop‑in guides for Claude Code, Cline, and generic REST calls, so the model can slot into VS Code, Git hooks, or CI pipelines in minutes. 

Why it matters

Qwen3‑Coder is more than another “bigger‑is‑better” headline. By combining MoE efficiency, million‑token context, and reinforcement learning tuned for agent workflows, Alibaba delivers a bridge between research hype and developer reality. Hobbyists with a single A100 can experiment with 256 K‑token coding agents, while enterprises get an Apache‑friendly alternative to closed, usage‑metered APIs. For AI enthusiasts, it’s an invitation: wire up Qwen3‑Coder to your build system, hand it a failing test, and watch an open model patch your codebase—all without leaving the command line. The age of end‑to‑end agentic coding just took a decisive step forward. 

22.7.25

Gemini “Deep Think” Hits Gold-Medal Performance at the International Mathematical Olympiad

 

From Silver to Gold in Twelve Months

Last year, DeepMind’s AlphaGeometry and AlphaProof systems collectively solved four of six IMO problems, earning a silver-medal equivalent. In July 2025 the research team leap-frogged that result: an advanced version of Gemini running in “Deep Think” mode solved five of six tasks for 35 points—crossing the 2025 gold-medal threshold and setting a new AI milestone.

International coordinators graded Gemini’s written solutions using the same rubric applied to student competitors. According to IMO President Gregor Dolinar, the proofs were “clear, precise, and, in several cases, easy to follow”.


What Makes Deep Think Different?

TechniquePurposeImpact on Performance
Parallel ThinkingExplores multiple proof avenues simultaneously, then merges the strongest ideas.Avoids dead-end, single-thread chains of thought.
Reinforcement-Learning Fine-TuneTrains on curated theorem-proving and problem-solving data with reward signals for conciseness and rigor.Raises success rate on multi-step reasoning challenges.
High-Quality Solution CorpusIngests expertly written IMO proofs plus heuristic “tips & tricks.”Gives the model stylistic and structural templates for clearer presentation.

These upgrades let Gemini run longer “scratch-pads” internally while staying within a feasible compute budget—no multi-day cluster runs were required, unlike earlier systems.

Benchmark Significance

  • 35 / 42 points → comparable to a top-25-percent human gold medalist.

  • Perfect scores on five problems; only one combinatorics task eluded the model.

  • Order-of-magnitude speed-up vs. AlphaGeometry 2 + AlphaProof, which needed days of inference in 2024.

While specialized theorem solvers have mastered narrow domains, Gemini Deep Think is a general LLM—capable of chat, code, and multimodal tasks—now showing elite mathematical reasoning.


Broader Implications

  1. Curriculum Design for AI
    Gemini’s success underscores the value of domain-targeted reinforcement learning on top of large-scale pre-training.

  2. Parallel Thinking as a New Primitive
    Instead of a single “chain of thought,” future models may default to branch-and-merge reasoning, akin to how human teams brainstorm proofs.

  3. Human–AI Collaboration
    DeepMind notes the technique could become a “proof assistant” for mathematicians—surfacing lemmas or counter-examples at gold-medal quality within minutes.

  4. Educational Outreach
    Publishing the solutions provides a free study resource for aspiring IMO contestants and teachers, potentially leveling the global playing field.


Limitations & Next Steps

  • Interpretability: Despite clearer written proofs, the internal decision tree remains opaque—researchers are now probing why certain branches survive the merge.

  • Generalization: Performance on under-represented areas (e.g., functional equations) still lags; future training will widen topic coverage.

  • Trust & Verification: Formal proof checkers like Lean are being integrated to machine-verify each Gemini output before publication.

DeepMind plans to open selected Deep Think capabilities via its Gemini API later this year, with safeguards to prevent misuse in academic competitions.


Key Takeaway

Gemini Deep Think’s gold-medal performance doesn’t just raise the bar for AI mathematics—it redefines what general-purpose language models can achieve when armed with structured parallel reasoning and tailored RL training. The achievement brings researchers a step closer to AI systems that can tackle longstanding open problems and act as partner mathematicians rather than mere calculators.

10.7.25

CriticLean makes the AI “grader” the hero of math formalization

 Automating the translation of plain-English math into Lean code has felt like grading your own exam: language models write a proof, a compiler checks syntax, and everyone hopes the semantics line up. CriticLean flips that script by training a dedicated critic model—dubbed CriticLeanGPT—that learns to catch logical slips the compiler can’t. Guided by reinforcement learning, that critic doesn’t just reject bad code; it drives an iterative rewrite loop that more than doubles end-to-end accuracy.

From passive judge to active coach

The team fine-tunes a lightweight Qwen backbone to score whether a Lean statement truly matches its natural-language prompt, then bakes those scores into a reward signal. Each failed attempt becomes a teaching moment, producing richer feedback than the usual “compiler error” one-liner. The critic also powers CriticLeanBench, a 500-item test set (half correct, half adversarially wrong) that shows CriticLeanGPT trouncing both open and closed-source baselines at spotting semantic mistakes.

Hard numbers: 38 % → 84 % accuracy

On a 50-problem slice of the Omni-MATH benchmark, a 7 B “Kimina-Autoformalizer” model alone solved just 38 % of tasks. A traditional compiler-feedback loop nudged that to 54 %. Swap in CriticLean’s RL-trained critic and the success rate soars to 84 %—a 30-point leap even seasoned theorem-prover veterans will notice.

A broader 500-problem stress test tells the same story: the multi-attempt CriticLean pipeline verified 52.8 % of statements under a 200-try cap, recovering forty extra points of yield that single-pass systems would toss out.

A new 285 k-problem corpus (and 36 k “diamond” stumpers)

Because the critic can certify semantic correctness without humans, the authors bootstrapped FineLeanCorpus, a 285 ,957-entry Lean dataset spanning 16 math domains with a flatter difficulty curve than the skewed Lean-Workbook previously used for fine-tuning. They also carved out a FineLeanCorpus-Diamond subset—36 k brutal problems meant to push future models beyond textbook algebra.

Why this matters

  • Reliability over compilation. Syntax is easy; semantics are king. CriticLean proves that investing compute in the grading phase pays bigger dividends than ever-bigger generators.

  • Plug-and-play RL recipe. The critic-guided loop is model-agnostic and could supervise any auto-formalizer—Lean, Isabelle, even Coq.

  • Dataset flywheel. With FineLeanCorpus open-sourced, researchers finally have a large, semantically vetted playground instead of noisy web scrapes.

Whether you’re chasing fully automated theorem proving or just want ChatGPT to stop hallucinating Lean syntax, CriticLean’s message is clear: the smartest way forward is to teach your models how to critique themselves.

Paper link: arXiv 2507.06181 (PDF)

8.7.25

DeepMesh makes artist-quality 3D meshes a one-click affair

 Triangle-mesh modelling is the CAD world’s equivalent of hand-drawn in-betweens: essential, mind-numbing and painfully slow. A new paper out of Tsinghua University, NTU and ShengShu AI says it can hand that job to an LLM-sized transformer without melting your GPU.

The team’s framework, DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning, marries a clever compression trick with a dose of RLHF to crank out clean, editable topology directly from point clouds or images. 


Why previous mesh LLMs hit the wall

Most auto-regressive mesh generators treat every vertex coordinate as a token. Feed them a high-poly model and the sequence balloons into tens of thousands of steps, torpedoing training stability and inference speed. Worse, their loss functions optimise geometry alone, so outputs pass numeric checks yet still look like Swiss cheese to artists.


Two upgrades, one big leap

PillarWhat they didWhy it matters
72 % shorter sequencesA hierarchical patch-based tokenization merges duplicate offsets and encodes connectivity inline, shrinking vertex strings by nearly three-quarters without dropping detail. Cuts pre-training FLOPs and lets the model scale to 30 k-face meshes on a single A100.
Human-aligned RLCollected 5 000 preference pairs scored with a hybrid of human rating and 3D metrics, then ran Direct Preference Optimization (DPO) on the base model. Removes holes and stray faces while nudging topology toward “artist-grade” layouts.

The researchers also trimmed an 800 k-mesh corpus to a cleaner 500 k set, tamping down the loss spikes that plague raw WebGL scrapes. 

Results: fewer faces, better faces

  • Up to 1 B parameters: two Hourglass-style transformer variants (500 M & 1 B) both converge in 100 k steps thanks to shorter sequences. 

  • Topology wins: DeepMesh’s large model eliminates 90 % of non-manifold edges that slip through MeshGPT and Nautilus, according to the authors’ “topology-valid” metric.

  • Visual quality: crowd-sourced raters picked DeepMesh over MeshGPT by 68 % on identical point-cloud prompts (exact numbers in paper’s Sec. 4.3).

  • Speed: a full 30 k-face generation takes ≈10 min, versus 20–25 min for LoRA-fine-tuned diffusion baselines reported in prior work.

A public demo gallery already shows clean Watertight dragons, furniture and stylised characters rendered straight from sparse point clouds. 


Why this is bigger than 3D fan art

Game studios, AR platforms and online-creator tools alike are sitting on troves of unoptimised 3D scans. A transformer that understands connectivity as well as shape could batch-convert those scans into lightweight, animation-ready assets—no retopology pass required. And because DeepMesh’s DPO loop is “just” another RLHF recipe, the same pipeline could teach a mesh LLM brand-specific style or IP-safe anatomy without touching the base weights.

The authors hint at scaling past one billion parameters and adding text-conditioned generation. Given how fast 3D GenAI is snowballing, don’t bet against DeepMesh—or its tokenization trick—showing up in the next wave of text-to-world engines.

Paper link: arXiv 2503.15265 (PDF)

6.7.25

LangGraph Rollout: how VeRL leveled-up multi-turn Agent RL

 

Why this matters

If you’ve ever tried to train an LLM-powered agent with many tool calls spread across a genuine back-and-forth conversation, you’ve probably discovered that “multi-turn” means different things to different frameworks. Yanbin Jiang’s latest post shows how the VeRL team punched through that ceiling by grafting LangGraph directly onto VeRL’s reinforcement-learning rollout engine. The result is a training loop that speaks the same language as production code. 


1. Where they started

  • Native VeRL multi-turn – great for quick experiments. You enable multi_turn: True, write a YAML schema for each tool, implement an async Python class, and you’re off; their GSM8K benchmark ran in two days. 

  • Pain points

    1. Double bookkeeping: every tool had to be declared twice (YAML + Python).

    2. Drift: schema and code fell out of sync, and prod tools (written for LangChain/LangGraph) diverged from the “training” clones. 


2. A quick stop-gap: automatic tool wrapping

Yanbin added BaseTool.from_callable(), which introspects any plain Python function with transformers.utils.get_json_schema, then fabricates a VeRL-compatible wrapper on the fly. One list of callables (tool_list = [multiply, add, …]) now powers both training and prod. 

My dev take: this is the same pattern I use in LangChain when I decorate business logic with @tool. Nice to see VeRL admit “if you can’t beat reflection, join it.”


3. The real blocker: orchestration power

Research quickly outgrew VeRL’s built-in rollout:

NeedWhy VeRL fell short
Dynamic branches & backtrackingNative graph was too rigid.
True multi-turn dialogue (user follow-ups)Any assistant message without tool calls ended the convo.
Per-node sampling / chat-template tweaksGlobal settings only.

Enter LangGraph: a lightweight DAG engine already shipping in production.

4. Architectural insight: separation of concerns

“Let VeRL manage actor weights & hardware; let LangGraph drive the conversation.” 

So they built a LangChain-compatible chat-model client for VeRL’s SGLang server. Training now works like this:

  1. VeRL hands the initial messages + model handle to the user’s LangGraph.

  2. The graph does its thing—branching, retrying, invoking tools—using the exact actor weights being optimized.

  3. When the graph stops, VeRL collects the message history and rewards. 

The PR shows a seven-line YAML snippet that swaps the old rollout for:

yaml
multi_turn:
chat_template_kwargs: {enable_thinking: false} langgraph: path: /path/to/graph.py graph_config: {recursion_limit: 100}

…and a 60-line example graph that binds tools, counts turns, and lets you vary temperature node-by-node. 


5. Why I’m excited

  • One graph to rule them all – deployment and training share code; no more “but it worked in prod!”

  • Easier ablations – want to test a new branch strategy? Edit the graph script; RL pipeline stays untouched.

  • Framework-agnostic future – the same bridge pattern could plug VeRL into OpenAI Function Calling, Microsoft’s AutoGen, or whatever framework wins next year.


My takeaway

VeRL just became a lot more attractive for serious agent RL work. By leaning on LangGraph instead of extending an in-house orchestration DSL, the team keeps VeRL laser-focused on fast rollouts, leaves graph logic to a dedicated library, and—crucially—lets devs iterate on one codebase. If you’re juggling duplicate tool definitions or fighting mismatch between training and production, clone Yanbin’s PR and breathe easier.

Explore it more here: https://jybsuper.github.io/posts/langgraph_rollout/ 

WebSailor charts an open-source course to super-human web reasoning

 For the past year, open-source web agents have looked like dinghies chasing aircraft carriers: even 70-billion-parameter models scraped single-digit accuracy on BrowseComp-en, the field’s toughest information-seeking benchmark, while closed systems such as DeepResearch and Grok-3 cruised far ahead. Tongyi Lab, Alibaba’s applied-AI skunkworks, says it has all but closed that gap with WebSailor, a post-training recipe that rewires large language models to “think like uncertainty-slayers.” 

Turning the web into a maze on purpose

At the heart of WebSailor is SailorFog-QA, a synthetic dataset that bombards the model with “Level-3” problems—questions whose answers hide behind tangled entity graphs and deliberately obfuscated clues (“a musician later honored in the early 21st century,” “a chronology that ends the same year a late-antique poet died”). Random walks over real web pages build those graphs; masking, vagueness and partial names turn each query into a fog bank the agent must burn off through multi-step reasoning. 

DUPO: reinforcement learning that isn’t painfully slow

Tool-using agents learn painfully slowly because every step calls a browser, but Tongyi Lab’s Duplicating Sampling Policy Optimization (DUPO) makes each RL batch pull double duty: one pass samples harder trajectories, the next re-samples mid-episode to squeeze more signal from sparse rewards. A small rejection-sampling fine-tuning (RFT) “cold start” of just 2 k expert traces primes the model so DUPO has something to optimize. 

Four sizes, one giant leap

WebSailor comes in 3B, 7B, 32B and 72B flavors. Even the 7-billion-parameter version hits 6.7 % pass@1 on BrowseComp-en, trouncing agents built on 32 B backbones that manage barely 2 – 3 %. The 32 B and 72 B models push further, outscoring open-source peers on BrowseComp-en/zh, GAIA and XBench and edging past proprietary offerings like Grok-3 and Doubao-Search when those systems add browsing tools. 

Why it matters

  • Democratizing deep search. BrowseComp-level tasks—ask a question, navigate dozen-plus pages, synthesize an answer—are what corporate knowledge-bases and vertical search startups need. WebSailor shows you no longer need a closed-source giant to play.

  • A recipe, not a model. The CPT + HCF routine, uncertainty-first data and DUPO optimizer are architecture-agnostic; any ReAct-style agent with tool APIs can adopt them.

  • Downward compatibility. Despite training only on headache-grade puzzles, WebSailor’s 72 B model scores >90 % pass@1 on the single-hop SimpleQA benchmark, proving that hard-first curricula don’t break easy tasks. 

Open weights, open benchmark

Code, data-generation scripts and checkpoints live in Tongyi Lab’s GitHub repo, alongside a dockerized evaluator so outside teams can reproduce—or dispute—the numbers. 

With WebSailor, the open-source fleet finally has a flagship capable of keeping proprietary juggernauts in sight. The real question now: how long before someone splices SailorFog-style data and DUPO into a general-purpose agent that can shop, schedule and navigate enterprise wikis with the same super-human calm?

Paper link: arXiv 2507.02592         (PDF)

4.7.25

DiffuCoder rewrites the code-LLM playbook with diffusion and smarter RL

 Autoregressive (AR) giants like GPT-4o and Qwen2.5 dominate today’s leaderboard-driven coding scene, but Apple’s research group thinks the next breakthrough may come from an entirely different generation paradigm. In a paper published late last week, the team unveiled DiffuCoder — a 7 B-parameter masked diffusion language model (dLLM) designed specifically for program synthesis and repair. Unlike AR models that predict the next token left-to-right, DiffuCoder iteratively denoises whole sequences, enabling global planning and out-of-order refinement.

What’s new under the hood

  • Scaled training for code. DiffuCoder is pretrained on 130 billion code tokens, then instruction-tuned and RL-fined on curated problem sets. That makes it one of the largest diffusion-first code models publicly documented.

  • Decoding insights. The authors introduce local and global AR-ness metrics to quantify how often a diffusion model falls back to sequential generation. They show that raising temperature not only diversifies token choice but also the order in which tokens are filled — a property AR models lack.

  • Coupled-GRPO. To tame the high-variance log-likelihood estimates that plague diffusion policy gradients, Apple proposes coupled Group Relative Policy Optimization, a two-pass masking strategy that evaluates complementary token subsets in one RL rollout. The technique drops noise without resorting to semi-AR “block decoding,” keeping the model fully diffusion-native.

Benchmark scores that matter

DiffuCoder’s base model already lands in the same ballpark as leading 7/8 B AR coders. After instruction tuning and coupled-GRPO, it posts:

ModelHumanEval+MBPP+EvalPlus (avg.)BigCodeBench C-Full
DiffuCoder-Instruct72.065.275.161.9
+ coupled-GRPO73.268.378.667.5

That +4.4-point jump on EvalPlus brings the diffusion model within striking distance of Qwen2.5-Coder-SFT while comfortably outpacing earlier dLLMs like Dream-7B and LLaDA-Instruct.

Why it matters

Diffusion’s parallel denoising lets models “think in drafts,” revisiting earlier lines without paying the quadratic attention tax AR models incur for long contexts. For enterprise dev-ops teams staring down thousand-line files, a diffusion-native coder that no longer needs block-wise hacks could slash latency and memory. And because coupled-GRPO is plug-and-play, the method can in theory retrofit any masked diffusion LLM — not just Apple’s.

Early tooling and ecosystem

A DiffuCoder-7B-Instruct checkpoint is already live on Hugging Face, and the GitHub repo ships with sampling scripts, RL rewards and evaluation harnesses. That means startups building unit-test agents or code-review copilots can kick the tires today on a single A100.

The bigger question is whether diffusion LLMs can climb the performance ladder as fast as their image cousins did in 2022. Apple’s coupled-GRPO shows one path forward: make RL native to diffusion instead of forcing AR habits onto a fundamentally different beast. If follow-up work scales the idea to 34 B or 70 B parameters, AR incumbents may soon find themselves sharing the podium.

Paper link: arXiv 2506.20639 (PDF)

18.6.25

MiniMax-M1: A Breakthrough Open-Source LLM with a 1 Million Token Context & Cost-Efficient Reinforcement Learning

 MiniMax, a Chinese AI startup renowned for its Hailuo video model, has unveiled MiniMax-M1, a landmark open-source language model released under the Apache 2.0 license. Designed for long-context reasoning and agentic tool use, M1 supports a 1 million token input and 80,000 token output window—vastly exceeding most commercial LLMs and enabling it to process large documents, contracts, or codebases in one go.

Built on a hybrid Mixture-of-Experts (MoE) architecture with lightning attention, MiniMax-M1 optimizes performance and cost. The model spans 456 billion parameters, with 45.9 billion activated per token. Its training employed a custom CISPO reinforcement learning algorithm, resulting in substantial efficiency gains. Remarkably, M1 was trained for just $534,700, compared to over $5–6 million spent by DeepSeek‑R1 or over $100 million for GPT‑4.


⚙️ Key Architectural Innovations

  • 1M Token Context Window: Enables comprehensive reasoning across lengthy documents or multi-step workflows.

  • Hybrid MoE + Lightning Attention: Delivers high performance without excessive computational overhead.

  • CISPO RL Algorithm: Efficiently trains the model with clipped importance sampling, lowering cost and training time.

  • Dual Variants: M1-40k and M1-80k versions support variable output lengths (40K and 80K “thinking budget”).


📊 Benchmark-Topping Performance

MiniMax-M1 excels in diverse reasoning and coding benchmarks:

AIME 2024 (Math): 86.0% accuracy
LiveCodeBench (Coding): 65.0%
SWE‑bench Verified: 56.0%
TAU‑bench: 62.8%
OpenAI MRCR (4-needle): 73.4% 

These results surpass leading open-weight models like DeepSeek‑R1 and Qwen3‑235B‑A22B, narrowing the gap with top-tier commercial LLMs such as OpenAI’s o3 and Google’s Gemini due to its unique architectural optimizations.


🚀 Developer-Friendly & Agent-Ready

MiniMax-M1 supports structured function calling and is packaged with an agent-capable API that includes search, multimedia generation, speech synthesis, and voice cloning. Recommended for deployment via vLLM, optimized for efficient serving and batch handling, it also offers standard Transformers compatibility.

For enterprises, technical leads, and AI orchestration engineers—MiniMax-M1 provides:

  • Lower operational costs and compute footprint

  • Simplified integration into existing AI pipelines

  • Support for in-depth, long-document tasks

  • A self-hosted, secure alternative to cloud-bound models

  • Business-grade performance with full community access


🧩 Final Takeaway

MiniMax-M1 marks a milestone in open-source AI—combining extreme context length, reinforcement-learning efficiency, and high benchmark performance within a cost-effective, accessible framework. It opens new possibilities for developers, researchers, and enterprises tackling tasks requiring deep reasoning over extensive content—without the limitations or expense of closed-weight models.

10.6.25

Ether0: The 24B-Parameter Scientific Reasoning Model Accelerating Molecular Discovery

 FutureHouse has unveiled Ether0, a 24 billion-parameter open-source reasoning model specialized for chemistry tasks. Built on Mistral 24B and fine-tuned through chain-of-thought reinforcement learning, Ether0 accepts natural-language prompts and generates molecule structures in SMILES notation, excelling particularly in drug-like compound design.

Why Ether0 Matters

While general-purpose LLMs possess extensive chemical knowledge, they falter at molecule manipulation—incorrect atom counts, implausible rings, or inaccurate compound names. Ether0 addresses these deficiencies by learning from reinforcement signals grounded in chemical validity rather than mimicry, significantly boosting accuracy in molecule generation.

Training Methodology

  • Base Model & Datasets: Starts with Mistral 24B Instruct.

  • Fine-tuning: Trains chains of thought and correct answers through supervised learning, separating specialists per task.

  • Reinforcement Learning: Specialized models trained on molecular tasks across ~50K examples each.

  • Distillation: Merges specialist reasoning into a generalized model, further refined with reinforcement over multiple tasks.

This modular workflow enables data efficiency, with Ether0 surpassing frontier models like GPT‑4.1 and DeepSeek‑R1 on chemistry problems while using substantially less data than traditional methods.

Capabilities and Limits

Ether0 accurately handles tasks such as:

  • Converting formulas (e.g., C₂₇H₃₇N₃O₄) to valid molecules.

  • Designing compounds by functional groups, solubility, pKa, smell, or receptor binding.

  • Proposing retrosynthesis steps and reaction outcomes.

However, it falters in:

  • Naming via IUPAC or common names.

  • Reasoning on molecular conformations.

  • General conversational chemistry outside strict molecule output.

The model develops unique behaviors—blending languages and inventing new terms (e.g., “reductamol”)—reflecting deeper reasoning at the cost of clarity in some reasoning traces.

Safety & Governance

Ether0 is released under an Apache 2.0 license and includes safeguards: refusal on controlled compounds, missiles-toxins filters, and rejection of explicit malicious content. This safety post-processing is critical given its open-weight deployment.

Community & Future Vision

Built by a FutureHouse team supported by Eric Schmidt and VoltagePark, Ether0 is part of a broader quest to automate scientific discovery via AI agents. The code, reward models, benchmarks, and model weights are available on GitHub and Hugging Face. Next steps include integrating Ether0 into Phoenix—FutureHouse’s chemistry agent—as a foundational block toward a generalized scientific reasoning engine 


Key Takeaways

  1. Domain-specific reasoning: Demonstrates how reinforcement-tuned LLMs can learn scientific tasks beyond pretraining.

  2. Data-efficient training: Delivers strong performance using ~50K task-specific examples, far fewer than traditional AI training regimes.

  3. Open-source advancement: Enables scientific and developer communities to build upon Ether0 in drug design and other chemistry domains.

  4. Transparent reasoning traces: Offers insight into LLM ‘thought processes’, facilitating interpretability in scientific AI.

What Claude offers now From Anthropic’s announcements: Creates and edits real files directly in chats or the desktop app: Excel (.xlsx)...