Wandering Nomad: Reinforcement Learning

Showing posts with label Reinforcement Learning. Show all posts

31.7.25

X-Omni proves RL can make token-based image generators great again

Diffusion may rule today’s text-to-image scene, but Tencent researchers just reminded everyone why discrete autoregressive models still matter. In a paper titled “X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again,” they show that a sprinkle of reward learning turns a 7 B LLM that predicts visual tokens into a Sora-class image engine—while natively sharing weights with language generation.

Three moving parts

Module	Job	RL impact
Semantic image tokenizer	Converts 32 × 32 patch features into a 65 k-token vocabulary without vector-quantization blur.	Supplies denser reward signals than pixel-level losses.
Unified AR backbone	One transformer handles both language and image tokens; no diffusion head during training.	After SFT it over-fits, but RL fixes fidelity & instruction following.
Offline diffusion decoder	A lightweight “decompressor” turns token grids into crisp 1 K-px frames.	Keeps inference < 2 s on a single A100.

Why reinforcement learning?

Supervised fine-tuning left the model with warped faces and garbled typography. Policy-gradient updates—rewarded for CLIP aesthetics, OCR accuracy and prompt adherence—steadily cleaned up artifacts and nailed complex layouts, something best-of-N sampling couldn’t match.

Early numbers worth noting

FID 1.7 on ImageNet-256 (beating DiT-XL by 9 %)
99.2 % prompt compliance on the new LongText-Bench (Chinese + English captions up to 120 chars)
3.5× faster than diffusion baselines at 1024 × 1024 when streaming tokens with Flash-Attn 3.0
< 8.5 GB VRAM for a distilled 1.3 B variant (coming soon, according to the repo)

Why it matters

Unified model, unified budget – No separate diffusion tower; language and image share the same 7 B weights, making deployment simpler and cheaper.
Long-text rendering solved – Posters, UI mock-ups and meme creators finally get reliable lettering without kludgy diffusion guidance.
Open everything – Code, checkpoints and the 200-prompt LongText-Bench live on GitHub under Apache-2.0. Fine-tune away.

The bigger picture

Until now, researchers had mostly written off discrete AR image models as artifacts-prone hold-overs from DALL·E 1. X-Omni flips that narrative: with the right reward design, token predictors can match (and in text rendering, beat) diffusion’s photorealism while keeping the door open for seamless language–vision fusion and future any-to-any generation. Expect a resurgence of AR tokenizers, LoRA packs for brand fonts, and perhaps a new front in the multimodal model wars.

Paper link: arXiv 2507.22058 (PDF)

23.7.25

Qwen3‑Coder: Alibaba’s 480‑B Agentic Code Model Aims for One‑Million‑Token Repos

When Alibaba’s Qwen research group dropped the link to “Qwen3‑Coder: Agentic Coding in the World,” AI Twitter lit up in minutes. The post introduces Qwen3‑Coder‑480B‑A35B‑Instruct, a gargantuan 480‑billion‑parameter Mixture‑of‑Experts (MoE) language model in which only 35 B parameters activate per token, making deployment far leaner than raw size suggests. Released on July 22, 2025 with permissive access points on GitHub, Hugging Face, and ModelScope, the model claims state‑of‑the‑art results in agent‑style coding and tool use—rivaling Anthropic’s Claude 4 Sonnet while remaining fully open‑weight.

Architecture built for truly big code

The Qwen team doubled down on “scaling in three dimensions.” First, tokens: 7.5 T training tokens with a hefty 70 % code ratio to anchor programming skill while preserving math and general reasoning. Second, context: the model handles a native 256 K‑token window and can stretch to 1 M tokens using YaRN extrapolation, making whole‑repository prompts or week‑long chat traces finally practical. Third, synthetic data: Qwen2.5‑Coder was used to rewrite noisy corpora, boosting baseline cleanliness before fine‑tuning even starts.

Reinforcement learning at industrial scale

Rather than stopping at supervised fine‑tune, Qwen3‑Coder undergoes two novel RL phases. “Scaling Code RL” turns automated unit‑test generation into millions of execution‑checked training rounds—improving code‑run accuracy and even general abilities. Then comes Agent RL, where 20 000 parallel cloud environments simulate real SWE‑Bench tickets. The model learns to plan, invoke tools, and iterate until tests pass, producing best‑in‑class scores on SWE‑Bench Verified without any test‑time tricks.

Benchmarks and agentic chops

Early numbers show Qwen3‑Coder topping every open‑source competitor on Agentic Coding, Agentic Browser‑Use, and Agentic Tool‑Use tracks; Alibaba positions it as “comparable to Claude Sonnet 4” in practical autonomy. In short, it doesn’t just spit snippets—it reasons across multi‑file repos, calls compilers, and revises until green checks appear. For developers chasing fully automated pull‑request bots, that’s a milestone.

Meet Qwen Code—your command‑line copilot

To make those agentic skills tangible, the team open‑sourced Qwen Code, a Node‑based CLI forked from Gemini CLI. With a one‑line npm i -g @qwen-code/qwen-code, users gain a prompt‑driven shell that speaks directly to Qwen3‑Coder via an OpenAI‑compatible endpoint. Prefer other tooling? The blog shows drop‑in guides for Claude Code, Cline, and generic REST calls, so the model can slot into VS Code, Git hooks, or CI pipelines in minutes.

Why it matters

Qwen3‑Coder is more than another “bigger‑is‑better” headline. By combining MoE efficiency, million‑token context, and reinforcement learning tuned for agent workflows, Alibaba delivers a bridge between research hype and developer reality. Hobbyists with a single A100 can experiment with 256 K‑token coding agents, while enterprises get an Apache‑friendly alternative to closed, usage‑metered APIs. For AI enthusiasts, it’s an invitation: wire up Qwen3‑Coder to your build system, hand it a failing test, and watch an open model patch your codebase—all without leaving the command line. The age of end‑to‑end agentic coding just took a decisive step forward.

22.7.25

Gemini “Deep Think” Hits Gold-Medal Performance at the International Mathematical Olympiad

From Silver to Gold in Twelve Months

Last year, DeepMind’s AlphaGeometry and AlphaProof systems collectively solved four of six IMO problems, earning a silver-medal equivalent. In July 2025 the research team leap-frogged that result: an advanced version of Gemini running in “Deep Think” mode solved five of six tasks for 35 points—crossing the 2025 gold-medal threshold and setting a new AI milestone.

International coordinators graded Gemini’s written solutions using the same rubric applied to student competitors. According to IMO President Gregor Dolinar, the proofs were “clear, precise, and, in several cases, easy to follow”.

What Makes Deep Think Different?

Technique	Purpose	Impact on Performance
Parallel Thinking	Explores multiple proof avenues simultaneously, then merges the strongest ideas.	Avoids dead-end, single-thread chains of thought.
Reinforcement-Learning Fine-Tune	Trains on curated theorem-proving and problem-solving data with reward signals for conciseness and rigor.	Raises success rate on multi-step reasoning challenges.
High-Quality Solution Corpus	Ingests expertly written IMO proofs plus heuristic “tips & tricks.”	Gives the model stylistic and structural templates for clearer presentation.

These upgrades let Gemini run longer “scratch-pads” internally while staying within a feasible compute budget—no multi-day cluster runs were required, unlike earlier systems.

Benchmark Significance

35 / 42 points → comparable to a top-25-percent human gold medalist.
Perfect scores on five problems; only one combinatorics task eluded the model.
Order-of-magnitude speed-up vs. AlphaGeometry 2 + AlphaProof, which needed days of inference in 2024.

While specialized theorem solvers have mastered narrow domains, Gemini Deep Think is a general LLM—capable of chat, code, and multimodal tasks—now showing elite mathematical reasoning.

Broader Implications

Curriculum Design for AI
Gemini’s success underscores the value of domain-targeted reinforcement learning on top of large-scale pre-training.
Parallel Thinking as a New Primitive
Instead of a single “chain of thought,” future models may default to branch-and-merge reasoning, akin to how human teams brainstorm proofs.
Human–AI Collaboration
DeepMind notes the technique could become a “proof assistant” for mathematicians—surfacing lemmas or counter-examples at gold-medal quality within minutes.
Educational Outreach
Publishing the solutions provides a free study resource for aspiring IMO contestants and teachers, potentially leveling the global playing field.

Limitations & Next Steps

Interpretability: Despite clearer written proofs, the internal decision tree remains opaque—researchers are now probing why certain branches survive the merge.
Generalization: Performance on under-represented areas (e.g., functional equations) still lags; future training will widen topic coverage.
Trust & Verification: Formal proof checkers like Lean are being integrated to machine-verify each Gemini output before publication.

DeepMind plans to open selected Deep Think capabilities via its Gemini API later this year, with safeguards to prevent misuse in academic competitions.

Key Takeaway

Gemini Deep Think’s gold-medal performance doesn’t just raise the bar for AI mathematics—it redefines what general-purpose language models can achieve when armed with structured parallel reasoning and tailored RL training. The achievement brings researchers a step closer to AI systems that can tackle longstanding open problems and act as partner mathematicians rather than mere calculators.

10.7.25

CriticLean makes the AI “grader” the hero of math formalization

Automating the translation of plain-English math into Lean code has felt like grading your own exam: language models write a proof, a compiler checks syntax, and everyone hopes the semantics line up. CriticLean flips that script by training a dedicated critic model—dubbed CriticLeanGPT—that learns to catch logical slips the compiler can’t. Guided by reinforcement learning, that critic doesn’t just reject bad code; it drives an iterative rewrite loop that more than doubles end-to-end accuracy.

From passive judge to active coach

The team fine-tunes a lightweight Qwen backbone to score whether a Lean statement truly matches its natural-language prompt, then bakes those scores into a reward signal. Each failed attempt becomes a teaching moment, producing richer feedback than the usual “compiler error” one-liner. The critic also powers CriticLeanBench, a 500-item test set (half correct, half adversarially wrong) that shows CriticLeanGPT trouncing both open and closed-source baselines at spotting semantic mistakes.

Hard numbers: 38 % → 84 % accuracy

On a 50-problem slice of the Omni-MATH benchmark, a 7 B “Kimina-Autoformalizer” model alone solved just 38 % of tasks. A traditional compiler-feedback loop nudged that to 54 %. Swap in CriticLean’s RL-trained critic and the success rate soars to 84 %—a 30-point leap even seasoned theorem-prover veterans will notice.

A broader 500-problem stress test tells the same story: the multi-attempt CriticLean pipeline verified 52.8 % of statements under a 200-try cap, recovering forty extra points of yield that single-pass systems would toss out.

A new 285 k-problem corpus (and 36 k “diamond” stumpers)

Because the critic can certify semantic correctness without humans, the authors bootstrapped FineLeanCorpus, a 285 ,957-entry Lean dataset spanning 16 math domains with a flatter difficulty curve than the skewed Lean-Workbook previously used for fine-tuning. They also carved out a FineLeanCorpus-Diamond subset—36 k brutal problems meant to push future models beyond textbook algebra.

Why this matters

Reliability over compilation. Syntax is easy; semantics are king. CriticLean proves that investing compute in the grading phase pays bigger dividends than ever-bigger generators.
Plug-and-play RL recipe. The critic-guided loop is model-agnostic and could supervise any auto-formalizer—Lean, Isabelle, even Coq.
Dataset flywheel. With FineLeanCorpus open-sourced, researchers finally have a large, semantically vetted playground instead of noisy web scrapes.

Whether you’re chasing fully automated theorem proving or just want ChatGPT to stop hallucinating Lean syntax, CriticLean’s message is clear: the smartest way forward is to teach your models how to critique themselves.

Paper link: arXiv 2507.06181 (PDF)

8.7.25

DeepMesh makes artist-quality 3D meshes a one-click affair

Triangle-mesh modelling is the CAD world’s equivalent of hand-drawn in-betweens: essential, mind-numbing and painfully slow. A new paper out of Tsinghua University, NTU and ShengShu AI says it can hand that job to an LLM-sized transformer without melting your GPU.

The team’s framework, DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning, marries a clever compression trick with a dose of RLHF to crank out clean, editable topology directly from point clouds or images.

Why previous mesh LLMs hit the wall

Most auto-regressive mesh generators treat every vertex coordinate as a token. Feed them a high-poly model and the sequence balloons into tens of thousands of steps, torpedoing training stability and inference speed. Worse, their loss functions optimise geometry alone, so outputs pass numeric checks yet still look like Swiss cheese to artists.

Two upgrades, one big leap

Pillar	What they did	Why it matters
72 % shorter sequences	A hierarchical patch-based tokenization merges duplicate offsets and encodes connectivity inline, shrinking vertex strings by nearly three-quarters without dropping detail.	Cuts pre-training FLOPs and lets the model scale to 30 k-face meshes on a single A100.
Human-aligned RL	Collected 5 000 preference pairs scored with a hybrid of human rating and 3D metrics, then ran Direct Preference Optimization (DPO) on the base model.	Removes holes and stray faces while nudging topology toward “artist-grade” layouts.

The researchers also trimmed an 800 k-mesh corpus to a cleaner 500 k set, tamping down the loss spikes that plague raw WebGL scrapes.

Results: fewer faces, better faces

Up to 1 B parameters: two Hourglass-style transformer variants (500 M & 1 B) both converge in 100 k steps thanks to shorter sequences.
Topology wins: DeepMesh’s large model eliminates 90 % of non-manifold edges that slip through MeshGPT and Nautilus, according to the authors’ “topology-valid” metric.
Visual quality: crowd-sourced raters picked DeepMesh over MeshGPT by 68 % on identical point-cloud prompts (exact numbers in paper’s Sec. 4.3).
Speed: a full 30 k-face generation takes ≈10 min, versus 20–25 min for LoRA-fine-tuned diffusion baselines reported in prior work.

A public demo gallery already shows clean Watertight dragons, furniture and stylised characters rendered straight from sparse point clouds.

Why this is bigger than 3D fan art

Game studios, AR platforms and online-creator tools alike are sitting on troves of unoptimised 3D scans. A transformer that understands connectivity as well as shape could batch-convert those scans into lightweight, animation-ready assets—no retopology pass required. And because DeepMesh’s DPO loop is “just” another RLHF recipe, the same pipeline could teach a mesh LLM brand-specific style or IP-safe anatomy without touching the base weights.

The authors hint at scaling past one billion parameters and adding text-conditioned generation. Given how fast 3D GenAI is snowballing, don’t bet against DeepMesh—or its tokenization trick—showing up in the next wave of text-to-world engines.

Paper link: arXiv 2503.15265 (PDF)

6.7.25

LangGraph Rollout: how VeRL leveled-up multi-turn Agent RL

Why this matters

If you’ve ever tried to train an LLM-powered agent with many tool calls spread across a genuine back-and-forth conversation, you’ve probably discovered that “multi-turn” means different things to different frameworks. Yanbin Jiang’s latest post shows how the VeRL team punched through that ceiling by grafting LangGraph directly onto VeRL’s reinforcement-learning rollout engine. The result is a training loop that speaks the same language as production code.

1. Where they started

Native VeRL multi-turn – great for quick experiments. You enable multi_turn: True, write a YAML schema for each tool, implement an async Python class, and you’re off; their GSM8K benchmark ran in two days.
Pain points
1. Double bookkeeping: every tool had to be declared twice (YAML + Python).
2. Drift: schema and code fell out of sync, and prod tools (written for LangChain/LangGraph) diverged from the “training” clones.

2. A quick stop-gap: automatic tool wrapping

Yanbin added BaseTool.from_callable(), which introspects any plain Python function with transformers.utils.get_json_schema, then fabricates a VeRL-compatible wrapper on the fly. One list of callables (tool_list = [multiply, add, …]) now powers both training and prod.

My dev take: this is the same pattern I use in LangChain when I decorate business logic with @tool. Nice to see VeRL admit “if you can’t beat reflection, join it.”

3. The real blocker: orchestration power

Research quickly outgrew VeRL’s built-in rollout:

Need	Why VeRL fell short
Dynamic branches & backtracking	Native graph was too rigid.
True multi-turn dialogue (user follow-ups)	Any assistant message without tool calls ended the convo.
Per-node sampling / chat-template tweaks	Global settings only.

Enter LangGraph: a lightweight DAG engine already shipping in production.

4. Architectural insight: separation of concerns

“Let VeRL manage actor weights & hardware; let LangGraph drive the conversation.”

So they built a LangChain-compatible chat-model client for VeRL’s SGLang server. Training now works like this:

VeRL hands the initial messages + model handle to the user’s LangGraph.
The graph does its thing—branching, retrying, invoking tools—using the exact actor weights being optimized.
When the graph stops, VeRL collects the message history and rewards.

The PR shows a seven-line YAML snippet that swaps the old rollout for:

yaml
multi_turn:
  chat_template_kwargs: {enable_thinking: false}
  langgraph:
    path: /path/to/graph.py
    graph_config: {recursion_limit: 100}

…and a 60-line example graph that binds tools, counts turns, and lets you vary temperature node-by-node.

5. Why I’m excited

One graph to rule them all – deployment and training share code; no more “but it worked in prod!”
Easier ablations – want to test a new branch strategy? Edit the graph script; RL pipeline stays untouched.
Framework-agnostic future – the same bridge pattern could plug VeRL into OpenAI Function Calling, Microsoft’s AutoGen, or whatever framework wins next year.

My takeaway

VeRL just became a lot more attractive for serious agent RL work. By leaning on LangGraph instead of extending an in-house orchestration DSL, the team keeps VeRL laser-focused on fast rollouts, leaves graph logic to a dedicated library, and—crucially—lets devs iterate on one codebase. If you’re juggling duplicate tool definitions or fighting mismatch between training and production, clone Yanbin’s PR and breathe easier.

Explore it more here: https://jybsuper.github.io/posts/langgraph_rollout/

WebSailor charts an open-source course to super-human web reasoning

For the past year, open-source web agents have looked like dinghies chasing aircraft carriers: even 70-billion-parameter models scraped single-digit accuracy on BrowseComp-en, the field’s toughest information-seeking benchmark, while closed systems such as DeepResearch and Grok-3 cruised far ahead. Tongyi Lab, Alibaba’s applied-AI skunkworks, says it has all but closed that gap with WebSailor, a post-training recipe that rewires large language models to “think like uncertainty-slayers.”

Turning the web into a maze on purpose

At the heart of WebSailor is SailorFog-QA, a synthetic dataset that bombards the model with “Level-3” problems—questions whose answers hide behind tangled entity graphs and deliberately obfuscated clues (“a musician later honored in the early 21st century,” “a chronology that ends the same year a late-antique poet died”). Random walks over real web pages build those graphs; masking, vagueness and partial names turn each query into a fog bank the agent must burn off through multi-step reasoning.

DUPO: reinforcement learning that isn’t painfully slow

Tool-using agents learn painfully slowly because every step calls a browser, but Tongyi Lab’s Duplicating Sampling Policy Optimization (DUPO) makes each RL batch pull double duty: one pass samples harder trajectories, the next re-samples mid-episode to squeeze more signal from sparse rewards. A small rejection-sampling fine-tuning (RFT) “cold start” of just 2 k expert traces primes the model so DUPO has something to optimize.

Four sizes, one giant leap

WebSailor comes in 3B, 7B, 32B and 72B flavors. Even the 7-billion-parameter version hits 6.7 % pass@1 on BrowseComp-en, trouncing agents built on 32 B backbones that manage barely 2 – 3 %. The 32 B and 72 B models push further, outscoring open-source peers on BrowseComp-en/zh, GAIA and XBench and edging past proprietary offerings like Grok-3 and Doubao-Search when those systems add browsing tools.

Why it matters

Democratizing deep search. BrowseComp-level tasks—ask a question, navigate dozen-plus pages, synthesize an answer—are what corporate knowledge-bases and vertical search startups need. WebSailor shows you no longer need a closed-source giant to play.
A recipe, not a model. The CPT + HCF routine, uncertainty-first data and DUPO optimizer are architecture-agnostic; any ReAct-style agent with tool APIs can adopt them.
Downward compatibility. Despite training only on headache-grade puzzles, WebSailor’s 72 B model scores >90 % pass@1 on the single-hop SimpleQA benchmark, proving that hard-first curricula don’t break easy tasks.

Open weights, open benchmark

Code, data-generation scripts and checkpoints live in Tongyi Lab’s GitHub repo, alongside a dockerized evaluator so outside teams can reproduce—or dispute—the numbers.

With WebSailor, the open-source fleet finally has a flagship capable of keeping proprietary juggernauts in sight. The real question now: how long before someone splices SailorFog-style data and DUPO into a general-purpose agent that can shop, schedule and navigate enterprise wikis with the same super-human calm?

Paper link: arXiv 2507.02592 (PDF)

4.7.25

DiffuCoder rewrites the code-LLM playbook with diffusion and smarter RL

Autoregressive (AR) giants like GPT-4o and Qwen2.5 dominate today’s leaderboard-driven coding scene, but Apple’s research group thinks the next breakthrough may come from an entirely different generation paradigm. In a paper published late last week, the team unveiled DiffuCoder — a 7 B-parameter masked diffusion language model (dLLM) designed specifically for program synthesis and repair. Unlike AR models that predict the next token left-to-right, DiffuCoder iteratively denoises whole sequences, enabling global planning and out-of-order refinement.

What’s new under the hood

Scaled training for code. DiffuCoder is pretrained on 130 billion code tokens, then instruction-tuned and RL-fined on curated problem sets. That makes it one of the largest diffusion-first code models publicly documented.
Decoding insights. The authors introduce local and global AR-ness metrics to quantify how often a diffusion model falls back to sequential generation. They show that raising temperature not only diversifies token choice but also the order in which tokens are filled — a property AR models lack.
Coupled-GRPO. To tame the high-variance log-likelihood estimates that plague diffusion policy gradients, Apple proposes coupled Group Relative Policy Optimization, a two-pass masking strategy that evaluates complementary token subsets in one RL rollout. The technique drops noise without resorting to semi-AR “block decoding,” keeping the model fully diffusion-native.

Benchmark scores that matter

DiffuCoder’s base model already lands in the same ballpark as leading 7/8 B AR coders. After instruction tuning and coupled-GRPO, it posts:

Model	HumanEval+	MBPP+	EvalPlus (avg.)	BigCodeBench C-Full
DiffuCoder-Instruct	72.0	65.2	75.1	61.9
+ coupled-GRPO	73.2	68.3	78.6	67.5

That +4.4-point jump on EvalPlus brings the diffusion model within striking distance of Qwen2.5-Coder-SFT while comfortably outpacing earlier dLLMs like Dream-7B and LLaDA-Instruct.

Why it matters

Diffusion’s parallel denoising lets models “think in drafts,” revisiting earlier lines without paying the quadratic attention tax AR models incur for long contexts. For enterprise dev-ops teams staring down thousand-line files, a diffusion-native coder that no longer needs block-wise hacks could slash latency and memory. And because coupled-GRPO is plug-and-play, the method can in theory retrofit any masked diffusion LLM — not just Apple’s.

Early tooling and ecosystem

A DiffuCoder-7B-Instruct checkpoint is already live on Hugging Face, and the GitHub repo ships with sampling scripts, RL rewards and evaluation harnesses. That means startups building unit-test agents or code-review copilots can kick the tires today on a single A100.

The bigger question is whether diffusion LLMs can climb the performance ladder as fast as their image cousins did in 2022. Apple’s coupled-GRPO shows one path forward: make RL native to diffusion instead of forcing AR habits onto a fundamentally different beast. If follow-up work scales the idea to 34 B or 70 B parameters, AR incumbents may soon find themselves sharing the podium.

Paper link: arXiv 2506.20639 (PDF)

18.6.25

MiniMax-M1: A Breakthrough Open-Source LLM with a 1 Million Token Context & Cost-Efficient Reinforcement Learning

MiniMax, a Chinese AI startup renowned for its Hailuo video model, has unveiled MiniMax-M1, a landmark open-source language model released under the Apache 2.0 license. Designed for long-context reasoning and agentic tool use, M1 supports a 1 million token input and 80,000 token output window—vastly exceeding most commercial LLMs and enabling it to process large documents, contracts, or codebases in one go.

Built on a hybrid Mixture-of-Experts (MoE) architecture with lightning attention, MiniMax-M1 optimizes performance and cost. The model spans 456 billion parameters, with 45.9 billion activated per token. Its training employed a custom CISPO reinforcement learning algorithm, resulting in substantial efficiency gains. Remarkably, M1 was trained for just $534,700, compared to over $5–6 million spent by DeepSeek‑R1 or over $100 million for GPT‑4.

⚙️ Key Architectural Innovations

1M Token Context Window: Enables comprehensive reasoning across lengthy documents or multi-step workflows.
Hybrid MoE + Lightning Attention: Delivers high performance without excessive computational overhead.
CISPO RL Algorithm: Efficiently trains the model with clipped importance sampling, lowering cost and training time.
Dual Variants: M1-40k and M1-80k versions support variable output lengths (40K and 80K “thinking budget”).

📊 Benchmark-Topping Performance

MiniMax-M1 excels in diverse reasoning and coding benchmarks:

– AIME 2024 (Math): 86.0% accuracy
– LiveCodeBench (Coding): 65.0%
– SWE‑bench Verified: 56.0%
– TAU‑bench: 62.8%
– OpenAI MRCR (4-needle): 73.4%

These results surpass leading open-weight models like DeepSeek‑R1 and Qwen3‑235B‑A22B, narrowing the gap with top-tier commercial LLMs such as OpenAI’s o3 and Google’s Gemini due to its unique architectural optimizations.

🚀 Developer-Friendly & Agent-Ready

MiniMax-M1 supports structured function calling and is packaged with an agent-capable API that includes search, multimedia generation, speech synthesis, and voice cloning. Recommended for deployment via vLLM, optimized for efficient serving and batch handling, it also offers standard Transformers compatibility.

For enterprises, technical leads, and AI orchestration engineers—MiniMax-M1 provides:

Lower operational costs and compute footprint
Simplified integration into existing AI pipelines
Support for in-depth, long-document tasks
A self-hosted, secure alternative to cloud-bound models
Business-grade performance with full community access

🧩 Final Takeaway

MiniMax-M1 marks a milestone in open-source AI—combining extreme context length, reinforcement-learning efficiency, and high benchmark performance within a cost-effective, accessible framework. It opens new possibilities for developers, researchers, and enterprises tackling tasks requiring deep reasoning over extensive content—without the limitations or expense of closed-weight models.

10.6.25

Ether0: The 24B-Parameter Scientific Reasoning Model Accelerating Molecular Discovery

FutureHouse has unveiled Ether0, a 24 billion-parameter open-source reasoning model specialized for chemistry tasks. Built on Mistral 24B and fine-tuned through chain-of-thought reinforcement learning, Ether0 accepts natural-language prompts and generates molecule structures in SMILES notation, excelling particularly in drug-like compound design.

Why Ether0 Matters

While general-purpose LLMs possess extensive chemical knowledge, they falter at molecule manipulation—incorrect atom counts, implausible rings, or inaccurate compound names. Ether0 addresses these deficiencies by learning from reinforcement signals grounded in chemical validity rather than mimicry, significantly boosting accuracy in molecule generation.

Training Methodology

Base Model & Datasets: Starts with Mistral 24B Instruct.
Fine-tuning: Trains chains of thought and correct answers through supervised learning, separating specialists per task.
Reinforcement Learning: Specialized models trained on molecular tasks across ~50K examples each.
Distillation: Merges specialist reasoning into a generalized model, further refined with reinforcement over multiple tasks.

This modular workflow enables data efficiency, with Ether0 surpassing frontier models like GPT‑4.1 and DeepSeek‑R1 on chemistry problems while using substantially less data than traditional methods.

Capabilities and Limits

Ether0 accurately handles tasks such as:

Converting formulas (e.g., C₂₇H₃₇N₃O₄) to valid molecules.
Designing compounds by functional groups, solubility, pKa, smell, or receptor binding.
Proposing retrosynthesis steps and reaction outcomes.

However, it falters in:

Naming via IUPAC or common names.
Reasoning on molecular conformations.
General conversational chemistry outside strict molecule output.

The model develops unique behaviors—blending languages and inventing new terms (e.g., “reductamol”)—reflecting deeper reasoning at the cost of clarity in some reasoning traces.

Safety & Governance

Ether0 is released under an Apache 2.0 license and includes safeguards: refusal on controlled compounds, missiles-toxins filters, and rejection of explicit malicious content. This safety post-processing is critical given its open-weight deployment.

Community & Future Vision

Built by a FutureHouse team supported by Eric Schmidt and VoltagePark, Ether0 is part of a broader quest to automate scientific discovery via AI agents. The code, reward models, benchmarks, and model weights are available on GitHub and Hugging Face. Next steps include integrating Ether0 into Phoenix—FutureHouse’s chemistry agent—as a foundational block toward a generalized scientific reasoning engine

Key Takeaways

Domain-specific reasoning: Demonstrates how reinforcement-tuned LLMs can learn scientific tasks beyond pretraining.
Data-efficient training: Delivers strong performance using ~50K task-specific examples, far fewer than traditional AI training regimes.
Open-source advancement: Enables scientific and developer communities to build upon Ether0 in drug design and other chemistry domains.
Transparent reasoning traces: Offers insight into LLM ‘thought processes’, facilitating interpretability in scientific AI.

9.6.25

Google’s MASS Revolutionizes Multi-Agent AI by Automating Prompt and Topology Optimization

Designing multi-agent AI systems—where several AI "agents" collaborate—has traditionally depended on manual tuning of prompt instructions and agent communication structures (topologies). Google AI, in partnership with Cambridge researchers, is aiming to change that with their new Multi-Agent System Search (MASS) framework. MASS brings automation to the design process, ensuring consistent performance gains across complex domains.

🧠 What MASS Actually Does

MASS performs a three-stage automated optimization that iteratively refines:

Block-Level Prompt Tuning
Fine-tunes individual agent prompts via local search—sharpening their roles (think “questioner”, “solver”).
Topology Optimization
Identifies the best agent interaction structure. It prunes and evaluates possible communication workflows to find the most impactful design.
Workflow-Level Prompt Refinement
Final tuning of prompts once the best network topology is set.

By alternating prompt and topology adjustments, MASS achieves optimization that surpasses previous methods which tackled only one dimension

🏅 Why It Matters

Benchmarked Success: MASS-designed agent systems outperform AFlow and ADAS on challenging benchmarks like MATH, LiveCodeBench, and multi-hop question-answering
Reduced Manual Overhead: Designers no longer need to trial-and-error their way through thousands of prompt-topology combinations.
Extended to Real-World Tasks: Whether for reasoning, coding, or decision-making, this framework is broadly applicable across domains.

💬 Community Reactions

Reddit’s r/machinelearningnews highlighted MASS’s leap beyond isolated prompt or topology tuning:

“Multi-Agent System Search (MASS) … reduces manual effort while achieving state‑of‑the‑art performance on tasks like reasoning, multi‑hop QA, and code generation.” linkedin.com

📘 Technical Deep Dive

Originating from a February 2025 paper by Zhou et al., MASS represents a methodological advance in agentic AI

Agents are modular: designed for distinct roles through prompts.
Topology defines agent communication patterns: linear chain, tree, ring, etc.
MASS explores both prompt and topology spaces, sequentially optimizing them across three stages.
Final systems demonstrate robustness not just in benchmarks but as a repeatable design methodology.

🚀 Wider Implications

Democratizing Agent Design: Non-experts in prompt engineering can deploy effective agent systems from pre-designed searches.
Adaptability: Potential for expanding MASS to dynamic, real-world settings like real-time planning and adaptive workflows.
Innovation Accelerator: Encourages research into auto-tuned multi-agent frameworks for fields like robotics, data pipelines, and interactive assistants.

🧭 Looking Ahead

As Google moves deeper into its “agentic era”—with initiatives like Project Mariner and Gemini's Agent Mode—MASS offers a scalable blueprint for future AS/AI applications. Expect to see frameworks that not only generate prompts but also self-optimize their agent networks for performance and efficiency.

6.6.25

NVIDIA's ProRL: Advancing Reasoning in Language Models Through Prolonged Reinforcement Learning

NVIDIA has unveiled ProRL (Prolonged Reinforcement Learning), a groundbreaking training methodology designed to expand the reasoning boundaries of large language models (LLMs). By extending the duration and stability of reinforcement learning (RL) training, ProRL enables LLMs to develop novel reasoning strategies that surpass the capabilities of their base models.

Understanding ProRL

Traditional RL approaches often face challenges in enhancing the reasoning abilities of LLMs, sometimes merely amplifying existing patterns without fostering genuine innovation. ProRL addresses this by introducing:

KL Divergence Control: Maintains a balance between exploring new strategies and retaining learned knowledge.
Reference Policy Resetting: Periodically resets the policy to prevent convergence on suboptimal solutions.
Diverse Task Suite: Engages models in a wide array of tasks to promote generalization and adaptability.

These components collectively ensure that models not only learn more effectively but also develop unique reasoning pathways previously inaccessible through standard training methods.

Key Findings

Empirical evaluations demonstrate that ProRL-trained models consistently outperform their base counterparts across various benchmarks, including scenarios where base models fail entirely. Notably, improvements were observed in:

Pass@k Evaluations: Higher success rates in generating correct outputs within k attempts.
Creativity Index: Enhanced ability to produce novel solutions not present in the training data.

These results indicate that prolonged RL training can lead to the emergence of new reasoning capabilities, expanding the solution space beyond initial limitations.

Implications for AI Development

The introduction of ProRL signifies a pivotal shift in AI training paradigms. By demonstrating that extended and stable RL training can foster genuine reasoning advancements, NVIDIA paves the way for more sophisticated and adaptable AI systems. This has profound implications for applications requiring complex decision-making and problem-solving abilities.

Accessing ProRL Resources

To facilitate further research and development, NVIDIA has released the model weights associated with ProRL. Interested parties can access these resources here:

These resources provide valuable insights and tools for researchers aiming to explore the frontiers of AI reasoning capabilities.

3.6.25

MiMo-VL-7B: Xiaomi's Advanced Vision-Language Model Elevating Multimodal AI Reasoning

Xiaomi has unveiled MiMo-VL-7B, a cutting-edge vision-language model (VLM) that combines compact architecture with exceptional performance in multimodal reasoning tasks. Designed to process and understand both visual and textual data, MiMo-VL-7B sets a new benchmark in the field of AI.

Innovative Architecture and Training

MiMo-VL-7B comprises three key components:

A native-resolution Vision Transformer (ViT) encoder that preserves fine-grained visual details.
A Multi-Layer Perceptron (MLP) projector for efficient cross-modal alignment.
The MiMo-7B language model, specifically optimized for complex reasoning tasks.

The model undergoes a two-phase training process:

Four-Stage Pre-Training: This phase includes projector warmup, vision-language alignment, general multimodal pre-training, and long-context supervised fine-tuning (SFT), resulting in the MiMo-VL-7B-SFT model.
Mixed On-Policy Reinforcement Learning (MORL): In this phase, diverse reward signals—such as perception accuracy, visual grounding precision, logical reasoning capabilities, and human preferences—are integrated to produce the MiMo-VL-7B-RL model.

Performance Highlights

MiMo-VL-7B demonstrates state-of-the-art performance in various benchmarks:

Excels in general visual-language understanding tasks.
Outperforms existing open-source models in multimodal reasoning tasks.
Exhibits exceptional GUI understanding and grounding capabilities, rivaling specialized models.

Notably, MiMo-VL-7B-RL achieves the highest Elo rating among all evaluated open-source vision-language models, ranking first across models ranging from 7B to 72B parameters.

Accessibility and Deployment

Xiaomi has open-sourced the MiMo-VL-7B series, including both the SFT and RL models, making them available for the research community and developers. The models are compatible with the Qwen2_5_VLForConditionalGeneration architecture, facilitating seamless deployment and inference.

Conclusion

MiMo-VL-7B represents a significant advancement in vision-language modeling, combining compact design with high performance. Through innovative training methodologies and open-source availability, Xiaomi contributes to the broader AI community's efforts in developing sophisticated multimodal systems.

1.6.25

QwenLong-L1: Alibaba's Breakthrough in Long-Context AI Reasoning

In a significant advancement for artificial intelligence, Alibaba Group has unveiled QwenLong-L1, a new framework designed to enhance large language models' (LLMs) ability to process and reason over exceptionally long textual inputs. This development addresses a longstanding challenge in AI: enabling models to understand and analyze extensive documents such as detailed corporate filings, comprehensive financial statements, and complex legal contracts.

The Challenge of Long-Form Reasoning

While recent advancements in large reasoning models (LRMs), particularly through reinforcement learning (RL), have improved problem-solving capabilities, these improvements have predominantly been observed with shorter texts, typically around 4,000 tokens. Scaling reasoning abilities to longer contexts, such as 120,000 tokens, remains a significant hurdle. Long-form reasoning necessitates a robust understanding of the entire context and the capacity for multi-step analysis. This limitation has posed a barrier to practical applications requiring interaction with extensive external knowledge.

Introducing QwenLong-L1

QwenLong-L1 addresses this challenge through a structured, multi-stage reinforcement learning framework:

Warm-up Supervised Fine-Tuning (SFT): The model undergoes initial training on examples of long-context reasoning, establishing a foundation for understanding context, generating logical reasoning chains, and extracting answers.
Curriculum-Guided Phased RL: Training progresses through multiple phases with gradually increasing input lengths, allowing the model to adapt its reasoning strategies from shorter to longer contexts systematically.
Difficulty-Aware Retrospective Sampling: Incorporating challenging examples from previous training phases ensures the model continues to learn from complex problems, encouraging exploration of diverse reasoning paths.

Additionally, QwenLong-L1 employs a hybrid reward mechanism combining rule-based verification with an "LLM-as-a-judge" approach, comparing the semantic similarity of generated answers with ground truth, allowing for more flexible and nuanced evaluations.

Performance and Implications

Evaluations using document question-answering benchmarks demonstrated QwenLong-L1's capabilities. Notably, the QwenLong-L1-32B model achieved performance comparable to leading models like Anthropic’s Claude-3.7 Sonnet Thinking and outperformed others such as OpenAI’s o3-mini. The model exhibited advanced reasoning behaviors, including grounding, subgoal setting, backtracking, and verification, essential for complex document analysis.

The introduction of QwenLong-L1 signifies a pivotal step in AI's ability to handle long-context reasoning tasks, opening avenues for applications in legal analysis, financial research, and beyond. By overcoming previous limitations, this framework enhances the practicality and reliability of AI in processing extensive and intricate documents.

29.5.25

Introducing s3: A Modular RAG Framework for Efficient Search Agent Training

Researchers at the University of Illinois Urbana-Champaign have developed s3, an open-source framework designed to streamline the training of search agents within Retrieval-Augmented Generation (RAG) systems. By decoupling the retrieval and generation components, s3 allows for efficient training using minimal data, addressing challenges faced by enterprises in deploying AI applications.

Evolution of RAG Systems

The effectiveness of RAG systems largely depends on the quality of their retrieval mechanisms. The researchers categorize the evolution of RAG approaches into three phases:

Classic RAG: Utilizes static retrieval methods with fixed queries, often resulting in a disconnect between retrieval quality and generation performance.
Pre-RL-Zero: Introduces multi-turn interactions between query generation, retrieval, and reasoning, but lacks trainable components to optimize retrieval based on outcomes.
RL-Zero: Employs reinforcement learning to train models as search agents, improving through feedback like answer correctness. However, these approaches often require fine-tuning the entire language model, which can be costly and limit compatibility with proprietary models.

The s3 Framework

s3 addresses these limitations by focusing solely on optimizing the retrieval component. It introduces a novel reward signal called Gain Beyond RAG (GBR), which measures the improvement in generation accuracy when using s3's retrieved documents compared to naive retrieval methods. This approach allows the generator model to remain untouched, facilitating integration with various off-the-shelf or proprietary large language models.

In evaluations across multiple question-answering benchmarks, s3 demonstrated strong performance using only 2.4k training examples, outperforming other methods that require significantly more data. Notably, s3 also showed the ability to generalize to domains it wasn't explicitly trained on, such as medical question-answering tasks.

Implications for Enterprises

For enterprises, s3 offers a practical solution to building efficient and adaptable search agents without the need for extensive data or computational resources. Its modular design ensures compatibility with existing language models and simplifies the deployment of AI-powered search applications.

Paper: "s3: You Don't Need That Much Data to Train a Search Agent via RL" – arXiv, May 20, 2025.

https://arxiv.org/abs/2505.14146

31.7.25

Three moving parts

Why reinforcement learning?

Early numbers worth noting

Why it matters

The bigger picture

23.7.25

Architecture built for truly big code

Reinforcement learning at industrial scale

Benchmarks and agentic chops

Meet Qwen Code—your command‑line copilot

Why it matters

22.7.25

From Silver to Gold in Twelve Months

What Makes Deep Think Different?

Benchmark Significance

Broader Implications

Limitations & Next Steps

Key Takeaway

10.7.25

From passive judge to active coach

Hard numbers: 38 % → 84 % accuracy

A new 285 k-problem corpus (and 36 k “diamond” stumpers)

Why this matters

8.7.25

Why previous mesh LLMs hit the wall

Two upgrades, one big leap

Results: fewer faces, better faces

Why this is bigger than 3D fan art

6.7.25

Why this matters

1. Where they started

2. A quick stop-gap: automatic tool wrapping

3. The real blocker: orchestration power

4. Architectural insight: separation of concerns

5. Why I’m excited

My takeaway

Turning the web into a maze on purpose

DUPO: reinforcement learning that isn’t painfully slow

Four sizes, one giant leap

Why it matters

Open weights, open benchmark

4.7.25

What’s new under the hood

Benchmark scores that matter

Why it matters

Early tooling and ecosystem

18.6.25

⚙️ Key Architectural Innovations

📊 Benchmark-Topping Performance

🚀 Developer-Friendly & Agent-Ready

🧩 Final Takeaway

10.6.25

Why Ether0 Matters

Training Methodology

Capabilities and Limits

Safety & Governance

Community & Future Vision

9.6.25

🧠 What MASS Actually Does

🏅 Why It Matters

💬 Community Reactions

📘 Technical Deep Dive

🚀 Wider Implications

🧭 Looking Ahead

6.6.25

Understanding ProRL

Key Findings

Implications for AI Development

Accessing ProRL Resources

3.6.25

1.6.25

29.5.25

Evolution of RAG Systems

The s3 Framework

Implications for Enterprises

Meet Qwen Code—your command‑line copilot