Showing posts with label Reinforcement Learning. Show all posts
Showing posts with label Reinforcement Learning. Show all posts

18.6.25

MiniMax-M1: A Breakthrough Open-Source LLM with a 1 Million Token Context & Cost-Efficient Reinforcement Learning

 MiniMax, a Chinese AI startup renowned for its Hailuo video model, has unveiled MiniMax-M1, a landmark open-source language model released under the Apache 2.0 license. Designed for long-context reasoning and agentic tool use, M1 supports a 1 million token input and 80,000 token output window—vastly exceeding most commercial LLMs and enabling it to process large documents, contracts, or codebases in one go.

Built on a hybrid Mixture-of-Experts (MoE) architecture with lightning attention, MiniMax-M1 optimizes performance and cost. The model spans 456 billion parameters, with 45.9 billion activated per token. Its training employed a custom CISPO reinforcement learning algorithm, resulting in substantial efficiency gains. Remarkably, M1 was trained for just $534,700, compared to over $5–6 million spent by DeepSeek‑R1 or over $100 million for GPT‑4.


⚙️ Key Architectural Innovations

  • 1M Token Context Window: Enables comprehensive reasoning across lengthy documents or multi-step workflows.

  • Hybrid MoE + Lightning Attention: Delivers high performance without excessive computational overhead.

  • CISPO RL Algorithm: Efficiently trains the model with clipped importance sampling, lowering cost and training time.

  • Dual Variants: M1-40k and M1-80k versions support variable output lengths (40K and 80K “thinking budget”).


📊 Benchmark-Topping Performance

MiniMax-M1 excels in diverse reasoning and coding benchmarks:

AIME 2024 (Math): 86.0% accuracy
LiveCodeBench (Coding): 65.0%
SWE‑bench Verified: 56.0%
TAU‑bench: 62.8%
OpenAI MRCR (4-needle): 73.4% 

These results surpass leading open-weight models like DeepSeek‑R1 and Qwen3‑235B‑A22B, narrowing the gap with top-tier commercial LLMs such as OpenAI’s o3 and Google’s Gemini due to its unique architectural optimizations.


🚀 Developer-Friendly & Agent-Ready

MiniMax-M1 supports structured function calling and is packaged with an agent-capable API that includes search, multimedia generation, speech synthesis, and voice cloning. Recommended for deployment via vLLM, optimized for efficient serving and batch handling, it also offers standard Transformers compatibility.

For enterprises, technical leads, and AI orchestration engineers—MiniMax-M1 provides:

  • Lower operational costs and compute footprint

  • Simplified integration into existing AI pipelines

  • Support for in-depth, long-document tasks

  • A self-hosted, secure alternative to cloud-bound models

  • Business-grade performance with full community access


🧩 Final Takeaway

MiniMax-M1 marks a milestone in open-source AI—combining extreme context length, reinforcement-learning efficiency, and high benchmark performance within a cost-effective, accessible framework. It opens new possibilities for developers, researchers, and enterprises tackling tasks requiring deep reasoning over extensive content—without the limitations or expense of closed-weight models.

10.6.25

Ether0: The 24B-Parameter Scientific Reasoning Model Accelerating Molecular Discovery

 FutureHouse has unveiled Ether0, a 24 billion-parameter open-source reasoning model specialized for chemistry tasks. Built on Mistral 24B and fine-tuned through chain-of-thought reinforcement learning, Ether0 accepts natural-language prompts and generates molecule structures in SMILES notation, excelling particularly in drug-like compound design.

Why Ether0 Matters

While general-purpose LLMs possess extensive chemical knowledge, they falter at molecule manipulation—incorrect atom counts, implausible rings, or inaccurate compound names. Ether0 addresses these deficiencies by learning from reinforcement signals grounded in chemical validity rather than mimicry, significantly boosting accuracy in molecule generation.

Training Methodology

  • Base Model & Datasets: Starts with Mistral 24B Instruct.

  • Fine-tuning: Trains chains of thought and correct answers through supervised learning, separating specialists per task.

  • Reinforcement Learning: Specialized models trained on molecular tasks across ~50K examples each.

  • Distillation: Merges specialist reasoning into a generalized model, further refined with reinforcement over multiple tasks.

This modular workflow enables data efficiency, with Ether0 surpassing frontier models like GPT‑4.1 and DeepSeek‑R1 on chemistry problems while using substantially less data than traditional methods.

Capabilities and Limits

Ether0 accurately handles tasks such as:

  • Converting formulas (e.g., C₂₇H₃₇N₃O₄) to valid molecules.

  • Designing compounds by functional groups, solubility, pKa, smell, or receptor binding.

  • Proposing retrosynthesis steps and reaction outcomes.

However, it falters in:

  • Naming via IUPAC or common names.

  • Reasoning on molecular conformations.

  • General conversational chemistry outside strict molecule output.

The model develops unique behaviors—blending languages and inventing new terms (e.g., “reductamol”)—reflecting deeper reasoning at the cost of clarity in some reasoning traces.

Safety & Governance

Ether0 is released under an Apache 2.0 license and includes safeguards: refusal on controlled compounds, missiles-toxins filters, and rejection of explicit malicious content. This safety post-processing is critical given its open-weight deployment.

Community & Future Vision

Built by a FutureHouse team supported by Eric Schmidt and VoltagePark, Ether0 is part of a broader quest to automate scientific discovery via AI agents. The code, reward models, benchmarks, and model weights are available on GitHub and Hugging Face. Next steps include integrating Ether0 into Phoenix—FutureHouse’s chemistry agent—as a foundational block toward a generalized scientific reasoning engine 


Key Takeaways

  1. Domain-specific reasoning: Demonstrates how reinforcement-tuned LLMs can learn scientific tasks beyond pretraining.

  2. Data-efficient training: Delivers strong performance using ~50K task-specific examples, far fewer than traditional AI training regimes.

  3. Open-source advancement: Enables scientific and developer communities to build upon Ether0 in drug design and other chemistry domains.

  4. Transparent reasoning traces: Offers insight into LLM ‘thought processes’, facilitating interpretability in scientific AI.

9.6.25

Google’s MASS Revolutionizes Multi-Agent AI by Automating Prompt and Topology Optimization

 Designing multi-agent AI systems—where several AI "agents" collaborate—has traditionally depended on manual tuning of prompt instructions and agent communication structures (topologies). Google AI, in partnership with Cambridge researchers, is aiming to change that with their new Multi-Agent System Search (MASS) framework. MASS brings automation to the design process, ensuring consistent performance gains across complex domains.


🧠 What MASS Actually Does

MASS performs a three-stage automated optimization that iteratively refines:

  1. Block-Level Prompt Tuning
    Fine-tunes individual agent prompts via local search—sharpening their roles (think “questioner”, “solver”).

  2. Topology Optimization
    Identifies the best agent interaction structure. It prunes and evaluates possible communication workflows to find the most impactful design.

  3. Workflow-Level Prompt Refinement
    Final tuning of prompts once the best network topology is set.

By alternating prompt and topology adjustments, MASS achieves optimization that surpasses previous methods which tackled only one dimension 


🏅 Why It Matters

  • Benchmarked Success: MASS-designed agent systems outperform AFlow and ADAS on challenging benchmarks like MATH, LiveCodeBench, and multi-hop question-answering 

  • Reduced Manual Overhead: Designers no longer need to trial-and-error their way through thousands of prompt-topology combinations.

  • Extended to Real-World Tasks: Whether for reasoning, coding, or decision-making, this framework is broadly applicable across domains.


💬 Community Reactions

Reddit’s r/machinelearningnews highlighted MASS’s leap beyond isolated prompt or topology tuning:

“Multi-Agent System Search (MASS) … reduces manual effort while achieving state‑of‑the‑art performance on tasks like reasoning, multi‑hop QA, and code generation.” linkedin.com

 


📘 Technical Deep Dive

Originating from a February 2025 paper by Zhou et al., MASS represents a methodological advance in agentic AI

  • Agents are modular: designed for distinct roles through prompts.

  • Topology defines agent communication patterns: linear chain, tree, ring, etc.

  • MASS explores both prompt and topology spaces, sequentially optimizing them across three stages.

  • Final systems demonstrate robustness not just in benchmarks but as a repeatable design methodology.


🚀 Wider Implications

  • Democratizing Agent Design: Non-experts in prompt engineering can deploy effective agent systems from pre-designed searches.

  • Adaptability: Potential for expanding MASS to dynamic, real-world settings like real-time planning and adaptive workflows.

  • Innovation Accelerator: Encourages research into auto-tuned multi-agent frameworks for fields like robotics, data pipelines, and interactive assistants.


🧭 Looking Ahead

As Google moves deeper into its “agentic era”—with initiatives like Project Mariner and Gemini's Agent Mode—MASS offers a scalable blueprint for future AS/AI applications. Expect to see frameworks that not only generate prompts but also self-optimize their agent networks for performance and efficiency.

6.6.25

NVIDIA's ProRL: Advancing Reasoning in Language Models Through Prolonged Reinforcement Learning

 NVIDIA has unveiled ProRL (Prolonged Reinforcement Learning), a groundbreaking training methodology designed to expand the reasoning boundaries of large language models (LLMs). By extending the duration and stability of reinforcement learning (RL) training, ProRL enables LLMs to develop novel reasoning strategies that surpass the capabilities of their base models.

Understanding ProRL

Traditional RL approaches often face challenges in enhancing the reasoning abilities of LLMs, sometimes merely amplifying existing patterns without fostering genuine innovation. ProRL addresses this by introducing:

  • KL Divergence Control: Maintains a balance between exploring new strategies and retaining learned knowledge.

  • Reference Policy Resetting: Periodically resets the policy to prevent convergence on suboptimal solutions.

  • Diverse Task Suite: Engages models in a wide array of tasks to promote generalization and adaptability.

These components collectively ensure that models not only learn more effectively but also develop unique reasoning pathways previously inaccessible through standard training methods.

Key Findings

Empirical evaluations demonstrate that ProRL-trained models consistently outperform their base counterparts across various benchmarks, including scenarios where base models fail entirely. Notably, improvements were observed in:

  • Pass@k Evaluations: Higher success rates in generating correct outputs within k attempts.

  • Creativity Index: Enhanced ability to produce novel solutions not present in the training data.

These results indicate that prolonged RL training can lead to the emergence of new reasoning capabilities, expanding the solution space beyond initial limitations.

Implications for AI Development

The introduction of ProRL signifies a pivotal shift in AI training paradigms. By demonstrating that extended and stable RL training can foster genuine reasoning advancements, NVIDIA paves the way for more sophisticated and adaptable AI systems. This has profound implications for applications requiring complex decision-making and problem-solving abilities.

Accessing ProRL Resources

To facilitate further research and development, NVIDIA has released the model weights associated with ProRL. Interested parties can access these resources here:

These resources provide valuable insights and tools for researchers aiming to explore the frontiers of AI reasoning capabilities.

3.6.25

MiMo-VL-7B: Xiaomi's Advanced Vision-Language Model Elevating Multimodal AI Reasoning

 Xiaomi has unveiled MiMo-VL-7B, a cutting-edge vision-language model (VLM) that combines compact architecture with exceptional performance in multimodal reasoning tasks. Designed to process and understand both visual and textual data, MiMo-VL-7B sets a new benchmark in the field of AI.

Innovative Architecture and Training

MiMo-VL-7B comprises three key components:

  • A native-resolution Vision Transformer (ViT) encoder that preserves fine-grained visual details.

  • A Multi-Layer Perceptron (MLP) projector for efficient cross-modal alignment.

  • The MiMo-7B language model, specifically optimized for complex reasoning tasks.

The model undergoes a two-phase training process:

  1. Four-Stage Pre-Training: This phase includes projector warmup, vision-language alignment, general multimodal pre-training, and long-context supervised fine-tuning (SFT), resulting in the MiMo-VL-7B-SFT model.

  2. Mixed On-Policy Reinforcement Learning (MORL): In this phase, diverse reward signals—such as perception accuracy, visual grounding precision, logical reasoning capabilities, and human preferences—are integrated to produce the MiMo-VL-7B-RL model.

Performance Highlights

MiMo-VL-7B demonstrates state-of-the-art performance in various benchmarks:

  • Excels in general visual-language understanding tasks.

  • Outperforms existing open-source models in multimodal reasoning tasks.

  • Exhibits exceptional GUI understanding and grounding capabilities, rivaling specialized models.

Notably, MiMo-VL-7B-RL achieves the highest Elo rating among all evaluated open-source vision-language models, ranking first across models ranging from 7B to 72B parameters.

Accessibility and Deployment

Xiaomi has open-sourced the MiMo-VL-7B series, including both the SFT and RL models, making them available for the research community and developers. The models are compatible with the Qwen2_5_VLForConditionalGeneration architecture, facilitating seamless deployment and inference.

Conclusion

MiMo-VL-7B represents a significant advancement in vision-language modeling, combining compact design with high performance. Through innovative training methodologies and open-source availability, Xiaomi contributes to the broader AI community's efforts in developing sophisticated multimodal systems.

1.6.25

QwenLong-L1: Alibaba's Breakthrough in Long-Context AI Reasoning

 In a significant advancement for artificial intelligence, Alibaba Group has unveiled QwenLong-L1, a new framework designed to enhance large language models' (LLMs) ability to process and reason over exceptionally long textual inputs. This development addresses a longstanding challenge in AI: enabling models to understand and analyze extensive documents such as detailed corporate filings, comprehensive financial statements, and complex legal contracts.

The Challenge of Long-Form Reasoning

While recent advancements in large reasoning models (LRMs), particularly through reinforcement learning (RL), have improved problem-solving capabilities, these improvements have predominantly been observed with shorter texts, typically around 4,000 tokens. Scaling reasoning abilities to longer contexts, such as 120,000 tokens, remains a significant hurdle. Long-form reasoning necessitates a robust understanding of the entire context and the capacity for multi-step analysis. This limitation has posed a barrier to practical applications requiring interaction with extensive external knowledge.

Introducing QwenLong-L1

QwenLong-L1 addresses this challenge through a structured, multi-stage reinforcement learning framework:

  1. Warm-up Supervised Fine-Tuning (SFT): The model undergoes initial training on examples of long-context reasoning, establishing a foundation for understanding context, generating logical reasoning chains, and extracting answers.

  2. Curriculum-Guided Phased RL: Training progresses through multiple phases with gradually increasing input lengths, allowing the model to adapt its reasoning strategies from shorter to longer contexts systematically.

  3. Difficulty-Aware Retrospective Sampling: Incorporating challenging examples from previous training phases ensures the model continues to learn from complex problems, encouraging exploration of diverse reasoning paths.

Additionally, QwenLong-L1 employs a hybrid reward mechanism combining rule-based verification with an "LLM-as-a-judge" approach, comparing the semantic similarity of generated answers with ground truth, allowing for more flexible and nuanced evaluations.

Performance and Implications

Evaluations using document question-answering benchmarks demonstrated QwenLong-L1's capabilities. Notably, the QwenLong-L1-32B model achieved performance comparable to leading models like Anthropic’s Claude-3.7 Sonnet Thinking and outperformed others such as OpenAI’s o3-mini. The model exhibited advanced reasoning behaviors, including grounding, subgoal setting, backtracking, and verification, essential for complex document analysis.

The introduction of QwenLong-L1 signifies a pivotal step in AI's ability to handle long-context reasoning tasks, opening avenues for applications in legal analysis, financial research, and beyond. By overcoming previous limitations, this framework enhances the practicality and reliability of AI in processing extensive and intricate documents.

29.5.25

Introducing s3: A Modular RAG Framework for Efficient Search Agent Training

 Researchers at the University of Illinois Urbana-Champaign have developed s3, an open-source framework designed to streamline the training of search agents within Retrieval-Augmented Generation (RAG) systems. By decoupling the retrieval and generation components, s3 allows for efficient training using minimal data, addressing challenges faced by enterprises in deploying AI applications.

Evolution of RAG Systems

The effectiveness of RAG systems largely depends on the quality of their retrieval mechanisms. The researchers categorize the evolution of RAG approaches into three phases:

  1. Classic RAG: Utilizes static retrieval methods with fixed queries, often resulting in a disconnect between retrieval quality and generation performance.

  2. Pre-RL-Zero: Introduces multi-turn interactions between query generation, retrieval, and reasoning, but lacks trainable components to optimize retrieval based on outcomes.

  3. RL-Zero: Employs reinforcement learning to train models as search agents, improving through feedback like answer correctness. However, these approaches often require fine-tuning the entire language model, which can be costly and limit compatibility with proprietary models.

The s3 Framework

s3 addresses these limitations by focusing solely on optimizing the retrieval component. It introduces a novel reward signal called Gain Beyond RAG (GBR), which measures the improvement in generation accuracy when using s3's retrieved documents compared to naive retrieval methods. This approach allows the generator model to remain untouched, facilitating integration with various off-the-shelf or proprietary large language models.

In evaluations across multiple question-answering benchmarks, s3 demonstrated strong performance using only 2.4k training examples, outperforming other methods that require significantly more data. Notably, s3 also showed the ability to generalize to domains it wasn't explicitly trained on, such as medical question-answering tasks.

Implications for Enterprises

For enterprises, s3 offers a practical solution to building efficient and adaptable search agents without the need for extensive data or computational resources. Its modular design ensures compatibility with existing language models and simplifies the deployment of AI-powered search applications.

Paper: "s3: You Don't Need That Much Data to Train a Search Agent via RL" – arXiv, May 20, 2025.

https://arxiv.org/abs/2505.14146

27.5.25

NVIDIA Introduces AceReason-Nemotron: Enhancing Math and Code Reasoning through Reinforcement Learning

 NVIDIA has unveiled AceReason-Nemotron, a 14-billion-parameter open-source model designed to enhance mathematical and coding reasoning through large-scale reinforcement learning (RL). This model demonstrates that RL can significantly improve reasoning capabilities in small to mid-sized models, surpassing traditional distillation-based approaches.

Key Features and Innovations

  • Sequential RL Training Strategy: The model undergoes a two-phase RL training process—initially on math-only prompts, followed by code-only prompts. This approach not only boosts performance in respective domains but also ensures minimal degradation across tasks. 

  • Enhanced Benchmark Performance: AceReason-Nemotron-14B achieves notable improvements on various benchmarks:

    • AIME 2025: 67.4% (+17.4%)

    • LiveCodeBench v5: 61.1% (+8%)

    • LiveCodeBench v6: 54.9% (+7%) 

  • Robust Data Curation Pipeline: NVIDIA developed a comprehensive data curation system to collect challenging prompts with verifiable answers, facilitating effective verification-based RL across both math and code domains. 

  • Curriculum Learning and Stability: The training incorporates curriculum learning with progressively increasing response lengths and utilizes on-policy parameter updates to stabilize the RL process. 

Implications for AI Development

AceReason-Nemotron's success illustrates the potential of reinforcement learning in enhancing the reasoning abilities of AI models, particularly in mathematical and coding tasks. By releasing this model under the NVIDIA Open Model License, NVIDIA encourages further research and development in the AI community.

26.5.25

GRIT: Teaching Multimodal Large Language Models to Reason with Images by Interleaving Text and Visual Grounding

 A recent AI research paper introduces GRIT (Grounded Reasoning with Images and Text), a pioneering approach designed to enhance the reasoning capabilities of Multimodal Large Language Models (MLLMs). GRIT enables these models to interleave natural language reasoning with explicit visual references, such as bounding box coordinates, allowing for more transparent and grounded decision-making processes.

Key Innovations of GRIT

  • Interleaved Reasoning Chains: Unlike traditional models that rely solely on textual explanations, GRIT-trained MLLMs generate reasoning chains that combine natural language with explicit visual cues, pinpointing specific regions in images that inform their conclusions.

  • Reinforcement Learning with GRPO-GR: GRIT employs a reinforcement learning strategy named GRPO-GR, which rewards models for producing accurate answers and well-structured, grounded reasoning outputs. This approach eliminates the need for extensive annotated datasets, as it does not require detailed reasoning chain annotations or explicit bounding box labels.

  • Data Efficiency: Remarkably, GRIT achieves effective training using as few as 20 image-question-answer triplets from existing datasets, demonstrating its efficiency and practicality for real-world applications.

Implications for AI Development

The GRIT methodology represents a significant advancement in the development of interpretable and efficient AI systems. By integrating visual grounding directly into the reasoning process, MLLMs can provide more transparent and verifiable explanations for their outputs, which is crucial for applications requiring high levels of trust and accountability.

14.5.25

Nemotron-Tool-N1: Revolutionizing LLM Tool Use with Reinforcement Learning

 In the rapidly evolving field of artificial intelligence, enabling large language models (LLMs) to effectively utilize external tools has become a focal point. Traditional methods often rely on supervised fine-tuning, which can be resource-intensive and may not generalize well across diverse tasks. Addressing these challenges, researchers have introduced Nemotron-Tool-N1, a novel approach that employs reinforcement learning to train LLMs for tool use with minimal supervision.

Moving Beyond Supervised Fine-Tuning

Conventional approaches to teaching LLMs tool usage typically involve supervised fine-tuning (SFT), where models learn from annotated reasoning traces or outputs from more powerful models. While effective to an extent, these methods often result in models that mimic reasoning patterns without truly understanding them, limiting their adaptability.

Nemotron-Tool-N1 diverges from this path by utilizing a reinforcement learning framework inspired by DeepSeek-R1. Instead of relying on detailed annotations, the model receives binary rewards based on the structural validity and functional correctness of its tool invocations. This approach encourages the model to develop its own reasoning strategies, leading to better generalization across tasks.

Impressive Performance Benchmarks

Built upon the Qwen-2.5-7B and Qwen-2.5-14B architectures, Nemotron-Tool-N1 has demonstrated remarkable performance. In evaluations using the BFCL and API-Bank benchmarks, the model not only achieved state-of-the-art results but also outperformed GPT-4o, showcasing its superior capability in tool utilization tasks.

Implications for the Future of AI

The success of Nemotron-Tool-N1 underscores the potential of reinforcement learning in training LLMs for complex tasks with minimal supervision. By moving away from traditional fine-tuning methods, this approach offers a more scalable and adaptable solution for integrating tool use into AI systems.

As the demand for more versatile and efficient AI models grows, innovations like Nemotron-Tool-N1 pave the way for future advancements in the field.

10.5.25

ZEROSEARCH: Simulating Search to Train Retrieval-Augmented LLMs at Zero API Cost

Introduction

Retrieval-Augmented Generation (RAG) has become a cornerstone for grounding large language models (LLMs) in up-to-date information. Yet, existing approaches that integrate live search engines face two critical hurdles: unpredictable document quality and prohibitive API expenses during reinforcement learning (RL) training arXiv. ZEROSEARCH, introduced by Sun et al., offers an elegant solution—train LLMs’ internal “search” strategies without ever contacting a real search engine, slashing costs and stabilizing learning.


Methodology Deep Dive

1. Search Simulation via Supervised Fine-Tuning

Rather than querying Google or Bing, ZEROSEARCH first converts an LLM into a retrieval module (π_ψ) through lightweight supervised fine-tuning (SFT).

  • Data Collection: The authors collect interaction trajectories by prompting the base LLM to interact with a real search engine until a correct answer is produced (“positive”) or an incorrect one (“negative”).

  • Prompt Design: Query–document pairs from these trajectories are extracted. The fine-tuning prompt explicitly labels whether the generated document should be useful or noisy, enabling the model to simulate both high- and low-quality retrievals on demand (Table 2) arXiv.

2. Curriculum-Based Rollout Strategy

To progressively challenge the policy model (π_θ), ZEROSEARCH employs a curriculum that gradually increases the noise probability (pᵢ) of simulated documents over training steps:

pi=ps+(i/m1b1)×(peps)p_i = p_s + \bigg(\frac{i/m - 1}{b - 1}\bigg) \times (p_e - p_s)
  • Parameters:

    • ps, pe: initial and final noise probabilities

    • i/m: fraction of completed training steps

    • b: exponential base (default 4)

  • Effect: Early training relies on mostly useful documents, allowing π_θ to learn structured reasoning. Over time, noisy retrievals dominate, forcing robust search strategies arXiv.

3. Reinforcement Learning Objective

ZEROSEARCH frames the optimization as:

maxπθ    Ex,y[rϕ(x,y)    βDKL(πθπref)],\max_{\pi_\theta} \;\; \mathbb{E}_{x,y}\Big[\,r_\phi(x,y)\;-\;\beta\,D_{\mathrm{KL}}\big(\pi_\theta\,\|\,\pi_{\mathrm{ref}}\big)\Big],

where:

  • rₚhi(x,y): F1-based reward (balances precision & recall, avoids “reward hacking” seen with Exact Match) arXiv.

  • π_ref: reference model (for KL-penalty regularization).

  • Compatible Algorithms: PPO, GRPO, Reinforce++.


Key Results Overview

  • A 3B-parameter simulation LLM effectively incentivizes π_θ’s search skills at zero API cost.

  • A 7B retrieval module matches real Google Search performance; a 14B model surpasses it on benchmark QA tasks.

  • Generalizes across both base and instruction-tuned LLMs, and under diverse RL algorithms arXiv.


Implications for the ML Industry

  1. Cost-Effective RAG Training
    Organizations can now sidestep expensive search-API fees during RL-based retrieval training, democratizing advanced RAG strategies for smaller teams.

  2. Controlled Noise Injection
    The curriculum approach offers principled noise scheduling—models become robust not only to clean retrievals but also to adversarial or low-quality documents, enhancing real-world resilience.

  3. Scalable, On-Premises Solutions
    By fully simulating search behaviors, enterprises can run end-to-end RAG pipelines in-house, preserving data privacy and reducing dependency on third-party services.

  4. Extensible Framework
    ZEROSEARCH’s modular design—plugging in any simulation LLM and RL algorithm—facilitates rapid experimentation. Researchers can explore new reward functions (e.g., retrieval diversity), fine-tune custom domains, or apply to multimodal search settings.

  5. Toward Autonomous Agents
    As LLMs evolve into general-purpose agents, ZEROSEARCH paves the way for self-sufficient information gathering, where agents learn to both seek and synthesize knowledge without external calls.


Conclusion
ZEROSEARCH represents a paradigm shift in training retrieval-augmented LLMs: by simulating instead of querying, it eliminates cost barriers, stabilizes learning through controlled noise, and scales from 3B to 14B models. For the ML industry, this means more accessible, robust, and private RAG solutions—setting the stage for truly autonomous, knowledge-seeking AI agents.

9.5.25

Alibaba’s ZeroSearch: Empowering AI to Self-Train and Slash Costs by 88%

 On May 8, 2025, Alibaba Group unveiled ZeroSearch, an innovative reinforcement learning framework designed to train large language models (LLMs) in information retrieval without relying on external search engines. This approach not only enhances the efficiency of AI training but also significantly reduces associated costs.

Revolutionizing AI Training Through Simulation

Traditional AI training methods for search capabilities depend heavily on real-time interactions with search engines, leading to substantial API expenses and unpredictable data quality. ZeroSearch addresses these challenges by enabling LLMs to simulate search engine interactions within a controlled environment. The process begins with a supervised fine-tuning phase, transforming an LLM into a retrieval module capable of generating both relevant and irrelevant documents in response to queries. Subsequently, a curriculum-based rollout strategy is employed during reinforcement learning to gradually degrade the quality of generated documents, enhancing the model's ability to discern and retrieve pertinent information. 

Achieving Superior Performance at Reduced Costs

In extensive evaluations across seven question-answering datasets, ZeroSearch demonstrated performance on par with, and in some cases surpassing, models trained using actual search engines. Notably, a 14-billion-parameter retrieval module trained with ZeroSearch outperformed Google Search in specific benchmarks. Financially, the benefits are substantial; training with approximately 64,000 search queries using Google Search via SerpAPI would cost about $586.70, whereas utilizing a 14B-parameter simulation LLM on four A100 GPUs incurs only $70.80—a remarkable 88% reduction in costs. 

Implications for the AI Industry

ZeroSearch's introduction marks a significant shift in AI development paradigms. By eliminating dependence on external search engines, developers gain greater control over training data quality and reduce operational costs. This advancement democratizes access to sophisticated AI training methodologies, particularly benefiting startups and organizations with limited resources. Furthermore, the open-source release of ZeroSearch's code, datasets, and pre-trained models on platforms like GitHub and Hugging Face fosters community engagement and collaborative innovation. 

Looking Ahead

As AI continues to evolve, frameworks like ZeroSearch exemplify the potential for self-sufficient learning models that minimize external dependencies. This development not only streamlines the training process but also paves the way for more resilient and adaptable AI systems in various applications.

  Anthropic Enhances Claude Code with Support for Remote MCP Servers Anthropic has announced a significant upgrade to Claude Code , enablin...