Wandering Nomad: Curriculum Learning

Showing posts with label Curriculum Learning. Show all posts

11.9.25

Parallel-R1: Teaching LLMs to reason from multiple angles—permanently

Modern large language models (LLMs) often reason sequentially—one thought chain at a time. Parallel thinking, in contrast, involves spawning multiple reasoning paths (or perspectives), then merging the insights. While prompting tricks can induce this behavior at inference, they carry heavy overhead and brittle generalization. Parallel-R1, a new paper by Tencent AI Lab Seattle with collaborators, pioneers a training-time RL framework for instilling parallel thinking as a native reasoning strategy.

What is Parallel-R1

The key idea: don’t just prompt models to use parallel paths—train them to do so. Parallel-R1 has a progressive curriculum:

Cold start (format learning via SFT) — teach the model the syntax/tags of parallel blocks (e.g. <Parallel>, <Path>...</Path>, <Summary>), using easier math problems (GSM8K) where high-quality parallel traces are easy to generate.
Reinforcement learning (RL) on easy tasks, to explore usage of parallel thinking, with reward that combines correctness + usage of parallel structure.
RL on more difficult problems (e.g. DAPO, AMC, AIME), so the model generalizes both performance and the parallel thinking style.

The architecture has two variants: a causal (structure-agnostic) version and a structured version. The structured version modifies the attention mechanism (via path-window masking, separate position encodings) so paths are more isolated during reasoning. But structured variants show trade-offs—good for generalization in some settings, but less robust under distribution shift.

Results & gains

On a battery of math benchmarks (MATH, AMC23, AIME24, AIME25), Parallel-R1 shows consistent improvements:

The “Seen” variant (causal) achieves ~48.9% average across benchmarks (Mean@16 / Pass@16, etc.), beating baseline GRPO RL on general math tasks.
In particular, on AIME’25, Parallel-R1 raises accuracy by ~8.4% over a purely sequential RL model trained on the harder tasks directly.
The structured (Unseen) variant also performs well under certain reward schedules; the “alternating ACC/PAR” reward schedule (switching between rewarding correctness and parallel structure periodically) helps balance parallel usage and performance.

Beyond numerical gains, the authors observe a behavioral shift: early in training, the model heavily uses parallel paths as an exploration tool, branching in many places; as the model becomes stronger, it shifts to using parallel paths more conservatively, mostly for verification near the end of reasoning. This shift correlates with stronger final performance.

Why this matters

Performance & efficiency trade-off: Parallel-R1 shows that training models for parallel thinking can yield higher reasoning ability without ballooning inference cost (since only when needed are parallel paths triggered).
Better than imitation: Many earlier works used supervised fine-tuning on synthetic parallel reasoning traces under teacher forcing; but those often over-fit to particular patterns. RL in Parallel-R1 helps models learn to decide when parallel paths help, not just how to mimic them.
Scaffolding exploration: The cold-start + easy tasks + alternating reward strategy functions as a scaffold, enabling RL to find a stronger policy space than direct RL on hard tasks.
Architecture designs matter: The structured variant shows that attention masking and position encodings can help or hurt depending on how well training data matches deployment tasks.

Limitations & future directions

The gains, though significant, still leave room before human-level performance in very hard math tasks.
The structured variants can struggle under domain shift; care needed in architectural changes that assume particular path structures.
Triggering parallel thinking (using <Parallel> blocks) costs some token and compute overhead, though the model learns to use it more sparsely over time.
There’s a balance tension between pushing for parallel structure (which encourages exploration) and maximizing accuracy (which sometimes pushes toward fewer divergences). Reward engineering is delicate.

Bottom line: Parallel-R1 is a breakthrough toward training LLMs that think in parallel, not just deeper. By combining curriculum learning, structured or causal variants, and reinforcement learning with rewards for both correctness and reasoning style, it unlocks better performance on challenging math tasks. As reasoning benchmarks and applications demand both correctness and robustness, methods like this will likely become a standard part of the toolkit.

Paper link: arXiv 2509.07980 (PDF)

10.5.25

ZEROSEARCH: Simulating Search to Train Retrieval-Augmented LLMs at Zero API Cost

Introduction

Retrieval-Augmented Generation (RAG) has become a cornerstone for grounding large language models (LLMs) in up-to-date information. Yet, existing approaches that integrate live search engines face two critical hurdles: unpredictable document quality and prohibitive API expenses during reinforcement learning (RL) training arXiv. ZEROSEARCH, introduced by Sun et al., offers an elegant solution—train LLMs’ internal “search” strategies without ever contacting a real search engine, slashing costs and stabilizing learning.

Methodology Deep Dive

1. Search Simulation via Supervised Fine-Tuning

Rather than querying Google or Bing, ZEROSEARCH first converts an LLM into a retrieval module (π_ψ) through lightweight supervised fine-tuning (SFT).

Data Collection: The authors collect interaction trajectories by prompting the base LLM to interact with a real search engine until a correct answer is produced (“positive”) or an incorrect one (“negative”).
Prompt Design: Query–document pairs from these trajectories are extracted. The fine-tuning prompt explicitly labels whether the generated document should be useful or noisy, enabling the model to simulate both high- and low-quality retrievals on demand (Table 2) arXiv.

2. Curriculum-Based Rollout Strategy

To progressively challenge the policy model (π_θ), ZEROSEARCH employs a curriculum that gradually increases the noise probability (pᵢ) of simulated documents over training steps:

p_i = p_s + \bigg(\frac{i/m - 1}{b - 1}\bigg) \times (p_e - p_s)

Parameters:
- ps, pe: initial and final noise probabilities
- i/m: fraction of completed training steps
- b: exponential base (default 4)
Effect: Early training relies on mostly useful documents, allowing π_θ to learn structured reasoning. Over time, noisy retrievals dominate, forcing robust search strategies arXiv.

3. Reinforcement Learning Objective

ZEROSEARCH frames the optimization as:

\max_{\pi_\theta} \;\; \mathbb{E}_{x,y}\Big[\,r_\phi(x,y)\;-\;\beta\,D_{\mathrm{KL}}\big(\pi_\theta\,\|\,\pi_{\mathrm{ref}}\big)\Big],

where:

rₚhi(x,y): F1-based reward (balances precision & recall, avoids “reward hacking” seen with Exact Match) arXiv.
π_ref: reference model (for KL-penalty regularization).
Compatible Algorithms: PPO, GRPO, Reinforce++.

Key Results Overview

A 3B-parameter simulation LLM effectively incentivizes π_θ’s search skills at zero API cost.
A 7B retrieval module matches real Google Search performance; a 14B model surpasses it on benchmark QA tasks.
Generalizes across both base and instruction-tuned LLMs, and under diverse RL algorithms arXiv.

Implications for the ML Industry

Cost-Effective RAG Training
Organizations can now sidestep expensive search-API fees during RL-based retrieval training, democratizing advanced RAG strategies for smaller teams.
Controlled Noise Injection
The curriculum approach offers principled noise scheduling—models become robust not only to clean retrievals but also to adversarial or low-quality documents, enhancing real-world resilience.
Scalable, On-Premises Solutions
By fully simulating search behaviors, enterprises can run end-to-end RAG pipelines in-house, preserving data privacy and reducing dependency on third-party services.
Extensible Framework
ZEROSEARCH’s modular design—plugging in any simulation LLM and RL algorithm—facilitates rapid experimentation. Researchers can explore new reward functions (e.g., retrieval diversity), fine-tune custom domains, or apply to multimodal search settings.
Toward Autonomous Agents
As LLMs evolve into general-purpose agents, ZEROSEARCH paves the way for self-sufficient information gathering, where agents learn to both seek and synthesize knowledge without external calls.

Conclusion
ZEROSEARCH represents a paradigm shift in training retrieval-augmented LLMs: by simulating instead of querying, it eliminates cost barriers, stabilizes learning through controlled noise, and scales from 3B to 14B models. For the ML industry, this means more accessible, robust, and private RAG solutions—setting the stage for truly autonomous, knowledge-seeking AI agents.