Wandering Nomad: GRPO

10.9.25

Language Self-Play: training an LLM without adding data actually works

LLMs keep getting better by eating more data—until the data well runs dry. A new paper from Meta Superintelligence Labs proposes Language Self-Play (LSP): turn training into a game where a single model plays both sides—a Challenger that generates tougher prompts and a Solver that answers them—so the system improves without ingesting new datasets. In tests on AlpacaEval using Llama-3.2-3B-Instruct, LSP matches a strong data-driven RL baseline and even pushes beyond it when used as a follow-on stage.

How it works: one model, two roles

LSP frames training as a minimax game: Challenger tries to minimize reward by making hard queries; Solver tries to maximize reward by answering them. Crucially, both roles are instantiated by the same LLM via a role-selecting prompt (e.g., a special challenger prompt), avoiding the instability and memory overhead of training an external adversary. KL regularization keeps the Challenger from devolving into nonsense prompts.

Under the hood, LSP borrows group-relative baselines from GRPO: Challenger generates N queries, Solver samples G answers per query, and the average reward defines both a per-answer advantage (for Solver) and a “difficulty” signal (for Challenger). A practical variant, LSP-Zero, runs as a pure zero-sum game; the full LSP adds a quality self-reward scored by a reference model to prevent reward-hacking (e.g., answering everything in Python).

Results: data-free ≈ data-driven—and sometimes better

Using GPT-4o as judge on AlpacaEval, the team compares models trained from the same base:

From base (no data): Overall win rates vs. the base model—GRPO (with data) 40.9%, LSP-Zero 40.1%, LSP 40.6%. Translation: self-play without any RL data keeps pace with standard RL.
From RL (as a next stage): Starting from the GRPO model and continuing with self-play, LSP lifts overall win rate to 43.1%, with large gains on Vicuna-style conversational tasks (28.7% → 46.3%).

The setup uses Skywork-Reward-V2-Llama-3.2-3B as the reward model; the authors note that LSP (with the added quality reward) avoids the degradation seen with LSP-Zero in some splits, and acknowledge dips on “chatbot-y” Koala prompts—likely because Challenger skews toward structured, orderly instructions.

Why this matters

Data bottleneck relief. If you can translate “more practice data” into a self-generated curriculum, you can keep improving without chasing new corpora.
A clean follow-on stage. Even after data-based RL, self-play adds headroom—useful when further high-quality preference data is scarce.
Single-model simplicity. One backbone serves both roles, avoiding adversary models and the instability they bring.

Caveats and open questions

Self-play can degenerate without the quality self-reward; reward choice caps the ceiling (a weak reward model means weak training signal); and Challenger diversity remains an open knob to broaden beyond the structured style seen in examples. Still, the authors argue the method should work even better on tasks with verifiable rewards (e.g., code tests), not just preferences.

If your roadmap hits a data wall, Language Self-Play is a compelling new leg in the post-training pipeline: spin up a Challenger inside your own model, let it stress-test itself, and learn—no fresh dataset required.

Paper link: arXiv 2509.07414 (PDF)

28.8.25

Vision-SR1: a self-rewarding recipe that makes VLMs “see” before they “think”

Most reinforcement-learning recipes for vision-language models (VLMs) grade only the final answer—so models learn to lean on text priors and hallucinate what isn’t in the image. Vision-SR1 flips that: it decomposes reasoning into visual perception → language reasoning, and rewards the model for producing a self-contained visual description that alone suffices to solve the task. No external teacher, no human labels—just the model validating its own perception.

How the self-reward works

Vision-SR1 runs two rollouts of the same policy per example:

Standard pass: image + question → visual perception + CoT + answer → reward on answer (and format).
Self-reward pass: question + the model’s visual perception (no image) → CoT + answer → reward if correct, signalling that the perception captured what mattered. Rewards are combined under GRPO for stable updates.

Training setup

The team builds a 47K-example RL set spanning math (30.5%), science/commonsense (30%), and general visual reasoning (39.5%). A 9K “cold-start” SFT subset teaches the output format before the short RL run (1 epoch). Backbones: Qwen-2.5-VL-3B and 7B. Code is public on GitHub.

Benchmarks: fewer shortcuts, fewer hallucinations

On a broad suite—MMMU, MMMU-Pro, MM-Vet, RealWorldQA, VisNumBench, MathVerse, MATH-Vision, HallusionBench—Vision-SR1 consistently edges strong Vision-R1 baselines trained on the same 47K. With the 7B backbone, Vision-SR1 averages 58.8 vs 57.4 for Vision-R1; at 3B it’s 52.9 vs 50.6.

The paper also introduces Language Shortcut Rate (LSR)—how often a model answers correctly with an insufficient perception. SR1 lowers LSR across datasets, indicating less “answering from priors.”

Not just vision: text-only reasoning stays solid

On textual suites (MMLU-Pro, SuperGPQA, GSM8K, MATH-500), SR1 keeps or improves accuracy relative to Vision-R1—evidence that strengthening perception doesn’t degrade language-side reasoning.

Why it matters

Balances see vs. think. Adding a perception reward raises dependence on pixels, not just prompts—curbing hallucinations without expensive human labels or external teachers.
Simple to adopt. The “see-think-answer” format and two-pass self-reward bolt onto standard GRPO pipelines.
Open and reproducible. Data recipe, SFT cold-start, and code are released for quick replication.

Paper link: arXiv 2508.19652 (PDF)