Wandering Nomad: see-think-answer

28.8.25

Vision-SR1: a self-rewarding recipe that makes VLMs “see” before they “think”

Most reinforcement-learning recipes for vision-language models (VLMs) grade only the final answer—so models learn to lean on text priors and hallucinate what isn’t in the image. Vision-SR1 flips that: it decomposes reasoning into visual perception → language reasoning, and rewards the model for producing a self-contained visual description that alone suffices to solve the task. No external teacher, no human labels—just the model validating its own perception.

How the self-reward works

Vision-SR1 runs two rollouts of the same policy per example:

Standard pass: image + question → visual perception + CoT + answer → reward on answer (and format).
Self-reward pass: question + the model’s visual perception (no image) → CoT + answer → reward if correct, signalling that the perception captured what mattered. Rewards are combined under GRPO for stable updates.

Training setup

The team builds a 47K-example RL set spanning math (30.5%), science/commonsense (30%), and general visual reasoning (39.5%). A 9K “cold-start” SFT subset teaches the output format before the short RL run (1 epoch). Backbones: Qwen-2.5-VL-3B and 7B. Code is public on GitHub.

Benchmarks: fewer shortcuts, fewer hallucinations

On a broad suite—MMMU, MMMU-Pro, MM-Vet, RealWorldQA, VisNumBench, MathVerse, MATH-Vision, HallusionBench—Vision-SR1 consistently edges strong Vision-R1 baselines trained on the same 47K. With the 7B backbone, Vision-SR1 averages 58.8 vs 57.4 for Vision-R1; at 3B it’s 52.9 vs 50.6.

The paper also introduces Language Shortcut Rate (LSR)—how often a model answers correctly with an insufficient perception. SR1 lowers LSR across datasets, indicating less “answering from priors.”

Not just vision: text-only reasoning stays solid

On textual suites (MMLU-Pro, SuperGPQA, GSM8K, MATH-500), SR1 keeps or improves accuracy relative to Vision-R1—evidence that strengthening perception doesn’t degrade language-side reasoning.

Why it matters

Balances see vs. think. Adding a perception reward raises dependence on pixels, not just prompts—curbing hallucinations without expensive human labels or external teachers.
Simple to adopt. The “see-think-answer” format and two-pass self-reward bolt onto standard GRPO pipelines.
Open and reproducible. Data recipe, SFT cold-start, and code are released for quick replication.

Paper link: arXiv 2508.19652 (PDF)