Most reinforcement-learning recipes for vision-language models (VLMs) grade only the final answer—so models learn to lean on text priors and hallucinate what isn’t in the image. Vision-SR1 flips that: it decomposes reasoning into visual perception → language reasoning, and rewards the model for producing a self-contained visual description that alone suffices to solve the task. No external teacher, no human labels—just the model validating its own perception.
How the self-reward works
Vision-SR1 runs two rollouts of the same policy per example:
-
Standard pass: image + question → visual perception + CoT + answer → reward on answer (and format).
-
Self-reward pass: question + the model’s visual perception (no image) → CoT + answer → reward if correct, signalling that the perception captured what mattered. Rewards are combined under GRPO for stable updates.
Training setup
The team builds a 47K-example RL set spanning math (30.5%), science/commonsense (30%), and general visual reasoning (39.5%). A 9K “cold-start” SFT subset teaches the output format before the short RL run (1 epoch). Backbones: Qwen-2.5-VL-3B and 7B. Code is public on GitHub.
Benchmarks: fewer shortcuts, fewer hallucinations
On a broad suite—MMMU, MMMU-Pro, MM-Vet, RealWorldQA, VisNumBench, MathVerse, MATH-Vision, HallusionBench—Vision-SR1 consistently edges strong Vision-R1 baselines trained on the same 47K. With the 7B backbone, Vision-SR1 averages 58.8 vs 57.4 for Vision-R1; at 3B it’s 52.9 vs 50.6.
The paper also introduces Language Shortcut Rate (LSR)—how often a model answers correctly with an insufficient perception. SR1 lowers LSR across datasets, indicating less “answering from priors.”
Not just vision: text-only reasoning stays solid
On textual suites (MMLU-Pro, SuperGPQA, GSM8K, MATH-500), SR1 keeps or improves accuracy relative to Vision-R1—evidence that strengthening perception doesn’t degrade language-side reasoning.
Why it matters
-
Balances see vs. think. Adding a perception reward raises dependence on pixels, not just prompts—curbing hallucinations without expensive human labels or external teachers.
-
Simple to adopt. The “see-think-answer” format and two-pass self-reward bolt onto standard GRPO pipelines.
-
Open and reproducible. Data recipe, SFT cold-start, and code are released for quick replication.
Paper link: arXiv 2508.19652 (PDF)