Wandering Nomad: MegaScience formalizes science reasoning data—and smaller models suddenly look smarter

26.7.25

MegaScience formalizes science reasoning data—and smaller models suddenly look smarter

Open-source LLMs can do math and code, but ask them to reason through a physics word problem or a cell-biology puzzle and they wobble. The GAIR Lab at Shanghai Jiao Tong University thinks the culprit is data, not architecture. Their new paper introduces TextbookReasoning (650 k Q&A pulled from 12 k university textbooks) and MegaScience (a 1.25 M‑sample mix of cleaned public science sets), then shows that models post‑trained on these datasets outperform their own official instruct variants—while using far shorter responses.

The problem: bad science data, bad evals

Most “science” corpora rely on noisy web text, weak decontamination and multiple‑choice benchmarks that don’t probe true reasoning. The authors flag four pain points: unreliable benchmarks, flimsy leakage checks, low‑quality references and shallow CoT distillation.

Two datasets, one pipeline

TextbookReasoning – 650 k verified questions across seven disciplines (physics → economics), built via textbook digitization, QA pair extraction, deduping, refinement and LLM‑assisted decontamination.
MegaScience – 1.25 M high‑quality instances from NaturalReasoning, Nemotron‑Science and TextbookReasoning, curated with a three‑way selection scheme: response‑length, difficulty, and random sampling, plus solution annotation.

Notably, answers are short: 410 tokens (TextbookReasoning) and 721 tokens (MegaScience) on average—meaning cheaper training and inference than CoT-heavy rivals.

Proof in the checkpoints

Fine‑tuning Llama3.1, Qwen2.5 and Qwen3 base models on MegaScience consistently beats their official instruct models across “general,” “specific,” and “math” categories. Example: Qwen3‑30B jumps from 55.66 → 61.12 average, with math rising to 89.33.

Ablations back the pipeline: drop refinement and performance collapses (58.33 % → 13.15 % overall); remove the extra CoT step and scores slide to 57.33 %. Decontamination matters too—without it, leakage inflates averages to 58.57 %.

Why this matters

Science is more than math/code. The field lacked open, verifiable, long‑form reasoning sets; MegaScience fills that gap.
Shorter CoT ≈ cheaper scaling. The datasets’ concise answers let bigger models benefit more from fine‑tuning—hinting at a “scaling law for data efficiency” in science domains.
Open everything. The team releases the full curation pipeline, eval system, seven trained models and all datasets, inviting the community to iterate.

If your lab is chasing AI scientists rather than chatty coders, MegaScience is a ready-made jumpstart—and a reminder that better questions and cleaner answers can beat another billion tokens of sludge.

Paper link: arXiv 2507.16812 (PDF)