For the past year, the mantra in large-language-model land has been simple: bigger weights, better brains. A new paper from the University of Science and Technology of China, Nanjing University and collaborators argues there’s another dial to turn—reasoning time at inference—and it introduces a purpose-built architecture called MetaStone-S1 to prove the point.
A reflective twist on the policy-reward combo
Standard alignment pipelines bolt a separate process-reward model (PRM) onto a frozen policy network, adding hundreds of millions of parameters and latency. MetaStone-S1 bundles both roles into one backbone and sprinkles in two task-specific heads: one for next-token prediction, the other for step-level scoring. The resulting Self-supervised Process Reward Model (SPRM) weighs in at just 53 M parameters—99 % smaller than conventional PRMs.
Dial-a-brain at test time
Because reward scoring lives inside the model, MetaStone-S1 can stretch or shrink its chain-of-thought on the fly:
Mode | Avg. reasoning steps | Typical use |
---|---|---|
Low | ~8 steps | latency-sensitive chat |
Medium | ~24 steps | balanced Q&A |
High | up to 64 steps | Olympiad math, code generation |
Benchmark bump without parameter bloat
Running in high mode, the 32 B-parameter MetaStone-S1 matches or beats OpenAI o3-mini across AIME ’24/’25, LiveCodeBench and C-EVAL—despite using roughly half the weights.
Why it matters
-
Cheaper alignment. Folding the PRM inside the policy cuts training and inference costs.
-
User-controllable latency. Products can trade speed for depth without model swaps.
-
Open playground. All code, checkpoints (1.5 B→32 B) and the reasoning-length scheduler are on GitHub under an Apache-2 license.
MetaStone-S1 won’t end the parameter-scaling race, but it offers a reminder that when and how long a model thinks can count as much as how big it is. Expect TTS dials and reflective reward heads to surface quickly in next-gen open-source stacks.
Paper link: arXiv 2507.01951 (PDF)
No comments:
Post a Comment