Wandering Nomad: positional embeddings

11.9.25

ParaThinker: parallel minds beat longer monologues

LLMs have ridden test-time compute—“think longer” chains of thought—but returns taper as early tokens lock models into bad trajectories. Tsinghua’s ParaThinker calls this Tunnel Vision and proposes native thought parallelism: generate several independent reasoning paths simultaneously, then fuse them into one answer.

Instead of external voting, ParaThinker trains the model itself to branch and merge: specialized control tokens (<think i>) trigger distinct trajectories, path-specific positional embeddings keep streams separate, and a two-phase attention mask enforces independence during thinking and controlled integration during summarization. The KV cache from the thinking stage is reused, avoiding re-prefill costs.

On AIME-24/25, AMC-23 and MATH-500, ParaThinker with 8 parallel paths boosts accuracy by +12.3 pts (1.5B) and +7.5 pts (7B) over sequential baselines under the same token budget, and still beats majority voting by +4.3/+2.0 pts—with only ~7.1% latency overhead. Generating up to 16 paths costs <2× single-path latency, thanks to better arithmetic intensity on GPUs.

The takeaway: scale width, not just depth. ParaThinker shows that orchestrating compute across diverse, parallel thoughts unlocks latent reasoning ability and makes smaller models out-punch larger sequential ones. Code is available on GitHub.

Paper link: arXiv 2509.04475 (PDF)