Wandering Nomad: parallel thinking

Showing posts with label parallel thinking. Show all posts

11.9.25

Parallel-R1: Teaching LLMs to reason from multiple angles—permanently

Modern large language models (LLMs) often reason sequentially—one thought chain at a time. Parallel thinking, in contrast, involves spawning multiple reasoning paths (or perspectives), then merging the insights. While prompting tricks can induce this behavior at inference, they carry heavy overhead and brittle generalization. Parallel-R1, a new paper by Tencent AI Lab Seattle with collaborators, pioneers a training-time RL framework for instilling parallel thinking as a native reasoning strategy.

What is Parallel-R1

The key idea: don’t just prompt models to use parallel paths—train them to do so. Parallel-R1 has a progressive curriculum:

Cold start (format learning via SFT) — teach the model the syntax/tags of parallel blocks (e.g. <Parallel>, <Path>...</Path>, <Summary>), using easier math problems (GSM8K) where high-quality parallel traces are easy to generate.
Reinforcement learning (RL) on easy tasks, to explore usage of parallel thinking, with reward that combines correctness + usage of parallel structure.
RL on more difficult problems (e.g. DAPO, AMC, AIME), so the model generalizes both performance and the parallel thinking style.

The architecture has two variants: a causal (structure-agnostic) version and a structured version. The structured version modifies the attention mechanism (via path-window masking, separate position encodings) so paths are more isolated during reasoning. But structured variants show trade-offs—good for generalization in some settings, but less robust under distribution shift.

Results & gains

On a battery of math benchmarks (MATH, AMC23, AIME24, AIME25), Parallel-R1 shows consistent improvements:

The “Seen” variant (causal) achieves ~48.9% average across benchmarks (Mean@16 / Pass@16, etc.), beating baseline GRPO RL on general math tasks.
In particular, on AIME’25, Parallel-R1 raises accuracy by ~8.4% over a purely sequential RL model trained on the harder tasks directly.
The structured (Unseen) variant also performs well under certain reward schedules; the “alternating ACC/PAR” reward schedule (switching between rewarding correctness and parallel structure periodically) helps balance parallel usage and performance.

Beyond numerical gains, the authors observe a behavioral shift: early in training, the model heavily uses parallel paths as an exploration tool, branching in many places; as the model becomes stronger, it shifts to using parallel paths more conservatively, mostly for verification near the end of reasoning. This shift correlates with stronger final performance.

Why this matters

Performance & efficiency trade-off: Parallel-R1 shows that training models for parallel thinking can yield higher reasoning ability without ballooning inference cost (since only when needed are parallel paths triggered).
Better than imitation: Many earlier works used supervised fine-tuning on synthetic parallel reasoning traces under teacher forcing; but those often over-fit to particular patterns. RL in Parallel-R1 helps models learn to decide when parallel paths help, not just how to mimic them.
Scaffolding exploration: The cold-start + easy tasks + alternating reward strategy functions as a scaffold, enabling RL to find a stronger policy space than direct RL on hard tasks.
Architecture designs matter: The structured variant shows that attention masking and position encodings can help or hurt depending on how well training data matches deployment tasks.

Limitations & future directions

The gains, though significant, still leave room before human-level performance in very hard math tasks.
The structured variants can struggle under domain shift; care needed in architectural changes that assume particular path structures.
Triggering parallel thinking (using <Parallel> blocks) costs some token and compute overhead, though the model learns to use it more sparsely over time.
There’s a balance tension between pushing for parallel structure (which encourages exploration) and maximizing accuracy (which sometimes pushes toward fewer divergences). Reward engineering is delicate.

Bottom line: Parallel-R1 is a breakthrough toward training LLMs that think in parallel, not just deeper. By combining curriculum learning, structured or causal variants, and reinforcement learning with rewards for both correctness and reasoning style, it unlocks better performance on challenging math tasks. As reasoning benchmarks and applications demand both correctness and robustness, methods like this will likely become a standard part of the toolkit.

Paper link: arXiv 2509.07980 (PDF)

ParaThinker: parallel minds beat longer monologues

LLMs have ridden test-time compute—“think longer” chains of thought—but returns taper as early tokens lock models into bad trajectories. Tsinghua’s ParaThinker calls this Tunnel Vision and proposes native thought parallelism: generate several independent reasoning paths simultaneously, then fuse them into one answer.

Instead of external voting, ParaThinker trains the model itself to branch and merge: specialized control tokens (<think i>) trigger distinct trajectories, path-specific positional embeddings keep streams separate, and a two-phase attention mask enforces independence during thinking and controlled integration during summarization. The KV cache from the thinking stage is reused, avoiding re-prefill costs.

On AIME-24/25, AMC-23 and MATH-500, ParaThinker with 8 parallel paths boosts accuracy by +12.3 pts (1.5B) and +7.5 pts (7B) over sequential baselines under the same token budget, and still beats majority voting by +4.3/+2.0 pts—with only ~7.1% latency overhead. Generating up to 16 paths costs <2× single-path latency, thanks to better arithmetic intensity on GPUs.

The takeaway: scale width, not just depth. ParaThinker shows that orchestrating compute across diverse, parallel thoughts unlocks latent reasoning ability and makes smaller models out-punch larger sequential ones. Code is available on GitHub.

Paper link: arXiv 2509.04475 (PDF)

1.8.25

Inside Gemini Deep Think: Google’s Gold-Medal Reasoning Engine with a 16-Minute Brain-Cycle

When Google DeepMind quietly flipped the switch on Gemini 2.5 Deep Think, it wasn’t just another toggle in the Gemini app. The same enhanced-reasoning mode had already notched a gold-medal-level score at the 2025 International Mathematical Olympiad (IMO)—solving five of six notoriously brutal problems and tying the human cutoff for gold. That feat put DeepMind shoulder-to-shoulder with OpenAI’s own experimental “gold-IMO” model, announced the very same week .

What makes the IMO special?

Founded in 1959, the IMO pits six pre-university prodigies from each country against six problems spanning algebra, geometry, number theory, and combinatorics. Every question is worth seven points, so 42 is perfection; a score of 35 secured this year’s gold cutoff. DeepMind’s best 2024 system managed silver, but needed more time than the four-and-a-half hours allotted to humans. In 2025, Deep Think achieved the same result within the human time window, using only plain-language prompts instead of formal proof assistants .

Under the hood: parallel minds at work

Deep Think is Gemini 2.5 Pro running in a multi-agent “parallel thinking” mode. Instead of one chain-of-thought, it spins up dozens, scores them against intermediate goals, and fuses the strongest ideas into a final answer. Google says the approach boosts benchmark scores for math, logic, and coding, at the cost of far longer inference times .

A field test from the transcript

In the YouTube walkthrough, the host pastes a 2025 IMO geometry problem into Deep Think. The clock ticks 16 minutes before the first full token arrives—but the model nails the official solution, listing the only valid values of k as 0, 1, 3. A second experiment on an AIME-25 algebra question takes 13 minutes yet again lands the correct answer (204) with detailed derivations. The lesson: breakthroughs come after a coffee break, not in real time.

Beyond math: voxel temples and half-baked Angry Birds

Deep Think’s slow-burn genius extends to generative tasks. Asked to script a colorful 3D “Sala Thai” pavilion in Three.js, the model architected a fully navigable voxel scene—complete with stylized roof eaves—on the first pass. A tougher challenge—re-creating Angry Birds in Pygame—showed its iterative potential: the first build lacked obstacles, but a follow-up prompt produced pigs, wood, glass, and workable physics. Still, each refinement added another ten-plus minutes to the wait.

When speed matters more than brilliance

Because Deep Think withholds partial streams until it has weighed all candidate thoughts, users stare at a blank screen for up to ten minutes. Google engineers admit the mode “isn’t practical for everyday coding” unless you fire a prompt and walk away—then return to review the answer or receive a push notification. For everyday tasks, plain Gemini 2.5 Pro or Flash-Lite may offer better latency-to-value ratios.

How to try it—and what’s next

Deep Think is already live for Gemini Ultra subscribers inside the consumer app, and Google says an API endpoint will roll out in the “next few weeks” to AI Studio and Vertex AI . Once that lands, developers can add a “deep-think” flag to long-form reasoning jobs—think automated theorem proving, contract analysis, or multi-step coding agents.

Bottom line: Gemini Deep Think proves massive parallel reflection can push public models into Olympiad territory, but it also shows there’s no free lunch—each extra IQ point costs time and compute. The next frontier won’t just be smarter LLMs; it will be orchestration layers that decide when a 16-minute think-tank is worth the wait and when a quick, cheaper model will do.

22.7.25

Gemini “Deep Think” Hits Gold-Medal Performance at the International Mathematical Olympiad

From Silver to Gold in Twelve Months

Last year, DeepMind’s AlphaGeometry and AlphaProof systems collectively solved four of six IMO problems, earning a silver-medal equivalent. In July 2025 the research team leap-frogged that result: an advanced version of Gemini running in “Deep Think” mode solved five of six tasks for 35 points—crossing the 2025 gold-medal threshold and setting a new AI milestone.

International coordinators graded Gemini’s written solutions using the same rubric applied to student competitors. According to IMO President Gregor Dolinar, the proofs were “clear, precise, and, in several cases, easy to follow”.

What Makes Deep Think Different?

Technique	Purpose	Impact on Performance
Parallel Thinking	Explores multiple proof avenues simultaneously, then merges the strongest ideas.	Avoids dead-end, single-thread chains of thought.
Reinforcement-Learning Fine-Tune	Trains on curated theorem-proving and problem-solving data with reward signals for conciseness and rigor.	Raises success rate on multi-step reasoning challenges.
High-Quality Solution Corpus	Ingests expertly written IMO proofs plus heuristic “tips & tricks.”	Gives the model stylistic and structural templates for clearer presentation.

These upgrades let Gemini run longer “scratch-pads” internally while staying within a feasible compute budget—no multi-day cluster runs were required, unlike earlier systems.

Benchmark Significance

35 / 42 points → comparable to a top-25-percent human gold medalist.
Perfect scores on five problems; only one combinatorics task eluded the model.
Order-of-magnitude speed-up vs. AlphaGeometry 2 + AlphaProof, which needed days of inference in 2024.

While specialized theorem solvers have mastered narrow domains, Gemini Deep Think is a general LLM—capable of chat, code, and multimodal tasks—now showing elite mathematical reasoning.

Broader Implications

Curriculum Design for AI
Gemini’s success underscores the value of domain-targeted reinforcement learning on top of large-scale pre-training.
Parallel Thinking as a New Primitive
Instead of a single “chain of thought,” future models may default to branch-and-merge reasoning, akin to how human teams brainstorm proofs.
Human–AI Collaboration
DeepMind notes the technique could become a “proof assistant” for mathematicians—surfacing lemmas or counter-examples at gold-medal quality within minutes.
Educational Outreach
Publishing the solutions provides a free study resource for aspiring IMO contestants and teachers, potentially leveling the global playing field.

Limitations & Next Steps

Interpretability: Despite clearer written proofs, the internal decision tree remains opaque—researchers are now probing why certain branches survive the merge.
Generalization: Performance on under-represented areas (e.g., functional equations) still lags; future training will widen topic coverage.
Trust & Verification: Formal proof checkers like Lean are being integrated to machine-verify each Gemini output before publication.

DeepMind plans to open selected Deep Think capabilities via its Gemini API later this year, with safeguards to prevent misuse in academic competitions.

Key Takeaway

Gemini Deep Think’s gold-medal performance doesn’t just raise the bar for AI mathematics—it redefines what general-purpose language models can achieve when armed with structured parallel reasoning and tailored RL training. The achievement brings researchers a step closer to AI systems that can tackle longstanding open problems and act as partner mathematicians rather than mere calculators.