Wandering Nomad: Hierarchical Reasoning Model: a tiny, brain-inspired model that out-reasons giant CoT LLMs

2.8.25

Hierarchical Reasoning Model: a tiny, brain-inspired model that out-reasons giant CoT LLMs

Most frontier models “reason” by narrating token-by-token chains of thought. Sapient Intelligence’s Hierarchical Reasoning Model (HRM) argues you don’t need that narration—or billions of parameters—to solve hard puzzles. The 27 M-parameter model runs two coupled recurrent modules at different timescales (a slow H-module for abstract planning and a fast L-module for detailed computation) to perform deep latent reasoning in a single forward pass. Trained from scratch with no pretraining and no CoT supervision, HRM hits standout scores across inductive-reasoning and search-heavy tasks.

Why it works: depth without the usual pain

HRM’s core trick is hierarchical convergence: the fast L-module iterates to a local equilibrium, then the slow H-module updates once and “resets” context for the next refinement cycle—stacking many effective computation steps without vanishing into a fixed point. To train it efficiently, the authors derive a one-step gradient approximation that avoids backpropagation-through-time, cutting memory from O(T) to O(1) per sequence.

There’s also an adaptive halting head (a small Q-learner) that decides whether to stop or continue another reasoning segment, enabling “think-more-if-needed” behavior at inference time—useful when a problem demands longer planning.

The receipts

With roughly 1,000 training examples per task, HRM posts numbers that would make far larger CoT systems blush:

ARC-AGI-1: 40.3 %, beating o3-mini-high (34.5), Claude-3.7 8K (21.2) and DeepSeek-R1 (21.0); a Transformer trained directly on IO pairs manages 15.8.
ARC-AGI-2: HRM reaches 5.0 % where strong CoT baselines hover near zero—consistent with the benchmark’s step-up in compositional difficulty.
Sudoku-Extreme (9×9, 1k ex.): 55.0 % accuracy; on the full Sudoku-Extreme-Full (3.83 M puzzles), HRM approaches near-perfect accuracy.
Maze-Hard (30×30, 1k ex.): 74.5 % optimal-path success—where CoT baselines flatline.

What this means for builders

Latent > linguistic reasoning: HRM shows you can get deep, backtracking-style reasoning inside hidden states—no verbose CoT, fewer tokens, lower latency.
Tiny models, big compute depth: By recycling computation through nested recurrent cycles, HRM attains “depth” that standard Transformers don’t, even when you stack layers.
Knob for “thinking time”: The halting mechanism effectively scales compute at inference—handy for tasks like Sudoku where a few extra cycles pay off more than on ARC-style transformations.

Dataset & evaluation notes

Sudoku-Extreme combines easier Kaggle-style puzzles with community “forum-hard” sets; difficulty is measured by average backtracks (≈22 per puzzle on the new subset—much tougher than common datasets). Maze-Hard requires optimal 30×30 paths; ARC-AGI results follow the official challenge protocols with standard augmentations.

If subsequent open-sourced code (the paper links a GitHub repo) spurs replication, expect a wave of BPTT-free recurrent designs and “reason-more-on-demand” controls to show up in lightweight agents—especially where token budgets and latency matter more than eloquent chain-of-thoughts.

Paper link: arXiv 2506.21734 (PDF)