22.7.25

Archer shows “smart” RL beats brute force for small-scale reasoning models

 Modern RLVR post-training treats every output token the same, even though factual snippets (“Euler’s number is …”) and logical connectors (“therefore …”) serve wildly different purposes. Enter Archer, short for Adaptive Entropy-Aware RLVR, a new technique that groups tokens by entropy and then trains them under dual constraints:

  • Knowledge tokens (low entropy): strong KL regularization + tight PPO clip to preserve facts.

  • Reasoning tokens (high entropy): weaker KL + looser clip to encourage exploration and richer chains of thought. 

Crucially, the update is synchronous—no gradient masking or asynchronous passes that risk breaking sentence-level dependencies.


Fewer GPUs, bigger gains

On a single H800 slice, Archer fine-tunes a 1.5 B DeepSeek-R1 distilled model in one stage, 520 steps, 1,900 GPU-hours, yet leaps past multi-round rivals that burned 3–8× the compute. 

BenchmarkBase (DAPO)ArcherΔ
AIME 2024 Pass@123.5 %30.1 %+6.6
AIME 2025 Pass@127.6 %32.8 %+5.2
LiveCodeBench v5 Avg@826.0 %29.4 %+3.4
LiveCodeBench v6 Avg@1627.6 %30.2 %+2.6

The math-tuned variant also edges out specialist models like FastCuRL-1.5B and DeepScaleR-1.5B, while the code-tuned edition tops DeepCoder and Nemotron in head-to-head comparisons. 

Why it works

Analysis shows the dual-token policy stabilizes entropy and slashes n-gram repetition—avoiding collapse when KL is too weak and under-training when it’s too strong. Optimal KL weight (0.001) and asymmetric clip thresholds kept first-token latency low and reasoning diversity high. 


Why it matters

  • Smarter, not bigger: Archer turns a lightweight 1.5 B checkpoint into a math-and-code contender without billions of extra tokens or exotic reward models.

  • Template-free recipe: Any PPO-style RLVR loop can drop in the entropy classifier and dual constraints.

  • Open & ready: Code and configs are live on GitHub (wizard-III/ArcherCodeR), so teams can replicate the gains on their own domains today. 

As LLM builders hunt for cheaper paths to robust reasoning, Archer’s “treat knowledge gently, push reasoning hard” mantra may become standard practice—especially for edge-sized models that can’t afford brute-force scaling.

Paper link: arXiv 2507.15778 (PDF)

No comments:

 Anyone who has watched today’s end‑to‑end robot policies fail a complex kitchen task knows the weakness: they map pixels to motors with no ...