Modern RLVR post-training treats every output token the same, even though factual snippets (“Euler’s number is …”) and logical connectors (“therefore …”) serve wildly different purposes. Enter Archer, short for Adaptive Entropy-Aware RLVR, a new technique that groups tokens by entropy and then trains them under dual constraints:
-
Knowledge tokens (low entropy): strong KL regularization + tight PPO clip to preserve facts.
-
Reasoning tokens (high entropy): weaker KL + looser clip to encourage exploration and richer chains of thought.
Crucially, the update is synchronous—no gradient masking or asynchronous passes that risk breaking sentence-level dependencies.
Fewer GPUs, bigger gains
On a single H800 slice, Archer fine-tunes a 1.5 B DeepSeek-R1 distilled model in one stage, 520 steps, 1,900 GPU-hours, yet leaps past multi-round rivals that burned 3–8× the compute.
Benchmark | Base (DAPO) | Archer | Δ |
---|---|---|---|
AIME 2024 Pass@1 | 23.5 % | 30.1 % | +6.6 |
AIME 2025 Pass@1 | 27.6 % | 32.8 % | +5.2 |
LiveCodeBench v5 Avg@8 | 26.0 % | 29.4 % | +3.4 |
LiveCodeBench v6 Avg@16 | 27.6 % | 30.2 % | +2.6 |
Why it works
Analysis shows the dual-token policy stabilizes entropy and slashes n-gram repetition—avoiding collapse when KL is too weak and under-training when it’s too strong. Optimal KL weight (0.001) and asymmetric clip thresholds kept first-token latency low and reasoning diversity high.
Why it matters
-
Smarter, not bigger: Archer turns a lightweight 1.5 B checkpoint into a math-and-code contender without billions of extra tokens or exotic reward models.
-
Template-free recipe: Any PPO-style RLVR loop can drop in the entropy classifier and dual constraints.
-
Open & ready: Code and configs are live on GitHub (
wizard-III/ArcherCodeR
), so teams can replicate the gains on their own domains today.
As LLM builders hunt for cheaper paths to robust reasoning, Archer’s “treat knowledge gently, push reasoning hard” mantra may become standard practice—especially for edge-sized models that can’t afford brute-force scaling.
Paper link: arXiv 2507.15778 (PDF)