Wandering Nomad: AIME benchmark

Showing posts with label AIME benchmark. Show all posts

22.7.25

Archer shows “smart” RL beats brute force for small-scale reasoning models

Modern RLVR post-training treats every output token the same, even though factual snippets (“Euler’s number is …”) and logical connectors (“therefore …”) serve wildly different purposes. Enter Archer, short for Adaptive Entropy-Aware RLVR, a new technique that groups tokens by entropy and then trains them under dual constraints:

Knowledge tokens (low entropy): strong KL regularization + tight PPO clip to preserve facts.
Reasoning tokens (high entropy): weaker KL + looser clip to encourage exploration and richer chains of thought.

Crucially, the update is synchronous—no gradient masking or asynchronous passes that risk breaking sentence-level dependencies.

Fewer GPUs, bigger gains

On a single H800 slice, Archer fine-tunes a 1.5 B DeepSeek-R1 distilled model in one stage, 520 steps, 1,900 GPU-hours, yet leaps past multi-round rivals that burned 3–8× the compute.

Benchmark	Base (DAPO)	Archer	Δ
AIME 2024 Pass@1	23.5 %	30.1 %	+6.6
AIME 2025 Pass@1	27.6 %	32.8 %	+5.2
LiveCodeBench v5 Avg@8	26.0 %	29.4 %	+3.4
LiveCodeBench v6 Avg@16	27.6 %	30.2 %	+2.6

The math-tuned variant also edges out specialist models like FastCuRL-1.5B and DeepScaleR-1.5B, while the code-tuned edition tops DeepCoder and Nemotron in head-to-head comparisons.

Why it works

Analysis shows the dual-token policy stabilizes entropy and slashes n-gram repetition—avoiding collapse when KL is too weak and under-training when it’s too strong. Optimal KL weight (0.001) and asymmetric clip thresholds kept first-token latency low and reasoning diversity high.

Why it matters

Smarter, not bigger: Archer turns a lightweight 1.5 B checkpoint into a math-and-code contender without billions of extra tokens or exotic reward models.
Template-free recipe: Any PPO-style RLVR loop can drop in the entropy classifier and dual constraints.
Open & ready: Code and configs are live on GitHub (wizard-III/ArcherCodeR), so teams can replicate the gains on their own domains today.

As LLM builders hunt for cheaper paths to robust reasoning, Archer’s “treat knowledge gently, push reasoning hard” mantra may become standard practice—especially for edge-sized models that can’t afford brute-force scaling.

Paper link: arXiv 2507.15778 (PDF)

14.7.25

MetaStone-S1 shows how to scale ‘thinking time’ instead of parameter count

For the past year, the mantra in large-language-model land has been simple: bigger weights, better brains. A new paper from the University of Science and Technology of China, Nanjing University and collaborators argues there’s another dial to turn—reasoning time at inference—and it introduces a purpose-built architecture called MetaStone-S1 to prove the point.

A reflective twist on the policy-reward combo

Standard alignment pipelines bolt a separate process-reward model (PRM) onto a frozen policy network, adding hundreds of millions of parameters and latency. MetaStone-S1 bundles both roles into one backbone and sprinkles in two task-specific heads: one for next-token prediction, the other for step-level scoring. The resulting Self-supervised Process Reward Model (SPRM) weighs in at just 53 M parameters—99 % smaller than conventional PRMs.

Dial-a-brain at test time

Because reward scoring lives inside the model, MetaStone-S1 can stretch or shrink its chain-of-thought on the fly:

Mode	Avg. reasoning steps	Typical use
Low	~8 steps	latency-sensitive chat
Medium	~24 steps	balanced Q&A
High	up to 64 steps	Olympiad math, code generation

The team coins this knob Test-Time Scaling (TTS) and backs it with an empirical scaling law linking “thinking FLOPs” to quality gains.

Benchmark bump without parameter bloat

Running in high mode, the 32 B-parameter MetaStone-S1 matches or beats OpenAI o3-mini across AIME ’24/’25, LiveCodeBench and C-EVAL—despite using roughly half the weights.

Why it matters

Cheaper alignment. Folding the PRM inside the policy cuts training and inference costs.
User-controllable latency. Products can trade speed for depth without model swaps.
Open playground. All code, checkpoints (1.5 B→32 B) and the reasoning-length scheduler are on GitHub under an Apache-2 license.

MetaStone-S1 won’t end the parameter-scaling race, but it offers a reminder that when and how long a model thinks can count as much as how big it is. Expect TTS dials and reflective reward heads to surface quickly in next-gen open-source stacks.

Paper link: arXiv 2507.01951 (PDF)

18.6.25

MiniMax-M1: A Breakthrough Open-Source LLM with a 1 Million Token Context & Cost-Efficient Reinforcement Learning

MiniMax, a Chinese AI startup renowned for its Hailuo video model, has unveiled MiniMax-M1, a landmark open-source language model released under the Apache 2.0 license. Designed for long-context reasoning and agentic tool use, M1 supports a 1 million token input and 80,000 token output window—vastly exceeding most commercial LLMs and enabling it to process large documents, contracts, or codebases in one go.

Built on a hybrid Mixture-of-Experts (MoE) architecture with lightning attention, MiniMax-M1 optimizes performance and cost. The model spans 456 billion parameters, with 45.9 billion activated per token. Its training employed a custom CISPO reinforcement learning algorithm, resulting in substantial efficiency gains. Remarkably, M1 was trained for just $534,700, compared to over $5–6 million spent by DeepSeek‑R1 or over $100 million for GPT‑4.

⚙️ Key Architectural Innovations

1M Token Context Window: Enables comprehensive reasoning across lengthy documents or multi-step workflows.
Hybrid MoE + Lightning Attention: Delivers high performance without excessive computational overhead.
CISPO RL Algorithm: Efficiently trains the model with clipped importance sampling, lowering cost and training time.
Dual Variants: M1-40k and M1-80k versions support variable output lengths (40K and 80K “thinking budget”).

📊 Benchmark-Topping Performance

MiniMax-M1 excels in diverse reasoning and coding benchmarks:

– AIME 2024 (Math): 86.0% accuracy
– LiveCodeBench (Coding): 65.0%
– SWE‑bench Verified: 56.0%
– TAU‑bench: 62.8%
– OpenAI MRCR (4-needle): 73.4%

These results surpass leading open-weight models like DeepSeek‑R1 and Qwen3‑235B‑A22B, narrowing the gap with top-tier commercial LLMs such as OpenAI’s o3 and Google’s Gemini due to its unique architectural optimizations.

🚀 Developer-Friendly & Agent-Ready

MiniMax-M1 supports structured function calling and is packaged with an agent-capable API that includes search, multimedia generation, speech synthesis, and voice cloning. Recommended for deployment via vLLM, optimized for efficient serving and batch handling, it also offers standard Transformers compatibility.

For enterprises, technical leads, and AI orchestration engineers—MiniMax-M1 provides:

Lower operational costs and compute footprint
Simplified integration into existing AI pipelines
Support for in-depth, long-document tasks
A self-hosted, secure alternative to cloud-bound models
Business-grade performance with full community access

🧩 Final Takeaway

MiniMax-M1 marks a milestone in open-source AI—combining extreme context length, reinforcement-learning efficiency, and high benchmark performance within a cost-effective, accessible framework. It opens new possibilities for developers, researchers, and enterprises tackling tasks requiring deep reasoning over extensive content—without the limitations or expense of closed-weight models.

22.7.25

Archer shows “smart” RL beats brute force for small-scale reasoning models

Fewer GPUs, bigger gains

Why it works

Why it matters

14.7.25

MetaStone-S1 shows how to scale ‘thinking time’ instead of parameter count

A reflective twist on the policy-reward combo

Dial-a-brain at test time

Benchmark bump without parameter bloat

Why it matters

18.6.25

MiniMax-M1: A Breakthrough Open-Source LLM with a 1 Million Token Context & Cost-Efficient Reinforcement Learning

⚙️ Key Architectural Innovations

📊 Benchmark-Topping Performance

🚀 Developer-Friendly & Agent-Ready

🧩 Final Takeaway

MiniMax-M1: A Breakthrough Open-Source LLM with a 1 Million Token Context & Cost-Efficient Reinforcement Learning