Showing posts with label AIME benchmark. Show all posts
Showing posts with label AIME benchmark. Show all posts

14.7.25

MetaStone-S1 shows how to scale ‘thinking time’ instead of parameter count

 For the past year, the mantra in large-language-model land has been simple: bigger weights, better brains. A new paper from the University of Science and Technology of China, Nanjing University and collaborators argues there’s another dial to turn—reasoning time at inference—and it introduces a purpose-built architecture called MetaStone-S1 to prove the point. 

A reflective twist on the policy-reward combo

Standard alignment pipelines bolt a separate process-reward model (PRM) onto a frozen policy network, adding hundreds of millions of parameters and latency. MetaStone-S1 bundles both roles into one backbone and sprinkles in two task-specific heads: one for next-token prediction, the other for step-level scoring. The resulting Self-supervised Process Reward Model (SPRM) weighs in at just 53 M parameters—99 % smaller than conventional PRMs. 

Dial-a-brain at test time

Because reward scoring lives inside the model, MetaStone-S1 can stretch or shrink its chain-of-thought on the fly:

ModeAvg. reasoning stepsTypical use
Low~8 stepslatency-sensitive chat
Medium~24 stepsbalanced Q&A
Highup to 64 stepsOlympiad math, code generation

The team coins this knob Test-Time Scaling (TTS) and backs it with an empirical scaling law linking “thinking FLOPs” to quality gains. 

Benchmark bump without parameter bloat

Running in high mode, the 32 B-parameter MetaStone-S1 matches or beats OpenAI o3-mini across AIME ’24/’25, LiveCodeBench and C-EVAL—despite using roughly half the weights. 

Why it matters

  • Cheaper alignment. Folding the PRM inside the policy cuts training and inference costs.

  • User-controllable latency. Products can trade speed for depth without model swaps.

  • Open playground. All code, checkpoints (1.5 B→32 B) and the reasoning-length scheduler are on GitHub under an Apache-2 license. 

MetaStone-S1 won’t end the parameter-scaling race, but it offers a reminder that when and how long a model thinks can count as much as how big it is. Expect TTS dials and reflective reward heads to surface quickly in next-gen open-source stacks.

Paper link: arXiv 2507.01951 (PDF)

18.6.25

MiniMax-M1: A Breakthrough Open-Source LLM with a 1 Million Token Context & Cost-Efficient Reinforcement Learning

 MiniMax, a Chinese AI startup renowned for its Hailuo video model, has unveiled MiniMax-M1, a landmark open-source language model released under the Apache 2.0 license. Designed for long-context reasoning and agentic tool use, M1 supports a 1 million token input and 80,000 token output window—vastly exceeding most commercial LLMs and enabling it to process large documents, contracts, or codebases in one go.

Built on a hybrid Mixture-of-Experts (MoE) architecture with lightning attention, MiniMax-M1 optimizes performance and cost. The model spans 456 billion parameters, with 45.9 billion activated per token. Its training employed a custom CISPO reinforcement learning algorithm, resulting in substantial efficiency gains. Remarkably, M1 was trained for just $534,700, compared to over $5–6 million spent by DeepSeek‑R1 or over $100 million for GPT‑4.


⚙️ Key Architectural Innovations

  • 1M Token Context Window: Enables comprehensive reasoning across lengthy documents or multi-step workflows.

  • Hybrid MoE + Lightning Attention: Delivers high performance without excessive computational overhead.

  • CISPO RL Algorithm: Efficiently trains the model with clipped importance sampling, lowering cost and training time.

  • Dual Variants: M1-40k and M1-80k versions support variable output lengths (40K and 80K “thinking budget”).


📊 Benchmark-Topping Performance

MiniMax-M1 excels in diverse reasoning and coding benchmarks:

AIME 2024 (Math): 86.0% accuracy
LiveCodeBench (Coding): 65.0%
SWE‑bench Verified: 56.0%
TAU‑bench: 62.8%
OpenAI MRCR (4-needle): 73.4% 

These results surpass leading open-weight models like DeepSeek‑R1 and Qwen3‑235B‑A22B, narrowing the gap with top-tier commercial LLMs such as OpenAI’s o3 and Google’s Gemini due to its unique architectural optimizations.


🚀 Developer-Friendly & Agent-Ready

MiniMax-M1 supports structured function calling and is packaged with an agent-capable API that includes search, multimedia generation, speech synthesis, and voice cloning. Recommended for deployment via vLLM, optimized for efficient serving and batch handling, it also offers standard Transformers compatibility.

For enterprises, technical leads, and AI orchestration engineers—MiniMax-M1 provides:

  • Lower operational costs and compute footprint

  • Simplified integration into existing AI pipelines

  • Support for in-depth, long-document tasks

  • A self-hosted, secure alternative to cloud-bound models

  • Business-grade performance with full community access


🧩 Final Takeaway

MiniMax-M1 marks a milestone in open-source AI—combining extreme context length, reinforcement-learning efficiency, and high benchmark performance within a cost-effective, accessible framework. It opens new possibilities for developers, researchers, and enterprises tackling tasks requiring deep reasoning over extensive content—without the limitations or expense of closed-weight models.

 If large language models have one redeeming feature for safety researchers, it’s that many of them think out loud . Ask GPT-4o or Claude 3....