Wandering Nomad: LiveCodeBench

Showing posts with label LiveCodeBench. Show all posts

2.8.25

MetaStone-S1 makes “how long to think” a first-class dial—and it pays off

Frontier models are learning to trade more inference compute for better answers. MetaStone-S1 turns that trend into a clean architecture: a Reflective Generative Form where the policy and a process reward model live in the same network, adding a light 53M-parameter scoring head instead of a separate, heavyweight judge. The scoring head is trained self-supervised from outcome rewards—no step-by-step human labels—so the system can generate multiple chains of thought and select the best one efficiently.

Three “reasoning effort” modes, one model

Because the verifier is built-in, MetaStone-S1 exposes controllable thinking lengths—low, medium, high—implemented via different candidate counts (k = 2/8/32) at inference. That makes test-time scaling a product feature rather than a research trick.

Benchmarks: o3-mini territory at 32B

Across AIME’24/’25 (math), LiveCodeBench (code), and C-Eval (Chinese reasoning), the 32B MetaStone-S1 variants lift accuracy over a strong 32B baseline and land comparable to OpenAI o3-mini (medium)—with the high mode leading math by a sizable margin. Example table slice (Pass@1): AIME’24 85.2, AIME’25 73.6, LiveCodeBench 64.2, C-Eval 89.7 for MetaStone-S1-32B-high vs. o3-mini-medium 79.6 / 74.8 / 67.4 / 75.9.

At smaller scales, the 1.5B and 7B versions also beat peer open models (e.g., R1-Distill 7B/8B) on AIME and LiveCodeBench, showing the approach is not just a big-model hack.

Why this matters

Unified policy+PRM = cheaper selection. Sharing the backbone removes a second giant model from the loop and still delivers strong external TTS gains.
Label-free verifier training. The SPRM head learns step scoring from outcome signals, sidestepping costly, noisy process annotations.
Production-ready knob. Teams can ship speed/quality dials (k=2/8/32) instead of maintaining separate models for different latency tiers.
Open release. Code and checkpoints are public, inviting replication and adaptation.

MetaStone-S1’s take-home: reasoning power isn’t only about bigger weights or longer chains—it’s about selecting the right trajectory at inference, with a verifier you can actually afford to run.

Paper link: arXiv 2507.01951 (PDF)

22.7.25

Archer shows “smart” RL beats brute force for small-scale reasoning models

Modern RLVR post-training treats every output token the same, even though factual snippets (“Euler’s number is …”) and logical connectors (“therefore …”) serve wildly different purposes. Enter Archer, short for Adaptive Entropy-Aware RLVR, a new technique that groups tokens by entropy and then trains them under dual constraints:

Knowledge tokens (low entropy): strong KL regularization + tight PPO clip to preserve facts.
Reasoning tokens (high entropy): weaker KL + looser clip to encourage exploration and richer chains of thought.

Crucially, the update is synchronous—no gradient masking or asynchronous passes that risk breaking sentence-level dependencies.

Fewer GPUs, bigger gains

On a single H800 slice, Archer fine-tunes a 1.5 B DeepSeek-R1 distilled model in one stage, 520 steps, 1,900 GPU-hours, yet leaps past multi-round rivals that burned 3–8× the compute.

Benchmark	Base (DAPO)	Archer	Δ
AIME 2024 Pass@1	23.5 %	30.1 %	+6.6
AIME 2025 Pass@1	27.6 %	32.8 %	+5.2
LiveCodeBench v5 Avg@8	26.0 %	29.4 %	+3.4
LiveCodeBench v6 Avg@16	27.6 %	30.2 %	+2.6

The math-tuned variant also edges out specialist models like FastCuRL-1.5B and DeepScaleR-1.5B, while the code-tuned edition tops DeepCoder and Nemotron in head-to-head comparisons.

Why it works

Analysis shows the dual-token policy stabilizes entropy and slashes n-gram repetition—avoiding collapse when KL is too weak and under-training when it’s too strong. Optimal KL weight (0.001) and asymmetric clip thresholds kept first-token latency low and reasoning diversity high.

Why it matters

Smarter, not bigger: Archer turns a lightweight 1.5 B checkpoint into a math-and-code contender without billions of extra tokens or exotic reward models.
Template-free recipe: Any PPO-style RLVR loop can drop in the entropy classifier and dual constraints.
Open & ready: Code and configs are live on GitHub (wizard-III/ArcherCodeR), so teams can replicate the gains on their own domains today.

As LLM builders hunt for cheaper paths to robust reasoning, Archer’s “treat knowledge gently, push reasoning hard” mantra may become standard practice—especially for edge-sized models that can’t afford brute-force scaling.

Paper link: arXiv 2507.15778 (PDF)

30.5.25

DeepSeek R1‑0528: The Open‑Source Challenger That Rivals GPT‑4o and Gemini 2.5 Pro

Chinese startup DeepSeek has just released R1‑0528, a major update to its flagship reasoning model, positioning it as an affordable yet powerful open‑source alternative to OpenAI’s o3 and Google’s Gemini 2.5 Pro.

The new release, published on Hugging Face under the permissive MIT License, brings a host of enhancements to math, science, business, and coding reasoning—all while reinforcing its competitive edge.

🚀 What’s New in R1‑0528

Stronger Reasoning:
On the AIME 2025 benchmark, accuracy surged from 70% to an impressive 87.5%, thanks to longer reasoning chains (average 23k tokens vs. 12k before). Code generation also jumped, with LiveCodeBench scores rising from 63.5% to 73.3% alongside doubling performance on the challenging “Humanity’s Last Exam.”
Developer-Friendly Features:
R1‑0528 now supports JSON output and function calling, streamlining integration into developer pipelines and automation workflows.
New Model Variant:
A distilled version—R1‑0528‑Qwen3‑8B—brings lightweight performance that's still on par with larger models in open benchmarks like AIME 2024.

🏆 Why This Matters

DeepSeek continues to challenge the perception that high performance requires closed-source models and massive budgets. R1‑0528 delivers competitive strength on par with expensive proprietary systems, but under an MIT license and at significantly lower cost—R1's API even cost just $0.14/1M tokens (peak) with local runtime options detailed on GitHub.

This open-access approach puts serious pressure on dominant U.S. models and fosters global collaboration—developers worldwide can use, modify, and deploy R1‑0528 freely.

🌍 Open-Source Renaissance in AI

Since its initial R1 model launch in January, DeepSeek has quickly become a key player in the global AI landscape. R1‑0528 maintains the open-source ethos and stakes its claim as a champion of community-driven innovation in areas where cost and licensing are bottlenecks.

🗣️ Community Buzz

Feedback from enthusiasts is bullish: voices from Reddit’s LocalLLaMA community noted that “DeepSeek is now almost on par with OpenAI’s o3 High model on LiveCodeBench! Huge win for opensource!”

Analysts also see this release as a strategic “Sputnik moment” that could disrupt AI dominance—similar to earlier 2025 reports on DeepSeek’s initial release.

✅ Final Verdict

DeepSeek R1‑0528 marks a significant milestone in open-source AI: powerful reasoning, developer utility, and community support—all while costing a fraction of proprietary counterparts. As a truly accessible yet competitive model, it nudges the AI ecosystem toward openness and transparency—without sacrificing performance.

27.5.25

NVIDIA Introduces AceReason-Nemotron: Enhancing Math and Code Reasoning through Reinforcement Learning

NVIDIA has unveiled AceReason-Nemotron, a 14-billion-parameter open-source model designed to enhance mathematical and coding reasoning through large-scale reinforcement learning (RL). This model demonstrates that RL can significantly improve reasoning capabilities in small to mid-sized models, surpassing traditional distillation-based approaches.

Key Features and Innovations

Sequential RL Training Strategy: The model undergoes a two-phase RL training process—initially on math-only prompts, followed by code-only prompts. This approach not only boosts performance in respective domains but also ensures minimal degradation across tasks.
Enhanced Benchmark Performance: AceReason-Nemotron-14B achieves notable improvements on various benchmarks:
- AIME 2025: 67.4% (+17.4%)
- LiveCodeBench v5: 61.1% (+8%)
- LiveCodeBench v6: 54.9% (+7%)
Robust Data Curation Pipeline: NVIDIA developed a comprehensive data curation system to collect challenging prompts with verifiable answers, facilitating effective verification-based RL across both math and code domains.
Curriculum Learning and Stability: The training incorporates curriculum learning with progressively increasing response lengths and utilizes on-policy parameter updates to stabilize the RL process.

Implications for AI Development

AceReason-Nemotron's success illustrates the potential of reinforcement learning in enhancing the reasoning abilities of AI models, particularly in mathematical and coding tasks. By releasing this model under the NVIDIA Open Model License, NVIDIA encourages further research and development in the AI community.