Wandering Nomad: AutoCodeBench turns LLMs into benchmark factories

19.8.25

AutoCodeBench turns LLMs into benchmark factories — and today’s coders sweat at ~52%

Code benchmarks have a scaling problem: hand-written tasks don’t keep up with fast-improving models, and multilingual coverage is thin. Tencent Hunyuan’s new paper proposes a fix: AutoCodeGen, an automated workflow that inverts dataset creation—generate solutions and tests first, then ask the LLM to write the problem—validated by a multilingual execution sandbox. The result is AutoCodeBench (ACB), a 3,920-problem suite evenly spread over 20 languages, with ~9.6 tests per problem and a deliberate bias toward hard tasks. Even frontier “think-mode” models top out around ~52% Pass@1, signaling real headroom.

How they build hard, correct problems

AutoCodeGen runs in four steps: (1) LLMs evolve self-contained code solutions from real multilingual snippets; (2) LLMs propose public and private test inputs, which the sandbox executes to compute ground-truth outputs; (3) the LLM then writes the problem description constrained by strict specs (language, entry points, naming); (4) a three-stage filter (multi-sampling for difficulty, LLM-as-critic for quality, diversity tagging) trims the set. This “reverse-order” recipe yields correct, executable tests without humans in the loop.

What’s inside ACB

Scale & spread: 3,920 problems, 37,777 tests, 20 languages (Python→TypeScript), 14 task categories from data structures to systems programming. >60% are “hard.”
Sandbox: open-sourced, 20+ languages, high-concurrency, request-based calls—usable for eval and data synthesis.
Lite & Complete: ACB-Lite (≈1,586 problems) for faster evals; ACB-Complete (1,000 completion-style tasks, 3-shot) targets base models rather than chat-tuned ones.

The scoreboard: even elites struggle

Averaged across 20 languages, the leaderboard’s top tier lands ~50–52% Pass@1, led by Claude Opus 4 (Think) at 52.4%, with o3-high, Grok-4, Claude Sonnet 4 (Think), and DeepSeek-R1-0528 clustered close behind. Mid-tier open models sit in the 30s–40s; smaller coders drop to the 20s. Translation: the multilingual + multi-logical mix is punishing.

Iterating with sandbox feedback helps

Across three refinement turns using execution error messages, models like DeepSeek-V3-0324 and Qwen2.5-Coder-32B-Instruct gain ~8–12 points, with the biggest jump on turn one—evidence that automated error signals materially improve code generation.

Base-model check: ACB-Complete

On the 1,000-item, 3-shot ACB-Complete, Seed-Coder-8B-Base leads its class (≤8B) at 31.6% Pass@1, edging OpenCoder-8B-Base and Qwen2.5-Coder-7B—useful signal for pre-instruct comparisons that classic HumanEval/MBPP miss.

Why it matters

Human-free, multilingual, hard. ACB shows you can scale quality and coverage without armies of annotators.
Better evals for code agents. Emphasis on multi-logical tasks (several core functions per problem) aligns with agent workflows like SWE-Bench.
Sandbox as a lever. Open, concurrent execution infra doubles as a training-data factory and an iterative-repair oracle.

Benchmarks drive progress. If your coding model cruises through Python puzzles but face-plants in Kotlin, Shell, or Elixir, AutoCodeBench will make that obvious—and give you a reproducible path to fix it.

Paper link: arXiv 2508.09101 (PDF)