Showing posts with label multi-agent distillation. Show all posts
Showing posts with label multi-agent distillation. Show all posts

22.8.25

Chain-of-Agents turns a whole agent swarm into a single end-to-end model

 Multi-agent frameworks can crush complex tasks—but they’re brittle, hand-engineered, and expensive to run. OPPO’s AI Agent team proposes a cleaner path: Chain-of-Agents (CoA), where a single model dynamically “plays” multiple roles and tools, simulating agent collaboration end-to-end without external orchestration. The team trains Agent Foundation Models (AFMs) with a two-step recipe: multi-agent distillation (learning from the best existing agent systems) followed by agentic RL on verifiable tasks. Result: a compact, data-trainable alternative to sprawling agent stacks. 

How it works

  • CoA paradigm: the model can activate role-specific and tool-specific “agents” inside its own prompt scaffolding, supporting multi-turn, multi-tool problem solving in one pass. 

  • Multi-agent distillation: successful trajectories from SOTA frameworks (e.g., OAgents) are converted into CoA-compatible traces, then used for supervised tuning so the AFM internalizes collaboration patterns. 

  • Agentic RL: verifiable tasks (search, code, math) provide reward signals that sharpen when to plan, call tools, and switch roles. 

The scoreboard

A 32B AFM posts new highs across web and code agents—and strong math gains: 55.3% GAIA, 11.1% BrowseComp, 18.0% HLE, 47.9% LiveCodeBench-v5, 32.7% CodeContests, and 59.8% AIME’25, surpassing recent tool-integrated reasoning baselines like ReTool and SimpleTIR. 

Beyond accuracy, CoA slashes runtime waste: the paper reports an 84.6% reduction in inference token cost versus traditional multi-agent frameworks while keeping performance competitive—thanks to fewer round-trips and no inter-agent chatter. 

Why it matters

  • From frameworks to foundations. Distilling orchestration into the model itself turns agent systems into trainable objects, not just prompt graphs. 

  • Generalization & scaling knobs. Analyses show transfer to unseen agents/tools and test-time scaling behaviors (think “try more plans” without changing weights). 

  • Open everything. OPPO releases weights, code, and training data, giving startups a reproducible base to study agentic RL beyond ReAct-style pipelines. 

CoA’s pitch is simple: keep the multi-tool, multi-role superpowers—but train them into one model. If the reported GAIA/BrowseComp gains hold up, expect more teams to swap brittle agent graphs for AFMs that plan, act, and coordinate natively.

Paper link: arXiv 2508.13167 (PDF)

 Multi-agent frameworks can crush complex tasks—but they’re brittle, hand-engineered, and expensive to run. OPPO’s AI Agent team proposes a ...