Wandering Nomad: GAIA benchmark

Showing posts with label GAIA benchmark. Show all posts

2.9.25

Memento: teach agents to learn on the fly—no LLM fine-tune required

Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep the LLM frozen and adapt the agent with a memory that learns from every episode. The team formalizes this as a Memory-augmented MDP and shows it can lift real-world “deep research” performance—without gradient updates to the underlying model.

The recipe in one diagram

Memento is a planner–executor architecture wired to a growing Case Bank of episodic traces (state, action, reward). At each step, the planner retrieves similar past cases to guide the next action; after acting, the trajectory (success or failure) is written back—so the memory rewrites itself with environmental feedback. Retrieval can be non-parametric (Top-K by similarity) or parametric via a lightweight Q(s, c) scorer trained online to prefer high-utility cases. Tools are accessed through an MCP-style interface so the executor can browse, run code, or call APIs inside the same loop.

Why this beats “prompt more” and “train more”

Unlike static RAG or handcrafted reflections, case-based reasoning (CBR) selectively reuses successful and failed traces; unlike RL-fine-tuning, it avoids catastrophic forgetting and heavy compute. In ablations, adding CBR memory yields +4.7 to +9.6 absolute points on out-of-distribution QA sets (MuSiQue, Bamboogle, PopQA).

The receipts

GAIA (long-horizon tool use): Top-1 on validation (87.88% Pass@3) and 79.40% on the private test leaderboard.
DeepResearcher (live web research): 66.6 F1 / 80.4 PM, outperforming training-based systems under the paper’s setup.
SimpleQA (single-hop factual): 95.0 PM, the highest among reported baselines.
Humanity’s Last Exam (HLE): 24.4 PM, second overall and within 0.92 of GPT-5 in the authors’ evaluation.

What this means for builders

Ship updates without re-training. Treat memory as the learning substrate; leave your production LLM untouched.
Choose your memory: start with non-parametric retrieval; add the parametric Q-head when you need sharper case selection.
Tooling that scales. MCP-based execution keeps multi-tool orchestration inside one protocol, making traces coherent and reusable.

The upshot: Memento reframes “agent improvement” as memory engineering. If your research agent gets better the more it works—without touching base weights—you’ve got a path to continual learning that’s practical outside the lab.

Paper link: arXiv 2508.16153 (PDF)

28.8.25

Anemoi: a semi-centralized agent system that lets bots talk to each other—literally

Most generalist multi-agent stacks still look like a relay race: a central planner prompts specialist workers, who pass back long context blobs for the planner to stitch together. It works—until you downsize the planner or hit token limits. Anemoi proposes a different wiring: keep a light planner, but let agents communicate directly over an Agent-to-Agent (A2A) MCP server so everyone can see progress, flag bottlenecks, and propose fixes in real time.

What’s actually new

Anemoi replaces unidirectional prompt passing with a threaded A2A server (built on the Model Context Protocol) that exposes primitives like list_agents, create_thread, send_message, and wait_for_mentions. Any agent can join a thread, address peers, and update plans mid-flight—reducing redundant context stuffing and information loss.

The cast of agents (and why it matters)

Planner: drafts the initial plan and spins up a thread.
Critique: continuously audits intermediate results.
Answer-Finder: compiles the final submission.
Workers: Web, Document Processing, and Reasoning & Coding—mirroring OWL’s tool set for a fair head-to-head. All are MCP-enabled so they can monitor progress and coordinate directly.

This design reduces reliance on one overpowered planner, supports adaptive plan updates, and cuts token overhead from repeated context injection.

Numbers that move the needle (GAIA validation)

Framework	Planner / Workers	Avg. Acc.
OWL-rep (pass@3)	GPT-4.1-mini / GPT-4o	43.64%
OWL (paper, pass@3)	GPT-4o-mini / GPT-4o	47.27%
Anemoi (pass@3)	GPT-4.1-mini / GPT-4o	52.73%

With a small planner (GPT-4.1-mini), Anemoi tops a strong open-source baseline by +9.09 points under identical tools and models—and is competitive with several proprietary systems that rely on larger planners.

How the A2A workflow runs

Discover agents → 2) Create thread with participants → 3) Workers execute subtasks; Critique labels outputs accept/uncertain while any agent can contribute revisions → 4) Consensus vote before finalization → 5) Answer-Finder submits. All via MCP messaging in a single conversation context.

Where it wins—and where it trips

Wins: Of the tasks Anemoi solved that OWL missed, 52% were due to collaborative refinement enabled by A2A; another 8% came from less context redundancy.
Failures: Remaining errors skew to LLM/tool limits (≈46%/21%), incorrect plans (≈12%), and some communication latency (≈10%)—notably when the web agent is busy and can’t respond to peers.

Why this matters

If your agent system juggles web search, file I/O, and coding, direct inter-agent communication can deliver better results without upgrading to an expensive planner. Anemoi shows a practical blueprint: keep the planner lightweight, move coordination into an A2A layer, and let specialists negotiate in-thread instead of bloating prompts.

Paper link: arXiv 2508.17068 (PDF)

22.8.25

Chain-of-Agents turns a whole agent swarm into a single end-to-end model

Multi-agent frameworks can crush complex tasks—but they’re brittle, hand-engineered, and expensive to run. OPPO’s AI Agent team proposes a cleaner path: Chain-of-Agents (CoA), where a single model dynamically “plays” multiple roles and tools, simulating agent collaboration end-to-end without external orchestration. The team trains Agent Foundation Models (AFMs) with a two-step recipe: multi-agent distillation (learning from the best existing agent systems) followed by agentic RL on verifiable tasks. Result: a compact, data-trainable alternative to sprawling agent stacks.

How it works

CoA paradigm: the model can activate role-specific and tool-specific “agents” inside its own prompt scaffolding, supporting multi-turn, multi-tool problem solving in one pass.
Multi-agent distillation: successful trajectories from SOTA frameworks (e.g., OAgents) are converted into CoA-compatible traces, then used for supervised tuning so the AFM internalizes collaboration patterns.
Agentic RL: verifiable tasks (search, code, math) provide reward signals that sharpen when to plan, call tools, and switch roles.

The scoreboard

A 32B AFM posts new highs across web and code agents—and strong math gains: 55.3% GAIA, 11.1% BrowseComp, 18.0% HLE, 47.9% LiveCodeBench-v5, 32.7% CodeContests, and 59.8% AIME’25, surpassing recent tool-integrated reasoning baselines like ReTool and SimpleTIR.

Beyond accuracy, CoA slashes runtime waste: the paper reports an 84.6% reduction in inference token cost versus traditional multi-agent frameworks while keeping performance competitive—thanks to fewer round-trips and no inter-agent chatter.

Why it matters

From frameworks to foundations. Distilling orchestration into the model itself turns agent systems into trainable objects, not just prompt graphs.
Generalization & scaling knobs. Analyses show transfer to unseen agents/tools and test-time scaling behaviors (think “try more plans” without changing weights).
Open everything. OPPO releases weights, code, and training data, giving startups a reproducible base to study agentic RL beyond ReAct-style pipelines.

CoA’s pitch is simple: keep the multi-tool, multi-role superpowers—but train them into one model. If the reported GAIA/BrowseComp gains hold up, expect more teams to swap brittle agent graphs for AFMs that plan, act, and coordinate natively.

Paper link: arXiv 2508.13167 (PDF)

22.7.25

WebShaper turns data generation for web agents into a set-theory science

LLM-powered web agents nibble at problems once reserved for human researchers, but they’re starving for the one thing that matters—clean, diverse question-answer trajectories. Most teams still scrape pages first and dream up queries later, a workflow that tangles reasoning paths and spawns hallucinated answers. Alibaba’s Tongyi Lab says it has a better recipe: WebShaper, a “formalization-driven” data factory that starts with mathematics, not HTML.

From ad-hoc scraping to knowledge projections

At the heart of WebShaper is a set-theoretic vocabulary called Knowledge Projections (KP): each KP is the set of entities linked by a single relation ( bornIn, playsFor, etc.). Two operations—union and intersection—let the authors compose arbitrarily deep queries and guarantee that every synthetic problem has a fully specified reasoning graph. The formal spec acts as a skeleton; only then does an agentic “Expander” venture onto the open web to fetch evidence that satisfies each KP node.

A multi-step agent that grows harder questions

WebShaper starts with 18 k seed Q&A pairs distilled from an offline Wikipedia crawl, then pushes them through n-step expansions. At each step, the Expander retrieves fresh pages, validates candidates, and rewrites the KP tree into a tougher query—controlling complexity like a curriculum designer rather than a random crawler.

Why it matters

Broader coverage – formal specs explore search patterns unconstrained by whatever a scraper happened to collect.
Structural consistency – answers align with the reasoning graph, slashing mismatched Q–A pairs.
Dial-a-difficulty – KP depth and branching let teams script “easy” or “nightmare” tasks on demand.

State-of-the-art results with leaner data

Training a 72 B agent on the new dataset catapulted WebShaper-72B to 60.2 % on GAIA’s information-seeking subset, beating Claude-Sonnet, GPT-4.1 and Gemini 2.5 Pro when all models shared the same two browsing tools. Even the 32 B version tops WebDancer and SimpleDR.

Model	GAIA ↑	Notes
WebShaper-72B	60.2 %	new SOTA
Claude-Sonnet *	58.3 %	proprietary
WebShaper-32B	55.4 %	open
WebSailor	55.3 %	open
GPT-4.1 *	48.5 %	proprietary

* scores reported using the same browsing APIs

Because the formal spec eliminates redundant retrieval, WebShaper needs ~42 % of the tokens consumed by earlier pipelines such as WebDancer, yet still outperforms them on WebWalkerQA.

Open kits for builders

All resources are public:

Dataset: on Hugging Face and ModelScope
Code: GitHub/Alibaba-NLP/WebAgent, including the Expander scripts
Checkpoints: 32 B & 72 B SFT models ready for RL fine-tuning

The bigger picture

WebShaper reframes web-agent training as data geometry rather than brute-force scraping. By baking reasoning patterns into the data itself, it closes the loop between question design and answer verification—an approach that could spill over into multi-hop RAG, legal search and even agentic code auditors. The message is simple: if you can formalize the hunt, you can synthesize the bounty.

Paper link: arXiv 2507.15061 (PDF)