Wandering Nomad: BrowseComp

Showing posts with label BrowseComp. Show all posts

22.8.25

Chain-of-Agents turns a whole agent swarm into a single end-to-end model

Multi-agent frameworks can crush complex tasks—but they’re brittle, hand-engineered, and expensive to run. OPPO’s AI Agent team proposes a cleaner path: Chain-of-Agents (CoA), where a single model dynamically “plays” multiple roles and tools, simulating agent collaboration end-to-end without external orchestration. The team trains Agent Foundation Models (AFMs) with a two-step recipe: multi-agent distillation (learning from the best existing agent systems) followed by agentic RL on verifiable tasks. Result: a compact, data-trainable alternative to sprawling agent stacks.

How it works

CoA paradigm: the model can activate role-specific and tool-specific “agents” inside its own prompt scaffolding, supporting multi-turn, multi-tool problem solving in one pass.
Multi-agent distillation: successful trajectories from SOTA frameworks (e.g., OAgents) are converted into CoA-compatible traces, then used for supervised tuning so the AFM internalizes collaboration patterns.
Agentic RL: verifiable tasks (search, code, math) provide reward signals that sharpen when to plan, call tools, and switch roles.

The scoreboard

A 32B AFM posts new highs across web and code agents—and strong math gains: 55.3% GAIA, 11.1% BrowseComp, 18.0% HLE, 47.9% LiveCodeBench-v5, 32.7% CodeContests, and 59.8% AIME’25, surpassing recent tool-integrated reasoning baselines like ReTool and SimpleTIR.

Beyond accuracy, CoA slashes runtime waste: the paper reports an 84.6% reduction in inference token cost versus traditional multi-agent frameworks while keeping performance competitive—thanks to fewer round-trips and no inter-agent chatter.

Why it matters

From frameworks to foundations. Distilling orchestration into the model itself turns agent systems into trainable objects, not just prompt graphs.
Generalization & scaling knobs. Analyses show transfer to unseen agents/tools and test-time scaling behaviors (think “try more plans” without changing weights).
Open everything. OPPO releases weights, code, and training data, giving startups a reproducible base to study agentic RL beyond ReAct-style pipelines.

CoA’s pitch is simple: keep the multi-tool, multi-role superpowers—but train them into one model. If the reported GAIA/BrowseComp gains hold up, expect more teams to swap brittle agent graphs for AFMs that plan, act, and coordinate natively.

Paper link: arXiv 2508.13167 (PDF)

12.8.25

GLM-4.5 wants to be the open-source workhorse for agents, reasoning, and code

Zhipu AI just dropped GLM-4.5, a Mixture-of-Experts LLM built to juggle three hard modes at once: agentic tasks, deep reasoning, and real-world coding. The headline specs: 355B total parameters with 32B active per token, a 23-trillion-token training run, and a hybrid reasoning switch that flips between “think-out-loud” and terse answers based on task demands. There’s also a slimmer GLM-4.5-Air (106B/12B active) for teams who can’t babysit a mega-model.

Why it stands out

ARC trifecta focus. Across 12 benchmarks, GLM-4.5 places #3 overall and #2 on agentic suites—with marquee scores like 91.0 on AIME’24, 64.2 on SWE-bench Verified, and 70.1 on TAU-Bench. It also reports 26.4 on BrowseComp for web agents, near OpenAI’s o4-mini-high in the authors’ runs.
Parameter-efficient MoE. Compared to some giant peers, GLM-4.5 keeps active params modest while stacking deeper layers, 96 attention heads, partial RoPE, QK-Norm, and a built-in MTP layer for speculative decoding.
Hybrid reasoning as a product feature. Both GLM-4.5 and Air support thinking (for complex tool use) and non-thinking (instant replies) modes from the same checkpoint.

The training recipe (quick hits)

A two-stage pretraining + mid-training stack mixes high-quality web, multilingual, code, math/science, then adds repo-level code, synthetic reasoning, 128K-token long-context, and agent trajectories to push real software-engineering and planning skills. Post-training distills expert Reasoning, Agent, and General models into one hybrid generalist, followed by targeted RL (including a “pathology RL” cleanup pass).

What you can actually download

Zhipu has published code, evals, and model cards on GitHub; weights are also listed on Hugging Face. The team pitches GLM-4.5 as agent-first and ships a simple eval harness to reproduce scores.

Bottom line

Open-source has plenty of great single-skill models. GLM-4.5 is aiming for a different bullseye: one backbone that can browse, reason, and patch code without feeling second-tier. If the reported ARC numbers hold up in the wild, this could become the go-to open checkpoint for production-grade agents.

Paper link: arXiv 2508.06471 (PDF)

6.7.25

WebSailor charts an open-source course to super-human web reasoning

For the past year, open-source web agents have looked like dinghies chasing aircraft carriers: even 70-billion-parameter models scraped single-digit accuracy on BrowseComp-en, the field’s toughest information-seeking benchmark, while closed systems such as DeepResearch and Grok-3 cruised far ahead. Tongyi Lab, Alibaba’s applied-AI skunkworks, says it has all but closed that gap with WebSailor, a post-training recipe that rewires large language models to “think like uncertainty-slayers.”

Turning the web into a maze on purpose

At the heart of WebSailor is SailorFog-QA, a synthetic dataset that bombards the model with “Level-3” problems—questions whose answers hide behind tangled entity graphs and deliberately obfuscated clues (“a musician later honored in the early 21st century,” “a chronology that ends the same year a late-antique poet died”). Random walks over real web pages build those graphs; masking, vagueness and partial names turn each query into a fog bank the agent must burn off through multi-step reasoning.

DUPO: reinforcement learning that isn’t painfully slow

Tool-using agents learn painfully slowly because every step calls a browser, but Tongyi Lab’s Duplicating Sampling Policy Optimization (DUPO) makes each RL batch pull double duty: one pass samples harder trajectories, the next re-samples mid-episode to squeeze more signal from sparse rewards. A small rejection-sampling fine-tuning (RFT) “cold start” of just 2 k expert traces primes the model so DUPO has something to optimize.

Four sizes, one giant leap

WebSailor comes in 3B, 7B, 32B and 72B flavors. Even the 7-billion-parameter version hits 6.7 % pass@1 on BrowseComp-en, trouncing agents built on 32 B backbones that manage barely 2 – 3 %. The 32 B and 72 B models push further, outscoring open-source peers on BrowseComp-en/zh, GAIA and XBench and edging past proprietary offerings like Grok-3 and Doubao-Search when those systems add browsing tools.

Why it matters

Democratizing deep search. BrowseComp-level tasks—ask a question, navigate dozen-plus pages, synthesize an answer—are what corporate knowledge-bases and vertical search startups need. WebSailor shows you no longer need a closed-source giant to play.
A recipe, not a model. The CPT + HCF routine, uncertainty-first data and DUPO optimizer are architecture-agnostic; any ReAct-style agent with tool APIs can adopt them.
Downward compatibility. Despite training only on headache-grade puzzles, WebSailor’s 72 B model scores >90 % pass@1 on the single-hop SimpleQA benchmark, proving that hard-first curricula don’t break easy tasks.

Open weights, open benchmark

Code, data-generation scripts and checkpoints live in Tongyi Lab’s GitHub repo, alongside a dockerized evaluator so outside teams can reproduce—or dispute—the numbers.

With WebSailor, the open-source fleet finally has a flagship capable of keeping proprietary juggernauts in sight. The real question now: how long before someone splices SailorFog-style data and DUPO into a general-purpose agent that can shop, schedule and navigate enterprise wikis with the same super-human calm?

Paper link: arXiv 2507.02592 (PDF)