Showing posts with label WebShaper. Show all posts
Showing posts with label WebShaper. Show all posts

22.7.25

WebShaper turns data generation for web agents into a set-theory science

 LLM-powered web agents nibble at problems once reserved for human researchers, but they’re starving for the one thing that matters—clean, diverse question-answer trajectories. Most teams still scrape pages first and dream up queries later, a workflow that tangles reasoning paths and spawns hallucinated answers. Alibaba’s Tongyi Lab says it has a better recipe: WebShaper, a “formalization-driven” data factory that starts with mathematics, not HTML. 

From ad-hoc scraping to knowledge projections

At the heart of WebShaper is a set-theoretic vocabulary called Knowledge Projections (KP): each KP is the set of entities linked by a single relation ( bornIn, playsFor, etc.). Two operations—union and intersection—let the authors compose arbitrarily deep queries and guarantee that every synthetic problem has a fully specified reasoning graph. The formal spec acts as a skeleton; only then does an agentic “Expander” venture onto the open web to fetch evidence that satisfies each KP node. 

A multi-step agent that grows harder questions

WebShaper starts with 18 k seed Q&A pairs distilled from an offline Wikipedia crawl, then pushes them through n-step expansions. At each step, the Expander retrieves fresh pages, validates candidates, and rewrites the KP tree into a tougher query—controlling complexity like a curriculum designer rather than a random crawler. 

Why it matters

  • Broader coverage – formal specs explore search patterns unconstrained by whatever a scraper happened to collect.

  • Structural consistency – answers align with the reasoning graph, slashing mismatched Q–A pairs.

  • Dial-a-difficulty – KP depth and branching let teams script “easy” or “nightmare” tasks on demand. 

State-of-the-art results with leaner data

Training a 72 B agent on the new dataset catapulted WebShaper-72B to 60.2 % on GAIA’s information-seeking subset, beating Claude-Sonnet, GPT-4.1 and Gemini 2.5 Pro when all models shared the same two browsing tools. Even the 32 B version tops WebDancer and SimpleDR. 

ModelGAIA ↑Notes
WebShaper-72B60.2 %new SOTA
Claude-Sonnet *58.3 %proprietary
WebShaper-32B55.4 %open
WebSailor55.3 %open
GPT-4.1 *48.5 %proprietary

* scores reported using the same browsing APIs

Because the formal spec eliminates redundant retrieval, WebShaper needs ~42 % of the tokens consumed by earlier pipelines such as WebDancer, yet still outperforms them on WebWalkerQA. 

Open kits for builders

All resources are public:

  • Dataset: on Hugging Face and ModelScope

  • Code: GitHub/Alibaba-NLP/WebAgent, including the Expander scripts

  • Checkpoints: 32 B & 72 B SFT models ready for RL fine-tuning 

The bigger picture

WebShaper reframes web-agent training as data geometry rather than brute-force scraping. By baking reasoning patterns into the data itself, it closes the loop between question design and answer verification—an approach that could spill over into multi-hop RAG, legal search and even agentic code auditors. The message is simple: if you can formalize the hunt, you can synthesize the bounty.

Paper link: arXiv 2507.15061 (PDF)

 Anyone who has watched today’s end‑to‑end robot policies fail a complex kitchen task knows the weakness: they map pixels to motors with no ...