Wandering Nomad: data synthesis

22.7.25

WebShaper turns data generation for web agents into a set-theory science

LLM-powered web agents nibble at problems once reserved for human researchers, but they’re starving for the one thing that matters—clean, diverse question-answer trajectories. Most teams still scrape pages first and dream up queries later, a workflow that tangles reasoning paths and spawns hallucinated answers. Alibaba’s Tongyi Lab says it has a better recipe: WebShaper, a “formalization-driven” data factory that starts with mathematics, not HTML.

From ad-hoc scraping to knowledge projections

At the heart of WebShaper is a set-theoretic vocabulary called Knowledge Projections (KP): each KP is the set of entities linked by a single relation ( bornIn, playsFor, etc.). Two operations—union and intersection—let the authors compose arbitrarily deep queries and guarantee that every synthetic problem has a fully specified reasoning graph. The formal spec acts as a skeleton; only then does an agentic “Expander” venture onto the open web to fetch evidence that satisfies each KP node.

A multi-step agent that grows harder questions

WebShaper starts with 18 k seed Q&A pairs distilled from an offline Wikipedia crawl, then pushes them through n-step expansions. At each step, the Expander retrieves fresh pages, validates candidates, and rewrites the KP tree into a tougher query—controlling complexity like a curriculum designer rather than a random crawler.

Why it matters

Broader coverage – formal specs explore search patterns unconstrained by whatever a scraper happened to collect.
Structural consistency – answers align with the reasoning graph, slashing mismatched Q–A pairs.
Dial-a-difficulty – KP depth and branching let teams script “easy” or “nightmare” tasks on demand.

State-of-the-art results with leaner data

Training a 72 B agent on the new dataset catapulted WebShaper-72B to 60.2 % on GAIA’s information-seeking subset, beating Claude-Sonnet, GPT-4.1 and Gemini 2.5 Pro when all models shared the same two browsing tools. Even the 32 B version tops WebDancer and SimpleDR.

Model	GAIA ↑	Notes
WebShaper-72B	60.2 %	new SOTA
Claude-Sonnet *	58.3 %	proprietary
WebShaper-32B	55.4 %	open
WebSailor	55.3 %	open
GPT-4.1 *	48.5 %	proprietary

* scores reported using the same browsing APIs

Because the formal spec eliminates redundant retrieval, WebShaper needs ~42 % of the tokens consumed by earlier pipelines such as WebDancer, yet still outperforms them on WebWalkerQA.

Open kits for builders

All resources are public:

Dataset: on Hugging Face and ModelScope
Code: GitHub/Alibaba-NLP/WebAgent, including the Expander scripts
Checkpoints: 32 B & 72 B SFT models ready for RL fine-tuning

The bigger picture

WebShaper reframes web-agent training as data geometry rather than brute-force scraping. By baking reasoning patterns into the data itself, it closes the loop between question design and answer verification—an approach that could spill over into multi-hop RAG, legal search and even agentic code auditors. The message is simple: if you can formalize the hunt, you can synthesize the bounty.

Paper link: arXiv 2507.15061 (PDF)