LLM-powered web agents nibble at problems once reserved for human researchers, but they’re starving for the one thing that matters—clean, diverse question-answer trajectories. Most teams still scrape pages first and dream up queries later, a workflow that tangles reasoning paths and spawns hallucinated answers. Alibaba’s Tongyi Lab says it has a better recipe: WebShaper, a “formalization-driven” data factory that starts with mathematics, not HTML.
From ad-hoc scraping to knowledge projections
At the heart of WebShaper is a set-theoretic vocabulary called Knowledge Projections (KP): each KP is the set of entities linked by a single relation ( bornIn, playsFor, etc.). Two operations—union and intersection—let the authors compose arbitrarily deep queries and guarantee that every synthetic problem has a fully specified reasoning graph. The formal spec acts as a skeleton; only then does an agentic “Expander” venture onto the open web to fetch evidence that satisfies each KP node.
A multi-step agent that grows harder questions
WebShaper starts with 18 k seed Q&A pairs distilled from an offline Wikipedia crawl, then pushes them through n-step expansions. At each step, the Expander retrieves fresh pages, validates candidates, and rewrites the KP tree into a tougher query—controlling complexity like a curriculum designer rather than a random crawler.
Why it matters
-
Broader coverage – formal specs explore search patterns unconstrained by whatever a scraper happened to collect.
-
Structural consistency – answers align with the reasoning graph, slashing mismatched Q–A pairs.
-
Dial-a-difficulty – KP depth and branching let teams script “easy” or “nightmare” tasks on demand.
State-of-the-art results with leaner data
Training a 72 B agent on the new dataset catapulted WebShaper-72B to 60.2 % on GAIA’s information-seeking subset, beating Claude-Sonnet, GPT-4.1 and Gemini 2.5 Pro when all models shared the same two browsing tools. Even the 32 B version tops WebDancer and SimpleDR.
Model | GAIA ↑ | Notes |
---|---|---|
WebShaper-72B | 60.2 % | new SOTA |
Claude-Sonnet * | 58.3 % | proprietary |
WebShaper-32B | 55.4 % | open |
WebSailor | 55.3 % | open |
GPT-4.1 * | 48.5 % | proprietary |
Because the formal spec eliminates redundant retrieval, WebShaper needs ~42 % of the tokens consumed by earlier pipelines such as WebDancer, yet still outperforms them on WebWalkerQA.
Open kits for builders
All resources are public:
-
Dataset: on Hugging Face and ModelScope
-
Code: GitHub/Alibaba-NLP/WebAgent, including the Expander scripts
-
Checkpoints: 32 B & 72 B SFT models ready for RL fine-tuning
The bigger picture
WebShaper reframes web-agent training as data geometry rather than brute-force scraping. By baking reasoning patterns into the data itself, it closes the loop between question design and answer verification—an approach that could spill over into multi-hop RAG, legal search and even agentic code auditors. The message is simple: if you can formalize the hunt, you can synthesize the bounty.
Paper link: arXiv 2507.15061 (PDF)