22.7.25

ParaStudent teaches a 7-B LLM to “struggle” like a freshman coder

 Large language models ace coding contests, but they rarely mimic the process of bumbling through a CS-101 assignment. With ParaStudent, Mihran Miroyan and colleagues at UC Berkeley show how to make an LLM act less like Stack Overflow and more like a sleep-deprived undergrad. The team fine-tuned Qwen-2.5 Coder 7B on 60 000 timestamped submissions from four semesters of an introductory Python course, then built an evaluation suite that scores outputs on semantics, functional correctness and style

Why “student-like” code matters

Personalised tutoring agents, auto-graders and curriculum-design tools need more than perfect solutions; they must anticipate syntax errors, awkward variable names and half-fixed bugs so they can give pedagogically useful feedback. Synthetic data that faithfully captures those quirks could unblock privacy-constrained research or bootstrap new courses with thin enrolment.

Three pillars of ParaStudent

ComponentWhat it does
Fine-tuned model (qwen-student)Learns error patterns, verbose style and incremental edits by ingesting full submission streams.
Low- vs high-resolution testsSnapshot evaluation (first/middle/final attempt) and frame-by-frame trajectory tracking reveal where models drift from real learners.
Multi-dimensional metricsCombines code-embedding distance, unit-test pass rate, AST edit distance and style vectors to judge realism beyond “does it run?”.

Key results

  • Closer trajectories. In the shared feature space Φ, qwen-student’s path hugs the real-student curve; GPT-4.1 and instruction-tuned Qwen jump straight from buggy to perfect, skipping the messy middle.

  • More human errors. Fine-tuning boosts coverage of common novice mistakes (off-by-one, misuse of max, stray print) by 2-3× versus prompting alone.

  • Style diversity. Edit-distance plots show qwen-student makes smaller, more frequent fixes, mirroring midnight-crunch behaviour, while GPT-4.1 rewrites whole files in one sweep.

  • Open & lightweight. Training ran on a single A100; code and evaluation scripts are on GitHub.

Take-aways for ed-tech builders

  1. Fine-tune, don’t prompt. Prompt-only models default to polished, one-shot answers—great for Stack Overflow, bad for teaching loops.

  2. Grade more than tests. Functional pass rate alone misses stylistic growth; ParaStudent’s metrics catch whether a learner’s code looks like a novice even when it finally works.

  3. Synthetic data is feasible. A 7 B open model can generate realistic class-size corpora without enterprise GPUs or proprietary APIs.

The authors release all data processing pipelines under a permissive licence, inviting researchers to port the approach to other languages or higher-level courses. Next on the roadmap: privacy-preserving fine-tuning and fully autoregressive “semester simulators” that could stress-test tutoring agents before they ever meet a real student.

Paper link: arXiv 2507.12674 (PDF)

WebShaper turns data generation for web agents into a set-theory science

 LLM-powered web agents nibble at problems once reserved for human researchers, but they’re starving for the one thing that matters—clean, diverse question-answer trajectories. Most teams still scrape pages first and dream up queries later, a workflow that tangles reasoning paths and spawns hallucinated answers. Alibaba’s Tongyi Lab says it has a better recipe: WebShaper, a “formalization-driven” data factory that starts with mathematics, not HTML. 

From ad-hoc scraping to knowledge projections

At the heart of WebShaper is a set-theoretic vocabulary called Knowledge Projections (KP): each KP is the set of entities linked by a single relation ( bornIn, playsFor, etc.). Two operations—union and intersection—let the authors compose arbitrarily deep queries and guarantee that every synthetic problem has a fully specified reasoning graph. The formal spec acts as a skeleton; only then does an agentic “Expander” venture onto the open web to fetch evidence that satisfies each KP node. 

A multi-step agent that grows harder questions

WebShaper starts with 18 k seed Q&A pairs distilled from an offline Wikipedia crawl, then pushes them through n-step expansions. At each step, the Expander retrieves fresh pages, validates candidates, and rewrites the KP tree into a tougher query—controlling complexity like a curriculum designer rather than a random crawler. 

Why it matters

  • Broader coverage – formal specs explore search patterns unconstrained by whatever a scraper happened to collect.

  • Structural consistency – answers align with the reasoning graph, slashing mismatched Q–A pairs.

  • Dial-a-difficulty – KP depth and branching let teams script “easy” or “nightmare” tasks on demand. 

State-of-the-art results with leaner data

Training a 72 B agent on the new dataset catapulted WebShaper-72B to 60.2 % on GAIA’s information-seeking subset, beating Claude-Sonnet, GPT-4.1 and Gemini 2.5 Pro when all models shared the same two browsing tools. Even the 32 B version tops WebDancer and SimpleDR. 

ModelGAIA ↑Notes
WebShaper-72B60.2 %new SOTA
Claude-Sonnet *58.3 %proprietary
WebShaper-32B55.4 %open
WebSailor55.3 %open
GPT-4.1 *48.5 %proprietary

* scores reported using the same browsing APIs

Because the formal spec eliminates redundant retrieval, WebShaper needs ~42 % of the tokens consumed by earlier pipelines such as WebDancer, yet still outperforms them on WebWalkerQA. 

Open kits for builders

All resources are public:

  • Dataset: on Hugging Face and ModelScope

  • Code: GitHub/Alibaba-NLP/WebAgent, including the Expander scripts

  • Checkpoints: 32 B & 72 B SFT models ready for RL fine-tuning 

The bigger picture

WebShaper reframes web-agent training as data geometry rather than brute-force scraping. By baking reasoning patterns into the data itself, it closes the loop between question design and answer verification—an approach that could spill over into multi-hop RAG, legal search and even agentic code auditors. The message is simple: if you can formalize the hunt, you can synthesize the bounty.

Paper link: arXiv 2507.15061 (PDF)

Archer shows “smart” RL beats brute force for small-scale reasoning models

 Modern RLVR post-training treats every output token the same, even though factual snippets (“Euler’s number is …”) and logical connectors (“therefore …”) serve wildly different purposes. Enter Archer, short for Adaptive Entropy-Aware RLVR, a new technique that groups tokens by entropy and then trains them under dual constraints:

  • Knowledge tokens (low entropy): strong KL regularization + tight PPO clip to preserve facts.

  • Reasoning tokens (high entropy): weaker KL + looser clip to encourage exploration and richer chains of thought. 

Crucially, the update is synchronous—no gradient masking or asynchronous passes that risk breaking sentence-level dependencies.


Fewer GPUs, bigger gains

On a single H800 slice, Archer fine-tunes a 1.5 B DeepSeek-R1 distilled model in one stage, 520 steps, 1,900 GPU-hours, yet leaps past multi-round rivals that burned 3–8× the compute. 

BenchmarkBase (DAPO)ArcherΔ
AIME 2024 Pass@123.5 %30.1 %+6.6
AIME 2025 Pass@127.6 %32.8 %+5.2
LiveCodeBench v5 Avg@826.0 %29.4 %+3.4
LiveCodeBench v6 Avg@1627.6 %30.2 %+2.6

The math-tuned variant also edges out specialist models like FastCuRL-1.5B and DeepScaleR-1.5B, while the code-tuned edition tops DeepCoder and Nemotron in head-to-head comparisons. 

Why it works

Analysis shows the dual-token policy stabilizes entropy and slashes n-gram repetition—avoiding collapse when KL is too weak and under-training when it’s too strong. Optimal KL weight (0.001) and asymmetric clip thresholds kept first-token latency low and reasoning diversity high. 


Why it matters

  • Smarter, not bigger: Archer turns a lightweight 1.5 B checkpoint into a math-and-code contender without billions of extra tokens or exotic reward models.

  • Template-free recipe: Any PPO-style RLVR loop can drop in the entropy classifier and dual constraints.

  • Open & ready: Code and configs are live on GitHub (wizard-III/ArcherCodeR), so teams can replicate the gains on their own domains today. 

As LLM builders hunt for cheaper paths to robust reasoning, Archer’s “treat knowledge gently, push reasoning hard” mantra may become standard practice—especially for edge-sized models that can’t afford brute-force scaling.

Paper link: arXiv 2507.15778 (PDF)

 The flashy AI announcements get the headlines — new model, higher benchmark, longer context. But if you've ever tried to actually deplo...