Wandering Nomad: learning trajectory modelling

22.7.25

ParaStudent teaches a 7-B LLM to “struggle” like a freshman coder

Large language models ace coding contests, but they rarely mimic the process of bumbling through a CS-101 assignment. With ParaStudent, Mihran Miroyan and colleagues at UC Berkeley show how to make an LLM act less like Stack Overflow and more like a sleep-deprived undergrad. The team fine-tuned Qwen-2.5 Coder 7B on 60 000 timestamped submissions from four semesters of an introductory Python course, then built an evaluation suite that scores outputs on semantics, functional correctness and style.

Why “student-like” code matters

Personalised tutoring agents, auto-graders and curriculum-design tools need more than perfect solutions; they must anticipate syntax errors, awkward variable names and half-fixed bugs so they can give pedagogically useful feedback. Synthetic data that faithfully captures those quirks could unblock privacy-constrained research or bootstrap new courses with thin enrolment.

Three pillars of ParaStudent

Component	What it does
Fine-tuned model (qwen-student)	Learns error patterns, verbose style and incremental edits by ingesting full submission streams.
Low- vs high-resolution tests	Snapshot evaluation (first/middle/final attempt) and frame-by-frame trajectory tracking reveal where models drift from real learners.
Multi-dimensional metrics	Combines code-embedding distance, unit-test pass rate, AST edit distance and style vectors to judge realism beyond “does it run?”.

Key results

Closer trajectories. In the shared feature space Φ, qwen-student’s path hugs the real-student curve; GPT-4.1 and instruction-tuned Qwen jump straight from buggy to perfect, skipping the messy middle.
More human errors. Fine-tuning boosts coverage of common novice mistakes (off-by-one, misuse of max, stray print) by 2-3× versus prompting alone.
Style diversity. Edit-distance plots show qwen-student makes smaller, more frequent fixes, mirroring midnight-crunch behaviour, while GPT-4.1 rewrites whole files in one sweep.
Open & lightweight. Training ran on a single A100; code and evaluation scripts are on GitHub.

Take-aways for ed-tech builders

Fine-tune, don’t prompt. Prompt-only models default to polished, one-shot answers—great for Stack Overflow, bad for teaching loops.
Grade more than tests. Functional pass rate alone misses stylistic growth; ParaStudent’s metrics catch whether a learner’s code looks like a novice even when it finally works.
Synthetic data is feasible. A 7 B open model can generate realistic class-size corpora without enterprise GPUs or proprietary APIs.

The authors release all data processing pipelines under a permissive licence, inviting researchers to port the approach to other languages or higher-level courses. Next on the roadmap: privacy-preserving fine-tuning and fully autoregressive “semester simulators” that could stress-test tutoring agents before they ever meet a real student.

Paper link: arXiv 2507.12674 (PDF)