Wandering Nomad: long chain-of-thought

10.9.25

TraceRL puts diffusion LLMs on the reasoning map

Autoregressive (AR) giants have dominated reasoning benchmarks, while diffusion language models (DLMs) were seen as “fast samplers” with limited logic chops. A new paper from Princeton and UChicago argues that’s mostly a training-objective problem—and offers TraceRL, a trajectory-aware reinforcement learning framework that aligns what a DLM learns with how it actually samples. The team also releases code and ready-to-run models under the TraDo banner.

What’s new

Trajectory-aware RL for DLMs. Instead of scoring randomly masked sequences, TraceRL optimizes against the model’s intermediate inference traces, matching the left-to-right / blockwise behavior used at decode time. A diffusion-based value model stabilizes training by reducing variance. Crucially, the method works for full-attention and block-attention DLMs.
Open stack. The release includes a framework to build/train/deploy DLMs across architectures, with KV-cache acceleration, inference engines, SFT + RL recipes for math and code, and links to TraDo-4B/8B checkpoints.

The receipts

On headline benchmarks (dynamic vs. static sampling shown in the paper), the TraDo models post the strongest DLM numbers to date and overtake AR peers at similar scale on math:

TraDo-8B-Instruct: MATH500 78.5, AIME’24 13.3, LCB-V2 25.9—a +6.1% relative lift over Qwen2.5-7B-Instruct and +51.3% over Llama-3.1-8B-Instruct on math reasoning.
TraDo-4B-Instruct: MATH500 75.6, AIME’24 10.3, LCB-V2 18.7, consistently edging 7B AR baselines on math.
TraDo-8B-Thinking (long-CoT): first long chain-of-thought diffusion LLM, hitting MATH500 87.4, AIME’24 35.5, LCB-V2 34.6 with very long answers.

The authors attribute gains to objective/trajectory alignment and show smoother curves with the value model vs. policy-only RL. They also document a speed/accuracy trade-off: dynamic sampling is faster; static top-1 decoding squeezes out extra points.

Why it matters

DLMs aren’t just “fast”—they can reason. With the right RL target, parallel generation stacks clear long-form math and coding hurdles previously ceded to AR. 2) Unifies the zoo. One RL recipe spans full-attention and block-diffusion, and even helps enlarge block size for more flexible sampling. 3) Practical path. The open framework + KV-cache tricks make DLM post-training and deployment feel product-ready, not just a lab exercise.

Setup notes

Math RL uses 8k hard MATH tasks; coding RL uses 6k verified problems from PrimeIntellect. Long-CoT training mixes TraceRL with long-form SFT as a curriculum.

Bottom line: TraceRL reframes diffusion LLMs as credible reasoners, not just fast generators—and TraDo-8B-Thinking plants the first long-CoT flag on the DLM side of the field.

Paper link: arXiv 2509.06949 (PDF)