Wandering Nomad

12.9.25

Claude’s new file creation tools vs. ChatGPT and Gemini: who’s ahead on real productivity

What Claude offers now

From Anthropic’s announcements:

Creates and edits real files directly in chats or the desktop app: Excel (.xlsx), Word (.docx), PowerPoint (.pptx), PDFs.
Users can upload data or supply shared input, then ask Claude to build files from scratch (e.g. spreadsheets with formulas, documents or slide decks).
The outputs are downloadable, usable “ready-to-use” artifacts. Claude can also convert document formats (e.g. PDF→slides) and do statistical/analysis tasks within spreadsheets.
File size limits: up to 30 MB uploads/downloads.
The feature is currently a preview for certain paid plans (Max, Team, Enterprise), with Pro plans getting access “soon.”

What ChatGPT currently supports (vs. Claude)

Based on public info:

File uploads & summarization / extraction: ChatGPT can accept PDFs, presentations, plaintext documents, etc., and then respond to queries about their contents.
Data analysis / code execution environment (“Code Interpreter” / “Advanced Data Analysis”): for spreadsheets or CSVs, you can upload, have it run code, do charts/visualizations, clean data, etc.
File editing or direct file creation: ChatGPT so far does not create or modify Excel/Word/PPTX/PDF files as downloadable artifacts via a “create new file + edit” flow in chat (at least very broadly marketed). There are plugins and workflows, but not a core feature announced the way Claude’s was.
Canvas interface: ChatGPT introduced “Canvas,” allowing inline editing of texts or code alongside chat—helpful for refining, rewriting, collaborating. But this is about editing text/code drafts in the interface, not necessarily generating formal document files with formatting and exporting to PPTX, XLSX, etc.

What we know less about Gemini (Google)

Public info is less detailed for Gemini’s ability to generate downloadable files like PowerPoints, spreadsheets with formulas, etc. Gemini can export Deep Research reports as Google Docs, which implies some document generation + formatting functionality. But whether it handles real .xlsx spreadsheets or retains formula logic is less clear. (This comes from secondary sources referencing export of reports as Google Docs.)
Gemini does well with research-style reports, text generation, multimedia input/output; but direct file editing workflows (upload file, edit content, download in formatted artifact) are not obviously at parity with Claude’s newly announced capability as of now (publicly).

Side-by-side strengths & gaps

Feature	Claude’s new file creation/edit	ChatGPT’s current capacities	Google Gemini (publicly known)
Create + download Word / PPTX / Excel / PDF from scratch via chat	✅ Yes (Claude)	❓ Mostly no / limited; chat drafts or upload → extract, but not full artifact creation with formatting & formulas	✅ Some doc export (e.g. Google Doc), but file formats & formula support unclear
Edit existing files (spreadsheets, slide decks, PDFs) by specifying edits	✅ Yes (Claude can modify uploaded file)	Partial: you can ask ChatGPT to suggest edits, maybe produce updated content; but usually via text, not editing the actual file artifact internally	Less clear publicly
Formulas / spreadsheet logic, charts, data analysis within file	✅ Claude supports formulas, chart generation in Excel sheets etc.	✅ ChatGPT’s Advanced Data Analysis / Code Interpreter can run code, generate charts etc., but output often image or code rather than in Excel with working formulas	Unknown detail for Gemini
Format preservation / bulk edits (e.g. replace terms, style, layout)	Claude claims it preserves formatting and supports direct editing without opening the file manually.	ChatGPT can manipulate content, but not always preserve all formatting when exporting to external files; often conversion‐based or re‐rendered text	Gemini likely similar to document export, with less file format variety known
File size & limits	Claude: upload/download up to ~30 MB.	ChatGPT file uploads also have size limits (for large files or large images), but the limit & support for editing artifacts with formulas or presentation layouts is more constrained	Not fully disclosed / varies across features / tools
Availability / plan restrictions	Preview for paid tiers in Claude; not yet general free access.	Many advanced features gated to Plus / Pro / Teams; “Canvas” is in beta etc.	Gemini similarly has tiered and regionized feature access; some users may have access, but not universally confirmed for all features

Implications & what this means

Claude’s added file creation/edit increases its utility for document + presentation work flows, especially for business / enterprise use, where formatted deliverables (slides, reports, spreadsheets) are key.
If ChatGPT (or Gemini) wants to match this, they'd need to support not just text/coding, but full artifact generation + editing with retention of formula logic/formatting + download/export in common office file formats.
Users whose workflows involve formatting, layout, bulk edits, or converting between formats will benefit more from Claude’s new feature—less manual reformatting and fewer copy-paste hacks.
For many use cases, existing tools (ChatGPT + Code Interpreter) suffice, especially when output is data or charts. But for file artifacts that are meant to be “finished” or shared, Claude’s offering tightens the gap.

Claude’s Leap: From Chat to File Factory

Anthropic just upgraded Claude to be more than a conversational assistant. A fresh feature preview lets users create and edit real files—Excel sheets, Word docs, PowerPoint decks, and PDFs—directly through Claude.ai and the desktop app. Rather than simply getting text output, you can describe what you need, upload data, and receive usable files already formatted and ready to share or export.

What’s New

File types supported: .xlsx, .docx, .pptx, .pdf—spreadsheet, word-processed, slide, and presentation formats.
Complex workflows enabled: You can ask Claude to build financial models with formulas and multiple sheets, convert PDFs into slides, clean raw data, run statistical analyses, produce charts, or stitch together reports—all via natural instructions.
Sandboxed computing: Claude now operates in a restricted internal computing environment. It can run code (e.g. Python), load libraries, and generate artifacts without exposing your local machine.

Availability & Plans

Already available now for those on Max, Team, and Enterprise plans.
Pro users will get access soon.
It’s currently a feature preview—opt-in required per user via the Claude settings (“Upgraded file creation and analysis”) and may still be tweaked.

Use Cases: What You Can Do

Transform raw data into polished reports (CSV → charts → formatted Word or PDF).
Build project trackers, scenario models, dashboards in Excel with working formulas.
Convert existing documents from one format to another: e.g., meeting notes → slide decks; PDF reports → editable docs.

Risks & Safeguards

Security: Because Claude gets limited internet access in order to import packages or execute code, there is risk of malicious content or prompt injection. Users are encouraged to monitor outputs and disable the feature if suspicious behavior arises.
Sandbox isolation: Enterprise settings allow admins to enable or disable file creation organization-wide. Team users must opt in; individuals can toggle the feature.

Why It Matters

This move shifts Claude (and similar models) further into hands-on productivity automation. Rather than merely advising, Claude can now execute parts of what used to require manual effort: formatting, data manipulation, cross-format conversion. That reduces friction for users who want to go from idea → usable artifact in fewer steps. It’s also a more natural way to blend AI into workflows: you stay in chat, give instructions, and get back files—not just text dumps you have to reformat. It’s a signal of what’s next: smarter agents embedded in the tools people use daily.

Blog Link

A Survey of Reinforcement Learning for Large Reasoning Models: mapping the promise and the gaps

Reinforcement learning (RL) isn’t new—but as Large Language Models (LLMs) evolve into reasoning machines, RL is taking a central role not just in alignment, but in building reasoning itself. A new survey, “Reinforcement Learning for Large Reasoning Models (LRMs)” by a large group from Tsinghua, Shanghai AI Lab, SJTU, and others, lays out an exhaustive map of the nascent field: what’s working, what’s risky, and what future architects need to solve.

What the survey covers

The paper dives into the core building blocks of using RL in reasoning-centered LLMs (often called LRMs): how to define rewards, what training algorithms are in play, how sampling strategies are evolving, and how infrastructure and task domains factor into the picture. It considers both alignment-adjacent RL (e.g. RLHF, preference learning) and RL whose goal is reasoning performance (accuracy, planning, reflection).

Key themes and insights

Reward design
The survey classifies rewards into several types:
- Verifiable rewards (e.g. test correctness, unit tests, exact checks) when tasks allow.
- Generative / learned reward models for subjective or open domains.
- Dense rewards vs outcome-only reward schemes—bringing signal into intermediate reasoning steps.
- Unsupervised or weak rewards when neither full correctness metrics nor human feedback are feasible.
  The authors emphasize that tasks with strong verifiability tend to yield more reliable RL learning.
Policy optimization & sampling strategies
There’s a broad sweep of algorithms: policy gradients, off-policy methods, regularized RL, hybrid approaches, critic-based vs critic-free methods. Sampling strategies—how you gather candidate outputs or intermediate chains—have big effects both on performance and on compute cost. Dynamic / structured sampling (e.g. adaptively adjusting paths, beam vs sampling) is becoming more common.
Foundational problems and gaps
Several of these stand out:
- Distinguishing when RL improves reasoning vs just memorization.
- Balancing weak model priors: does your base LLM already encode reasoning bias, or do you need to train from scratch?
- Trap of over-rewarding narrow achievements; reward hacking.
- Challenges in reward specification in subjective domains.
- Scaling issues: compute, infrastructure, verifying many candidates.
Training resources & infrastructure
The survey catalogues the spectrum of environments and corpora used: from static datasets to dynamic environments (interactive tasks, tool usage), from single-task to multi-agent setups. It also considers RL frameworks and infrastructure tools (e.g. RL pipeline libraries) that enable reproducible LLM+RL research.
Applications
RL for LRMs has been used in:
- Coding: unit tests, code correctness, reflection.
- Agentic tasks: agents using tools, web retrieval, planning.
- Multimodal reasoning: vision-language tasks, code+images.
- Robotics / medical / scientific domains. Each has its own reward/verification constraints.

Why it matters & what to watch next

Reasoning as an explicit target. RL is being woven into models not just to be more “helpful” or “safe,” but to reason more deeply: plan, reflect, self-correct.
Verifiability is a power lever. Where tasks allow for exact or semi-exact verification, RL works well. When reward is fuzzy, progress is slower and riskier.
Cost and scalability are fundamental constraints. As LRMs become larger and used with more test-time compute (more chain-of-thought, more candidate generations), RL training and inference costs balloon; infrastructure and sampling strategy choices can make or break feasibility.
Hybrid and co-evolving reward models are growing. There’s increasing interest in reward models that both learn and evolve alongside the LLM, or in having the model itself critique or verify its own work.

Takeaways for researchers and builders

If you’re designing RL for reasoning tasks, aim for verifiable reward signals where possible—they give cleaner gradients and fewer surprises.
Pay attention to sampling strategy—generating more candidates or reasoning branches helps, but only when combined with selective reinforcement.
For subjective or “open” tasks (creative writing, alignment, etc.), you likely need sophisticated reward models, rubric-based or generative rewards, and strong regularization.
Infrastructure matters: your ability to scale RL—from having candidate generation, verifiers, tool execution environments, caching, etc.—significantly affects what you can achieve.

Bottom line: This survey is a timely, comprehensive lookup table for anyone playing at the intersection of LLMs, RL, and reasoning. It confirms that reward design and verifiability are major levers, that RL is now essential for pushing reasoning as a capability, but also that many technical, infrastructural, and algorithmic challenges remain before “reasoning superintelligence.”

Paper link: arXiv 2509.08827 (PDF)

How to Build High-Quality Tools for LLM Agents — Lessons from Anthropic

As agents become more central to AI workflows, what separates a good agent from a great one often comes down to the tools it has—and how well those tools are designed. In “Writing effective tools for agents — with agents,” Anthropic shares a practical roadmap for building better tools powered by tools themselves, using Claude and the Model Context Protocol (MCP) as real-use labs.

What are “tools” in the agentic context?

Unlike conventional software APIs—deterministic functions that always give the same output for the same input—tools for agents must be built to coexist with non-deterministic systems. Agents like Claude must decide when to use tools, how to parse their output, and how to call them responsibly. A tool here is not just an API call; it's part of an interface contract between predictable software and unpredictable agent behavior. Tools are the mechanisms by which agents expand what they can reliably do.

Key workflows: prototyping, evaluating, and iterating

Anthropic emphasizes an iterative workflow:

Prototype early: Build simple versions of your tools. Use MCP servers or desktop extensions to connect your tool to Claude Code, allowing rapid experimentation and detection of rough edges. Include clear documentation that the agent can consume.
Run realistic evaluations: Create evaluation tasks that reflect real-world usage (multiple tool calls, complex chains, integration with other services). Use verifiable outcomes, not just “it seems right.” Capture metrics such as tool calls, token consumption, runtime, errors. Avoid toy tasks that underrepresent complexity.
Use agents to improve tools: Let Claude analyze transcripts and feedback to suggest refinements—maybe better prompt descriptions, more efficient tool outputs, clearer schemas. Anthropic reports improvements even for tools built by internal experts, purely by letting agents inspect tools’ performance.

Best practices and guiding principles

Anthropic distills the lessons into a set of design principles. Key among them:

Choosing tools selectively: Not every API needs to become a tool. Tools should cover high-impact, repeated workflows—not wrapping every possible existing endpoint. Also, consolidate when possible.
Namespaces and naming clarity: Clear, consistent naming helps agents pick the right tool. Avoid ambiguous names or overlapping functionality. Group related tools under logical prefixes or categories.
Return meaningful, concise context: Tools should return high-signal info. Avoid overwhelming the agent with technical IDs, long metadata unless necessary. Also allow “concise” vs “detailed” response modes.
Optimize for token efficiency: Use truncation, filtering, pagination. Prompt agents to use fewer tool calls or more precise queries. Efficient context limits make downstream tasks more reliable.
Clear tool specs and descriptions: Explicit parameter naming, clear input/output formats, good examples. Prompt engineering of tool descriptions can significantly impact performance.

Why this matters

Tools shape what agents can do. When tools are poorly described, overly broad, or return huge dumps of irrelevant context, agents waste resources, produce hallucinations, or fail to successfully orchestrate workflows. On the other hand, well-designed tools reduce ambiguity, reduce token use, reduce error, and let agents scale reliably across real-world tasks.

Especially as agents connect to many tools (hundreds via MCP servers), these design principles become the difference between brittle behavior and something that feels reliable and intuitive. Anthropic’s experience shows that many improvements come not from changing the LLM itself but refining the tools around it.

If you’re building agent tools or service/tool APIs for agents, following Anthropic’s workflow—prototype → evaluate → iterate—and using clear naming, context-efficient returns, and good documentation will set you up for tools agents actually use well.

Link: https://www.anthropic.com/engineering/writing-tools-for-agents

11.9.25

Parallel-R1: Teaching LLMs to reason from multiple angles—permanently

Modern large language models (LLMs) often reason sequentially—one thought chain at a time. Parallel thinking, in contrast, involves spawning multiple reasoning paths (or perspectives), then merging the insights. While prompting tricks can induce this behavior at inference, they carry heavy overhead and brittle generalization. Parallel-R1, a new paper by Tencent AI Lab Seattle with collaborators, pioneers a training-time RL framework for instilling parallel thinking as a native reasoning strategy.

What is Parallel-R1

The key idea: don’t just prompt models to use parallel paths—train them to do so. Parallel-R1 has a progressive curriculum:

Cold start (format learning via SFT) — teach the model the syntax/tags of parallel blocks (e.g. <Parallel>, <Path>...</Path>, <Summary>), using easier math problems (GSM8K) where high-quality parallel traces are easy to generate.
Reinforcement learning (RL) on easy tasks, to explore usage of parallel thinking, with reward that combines correctness + usage of parallel structure.
RL on more difficult problems (e.g. DAPO, AMC, AIME), so the model generalizes both performance and the parallel thinking style.

The architecture has two variants: a causal (structure-agnostic) version and a structured version. The structured version modifies the attention mechanism (via path-window masking, separate position encodings) so paths are more isolated during reasoning. But structured variants show trade-offs—good for generalization in some settings, but less robust under distribution shift.

Results & gains

On a battery of math benchmarks (MATH, AMC23, AIME24, AIME25), Parallel-R1 shows consistent improvements:

The “Seen” variant (causal) achieves ~48.9% average across benchmarks (Mean@16 / Pass@16, etc.), beating baseline GRPO RL on general math tasks.
In particular, on AIME’25, Parallel-R1 raises accuracy by ~8.4% over a purely sequential RL model trained on the harder tasks directly.
The structured (Unseen) variant also performs well under certain reward schedules; the “alternating ACC/PAR” reward schedule (switching between rewarding correctness and parallel structure periodically) helps balance parallel usage and performance.

Beyond numerical gains, the authors observe a behavioral shift: early in training, the model heavily uses parallel paths as an exploration tool, branching in many places; as the model becomes stronger, it shifts to using parallel paths more conservatively, mostly for verification near the end of reasoning. This shift correlates with stronger final performance.

Why this matters

Performance & efficiency trade-off: Parallel-R1 shows that training models for parallel thinking can yield higher reasoning ability without ballooning inference cost (since only when needed are parallel paths triggered).
Better than imitation: Many earlier works used supervised fine-tuning on synthetic parallel reasoning traces under teacher forcing; but those often over-fit to particular patterns. RL in Parallel-R1 helps models learn to decide when parallel paths help, not just how to mimic them.
Scaffolding exploration: The cold-start + easy tasks + alternating reward strategy functions as a scaffold, enabling RL to find a stronger policy space than direct RL on hard tasks.
Architecture designs matter: The structured variant shows that attention masking and position encodings can help or hurt depending on how well training data matches deployment tasks.

Limitations & future directions

The gains, though significant, still leave room before human-level performance in very hard math tasks.
The structured variants can struggle under domain shift; care needed in architectural changes that assume particular path structures.
Triggering parallel thinking (using <Parallel> blocks) costs some token and compute overhead, though the model learns to use it more sparsely over time.
There’s a balance tension between pushing for parallel structure (which encourages exploration) and maximizing accuracy (which sometimes pushes toward fewer divergences). Reward engineering is delicate.

Bottom line: Parallel-R1 is a breakthrough toward training LLMs that think in parallel, not just deeper. By combining curriculum learning, structured or causal variants, and reinforcement learning with rewards for both correctness and reasoning style, it unlocks better performance on challenging math tasks. As reasoning benchmarks and applications demand both correctness and robustness, methods like this will likely become a standard part of the toolkit.

Paper link: arXiv 2509.07980 (PDF)

The Majority Isn’t Always Right: AggLM Learns to Aggregate Better Than Voting

When logic is tricky, the most common answer isn’t always the correct one. A new Meta/Fair & CMU paper titled “The Majority is not always right: RL training for solution aggregation” challenges the standard practice of combining LLM outputs via voting or reward-scored selection. Their method—AggLM—trains a dedicated aggregator model to review, correct, and synthesize among multiple LLM-generated candidate solutions via reinforcement learning from verifiable rewards (RLVR), yielding big gains over majority voting and reward model baselines.

Solving it: learned reconciliation vs. counting

Standard aggregation in LLM reasoning often works like this: sample many candidate solutions, then pick the answer that's most frequent (majority voting) or highest scored by some reward model. While effective in many settings, these methods have a blind spot—when correct answers exist only among minority solutions. In contrast, AggLM treats aggregation itself as a reasoning task. It takes a set of candidate solutions, analyzes them, spots mistakes or partial correctness, then combines ideas or corrects missing steps to produce a final solution. Importantly, it’s trained using verifiable rewards—i.e. only when the aggregated output matches a known correct solution.

Key ingredients & experiments

Dataset & training: Using Qwen3-1.7B as the solution generator, AggLM-1.7B is trained on ~446,000 examples drawn from a mixture of “easy” and “hard” sets. Hard sets are those where the majority answer among candidates is actually incorrect; the mix helps the model learn both to follow the majority and to rescue correctness from minority solutions.
Aggregation via RLVR: The model uses Group-Relative Policy Optimization (GRPO), with a binary reward (1 for matching the ground truth, 0 otherwise). The aggregator is initialized from the Qwen3-1.7B model but is tuned via this RL signal.
Benchmarks: Evaluated on four math contest datasets: AIME24, AIME25, HMMT24, HMMT25. AggLM was tested aggregating candidate solutions from both the same generator model (Qwen3-1.7B) and stronger ones (Qwen3-8B), in both thinking and non-thinking modes.

Results & token-efficiency

On solutions from Qwen3-1.7B in thinking mode, AggLM-1.7B lifts accuracy significantly. For example, on AIME25, majority voting with 8 candidates yields ~67.9%, while AggLM pushes it to 50.0% in a different benchmark context (depending on the exact evaluation variant). More striking, when aggregating from the stronger 8B model, AggLM still outperforms majority voting, weighted voting, and reward-model selection baselines.
In non-thinking modes (i.e. when the candidate-generating model is weaker or does not use chain-of-thought reasoning), AggLM retains its lead—showing that it generalizes beyond just cherry-picking strong or specifically-formatted inputs.
Regarding cost, AggLM is more token efficient: instead of needing large numbers of candidate solutions (i.e. very large k) for majority voting to reach high accuracy, AggLM achieves similar or better accuracy with fewer candidate solutions, saving both inference time and compute.

Implications & what’s next

AggLM shifts thinking in two ways:

Aggregation as reasoning. Aggregation isn’t just picking among options—it’s an opportunity to correct, synthesize, and integrate partial truths. Models that can do that perform better, especially in instances where majority answers mislead.
Balancing examples is key. Training on a mix of easy and hard cases was essential. If you train only on “easy” majority-correct groups, or only on “hard” ones, performance suffers.
Generalization beyond training generators. AggLM works well even when aggregating from stronger models than those used during training—implying aggregation skills are transferable, not just overfitted to particular output distributions.
Efficiency trade-off. Instead of scaling k (number of solutions) to very high values, using a learned aggregator yields larger gains per additional candidate, meaning happier ceilings on tokens/time.

Bottom line: AggLM demonstrates that “the majority vote” should not be the default in reasoning aggregation. Models that are trained to look across candidate solutions—identify hidden truth, correct errors, and combine the best ideas—do better than simple heuristics. Especially in math and logic tasks where minority correct answers exist, learned aggregation via RL with verifiable reward is a strong lever. If you’re designing agents or reasoning pipelines, integrating an aggregator like AggLM can be a powerful performance boost with reasonable cost.

Paper link: arXiv 2509.06870 (PDF)

ParaThinker: parallel minds beat longer monologues

LLMs have ridden test-time compute—“think longer” chains of thought—but returns taper as early tokens lock models into bad trajectories. Tsinghua’s ParaThinker calls this Tunnel Vision and proposes native thought parallelism: generate several independent reasoning paths simultaneously, then fuse them into one answer.

Instead of external voting, ParaThinker trains the model itself to branch and merge: specialized control tokens (<think i>) trigger distinct trajectories, path-specific positional embeddings keep streams separate, and a two-phase attention mask enforces independence during thinking and controlled integration during summarization. The KV cache from the thinking stage is reused, avoiding re-prefill costs.

On AIME-24/25, AMC-23 and MATH-500, ParaThinker with 8 parallel paths boosts accuracy by +12.3 pts (1.5B) and +7.5 pts (7B) over sequential baselines under the same token budget, and still beats majority voting by +4.3/+2.0 pts—with only ~7.1% latency overhead. Generating up to 16 paths costs <2× single-path latency, thanks to better arithmetic intensity on GPUs.

The takeaway: scale width, not just depth. ParaThinker shows that orchestrating compute across diverse, parallel thoughts unlocks latent reasoning ability and makes smaller models out-punch larger sequential ones. Code is available on GitHub.

Paper link: arXiv 2509.04475 (PDF)

10.9.25

TraceRL puts diffusion LLMs on the reasoning map

Autoregressive (AR) giants have dominated reasoning benchmarks, while diffusion language models (DLMs) were seen as “fast samplers” with limited logic chops. A new paper from Princeton and UChicago argues that’s mostly a training-objective problem—and offers TraceRL, a trajectory-aware reinforcement learning framework that aligns what a DLM learns with how it actually samples. The team also releases code and ready-to-run models under the TraDo banner.

What’s new

Trajectory-aware RL for DLMs. Instead of scoring randomly masked sequences, TraceRL optimizes against the model’s intermediate inference traces, matching the left-to-right / blockwise behavior used at decode time. A diffusion-based value model stabilizes training by reducing variance. Crucially, the method works for full-attention and block-attention DLMs.
Open stack. The release includes a framework to build/train/deploy DLMs across architectures, with KV-cache acceleration, inference engines, SFT + RL recipes for math and code, and links to TraDo-4B/8B checkpoints.

The receipts

On headline benchmarks (dynamic vs. static sampling shown in the paper), the TraDo models post the strongest DLM numbers to date and overtake AR peers at similar scale on math:

TraDo-8B-Instruct: MATH500 78.5, AIME’24 13.3, LCB-V2 25.9—a +6.1% relative lift over Qwen2.5-7B-Instruct and +51.3% over Llama-3.1-8B-Instruct on math reasoning.
TraDo-4B-Instruct: MATH500 75.6, AIME’24 10.3, LCB-V2 18.7, consistently edging 7B AR baselines on math.
TraDo-8B-Thinking (long-CoT): first long chain-of-thought diffusion LLM, hitting MATH500 87.4, AIME’24 35.5, LCB-V2 34.6 with very long answers.

The authors attribute gains to objective/trajectory alignment and show smoother curves with the value model vs. policy-only RL. They also document a speed/accuracy trade-off: dynamic sampling is faster; static top-1 decoding squeezes out extra points.

Why it matters

DLMs aren’t just “fast”—they can reason. With the right RL target, parallel generation stacks clear long-form math and coding hurdles previously ceded to AR. 2) Unifies the zoo. One RL recipe spans full-attention and block-diffusion, and even helps enlarge block size for more flexible sampling. 3) Practical path. The open framework + KV-cache tricks make DLM post-training and deployment feel product-ready, not just a lab exercise.

Setup notes

Math RL uses 8k hard MATH tasks; coding RL uses 6k verified problems from PrimeIntellect. Long-CoT training mixes TraceRL with long-form SFT as a curriculum.

Bottom line: TraceRL reframes diffusion LLMs as credible reasoners, not just fast generators—and TraDo-8B-Thinking plants the first long-CoT flag on the DLM side of the field.

Paper link: arXiv 2509.06949 (PDF)

Language Self-Play: training an LLM without adding data actually works

LLMs keep getting better by eating more data—until the data well runs dry. A new paper from Meta Superintelligence Labs proposes Language Self-Play (LSP): turn training into a game where a single model plays both sides—a Challenger that generates tougher prompts and a Solver that answers them—so the system improves without ingesting new datasets. In tests on AlpacaEval using Llama-3.2-3B-Instruct, LSP matches a strong data-driven RL baseline and even pushes beyond it when used as a follow-on stage.

How it works: one model, two roles

LSP frames training as a minimax game: Challenger tries to minimize reward by making hard queries; Solver tries to maximize reward by answering them. Crucially, both roles are instantiated by the same LLM via a role-selecting prompt (e.g., a special challenger prompt), avoiding the instability and memory overhead of training an external adversary. KL regularization keeps the Challenger from devolving into nonsense prompts.

Under the hood, LSP borrows group-relative baselines from GRPO: Challenger generates N queries, Solver samples G answers per query, and the average reward defines both a per-answer advantage (for Solver) and a “difficulty” signal (for Challenger). A practical variant, LSP-Zero, runs as a pure zero-sum game; the full LSP adds a quality self-reward scored by a reference model to prevent reward-hacking (e.g., answering everything in Python).

Results: data-free ≈ data-driven—and sometimes better

Using GPT-4o as judge on AlpacaEval, the team compares models trained from the same base:

From base (no data): Overall win rates vs. the base model—GRPO (with data) 40.9%, LSP-Zero 40.1%, LSP 40.6%. Translation: self-play without any RL data keeps pace with standard RL.
From RL (as a next stage): Starting from the GRPO model and continuing with self-play, LSP lifts overall win rate to 43.1%, with large gains on Vicuna-style conversational tasks (28.7% → 46.3%).

The setup uses Skywork-Reward-V2-Llama-3.2-3B as the reward model; the authors note that LSP (with the added quality reward) avoids the degradation seen with LSP-Zero in some splits, and acknowledge dips on “chatbot-y” Koala prompts—likely because Challenger skews toward structured, orderly instructions.

Why this matters

Data bottleneck relief. If you can translate “more practice data” into a self-generated curriculum, you can keep improving without chasing new corpora.
A clean follow-on stage. Even after data-based RL, self-play adds headroom—useful when further high-quality preference data is scarce.
Single-model simplicity. One backbone serves both roles, avoiding adversary models and the instability they bring.

Caveats and open questions

Self-play can degenerate without the quality self-reward; reward choice caps the ceiling (a weak reward model means weak training signal); and Challenger diversity remains an open knob to broaden beyond the structured style seen in examples. Still, the authors argue the method should work even better on tasks with verifiable rewards (e.g., code tests), not just preferences.

If your roadmap hits a data wall, Language Self-Play is a compelling new leg in the post-training pipeline: spin up a Challenger inside your own model, let it stress-test itself, and learn—no fresh dataset required.

Paper link: arXiv 2509.07414 (PDF)