12.9.25

How to Build High-Quality Tools for LLM Agents — Lessons from Anthropic

 As agents become more central to AI workflows, what separates a good agent from a great one often comes down to the tools it has—and how well those tools are designed. In “Writing effective tools for agents — with agents,” Anthropic shares a practical roadmap for building better tools powered by tools themselves, using Claude and the Model Context Protocol (MCP) as real-use labs.


What are “tools” in the agentic context?

Unlike conventional software APIs—deterministic functions that always give the same output for the same input—tools for agents must be built to coexist with non-deterministic systems. Agents like Claude must decide when to use tools, how to parse their output, and how to call them responsibly. A tool here is not just an API call; it's part of an interface contract between predictable software and unpredictable agent behavior. Tools are the mechanisms by which agents expand what they can reliably do. 


Key workflows: prototyping, evaluating, and iterating

Anthropic emphasizes an iterative workflow:

  1. Prototype early: Build simple versions of your tools. Use MCP servers or desktop extensions to connect your tool to Claude Code, allowing rapid experimentation and detection of rough edges. Include clear documentation that the agent can consume. 

  2. Run realistic evaluations: Create evaluation tasks that reflect real-world usage (multiple tool calls, complex chains, integration with other services). Use verifiable outcomes, not just “it seems right.” Capture metrics such as tool calls, token consumption, runtime, errors. Avoid toy tasks that underrepresent complexity. 

  3. Use agents to improve tools: Let Claude analyze transcripts and feedback to suggest refinements—maybe better prompt descriptions, more efficient tool outputs, clearer schemas. Anthropic reports improvements even for tools built by internal experts, purely by letting agents inspect tools’ performance. 


Best practices and guiding principles

Anthropic distills the lessons into a set of design principles. Key among them:

  • Choosing tools selectively: Not every API needs to become a tool. Tools should cover high-impact, repeated workflows—not wrapping every possible existing endpoint. Also, consolidate when possible. 

  • Namespaces and naming clarity: Clear, consistent naming helps agents pick the right tool. Avoid ambiguous names or overlapping functionality. Group related tools under logical prefixes or categories. 

  • Return meaningful, concise context: Tools should return high-signal info. Avoid overwhelming the agent with technical IDs, long metadata unless necessary. Also allow “concise” vs “detailed” response modes. 

  • Optimize for token efficiency: Use truncation, filtering, pagination. Prompt agents to use fewer tool calls or more precise queries. Efficient context limits make downstream tasks more reliable. 

  • Clear tool specs and descriptions: Explicit parameter naming, clear input/output formats, good examples. Prompt engineering of tool descriptions can significantly impact performance. 


Why this matters

Tools shape what agents can do. When tools are poorly described, overly broad, or return huge dumps of irrelevant context, agents waste resources, produce hallucinations, or fail to successfully orchestrate workflows. On the other hand, well-designed tools reduce ambiguity, reduce token use, reduce error, and let agents scale reliably across real-world tasks.

Especially as agents connect to many tools (hundreds via MCP servers), these design principles become the difference between brittle behavior and something that feels reliable and intuitive. Anthropic’s experience shows that many improvements come not from changing the LLM itself but refining the tools around it.


If you’re building agent tools or service/tool APIs for agents, following Anthropic’s workflow—prototype → evaluate → iterate—and using clear naming, context-efficient returns, and good documentation will set you up for tools agents actually use well.

Link: https://www.anthropic.com/engineering/writing-tools-for-agents

11.9.25

Parallel-R1: Teaching LLMs to reason from multiple angles—permanently

 Modern large language models (LLMs) often reason sequentially—one thought chain at a time. Parallel thinking, in contrast, involves spawning multiple reasoning paths (or perspectives), then merging the insights. While prompting tricks can induce this behavior at inference, they carry heavy overhead and brittle generalization. Parallel-R1, a new paper by Tencent AI Lab Seattle with collaborators, pioneers a training-time RL framework for instilling parallel thinking as a native reasoning strategy. 


What is Parallel-R1

The key idea: don’t just prompt models to use parallel paths—train them to do so. Parallel-R1 has a progressive curriculum:

  1. Cold start (format learning via SFT) — teach the model the syntax/tags of parallel blocks (e.g. <Parallel>, <Path>...</Path>, <Summary>), using easier math problems (GSM8K) where high-quality parallel traces are easy to generate.

  2. Reinforcement learning (RL) on easy tasks, to explore usage of parallel thinking, with reward that combines correctness + usage of parallel structure. 

  3. RL on more difficult problems (e.g. DAPO, AMC, AIME), so the model generalizes both performance and the parallel thinking style. 

The architecture has two variants: a causal (structure-agnostic) version and a structured version. The structured version modifies the attention mechanism (via path-window masking, separate position encodings) so paths are more isolated during reasoning. But structured variants show trade-offs—good for generalization in some settings, but less robust under distribution shift.


Results & gains

On a battery of math benchmarks (MATH, AMC23, AIME24, AIME25), Parallel-R1 shows consistent improvements:

  • The “Seen” variant (causal) achieves ~48.9% average across benchmarks (Mean@16 / Pass@16, etc.), beating baseline GRPO RL on general math tasks. 

  • In particular, on AIME’25, Parallel-R1 raises accuracy by ~8.4% over a purely sequential RL model trained on the harder tasks directly. 

  • The structured (Unseen) variant also performs well under certain reward schedules; the “alternating ACC/PAR” reward schedule (switching between rewarding correctness and parallel structure periodically) helps balance parallel usage and performance. 

Beyond numerical gains, the authors observe a behavioral shift: early in training, the model heavily uses parallel paths as an exploration tool, branching in many places; as the model becomes stronger, it shifts to using parallel paths more conservatively, mostly for verification near the end of reasoning. This shift correlates with stronger final performance. 


Why this matters

  • Performance & efficiency trade-off: Parallel-R1 shows that training models for parallel thinking can yield higher reasoning ability without ballooning inference cost (since only when needed are parallel paths triggered).

  • Better than imitation: Many earlier works used supervised fine-tuning on synthetic parallel reasoning traces under teacher forcing; but those often over-fit to particular patterns. RL in Parallel-R1 helps models learn to decide when parallel paths help, not just how to mimic them.

  • Scaffolding exploration: The cold-start + easy tasks + alternating reward strategy functions as a scaffold, enabling RL to find a stronger policy space than direct RL on hard tasks.

  • Architecture designs matter: The structured variant shows that attention masking and position encodings can help or hurt depending on how well training data matches deployment tasks.


Limitations & future directions

  • The gains, though significant, still leave room before human-level performance in very hard math tasks.

  • The structured variants can struggle under domain shift; care needed in architectural changes that assume particular path structures.

  • Triggering parallel thinking (using <Parallel> blocks) costs some token and compute overhead, though the model learns to use it more sparsely over time.

  • There’s a balance tension between pushing for parallel structure (which encourages exploration) and maximizing accuracy (which sometimes pushes toward fewer divergences). Reward engineering is delicate.


Bottom line: Parallel-R1 is a breakthrough toward training LLMs that think in parallel, not just deeper. By combining curriculum learning, structured or causal variants, and reinforcement learning with rewards for both correctness and reasoning style, it unlocks better performance on challenging math tasks. As reasoning benchmarks and applications demand both correctness and robustness, methods like this will likely become a standard part of the toolkit.

Paper link: arXiv 2509.07980 (PDF)

The Majority Isn’t Always Right: AggLM Learns to Aggregate Better Than Voting

 When logic is tricky, the most common answer isn’t always the correct one. A new Meta/Fair & CMU paper titled “The Majority is not always right: RL training for solution aggregation” challenges the standard practice of combining LLM outputs via voting or reward-scored selection. Their method—AggLM—trains a dedicated aggregator model to review, correct, and synthesize among multiple LLM-generated candidate solutions via reinforcement learning from verifiable rewards (RLVR), yielding big gains over majority voting and reward model baselines. 


Solving it: learned reconciliation vs. counting

Standard aggregation in LLM reasoning often works like this: sample many candidate solutions, then pick the answer that's most frequent (majority voting) or highest scored by some reward model. While effective in many settings, these methods have a blind spot—when correct answers exist only among minority solutions. In contrast, AggLM treats aggregation itself as a reasoning task. It takes a set of candidate solutions, analyzes them, spots mistakes or partial correctness, then combines ideas or corrects missing steps to produce a final solution. Importantly, it’s trained using verifiable rewards—i.e. only when the aggregated output matches a known correct solution. 


Key ingredients & experiments

  • Dataset & training: Using Qwen3-1.7B as the solution generator, AggLM-1.7B is trained on ~446,000 examples drawn from a mixture of “easy” and “hard” sets. Hard sets are those where the majority answer among candidates is actually incorrect; the mix helps the model learn both to follow the majority and to rescue correctness from minority solutions. 

  • Aggregation via RLVR: The model uses Group-Relative Policy Optimization (GRPO), with a binary reward (1 for matching the ground truth, 0 otherwise). The aggregator is initialized from the Qwen3-1.7B model but is tuned via this RL signal. 

  • Benchmarks: Evaluated on four math contest datasets: AIME24, AIME25, HMMT24, HMMT25. AggLM was tested aggregating candidate solutions from both the same generator model (Qwen3-1.7B) and stronger ones (Qwen3-8B), in both thinking and non-thinking modes. 


Results & token-efficiency

  • On solutions from Qwen3-1.7B in thinking mode, AggLM-1.7B lifts accuracy significantly. For example, on AIME25, majority voting with 8 candidates yields ~67.9%, while AggLM pushes it to 50.0% in a different benchmark context (depending on the exact evaluation variant). More striking, when aggregating from the stronger 8B model, AggLM still outperforms majority voting, weighted voting, and reward-model selection baselines. 

  • In non-thinking modes (i.e. when the candidate-generating model is weaker or does not use chain-of-thought reasoning), AggLM retains its lead—showing that it generalizes beyond just cherry-picking strong or specifically-formatted inputs. 

  • Regarding cost, AggLM is more token efficient: instead of needing large numbers of candidate solutions (i.e. very large k) for majority voting to reach high accuracy, AggLM achieves similar or better accuracy with fewer candidate solutions, saving both inference time and compute. 


Implications & what’s next

AggLM shifts thinking in two ways:

  1. Aggregation as reasoning. Aggregation isn’t just picking among options—it’s an opportunity to correct, synthesize, and integrate partial truths. Models that can do that perform better, especially in instances where majority answers mislead.

  2. Balancing examples is key. Training on a mix of easy and hard cases was essential. If you train only on “easy” majority-correct groups, or only on “hard” ones, performance suffers. 

  3. Generalization beyond training generators. AggLM works well even when aggregating from stronger models than those used during training—implying aggregation skills are transferable, not just overfitted to particular output distributions. 

  4. Efficiency trade-off. Instead of scaling k (number of solutions) to very high values, using a learned aggregator yields larger gains per additional candidate, meaning happier ceilings on tokens/time. 


Bottom line: AggLM demonstrates that “the majority vote” should not be the default in reasoning aggregation. Models that are trained to look across candidate solutions—identify hidden truth, correct errors, and combine the best ideas—do better than simple heuristics. Especially in math and logic tasks where minority correct answers exist, learned aggregation via RL with verifiable reward is a strong lever. If you’re designing agents or reasoning pipelines, integrating an aggregator like AggLM can be a powerful performance boost with reasonable cost.

Paper link: arXiv 2509.06870 (PDF)

ParaThinker: parallel minds beat longer monologues

 LLMs have ridden test-time compute—“think longer” chains of thought—but returns taper as early tokens lock models into bad trajectories. Tsinghua’s ParaThinker calls this Tunnel Vision and proposes native thought parallelism: generate several independent reasoning paths simultaneously, then fuse them into one answer. 

Instead of external voting, ParaThinker trains the model itself to branch and merge: specialized control tokens (<think i>) trigger distinct trajectories, path-specific positional embeddings keep streams separate, and a two-phase attention mask enforces independence during thinking and controlled integration during summarization. The KV cache from the thinking stage is reused, avoiding re-prefill costs. 

On AIME-24/25, AMC-23 and MATH-500, ParaThinker with 8 parallel paths boosts accuracy by +12.3 pts (1.5B) and +7.5 pts (7B) over sequential baselines under the same token budget, and still beats majority voting by +4.3/+2.0 pts—with only ~7.1% latency overhead. Generating up to 16 paths costs <2× single-path latency, thanks to better arithmetic intensity on GPUs. 

The takeaway: scale width, not just depth. ParaThinker shows that orchestrating compute across diverse, parallel thoughts unlocks latent reasoning ability and makes smaller models out-punch larger sequential ones. Code is available on GitHub. 

Paper link: arXiv 2509.04475 (PDF)

10.9.25

TraceRL puts diffusion LLMs on the reasoning map

 Autoregressive (AR) giants have dominated reasoning benchmarks, while diffusion language models (DLMs) were seen as “fast samplers” with limited logic chops. A new paper from Princeton and UChicago argues that’s mostly a training-objective problem—and offers TraceRL, a trajectory-aware reinforcement learning framework that aligns what a DLM learns with how it actually samples. The team also releases code and ready-to-run models under the TraDo banner. 

What’s new

  • Trajectory-aware RL for DLMs. Instead of scoring randomly masked sequences, TraceRL optimizes against the model’s intermediate inference traces, matching the left-to-right / blockwise behavior used at decode time. A diffusion-based value model stabilizes training by reducing variance. Crucially, the method works for full-attention and block-attention DLMs. 

  • Open stack. The release includes a framework to build/train/deploy DLMs across architectures, with KV-cache acceleration, inference engines, SFT + RL recipes for math and code, and links to TraDo-4B/8B checkpoints. 

The receipts

On headline benchmarks (dynamic vs. static sampling shown in the paper), the TraDo models post the strongest DLM numbers to date and overtake AR peers at similar scale on math:

  • TraDo-8B-Instruct: MATH500 78.5, AIME’24 13.3, LCB-V2 25.9—a +6.1% relative lift over Qwen2.5-7B-Instruct and +51.3% over Llama-3.1-8B-Instruct on math reasoning. 

  • TraDo-4B-Instruct: MATH500 75.6, AIME’24 10.3, LCB-V2 18.7, consistently edging 7B AR baselines on math. 

  • TraDo-8B-Thinking (long-CoT): first long chain-of-thought diffusion LLM, hitting MATH500 87.4, AIME’24 35.5, LCB-V2 34.6 with very long answers. 

The authors attribute gains to objective/trajectory alignment and show smoother curves with the value model vs. policy-only RL. They also document a speed/accuracy trade-off: dynamic sampling is faster; static top-1 decoding squeezes out extra points. 

Why it matters

  1. DLMs aren’t just “fast”—they can reason. With the right RL target, parallel generation stacks clear long-form math and coding hurdles previously ceded to AR. 2) Unifies the zoo. One RL recipe spans full-attention and block-diffusion, and even helps enlarge block size for more flexible sampling. 3) Practical path. The open framework + KV-cache tricks make DLM post-training and deployment feel product-ready, not just a lab exercise. 

Setup notes

Math RL uses 8k hard MATH tasks; coding RL uses 6k verified problems from PrimeIntellect. Long-CoT training mixes TraceRL with long-form SFT as a curriculum. 

Bottom line: TraceRL reframes diffusion LLMs as credible reasoners, not just fast generators—and TraDo-8B-Thinking plants the first long-CoT flag on the DLM side of the field. 

Paper link: arXiv 2509.06949 (PDF)

Language Self-Play: training an LLM without adding data actually works

 LLMs keep getting better by eating more data—until the data well runs dry. A new paper from Meta Superintelligence Labs proposes Language Self-Play (LSP): turn training into a game where a single model plays both sides—a Challenger that generates tougher prompts and a Solver that answers them—so the system improves without ingesting new datasets. In tests on AlpacaEval using Llama-3.2-3B-Instruct, LSP matches a strong data-driven RL baseline and even pushes beyond it when used as a follow-on stage. 

How it works: one model, two roles

LSP frames training as a minimax game: Challenger tries to minimize reward by making hard queries; Solver tries to maximize reward by answering them. Crucially, both roles are instantiated by the same LLM via a role-selecting prompt (e.g., a special challenger prompt), avoiding the instability and memory overhead of training an external adversary. KL regularization keeps the Challenger from devolving into nonsense prompts. 

Under the hood, LSP borrows group-relative baselines from GRPO: Challenger generates N queries, Solver samples G answers per query, and the average reward defines both a per-answer advantage (for Solver) and a “difficulty” signal (for Challenger). A practical variant, LSP-Zero, runs as a pure zero-sum game; the full LSP adds a quality self-reward scored by a reference model to prevent reward-hacking (e.g., answering everything in Python). 

Results: data-free ≈ data-driven—and sometimes better

Using GPT-4o as judge on AlpacaEval, the team compares models trained from the same base:

  • From base (no data): Overall win rates vs. the base model—GRPO (with data) 40.9%, LSP-Zero 40.1%, LSP 40.6%. Translation: self-play without any RL data keeps pace with standard RL. 

  • From RL (as a next stage): Starting from the GRPO model and continuing with self-play, LSP lifts overall win rate to 43.1%, with large gains on Vicuna-style conversational tasks (28.7% → 46.3%). 

The setup uses Skywork-Reward-V2-Llama-3.2-3B as the reward model; the authors note that LSP (with the added quality reward) avoids the degradation seen with LSP-Zero in some splits, and acknowledge dips on “chatbot-y” Koala prompts—likely because Challenger skews toward structured, orderly instructions. 

Why this matters

  • Data bottleneck relief. If you can translate “more practice data” into a self-generated curriculum, you can keep improving without chasing new corpora. 

  • A clean follow-on stage. Even after data-based RL, self-play adds headroom—useful when further high-quality preference data is scarce. 

  • Single-model simplicity. One backbone serves both roles, avoiding adversary models and the instability they bring. 

Caveats and open questions

Self-play can degenerate without the quality self-reward; reward choice caps the ceiling (a weak reward model means weak training signal); and Challenger diversity remains an open knob to broaden beyond the structured style seen in examples. Still, the authors argue the method should work even better on tasks with verifiable rewards (e.g., code tests), not just preferences. 

If your roadmap hits a data wall, Language Self-Play is a compelling new leg in the post-training pipeline: spin up a Challenger inside your own model, let it stress-test itself, and learn—no fresh dataset required.

Paper link: arXiv 2509.07414 (PDF)

An AI that writes expert-level scientific software—and often beats the leaderboard

 A large Google team is pushing past “chatty copilot” and into AI that authors working scientific code. Their system pairs a large language model with tree search to iteratively write, run, and score programs for scorable research problems—then learns to recombine ideas from papers and prior algorithms. In benchmarks, it discovered 40 new single-cell RNA-seq methods that outperformed the top human-made entries on OpenProblems, and produced 14 COVID-19 hospitalization forecasters that beat the CDC’s ensemble and every individual competitor during the study window. 

How it works. Researchers frame a scientific task as “maximize a quality metric,” let the LLM generate code variants, and use tree search to expand promising branches while pruning the rest. The agent can ingest research ideas from literature (summarized with Gemini 2.5 Pro) and also tries automatic recombinations of methods, plus proposals from Gemini Deep Research and AI co-scientist tools. In head-to-head tests on nine published algorithms, the system’s implementations beat eight of nine baselines; its best run—BBKNN(TS)—improved the bioinformatics leaderboard by 14% over the long-standing ComBat approach. 

Bioinformatics at scale. The team evaluates on OpenProblems v2.0.0, spanning 1,747,937 cells and 13 metrics across six datasets. Beyond re-implementing published methods, recombination mattered: among 55 pairwise hybrids, 24 outperformed both parents and most others beat at least one—evidence that the search can synthesize competitive, novel ideas rather than just tune hyperparameters. 

Public-health forecasting. For U.S. COVID-19 hospitalization forecasting (the CDC’s Forecast Hub), the system generated models that were consistently lower-error (better WIS) than the official ensemble in most jurisdictions; in an aggregate comparison, 14 strategies (10 recombinations, plus two Deep Research, one AI co-scientist, and one replicated baseline) surpassed the ensemble across the three-week hold-out period. 

Not just biology. The abstract lists additional wins in geospatial image segmentation, zebrafish neural activity prediction, general time-series, and numerical integration, arguing the approach generalizes to diverse “empirical software” problems where code can be scored automatically. 

Engineering notes—and guardrails. To avoid overfitting, bio experiments hill-climb on a separate CELLxGENE dataset and report on the held-out OpenProblems benchmark; metrics that fail to compute are clamped to worst-case—making robustness part of the score. The team also ran multiple replicates to show stability, and reports practical budgets: ≈500 nodes (~7 hours) per scRNA-seq search and ≈2000 nodes per COVID run on their infra. 

Why it matters. Rather than waiting for domain-specific code to be hand-crafted over months, this “AI co-scientist” produces working software, tests it against public leaderboards, and composes new hybrids from the literature. If those patterns hold beyond the reported tasks, the future of scientific computing looks less like prompt engineering—and more like searching the space of programs

Paper link: arXiv 2509.06503 (PDF)

Embedding retrievers hit a math wall—and DeepMind just mapped it

 Vector embeddings power everything from RAG to enterprise search. But a new DeepMind paper argues there’s a theoretical ceiling baked into single-vector retrieval: for any embedding dimension 

dd, there exist query-document relevance patterns that no embedding model can represent—no matter the data or training tricks. The authors connect learning-theory and geometric results to IR and then build a deliberately simple dataset, LIMIT, where leading embedders struggle. 

The core result, in plain English

Treat each query’s relevant docs as a row in a binary matrix (“qrels”). The paper introduces row-wise thresholdable rank and lower-bounds it via sign-rank to show a fundamental limit: once the number of documents nn crosses a critical threshold for a given dd, there exist top-k sets that cannot be realized by any single-vector embedding retriever. That’s a property of geometry, not optimization. 

LIMIT: a toy task that breaks real systems

To make the math bite, the team instantiates LIMIT with natural-language facts (“Jon Durben likes quokkas and apples…”) that encode all combinations of relevance over a small doc pool. Despite its simplicity, SoTA MTEB models score <20 recall@100, while classic BM25 is near-perfect—underscoring that the failure is specific to single-vector embedding retrieval. 

In a “small” LIMIT (N≈46) sweep, ramping dimensions up to 4096 lifts recall but still doesn’t solve the task; BM25 cruises to 100% at @10/@20. Fine-tuning on in-domain LIMIT data barely helps, indicating intrinsic hardness, not domain shift. 

How this differs from usual benchmark talk

LIMIT’s structure—dense overlap of query relevances—looks nothing like BEIR or typical web QA. Compared across datasets, LIMIT shows far higher “graph density” and query-similarity strength than NQ, HotpotQA, or SciFact, approximating instruction-following IR where prompts combine unrelated items with logical operators. 

Numbers that sting

A table of critical document counts shows how quickly trouble arrives as dd grows (e.g., d=4n10d=4 \Rightarrow n\approx10; d=16n79d=16 \Rightarrow n\approx79; d=32n296d=32 \Rightarrow n\approx296). Put differently: long before you reach enterprise-scale corpora, some seemingly trivial “return docs X and Y, not Z” requests fall outside what an embedder can express. 

What to do about it (and what not to)

  • Don’t only crank up dimension. Bigger dd delays but doesn’t remove the wall. 

  • Consider alternative architectures. Multi-vector approaches (e.g., ColBERT-style), sparse methods, or hybrid stacks escape parts of the limit that bind single-vector embedders. The paper’s head-to-heads hint why BM25 and multi-vector models fare better. 

  • Test against LIMIT-style stressors. The team released datasets on Hugging Face and code on GitHub to reproduce results and probe your own models. 

Why this matters for RAG and instruction-following IR

Modern agents increasingly ask retrieval systems to honor combinational and logical constraints (“find papers that mention A and B but not C”). The paper shows there’s a mathematical point where single-vector embedders must fail such patterns—explaining why teams often paper over issues with rerankers and handcrafted filters. As instruction-following IR grows, expect more LIMIT-like cases in the wild. 

Bottom line: embedding-only retrieval won’t scale to every notion of relevance. If your roadmap leans on expressive, compositional queries, plan for hybrid retrieval and reranking—and add LIMIT to your eval suite.

Paper link: arXiv 2508.21038 (PDF)

9.9.25

UDR turns “deep research” into a programmable product feature

 Most “deep research” agents hard-code their plan and lock you into one LLM. Universal Deep Research (UDR) proposes a different deal: you supply the model and the method. UDR wraps around any LLM and lets users create, edit, and refine fully custom research strategies—no extra training required. Think of it as a general-purpose chassis for web-scale and enterprise research that you can rewire on the fly. 

Why this matters

Today’s tools (Gemini, Perplexity, OpenAI/Grok deep research, and enterprise stacks like NVIDIA AI-Q, SambaNova, ERP-AI) ship opinionated pipelines that work—but are hard to reshape, mix, or upgrade with a different backbone. UDR targets three pain points: (P1) limited control over sources/costs, (P2) no way to encode specialized industry workflows, and (P3) inability to swap in the newest model independently of the agent.


How UDR works (in plain English)

1) Strategy → code.
You write a numbered strategy in natural language. UDR compiles it into a single callable function that emits structured progress updates via yield and constrains tool use to what you allow. The paper found “one-shot, end-to-end” code generation—annotated step-by-step—was far more reliable than fragmentary orchestration. 

2) Isolated execution with small contexts.
Instead of stuffing a giant context window, UDR stores interim artifacts as named variables in the execution state. In experiments, 8k tokens was enough for full workflows, because the controller code (CPU-side) keeps state while the LLM is invoked only for local tasks (summarize, rank, extract). Tools are synchronous function calls for deterministic behavior. 

3) Transparent progress + auditable outputs.
Your strategy defines notifications (type, timestamp, description) that stream to the UI during the run, and a final “research report” built from the accumulated state—with citations and formatting under your control. 

4) Safety by design.
Because UDR executes generated code, it’s meant to run inside a sandbox (e.g., Piston) so strategies can’t touch the host system—mandatory for anything beyond a trusted demo. 


What you can build with it

The authors ship minimal, expansive, and intensive example strategies plus a simple UI: search bar for prompts, a strategy picker, and an editor to tweak steps—handy for teams iterating on domain-specific research recipes (finance, legal, healthcare). 


The headline advantages

  • BYO model, BYO strategy. Pair the strongest available LLM with your best research recipe—no re-training loops. 

  • Latency & cost discipline. Orchestration runs as code on CPU; the LLM is called sparingly on focused snippets, reducing GPU churn and token spend. 

  • Deterministic tool use. Explicit, synchronous calls and stateful variables curb flaky agent behaviors like skipping steps or re-scraping needlessly. 


Big picture

Deep research tools are already popular, but strategy rigidity and model lock-in limit how far they go inside enterprises. UDR reframes the agent as a compiler/runtime: you specify the plan, the system turns it into constrained code, and any LLM can power the reasoning. For builders eyeing compliance-friendly, auditable research automation, that’s a compelling foundation. 

Paper link: arXiv 2509.00244 (PDF)

What Claude offers now From Anthropic’s announcements: Creates and edits real files directly in chats or the desktop app: Excel (.xlsx)...