Wandering Nomad: LLM Agents

Showing posts with label LLM Agents. Show all posts

12.9.25

How to Build High-Quality Tools for LLM Agents — Lessons from Anthropic

As agents become more central to AI workflows, what separates a good agent from a great one often comes down to the tools it has—and how well those tools are designed. In “Writing effective tools for agents — with agents,” Anthropic shares a practical roadmap for building better tools powered by tools themselves, using Claude and the Model Context Protocol (MCP) as real-use labs.

What are “tools” in the agentic context?

Unlike conventional software APIs—deterministic functions that always give the same output for the same input—tools for agents must be built to coexist with non-deterministic systems. Agents like Claude must decide when to use tools, how to parse their output, and how to call them responsibly. A tool here is not just an API call; it's part of an interface contract between predictable software and unpredictable agent behavior. Tools are the mechanisms by which agents expand what they can reliably do.

Key workflows: prototyping, evaluating, and iterating

Anthropic emphasizes an iterative workflow:

Prototype early: Build simple versions of your tools. Use MCP servers or desktop extensions to connect your tool to Claude Code, allowing rapid experimentation and detection of rough edges. Include clear documentation that the agent can consume.
Run realistic evaluations: Create evaluation tasks that reflect real-world usage (multiple tool calls, complex chains, integration with other services). Use verifiable outcomes, not just “it seems right.” Capture metrics such as tool calls, token consumption, runtime, errors. Avoid toy tasks that underrepresent complexity.
Use agents to improve tools: Let Claude analyze transcripts and feedback to suggest refinements—maybe better prompt descriptions, more efficient tool outputs, clearer schemas. Anthropic reports improvements even for tools built by internal experts, purely by letting agents inspect tools’ performance.

Best practices and guiding principles

Anthropic distills the lessons into a set of design principles. Key among them:

Choosing tools selectively: Not every API needs to become a tool. Tools should cover high-impact, repeated workflows—not wrapping every possible existing endpoint. Also, consolidate when possible.
Namespaces and naming clarity: Clear, consistent naming helps agents pick the right tool. Avoid ambiguous names or overlapping functionality. Group related tools under logical prefixes or categories.
Return meaningful, concise context: Tools should return high-signal info. Avoid overwhelming the agent with technical IDs, long metadata unless necessary. Also allow “concise” vs “detailed” response modes.
Optimize for token efficiency: Use truncation, filtering, pagination. Prompt agents to use fewer tool calls or more precise queries. Efficient context limits make downstream tasks more reliable.
Clear tool specs and descriptions: Explicit parameter naming, clear input/output formats, good examples. Prompt engineering of tool descriptions can significantly impact performance.

Why this matters

Tools shape what agents can do. When tools are poorly described, overly broad, or return huge dumps of irrelevant context, agents waste resources, produce hallucinations, or fail to successfully orchestrate workflows. On the other hand, well-designed tools reduce ambiguity, reduce token use, reduce error, and let agents scale reliably across real-world tasks.

Especially as agents connect to many tools (hundreds via MCP servers), these design principles become the difference between brittle behavior and something that feels reliable and intuitive. Anthropic’s experience shows that many improvements come not from changing the LLM itself but refining the tools around it.

If you’re building agent tools or service/tool APIs for agents, following Anthropic’s workflow—prototype → evaluate → iterate—and using clear naming, context-efficient returns, and good documentation will set you up for tools agents actually use well.

Link: https://www.anthropic.com/engineering/writing-tools-for-agents

2.9.25

Memento: teach agents to learn on the fly—no LLM fine-tune required

Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep the LLM frozen and adapt the agent with a memory that learns from every episode. The team formalizes this as a Memory-augmented MDP and shows it can lift real-world “deep research” performance—without gradient updates to the underlying model.

The recipe in one diagram

Memento is a planner–executor architecture wired to a growing Case Bank of episodic traces (state, action, reward). At each step, the planner retrieves similar past cases to guide the next action; after acting, the trajectory (success or failure) is written back—so the memory rewrites itself with environmental feedback. Retrieval can be non-parametric (Top-K by similarity) or parametric via a lightweight Q(s, c) scorer trained online to prefer high-utility cases. Tools are accessed through an MCP-style interface so the executor can browse, run code, or call APIs inside the same loop.

Why this beats “prompt more” and “train more”

Unlike static RAG or handcrafted reflections, case-based reasoning (CBR) selectively reuses successful and failed traces; unlike RL-fine-tuning, it avoids catastrophic forgetting and heavy compute. In ablations, adding CBR memory yields +4.7 to +9.6 absolute points on out-of-distribution QA sets (MuSiQue, Bamboogle, PopQA).

The receipts

GAIA (long-horizon tool use): Top-1 on validation (87.88% Pass@3) and 79.40% on the private test leaderboard.
DeepResearcher (live web research): 66.6 F1 / 80.4 PM, outperforming training-based systems under the paper’s setup.
SimpleQA (single-hop factual): 95.0 PM, the highest among reported baselines.
Humanity’s Last Exam (HLE): 24.4 PM, second overall and within 0.92 of GPT-5 in the authors’ evaluation.

What this means for builders

Ship updates without re-training. Treat memory as the learning substrate; leave your production LLM untouched.
Choose your memory: start with non-parametric retrieval; add the parametric Q-head when you need sharper case selection.
Tooling that scales. MCP-based execution keeps multi-tool orchestration inside one protocol, making traces coherent and reusable.

The upshot: Memento reframes “agent improvement” as memory engineering. If your research agent gets better the more it works—without touching base weights—you’ve got a path to continual learning that’s practical outside the lab.

Paper link: arXiv 2508.16153 (PDF)

From AI for Science to Agentic Science: a blueprint for autonomous discovery

If the last decade was about AI as a tool for scientists, the next one may be about AI as a research partner. A sweeping, 74-page survey positions Agentic Science as that next stage: systems that generate hypotheses, design and execute experiments, analyze outcomes, and then refine theories with minimal human steering. The authors organize the field into a practical stack—and back it with domain-specific reviews across life sciences, chemistry, materials, and physics.

The elevator pitch

The paper argues Agentic Science is Level 3 on a four-level evolution of “AI for Science,” moving from computational oracles (Level 1) and automated assistants (Level 2) toward autonomous partners—and, eventually, “generative architects” that proactively propose research programs (Level 4). It’s a unification of three fragmented lenses—process, autonomy, and mechanisms—into one working framework.

Five core capabilities every scientific agent needs

Reasoning & planning engines to structure goals, decompose tasks, and adapt plans;
Tool use & integration to operate lab gear, simulators, search APIs, and code;
Memory mechanisms to retain papers, traces, and intermediate results;
Multi-agent collaboration for division of labor and peer review;
Optimization & evolution (skills, data, and policies) to get better over time. Each has open challenges—e.g., robust tool APIs and verifiable memories—that the survey catalogs with exemplars.

A four-stage scientific workflow, made agentic

The authors reframe the scientific method as a dynamic loop:
(1) Observation & hypothesis generation → (2) experimental planning & execution → (3) analysis → (4) synthesis, validation & evolution, with agents flexibly revisiting stages as evidence arrives. The survey also sketches a “fully autonomous research pipeline” that strings these together end-to-end.

What’s actually happening in the lab (and sim)

Beyond taxonomy, the paper tours concrete progress: automated multi-omics analysis and protein design in the life sciences; autonomous reaction optimization and molecular design in chemistry; closed-loop materials discovery platforms; and agentic workflows across physics, including cosmology, CFD and quantum. The thread tying them together: agents that operate tools (wet-lab robots, DFT solvers, telescopes, or HPC codes), capture traces, and use structured feedback to improve.

Why this survey matters now

It’s a build sheet, not just a reading list. By mapping capabilities to workflow stages—and then to domain-specific systems—the paper serves as a blueprint for teams trying to operationalize “AI co-scientists.”
It pushes on verification. Sections on reproducibility, novelty validation, transparent reasoning, and ethics acknowledge the real blockers to trusting autonomous results.
Ecosystem signal. A companion GitHub “Awesome-Agent-Scientists” catalog and project links indicate growing coordination around shared datasets, benchmarks, and platform plumbing.

How it compares with adjacent work

Other recent efforts survey “agentic AI for science” at a higher altitude or via community workshops, but this paper leans hard into domain-oriented synthesis and a capabilities × workflow matrix, plus concrete exemplars in the natural sciences. Taken together, it helps standardize vocabulary across research and industry stacks now building agent platforms.

The road ahead

The outlook section pulls no punches: making agents reproducible, auditable, and collaborative is as much socio-technical as it is algorithmic. The authors float big bets—a Global Cooperation Research Agent and even a tongue-in-cheek “Nobel-Turing Test”—to force clarity about what counts as scientific novelty and credit when agents contribute.

Bottom line: If you’re building AI that does more than summarize papers—systems that plan, run, and iterate on experiments—this survey offers a pragmatic frame: start with the five capabilities, wire them into the four-stage loop, and measure progress with verifiable, domain-specific tasks.

Paper link: arXiv 2508.14111 (PDF)

4.8.25

The Agentic Web: when bots become the primary users of the internet

Search boxes and feeds defined the first two web eras. A new position paper proposes the third: the Agentic Web, where autonomous software agents—often LLM-powered—act on our behalf, coordinate with other agents, and execute long-horizon tasks across services. The authors offer a working definition and argue the shift is already visible in consumer assistants that can plan purchases and book reservations end-to-end.

A framework in three dimensions

The paper lays out a conceptual stack for this world: intelligence (reasoning, memory, planning), interaction (tools, APIs, multi-agent protocols), and economics (incentives, pricing, marketplaces). These dimensions, taken together, underpin capabilities like retrieval, recommendation, planning and collaboration that move beyond single-turn chat.

From retrieval to planning to coordination

Architecturally, the authors chart algorithmic transitions: user-issued queries give way to agentic retrieval; recommender systems evolve into agent planners; and isolated tools become multi-agent collectives able to decompose and delegate work. A worked example walks through agents co-planning a travel itinerary, highlighting orchestration and memory.

New pipes: MCP and agent-to-agent messaging

HTTP and RPC weren’t built for autonomous, negotiated workflows. The paper surveys emerging Model Context Protocol (MCP) interfaces and purpose-built agent-to-agent (A2A) messaging layers to support capability discovery, tool brokering and structured negotiations between services—foundational plumbing for an internet of bots.

The Agent Attention Economy

If algorithms once competed for human attention, services on the Agentic Web will compete to be selected by agents mid-plan. That reframes ranking, pricing and attribution around machine decision-makers—an attention market where tools, APIs and even other agents bid for inclusion in workflows.

What breaks (and who pays)

The authors predict “agent browsers” will disrupt today’s user-centric browsing model, shifting interfaces from manual clicks to delegated execution. They also flag a looming billing problem for complex, multi-step agent services that span providers and time windows—who gets paid, and how, when dozens of tools contribute to one outcome?

Risks, red teaming and defense

A full section maps threats across layers (prompt-/tool-injection, data exfiltration, compromised marketplaces), and compares human-in-the-loop versus automated red teaming for agent systems. The authors argue for hybrid approaches, inference-time guardrails, and controllable planning to keep autonomous workflows within safe bounds.

Why it matters

If the Agentic Web arrives, the primary “users” of the internet won’t be humans but agents negotiating with each other—demanding new protocols, marketplaces, governance and safety tooling. For startups, the opportunity is to build the pipes, policies and platforms that let those agents cooperate—and compete—reliably.

Paper link: arXiv 2507.21206 (PDF)

15.5.25

MLE-Dojo: A Gym-Style Framework for Training and Evaluating Autonomous Machine Learning Engineering Agents

In a significant advancement for AI research, Georgia Tech and Stanford University have introduced MLE-Dojo, a Gym-style framework aimed at training, evaluating, and benchmarking autonomous machine learning engineering (MLE) agents. This innovative platform provides a realistic, interactive environment for agents to develop and refine their skills across a wide array of machine learning tasks.

What is MLE-Dojo?

MLE-Dojo is designed to simulate the iterative workflows of human machine learning engineers. It offers an environment where large language model (LLM) agents can write, execute, and debug code, receiving structured feedback to improve their performance over time. The framework is built upon over 200 real-world Kaggle competitions, encompassing diverse domains such as tabular data analysis, computer vision, natural language processing, and time series forecasting.

Key Features

Interactive Environment: Agents engage in a loop of experimentation, debugging, and refinement, closely mirroring real-world engineering processes.
Comprehensive Task Suite: With over 200 curated tasks, MLE-Dojo provides a broad spectrum of challenges to test and improve agent capabilities.
Modular Architecture: Each task operates within its own Docker container, ensuring safety, reproducibility, and ease of integration with various tools and datasets.
Structured Feedback: Agents receive detailed observations, including datasets, execution results, and error messages, facilitating step-by-step learning and improvement.
Training Flexibility: Supports both supervised fine-tuning and reinforcement learning, allowing for diverse training methodologies.

Benchmarking and Evaluation

MLE-Dojo serves as a benchmark to assess the performance of autonomous MLE agents. In evaluations involving eight frontier LLMs, the framework highlighted both the capabilities and limitations of current models, particularly in handling complex, long-horizon tasks and error resolution.

Implications for AI Research

By providing a realistic and comprehensive environment, MLE-Dojo enables researchers to systematically train and evaluate autonomous agents in machine learning engineering tasks. This framework paves the way for the development of more robust, generalizable, and scalable AI agents capable of handling real-world engineering challenges

Access and Community Involvement

MLE-Dojo is open-source, encouraging community collaboration and innovation. Researchers and developers can access the framework and contribute to its ongoing development through the official GitHub repository: https://github.com/MLE-Dojo/MLE-Dojo.

Takeaway

MLE-Dojo represents a significant step forward in the training and evaluation of autonomous machine learning engineering agents. By simulating real-world tasks and providing structured feedback, it offers a valuable tool for advancing AI research and developing agents capable of complex problem-solving in dynamic environments.