Showing posts with label tool use. Show all posts
Showing posts with label tool use. Show all posts

2.9.25

From AI for Science to Agentic Science: a blueprint for autonomous discovery

If the last decade was about AI as a tool for scientists, the next one may be about AI as a research partner. A sweeping, 74-page survey positions Agentic Science as that next stage: systems that generate hypotheses, design and execute experiments, analyze outcomes, and then refine theories with minimal human steering. The authors organize the field into a practical stack—and back it with domain-specific reviews across life sciences, chemistry, materials, and physics. 

The elevator pitch

The paper argues Agentic Science is Level 3 on a four-level evolution of “AI for Science,” moving from computational oracles (Level 1) and automated assistants (Level 2) toward autonomous partners—and, eventually, “generative architects” that proactively propose research programs (Level 4). It’s a unification of three fragmented lenses—process, autonomy, and mechanisms—into one working framework. 

Five core capabilities every scientific agent needs

  1. Reasoning & planning engines to structure goals, decompose tasks, and adapt plans;

  2. Tool use & integration to operate lab gear, simulators, search APIs, and code;

  3. Memory mechanisms to retain papers, traces, and intermediate results;

  4. Multi-agent collaboration for division of labor and peer review;

  5. Optimization & evolution (skills, data, and policies) to get better over time. Each has open challenges—e.g., robust tool APIs and verifiable memories—that the survey catalogs with exemplars. 

A four-stage scientific workflow, made agentic

The authors reframe the scientific method as a dynamic loop:
(1) Observation & hypothesis generation → (2) experimental planning & execution → (3) analysis → (4) synthesis, validation & evolution, with agents flexibly revisiting stages as evidence arrives. The survey also sketches a “fully autonomous research pipeline” that strings these together end-to-end. 

What’s actually happening in the lab (and sim)

Beyond taxonomy, the paper tours concrete progress: automated multi-omics analysis and protein design in the life sciences; autonomous reaction optimization and molecular design in chemistry; closed-loop materials discovery platforms; and agentic workflows across physics, including cosmology, CFD and quantum. The thread tying them together: agents that operate tools (wet-lab robots, DFT solvers, telescopes, or HPC codes), capture traces, and use structured feedback to improve. 

Why this survey matters now

  • It’s a build sheet, not just a reading list. By mapping capabilities to workflow stages—and then to domain-specific systems—the paper serves as a blueprint for teams trying to operationalize “AI co-scientists.” 

  • It pushes on verification. Sections on reproducibility, novelty validation, transparent reasoning, and ethics acknowledge the real blockers to trusting autonomous results. 

  • Ecosystem signal. A companion GitHub “Awesome-Agent-Scientists” catalog and project links indicate growing coordination around shared datasets, benchmarks, and platform plumbing. 

How it compares with adjacent work

Other recent efforts survey “agentic AI for science” at a higher altitude or via community workshops, but this paper leans hard into domain-oriented synthesis and a capabilities × workflow matrix, plus concrete exemplars in the natural sciences. Taken together, it helps standardize vocabulary across research and industry stacks now building agent platforms. 

The road ahead

The outlook section pulls no punches: making agents reproducible, auditable, and collaborative is as much socio-technical as it is algorithmic. The authors float big bets—a Global Cooperation Research Agent and even a tongue-in-cheek “Nobel-Turing Test”—to force clarity about what counts as scientific novelty and credit when agents contribute. 

Bottom line: If you’re building AI that does more than summarize papers—systems that plan, run, and iterate on experiments—this survey offers a pragmatic frame: start with the five capabilities, wire them into the four-stage loop, and measure progress with verifiable, domain-specific tasks.

Paper link: arXiv 2508.14111 (PDF)

28.8.25

Anemoi: a semi-centralized agent system that lets bots talk to each other—literally

 Most generalist multi-agent stacks still look like a relay race: a central planner prompts specialist workers, who pass back long context blobs for the planner to stitch together. It works—until you downsize the planner or hit token limits. Anemoi proposes a different wiring: keep a light planner, but let agents communicate directly over an Agent-to-Agent (A2A) MCP server so everyone can see progress, flag bottlenecks, and propose fixes in real time. 

What’s actually new

Anemoi replaces unidirectional prompt passing with a threaded A2A server (built on the Model Context Protocol) that exposes primitives like list_agents, create_thread, send_message, and wait_for_mentions. Any agent can join a thread, address peers, and update plans mid-flight—reducing redundant context stuffing and information loss.

The cast of agents (and why it matters)

  • Planner: drafts the initial plan and spins up a thread.

  • Critique: continuously audits intermediate results.

  • Answer-Finder: compiles the final submission.

  • Workers: Web, Document Processing, and Reasoning & Coding—mirroring OWL’s tool set for a fair head-to-head. All are MCP-enabled so they can monitor progress and coordinate directly. 

This design reduces reliance on one overpowered planner, supports adaptive plan updates, and cuts token overhead from repeated context injection.

Numbers that move the needle (GAIA validation)

FrameworkPlanner / WorkersAvg. Acc.
OWL-rep (pass@3)GPT-4.1-mini / GPT-4o43.64%
OWL (paper, pass@3)GPT-4o-mini / GPT-4o47.27%
Anemoi (pass@3)GPT-4.1-mini / GPT-4o52.73%

With a small planner (GPT-4.1-mini), Anemoi tops a strong open-source baseline by +9.09 points under identical tools and models—and is competitive with several proprietary systems that rely on larger planners. 

How the A2A workflow runs

  1. Discover agents → 2) Create thread with participants → 3) Workers execute subtasks; Critique labels outputs accept/uncertain while any agent can contribute revisions → 4) Consensus vote before finalization → 5) Answer-Finder submits. All via MCP messaging in a single conversation context. 

Where it wins—and where it trips

  • Wins: Of the tasks Anemoi solved that OWL missed, 52% were due to collaborative refinement enabled by A2A; another 8% came from less context redundancy. 

  • Failures: Remaining errors skew to LLM/tool limits (≈46%/21%), incorrect plans (≈12%), and some communication latency (≈10%)—notably when the web agent is busy and can’t respond to peers. 

Why this matters

If your agent system juggles web search, file I/O, and coding, direct inter-agent communication can deliver better results without upgrading to an expensive planner. Anemoi shows a practical blueprint: keep the planner lightweight, move coordination into an A2A layer, and let specialists negotiate in-thread instead of bloating prompts. 

Paper link: arXiv 2508.17068 (PDF)

12.8.25

From Jagged Intelligence to World Models: Demis Hassabis’ Case for an “Omni Model” (and Why Evals Must Grow Up)

 DeepMind’s cadence right now is wild—new drops practically daily. In this conversation, Demis Hassabis connects the dots: “thinking” models (Deep Think), world models that capture physics, and a path toward an omni model that unifies language, vision, audio, and interactive behavior. As an AI practitioner, I buy the core thesis: pure next-token prediction has hit diminishing returns; reasoning, tool-use, and grounded physical understanding are the new scaling dimensions.

I especially agree with the framing of thinking as planning—AlphaGo/AlphaZero DNA brought into the LLM era. The key is not the longest chain of thought, but the right amount of thought: parallel plans, prune, decide, iterate. That’s how strong engineers work, and it’s how models should spend compute. My caveat: “thinking budgets” still pay a real latency/energy cost. Until tool calls and sandboxed execution are bulletproof, deep reasoning will remain spiky in production.

The world model agenda resonates. If you want robust robotics or assistants like Astra/Gemini Live, you need spatiotemporal understanding, not just good text priors. Genie 3 is a striking signal: it can generate coherent worlds where objects persist and physics behaves sensibly. I’m enthusiastic—and I still want tougher tests than “looks consistent.” Sim-to-real is notorious; we’ll need evaluations for controllable dynamics, invariances (occlusion, lighting, continuity), and goal-conditioned behavior before I call it solved.

Hassabis is refreshingly blunt about jagged intelligence. Yes, models ace IMO-style math yet bungle simple logic or even chess legality. Benchmarks saturate (AIME hitting ~99%); we need new stressors. I like Game Arena with Kaggle—self-advancing tournaments give clear, leak-resistant signals and scale with capability. Where I push back: games aren’t the world. Outside well-specified payoffs, reward specification gets messy. The next wave of evals should be multi-objective and long-horizon—measuring planning, memory, tool reliability, and safety traits (e.g., deception) under distribution shift, not just single-shot accuracy.

Another point I applaud: tools as a scaling axis. Let models reason with search, solvers, and domain AIs (AlphaFold-class tools) during planning. The open question—what becomes a built-in capability versus an external tool—is empirical. Coding/math often lifts general reasoning; chess may or may not. My hesitation: as “models become systems,” provenance and governance get harder. Developers will need traceable tool chains, permissions, and reproducible runs—otherwise we ship beautifully wrong answers faster.

Finally, the omni model vision—converging Genie, Veo, and Gemini—feels inevitable. I’m aligned on direction, wary on product surface area. When base models upgrade every few weeks, app teams must design for hot-swappable engines, stable APIs, and eval harnesses that survive version churn.

Net-net: I’m excited by DeepMind’s trajectory—reasoning + tools + world modeling is the right stack. But to turn wow-demos into trustworthy systems, we must grow our evaluations just as aggressively as our models. Give me benchmarks that span days, not prompts; measure alignment under ambiguity; and prove sim-to-real. Do that, and an omni model won’t just impress us—it’ll hold up in the messy, physical, human world it aims to serve.


13.7.25

Moonshot AI’s Kimi K2: A Free, Open-Source Model that Tops GPT-4 on Coding & Agentic Benchmarks

 Moonshot AI, a Beijing-based startup backed by Alibaba, has thrown down the gauntlet to proprietary giants with the public release of Kimi K2—an open-source large language model that outperforms OpenAI’s GPT-4 in several high-stakes coding and reasoning benchmarks. 

What Makes Kimi K2 Different?

  • Massive—but Efficient—MoE Design
    Kimi K2 uses a mixture-of-experts (MoE) architecture: 1 trillion total parameters with only 32 B active per token. That means GPT-4-level capability without GPT-4-level hardware.

  • Agentic Skill Set
    The model is optimized for tool use: autonomously writing, executing and debugging code, then chaining those steps to solve end-to-end tasks—no external agent wrapper required. 

  • Benchmark Dominance

    • SWE-bench Verified: 65.8 % (previous open-source best ≈ 59 %)

    • Tau2 & AceBench (multi-step reasoning): tops all open models, matches some closed ones.

  • Totally Free & Open
    Weights, training scripts and eval harnesses are published on GitHub under an Apache-style license—a sharp contrast to the closed policies of OpenAI, Anthropic and Google.

Why Moonshot Is Giving It Away

Moonshot’s strategy mirrors Meta’s Llama: open weights become a developer-acquisition flywheel. Every engineer who fine-tunes or embeds Kimi K2 is a prospect for Moonshot’s paid enterprise support and customized cloud instances. 

Early Use Cases

DomainHow Kimi K2 Helps
Software EngineeringGenerates minimal bug-fix diffs that pass repo test suites.
Data-Ops AutomationUses built-in function calling to orchestrate pipelines without bespoke agents.
AI ResearchServes as an open baseline for tool-augmented reasoning experiments.

Limitations & Roadmap

Kimi K2 is text-only (for now) and lacks the multimodal chops of Gemini 2.5 or GPT-4o. Moonshot says an image-and-code variant and a quantized 8 B edge model are slated for Q4 2025. 


Takeaway
Kimi K2 signals a tipping point: open models can now match—or beat—top proprietary LLMs in complex, real-world coding tasks. For developers and enterprises evaluating AI stacks, the question is no longer if open source can compete, but how quickly they can deploy it.

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...