Wandering Nomad: AI research agents

8.7.25

AIRA shows how better operators — not just bigger models — turbo-charge AI research agents

Large language models that write code have already stormed GitHub, but turning them into full-blown research agents—systems that iterate on entire ML pipelines until they medal on Kaggle—has proved trickier. The latest state-of-the-art, AIDE, could grab a medal on roughly 40 % of MLE-bench tasks. Now Meta AI and UCL push that rate to 47.7 % with AIRA, a rethink that says the secret isn’t a flashier LLM, it’s the operators and search policy you wrap around it.

From one-shot “Draft, Debug, Improve” to a toolbox of surgical edits

AIRA introduces OAIRA, a new operator set that goes beyond AIDE’s three blunt actions. Scoped memory keeps prompts lean, “think tokens” force structured reasoning, and a prompt-adaptive complexity cue decides whether the agent should sketch a quick baseline or engineer a deep ensemble. The result: twice the reasoning tokens per call and far less mode collapse.

Search policies finally get room to shine

When AIDE’s old operators were plugged into greedy, MCTS and evolutionary searches, the fancier algorithms gained zero ground—operator bottlenecks were that severe. Swap in OAIRA and those same policies leapfrog greedy search, proving that exploration muscle only pays off once edits are expressive enough.

The scoreboard (MLE-bench Lite, 22 Kaggle tasks)

AIDE (o1-preview, greedy): 39.6 % medal rate
AIRA (greedy + OAIRA): 45.5 %
AIRA (MCTS + OAIRA): 47.7 %
AIRA (Evolutionary + OAIRA): 47.3 %
All agents ran under identical 24-hour, single-GPU budgets inside AIRA-dojo, a new sandbox that hands each run a root-privileged H200 container yet isolates filesystem side effects.

Mind the generalization gap

The study also spotlights a pitfall for auto-ML agents: validation scores routinely over-estimate test-set gains, steering greedy searches into dead ends. By examining thousands of runs, the team quantifies that “proxy-test gap” and urges future benchmarks to track it explicitly.

Why it matters

Agent design ≠ model scale. The leap came without touching the underlying LLM (DeepSeek-R1 or GPT-4o). That’s good news for teams capped by API limits.
Composable recipe. OAIRA operators, MCTS search and the open-source aira-dojo testbed (GitHub link in the paper) can bolt onto any ReAct-style coding agent.
Toward autonomous ML ops. AIRA’s 24-hour, single-GPU constraint mirrors real-world hack-day budgets, making the findings immediately useful for startups chasing continuous Kaggle pipelines or internal model tuning bots.

Auto-ML agents are no longer judged solely by the size of their LLM brains; the tools they wield and the ways they explore the search space may count just as much. AIRA’s 8-point jump on MLE-bench suggests that the next frontier in agentic ML will be won with sharper scalpels, not bigger hammers.

Paper link: arXiv 2507.02554 (PDF)

17.5.25

How FutureHouse’s AI Agents Are Reshaping Scientific Discovery

In a major leap for scientific research, FutureHouse—a nonprofit backed by former Google CEO Eric Schmidt—has introduced a powerful lineup of AI research agents aimed at accelerating the pace of scientific discovery. Built to support scientists across disciplines, these agents automate key parts of the research workflow—from literature search to chemical synthesis planning—reducing bottlenecks and enhancing productivity.

This suite includes four primary agents: Crow, Falcon, Owl, and Phoenix, each specialized in a unique aspect of the research pipeline. Together, they form a comprehensive AI-powered infrastructure for modern science.

Meet the AI Agents Changing Science

1. Crow – The Concise Search Specialist

Crow acts as a rapid-response research assistant. It provides short, precise answers to technical queries by intelligently retrieving evidence from full-text scientific papers. Designed for speed and accuracy, it’s especially useful for API-based interactions, where precision and performance matter most. Crow is built on top of FutureHouse’s custom PaperQA2 architecture.

2. Falcon – Deep Research Assistant

Falcon takes things further by conducting expansive literature reviews. It produces full-length research reports in response to broader or more open-ended scientific questions. By analyzing papers, data sources, and context-rich materials, Falcon allows researchers to dive deep into topics without manually sorting through endless PDFs.

3. Owl – Precedent Investigator

Owl helps scientists find out whether an experiment or research idea has already been executed. This is crucial for grant applications, patent filings, and ensuring that researchers don’t waste time reinventing the wheel. By surfacing related studies and experiments, Owl enables more informed, original work.

4. Phoenix – The Chemistry Innovator

Phoenix is built for early-stage chemistry research. Leveraging cheminformatics tools, it assists in designing molecules, suggesting synthetic routes, and evaluating chemical feasibility. It builds upon an earlier FutureHouse prototype called ChemCrow and remains in active development as a sandbox tool for chemists to explore and provide feedback.

Performance and Potential

In benchmark tests, Crow, Falcon, and Owl outperformed PhD-level biologists on scientific retrieval and reasoning tasks. Unlike many AI tools that only read paper abstracts or summaries, these agents consume and analyze full-text documents, allowing them to detect nuanced issues like methodological flaws or statistical limitations.

Although Phoenix is still in its experimental phase and may sometimes produce errors, it represents an important step toward automating complex tasks in synthetic chemistry.

Why This Matters

The bottlenecks of modern science often lie not in experimentation, but in navigating the overwhelming volume of prior work. By offloading repetitive and time-consuming research tasks to AI, FutureHouse's agents free up scientists to focus on creativity, innovation, and critical thinking.

These tools are also being made openly available for scientists and research institutions, fostering a collaborative environment for AI-augmented science.

Final Takeaway

FutureHouse’s AI agents aren’t just productivity boosters—they’re a vision of a new research paradigm. By augmenting human researchers with scalable, intelligent assistants, we’re witnessing the early stages of a revolution in how science is done. As these tools evolve, they hold the potential to dramatically accelerate scientific discovery across disciplines.