Wandering Nomad: Google Research

10.9.25

An AI that writes expert-level scientific software—and often beats the leaderboard

A large Google team is pushing past “chatty copilot” and into AI that authors working scientific code. Their system pairs a large language model with tree search to iteratively write, run, and score programs for scorable research problems—then learns to recombine ideas from papers and prior algorithms. In benchmarks, it discovered 40 new single-cell RNA-seq methods that outperformed the top human-made entries on OpenProblems, and produced 14 COVID-19 hospitalization forecasters that beat the CDC’s ensemble and every individual competitor during the study window.

How it works. Researchers frame a scientific task as “maximize a quality metric,” let the LLM generate code variants, and use tree search to expand promising branches while pruning the rest. The agent can ingest research ideas from literature (summarized with Gemini 2.5 Pro) and also tries automatic recombinations of methods, plus proposals from Gemini Deep Research and AI co-scientist tools. In head-to-head tests on nine published algorithms, the system’s implementations beat eight of nine baselines; its best run—BBKNN(TS)—improved the bioinformatics leaderboard by 14% over the long-standing ComBat approach.

Bioinformatics at scale. The team evaluates on OpenProblems v2.0.0, spanning 1,747,937 cells and 13 metrics across six datasets. Beyond re-implementing published methods, recombination mattered: among 55 pairwise hybrids, 24 outperformed both parents and most others beat at least one—evidence that the search can synthesize competitive, novel ideas rather than just tune hyperparameters.

Public-health forecasting. For U.S. COVID-19 hospitalization forecasting (the CDC’s Forecast Hub), the system generated models that were consistently lower-error (better WIS) than the official ensemble in most jurisdictions; in an aggregate comparison, 14 strategies (10 recombinations, plus two Deep Research, one AI co-scientist, and one replicated baseline) surpassed the ensemble across the three-week hold-out period.

Not just biology. The abstract lists additional wins in geospatial image segmentation, zebrafish neural activity prediction, general time-series, and numerical integration, arguing the approach generalizes to diverse “empirical software” problems where code can be scored automatically.

Engineering notes—and guardrails. To avoid overfitting, bio experiments hill-climb on a separate CELLxGENE dataset and report on the held-out OpenProblems benchmark; metrics that fail to compute are clamped to worst-case—making robustness part of the score. The team also ran multiple replicates to show stability, and reports practical budgets: ≈500 nodes (~7 hours) per scRNA-seq search and ≈2000 nodes per COVID run on their infra.

Why it matters. Rather than waiting for domain-specific code to be hand-crafted over months, this “AI co-scientist” produces working software, tests it against public leaderboards, and composes new hybrids from the literature. If those patterns hold beyond the reported tasks, the future of scientific computing looks less like prompt engineering—and more like searching the space of programs.

Paper link: arXiv 2509.06503 (PDF)

5.8.25

MLE-STAR: Google’s ML Engineering Agent Is Impressive—But Real-World Automation Still Needs Guardrails

Google Research just unveiled MLE-STAR, a machine-learning engineering agent that treats model building like a guided search-and-refine loop rather than a single shot of LLM codegen. The announcement (August 1, 2025) positions MLE-STAR as a state-of-the-art ML engineering agent capable of automating diverse tasks.

At a high level, the system does three things I really like:

Bootstraps from the web. Instead of relying purely on prior LLM knowledge (which often overfits to familiar libraries), MLE-STAR first uses web search to pull task-appropriate, modern model patterns and builds an initial solution from them. In other words, it goes looking for today’s best practice before writing code.
Refines the right part of the pipeline. Many agents rewrite whole scripts every iteration; MLE-STAR runs ablation studies to find the code block with the biggest performance impact (e.g., feature engineering vs. model vs. ensembling), then iteratively refines that block using feedback from prior runs. This targeted loop is far closer to how strong human MLEs work day-to-day.
Ensembles with intent. Rather than naive voting, the agent proposes and improves ensemble strategies to merge multiple candidate solutions into a single, better one.

The team also built pragmatic safety rails I’m thrilled to see in an autonomous coder: a debugging agent for traceback-driven fixes, a data-leakage checker to catch test-time contamination, and a data-usage checker so scripts don’t ignore provided modalities. These modules address common failure modes I’ve encountered with LLM-generated pipelines.

On benchmarks, the results are eye-catching. MLE-STAR won medals in ~63–64% of Kaggle competitions in MLE-Bench-Lite, a massive jump over prior agents; the blog cites 63.6% any-medal (with 36% gold), and the arXiv v2 reports 64%. Either way, it’s a big leap.

I also appreciate the ops mindset: there’s open-source code built with Google’s Agent Development Kit (ADK) so teams can reproduce the workflow and extend it.

Now, where I’m cautious:

Generalization. MLE-Bench-Lite is a valuable proxy, but medals on curated Kaggle tasks aren’t the same as long-lived production systems with shifting data, compliance constraints, and messy labels. The refinement loop may still need human “taste” to set success metrics and pick trade-offs (latency vs. accuracy, cost vs. recall). The paper itself stresses targeted refinement and web retrieval as the key innovations—not a claim that human MLEs are obsolete.
Licensing & provenance. Because the agent retrieves models and code from the web, verifying permissive licenses and acceptable usage is non-negotiable—Google explicitly flags MLE-STAR as research-only and expects users to check licensing of retrieved assets. That’s the right call, and enterprises should wire in policy checks before any auto-generated PRs land.
Evaluation drift. The ablation-guided focus is elegant, but it assumes your validation signal is representative. In many real datasets, weak labels or distribution shift can mislead the ablation and push the agent to overfit the “most impactful block.” Tight data splits and independent holdouts remain essential.

Bottom line: MLE-STAR advances the state of autonomous ML engineering—web-aware bootstrapping, ablation-driven targeted refinement, and smarter ensembling are exactly the techniques I want in an agentic MLE. I’m ready to use it as a co-engineer on well-scoped problems, with humans owning metrics, governance, and final review. If we pair this agent with robust eval harnesses and license compliance, the payoff could be faster iteration and stronger baselines—without losing the engineering discipline that production ML demands.