Wandering Nomad: ZAPBench

10.9.25

An AI that writes expert-level scientific software—and often beats the leaderboard

A large Google team is pushing past “chatty copilot” and into AI that authors working scientific code. Their system pairs a large language model with tree search to iteratively write, run, and score programs for scorable research problems—then learns to recombine ideas from papers and prior algorithms. In benchmarks, it discovered 40 new single-cell RNA-seq methods that outperformed the top human-made entries on OpenProblems, and produced 14 COVID-19 hospitalization forecasters that beat the CDC’s ensemble and every individual competitor during the study window.

How it works. Researchers frame a scientific task as “maximize a quality metric,” let the LLM generate code variants, and use tree search to expand promising branches while pruning the rest. The agent can ingest research ideas from literature (summarized with Gemini 2.5 Pro) and also tries automatic recombinations of methods, plus proposals from Gemini Deep Research and AI co-scientist tools. In head-to-head tests on nine published algorithms, the system’s implementations beat eight of nine baselines; its best run—BBKNN(TS)—improved the bioinformatics leaderboard by 14% over the long-standing ComBat approach.

Bioinformatics at scale. The team evaluates on OpenProblems v2.0.0, spanning 1,747,937 cells and 13 metrics across six datasets. Beyond re-implementing published methods, recombination mattered: among 55 pairwise hybrids, 24 outperformed both parents and most others beat at least one—evidence that the search can synthesize competitive, novel ideas rather than just tune hyperparameters.

Public-health forecasting. For U.S. COVID-19 hospitalization forecasting (the CDC’s Forecast Hub), the system generated models that were consistently lower-error (better WIS) than the official ensemble in most jurisdictions; in an aggregate comparison, 14 strategies (10 recombinations, plus two Deep Research, one AI co-scientist, and one replicated baseline) surpassed the ensemble across the three-week hold-out period.

Not just biology. The abstract lists additional wins in geospatial image segmentation, zebrafish neural activity prediction, general time-series, and numerical integration, arguing the approach generalizes to diverse “empirical software” problems where code can be scored automatically.

Engineering notes—and guardrails. To avoid overfitting, bio experiments hill-climb on a separate CELLxGENE dataset and report on the held-out OpenProblems benchmark; metrics that fail to compute are clamped to worst-case—making robustness part of the score. The team also ran multiple replicates to show stability, and reports practical budgets: ≈500 nodes (~7 hours) per scRNA-seq search and ≈2000 nodes per COVID run on their infra.

Why it matters. Rather than waiting for domain-specific code to be hand-crafted over months, this “AI co-scientist” produces working software, tests it against public leaderboards, and composes new hybrids from the literature. If those patterns hold beyond the reported tasks, the future of scientific computing looks less like prompt engineering—and more like searching the space of programs.

Paper link: arXiv 2509.06503 (PDF)