Wandering Nomad: memory

If the last decade was about AI as a tool for scientists, the next one may be about AI as a research partner. A sweeping, 74-page survey positions Agentic Science as that next stage: systems that generate hypotheses, design and execute experiments, analyze outcomes, and then refine theories with minimal human steering. The authors organize the field into a practical stack—and back it with domain-specific reviews across life sciences, chemistry, materials, and physics.

The elevator pitch

The paper argues Agentic Science is Level 3 on a four-level evolution of “AI for Science,” moving from computational oracles (Level 1) and automated assistants (Level 2) toward autonomous partners—and, eventually, “generative architects” that proactively propose research programs (Level 4). It’s a unification of three fragmented lenses—process, autonomy, and mechanisms—into one working framework.

Five core capabilities every scientific agent needs

Reasoning & planning engines to structure goals, decompose tasks, and adapt plans;
Tool use & integration to operate lab gear, simulators, search APIs, and code;
Memory mechanisms to retain papers, traces, and intermediate results;
Multi-agent collaboration for division of labor and peer review;
Optimization & evolution (skills, data, and policies) to get better over time. Each has open challenges—e.g., robust tool APIs and verifiable memories—that the survey catalogs with exemplars.

A four-stage scientific workflow, made agentic

The authors reframe the scientific method as a dynamic loop:
(1) Observation & hypothesis generation → (2) experimental planning & execution → (3) analysis → (4) synthesis, validation & evolution, with agents flexibly revisiting stages as evidence arrives. The survey also sketches a “fully autonomous research pipeline” that strings these together end-to-end.

What’s actually happening in the lab (and sim)

Beyond taxonomy, the paper tours concrete progress: automated multi-omics analysis and protein design in the life sciences; autonomous reaction optimization and molecular design in chemistry; closed-loop materials discovery platforms; and agentic workflows across physics, including cosmology, CFD and quantum. The thread tying them together: agents that operate tools (wet-lab robots, DFT solvers, telescopes, or HPC codes), capture traces, and use structured feedback to improve.

Why this survey matters now

It’s a build sheet, not just a reading list. By mapping capabilities to workflow stages—and then to domain-specific systems—the paper serves as a blueprint for teams trying to operationalize “AI co-scientists.”
It pushes on verification. Sections on reproducibility, novelty validation, transparent reasoning, and ethics acknowledge the real blockers to trusting autonomous results.
Ecosystem signal. A companion GitHub “Awesome-Agent-Scientists” catalog and project links indicate growing coordination around shared datasets, benchmarks, and platform plumbing.

How it compares with adjacent work

Other recent efforts survey “agentic AI for science” at a higher altitude or via community workshops, but this paper leans hard into domain-oriented synthesis and a capabilities × workflow matrix, plus concrete exemplars in the natural sciences. Taken together, it helps standardize vocabulary across research and industry stacks now building agent platforms.

The road ahead

The outlook section pulls no punches: making agents reproducible, auditable, and collaborative is as much socio-technical as it is algorithmic. The authors float big bets—a Global Cooperation Research Agent and even a tongue-in-cheek “Nobel-Turing Test”—to force clarity about what counts as scientific novelty and credit when agents contribute.

Bottom line: If you’re building AI that does more than summarize papers—systems that plan, run, and iterate on experiments—this survey offers a pragmatic frame: start with the five capabilities, wire them into the four-stage loop, and measure progress with verifiable, domain-specific tasks.

Paper link: arXiv 2508.14111 (PDF)

What Is Context Engineering?

In classic software, developers write deterministic code; in today’s AI systems, we compose contexts. Context engineering is the systematic craft of designing, organizing and manipulating every token fed into a large-language model (LLM) at inference time—instructions, examples, retrieved documents, API results, user profiles, safety policies, even intermediate chain-of-thought. Well-engineered context turns a general model into a domain expert; poor context produces hallucinations, leakage or policy violations.

Core Techniques

Technique	Goal	Typical Tools / Patterns
Prompt Design & Templates	Give the model clear role, task, format and constraints	System + user role prompts; XML / JSON schemas; function-calling specs
Retrieval-Augmented Generation (RAG)	Supply fresh, external knowledge just-in-time	Vector search, hybrid BM25+embedding, GraphRAG
Context Compression	Fit more signal into limited tokens	Summarisation, saliency ranking, LLM-powered “short-former” rewriters
Chunking & Windowing	Preserve locality in extra-long inputs	Hierarchical windows, sliding attention, FlashMask / Ring Attention
Scratchpads & CoT Scaffolds	Expose model reasoning for better accuracy and debuggability	Self-consistency, tree-of-thought, DST (Directed Self-Testing)
Memory & Profiles	Personalise without retraining	Vector memories, episodic caches, preference embeddings
Tool / API Context	Let models call and interpret external systems	Model Context Protocol (MCP), JSON-schema function calls, structured tool output
Policy & Guardrails	Enforce safety and brand style	Content filters, regex validators, policy adapters, YAML instruction blocks

Why It Matters

Accuracy & Trust – Fact-filled, well-structured context slashes hallucination rates and citation errors.
Privacy & Governance – Explicit control over what leaves the organisation or reaches the model helps meet GDPR, HIPAA and the EU AI Act.
Cost Efficiency – Compressing or caching context can cut token bills by 50-80 %.
Scalability – Multi-step agent systems live or die by fast, machine-readable context routing; good design tames complexity.

High-Impact Use Cases

Sector	How Context Engineering Delivers Value
Customer Support	RAG surfaces the exact policy paragraph and recent ticket history, enabling a single prompt to draft compliant replies.
Coding Agents	Function-calling + repository retrieval feed IDE paths, diffs and test logs, letting models patch bugs autonomously.
Healthcare Q&A	Context filters strip PHI before retrieval; clinically-approved guidelines injected to guide safe advice.
Legal Analysis	Long-context models read entire case bundles; chunk ranking highlights precedent sections for argument drafting.
Manufacturing IoT	Streaming sensor data is summarised every minute and appended to a rolling window for predictive-maintenance agents.

Designing a Context Pipeline: Four Practical Steps

Map the Task Surface
• What knowledge is static vs. dynamic?
• Which external tools or databases are authoritative?
Define Context Layers
• Base prompt: role, format, policy
• Ephemeral layer: user query, tool results
• Memory layer: user or session history
• Safety layer: filters, refusal templates
Choose Retrieval & Compression Strategies
• Exact text (BM25) for short policies; dense vectors for semantic match
• Summaries or selective quoting for large PDFs
Instrument & Iterate
• Log token mixes, latency, cost
• A/B test different ordering, chunking, or reasoning scaffolds
• Use self-reflection or eval suites (e.g., TruthfulQA-Context) to measure gains

Emerging Tools & Standards

MCP (Model Context Protocol) – open JSON schema for passing tool output and trace metadata to any LLM, adopted by Claude Code, Gemini CLI and IBM MCP Gateway.
Context-Aware Runtimes – vLLM, Flash-Infer and Infinity Lite stream 128 K-1 M tokens with optimized KV caches.
Context Observability Dashboards – Startups like ContextHub show token-level diff, attribution and cost per layer.

The Road Ahead

As context windows expand to a million tokens and multi-agent systems proliferate, context engineering will sit alongside model training and fine-tuning as a first-class AI discipline. Teams that master it will ship assistants that feel domain-expert-smart, honest and cost-efficient—while everyone else will chase unpredictable black boxes.

Whether you’re building a retrieval chatbot, a self-healing codebase or an autonomous research agent, remember: the model is only as good as the context you feed it.

Wandering Nomad

2.9.25

From AI for Science to Agentic Science: a blueprint for autonomous discovery