5.8.25

MLE-STAR: Google’s ML Engineering Agent Is Impressive—But Real-World Automation Still Needs Guardrails

 Google Research just unveiled MLE-STAR, a machine-learning engineering agent that treats model building like a guided search-and-refine loop rather than a single shot of LLM codegen. The announcement (August 1, 2025) positions MLE-STAR as a state-of-the-art ML engineering agent capable of automating diverse tasks. 

At a high level, the system does three things I really like:

  1. Bootstraps from the web. Instead of relying purely on prior LLM knowledge (which often overfits to familiar libraries), MLE-STAR first uses web search to pull task-appropriate, modern model patterns and builds an initial solution from them. In other words, it goes looking for today’s best practice before writing code. 

  2. Refines the right part of the pipeline. Many agents rewrite whole scripts every iteration; MLE-STAR runs ablation studies to find the code block with the biggest performance impact (e.g., feature engineering vs. model vs. ensembling), then iteratively refines that block using feedback from prior runs. This targeted loop is far closer to how strong human MLEs work day-to-day. 

  3. Ensembles with intent. Rather than naive voting, the agent proposes and improves ensemble strategies to merge multiple candidate solutions into a single, better one. 

The team also built pragmatic safety rails I’m thrilled to see in an autonomous coder: a debugging agent for traceback-driven fixes, a data-leakage checker to catch test-time contamination, and a data-usage checker so scripts don’t ignore provided modalities. These modules address common failure modes I’ve encountered with LLM-generated pipelines. 

On benchmarks, the results are eye-catching. MLE-STAR won medals in ~63–64% of Kaggle competitions in MLE-Bench-Lite, a massive jump over prior agents; the blog cites 63.6% any-medal (with 36% gold), and the arXiv v2 reports 64%. Either way, it’s a big leap. 

I also appreciate the ops mindset: there’s open-source code built with Google’s Agent Development Kit (ADK) so teams can reproduce the workflow and extend it. 

Now, where I’m cautious:

  • Generalization. MLE-Bench-Lite is a valuable proxy, but medals on curated Kaggle tasks aren’t the same as long-lived production systems with shifting data, compliance constraints, and messy labels. The refinement loop may still need human “taste” to set success metrics and pick trade-offs (latency vs. accuracy, cost vs. recall). The paper itself stresses targeted refinement and web retrieval as the key innovations—not a claim that human MLEs are obsolete. 

  • Licensing & provenance. Because the agent retrieves models and code from the web, verifying permissive licenses and acceptable usage is non-negotiable—Google explicitly flags MLE-STAR as research-only and expects users to check licensing of retrieved assets. That’s the right call, and enterprises should wire in policy checks before any auto-generated PRs land. 

  • Evaluation drift. The ablation-guided focus is elegant, but it assumes your validation signal is representative. In many real datasets, weak labels or distribution shift can mislead the ablation and push the agent to overfit the “most impactful block.” Tight data splits and independent holdouts remain essential.

Bottom line: MLE-STAR advances the state of autonomous ML engineering—web-aware bootstrapping, ablation-driven targeted refinement, and smarter ensembling are exactly the techniques I want in an agentic MLE. I’m ready to use it as a co-engineer on well-scoped problems, with humans owning metrics, governance, and final review. If we pair this agent with robust eval harnesses and license compliance, the payoff could be faster iteration and stronger baselines—without losing the engineering discipline that production ML demands. 

ReaGAN turns every node into an agent—with a plan, memory, and tools

 Classical GNNs push messages with one global rule per layer—great for tidy graphs, brittle for messy ones. ReaGAN (Retrieval-augmented Graph Agentic Network) breaks that mold by treating each node as an autonomous agent that decides whether to aggregate locally, retrieve globally, predict now, or do nothing—based on its own memory and a plan drafted by a frozen LLM

What’s new

  • Node-level autonomy. At every layer, a node queries the LLM for an action plan, executes it, and updates memory—no globally synchronized rulebook. 

  • Local + global context. Beyond neighbors in the graph, nodes invoke RAG to retrieve semantically similar but structurally distant nodes, then fuse both sources. 

  • Memory as glue. Nodes persist aggregated text snippets and few-shot (text, label) exemplars, enabling in-context prediction later. 

Why it matters

Real-world graphs are sparse and noisy; uniform propagation amplifies junk. ReaGAN’s per-node planning and local-global retrieval adapt to informativeness imbalances and long-range semantics—key gaps in standard GNNs. In experiments, the authors report competitive few-shot performance using only a frozen LLM (no fine-tuning), highlighting a compute-friendly path for graph ML. 

How it runs (at a glance)

Each node iterates a loop: perceive → plan → act (LocalAggregation / GlobalAggregation / Predict / NoOp) → update memory. A simple algorithmic skeleton formalizes the layer-wise cycle and action space. 

Paper link: https://arxiv.org/pdf/2508.00429

4.8.25

The Agentic Web: when bots become the primary users of the internet

 Search boxes and feeds defined the first two web eras. A new position paper proposes the third: the Agentic Web, where autonomous software agents—often LLM-powered—act on our behalf, coordinate with other agents, and execute long-horizon tasks across services. The authors offer a working definition and argue the shift is already visible in consumer assistants that can plan purchases and book reservations end-to-end. 

A framework in three dimensions

The paper lays out a conceptual stack for this world: intelligence (reasoning, memory, planning), interaction (tools, APIs, multi-agent protocols), and economics (incentives, pricing, marketplaces). These dimensions, taken together, underpin capabilities like retrieval, recommendation, planning and collaboration that move beyond single-turn chat.

From retrieval to planning to coordination

Architecturally, the authors chart algorithmic transitions: user-issued queries give way to agentic retrieval; recommender systems evolve into agent planners; and isolated tools become multi-agent collectives able to decompose and delegate work. A worked example walks through agents co-planning a travel itinerary, highlighting orchestration and memory. 

New pipes: MCP and agent-to-agent messaging

HTTP and RPC weren’t built for autonomous, negotiated workflows. The paper surveys emerging Model Context Protocol (MCP) interfaces and purpose-built agent-to-agent (A2A) messaging layers to support capability discovery, tool brokering and structured negotiations between services—foundational plumbing for an internet of bots. 

The Agent Attention Economy

If algorithms once competed for human attention, services on the Agentic Web will compete to be selected by agents mid-plan. That reframes ranking, pricing and attribution around machine decision-makers—an attention market where tools, APIs and even other agents bid for inclusion in workflows. 

What breaks (and who pays)

The authors predict “agent browsers” will disrupt today’s user-centric browsing model, shifting interfaces from manual clicks to delegated execution. They also flag a looming billing problem for complex, multi-step agent services that span providers and time windows—who gets paid, and how, when dozens of tools contribute to one outcome? 

Risks, red teaming and defense

A full section maps threats across layers (prompt-/tool-injection, data exfiltration, compromised marketplaces), and compares human-in-the-loop versus automated red teaming for agent systems. The authors argue for hybrid approaches, inference-time guardrails, and controllable planning to keep autonomous workflows within safe bounds.

Why it matters

If the Agentic Web arrives, the primary “users” of the internet won’t be humans but agents negotiating with each other—demanding new protocols, marketplaces, governance and safety tooling. For startups, the opportunity is to build the pipes, policies and platforms that let those agents cooperate—and compete—reliably.

Paper link: arXiv 2507.21206 (PDF)

2.8.25

MetaStone-S1 makes “how long to think” a first-class dial—and it pays off

 Frontier models are learning to trade more inference compute for better answers. MetaStone-S1 turns that trend into a clean architecture: a Reflective Generative Form where the policy and a process reward model live in the same network, adding a light 53M-parameter scoring head instead of a separate, heavyweight judge. The scoring head is trained self-supervised from outcome rewards—no step-by-step human labels—so the system can generate multiple chains of thought and select the best one efficiently. 

Three “reasoning effort” modes, one model

Because the verifier is built-in, MetaStone-S1 exposes controllable thinking lengthslow, medium, high—implemented via different candidate counts (k = 2/8/32) at inference. That makes test-time scaling a product feature rather than a research trick. 

Benchmarks: o3-mini territory at 32B

Across AIME’24/’25 (math), LiveCodeBench (code), and C-Eval (Chinese reasoning), the 32B MetaStone-S1 variants lift accuracy over a strong 32B baseline and land comparable to OpenAI o3-mini (medium)—with the high mode leading math by a sizable margin. Example table slice (Pass@1): AIME’24 85.2, AIME’25 73.6, LiveCodeBench 64.2, C-Eval 89.7 for MetaStone-S1-32B-high vs. o3-mini-medium 79.6 / 74.8 / 67.4 / 75.9

At smaller scales, the 1.5B and 7B versions also beat peer open models (e.g., R1-Distill 7B/8B) on AIME and LiveCodeBench, showing the approach is not just a big-model hack. 

Why this matters

  • Unified policy+PRM = cheaper selection. Sharing the backbone removes a second giant model from the loop and still delivers strong external TTS gains. 

  • Label-free verifier training. The SPRM head learns step scoring from outcome signals, sidestepping costly, noisy process annotations. 

  • Production-ready knob. Teams can ship speed/quality dials (k=2/8/32) instead of maintaining separate models for different latency tiers. 

  • Open release. Code and checkpoints are public, inviting replication and adaptation. 

MetaStone-S1’s take-home: reasoning power isn’t only about bigger weights or longer chains—it’s about selecting the right trajectory at inference, with a verifier you can actually afford to run.

Paper link: arXiv 2507.01951 (PDF)

Computing Changes How We Think—But Creativity, Not Just GPUs, Will Decide AI’s Next Decade

 In a wide-ranging Bloomberg interview, Dr. Wang Jian (founder of Alibaba Cloud) makes a forceful case that the era of AI “toy problems” is over. I agree. The last two years moved us from brittle demos to systems that reliably draft code, analyze documents, and support human decision-making. His analogy that more compute is like upgrading from a bicycle to a rocket is compelling: when the cost and scale of computation change, the feasible solution space—and our mental models—change with it.

Where I especially align is his view that markets are not just places to sell, but living testbeds where technology matures under real constraints. This resonates with best practices in ML ops: no benchmark, however well chosen, substitutes for deployment feedback. China’s dense competitive landscape, as he notes, creates short iteration loops—startups push features, rivals answer, users vote—accelerating collective learning. In ML terms, it’s a virtuous cycle of data, gradient steps, and evaluation at production scale.

I also appreciate his skepticism about tidy labels like AI → AGI → ASI. In practice, capability is a continuum: larger context windows, better tool use, richer memory, and planning—these blur categorical boundaries. Treating progress as increasing capability across tasks avoids false thresholds and keeps builders focused on measurable gains.

That said, I diverge on several points.

First, Dr. Wang downplays compute as a long-term bottleneck. I’m not fully convinced. While creativity and product insight absolutely dominate value creation, frontier training remains capital- and energy-intensive. Export controls, supply chain variability, and power availability still shape who can train or serve the most advanced models. For many labs, clever data curation and distillation help—but they don’t erase the physics and economics of scaling laws.

Second, on robotics, he frames AI as a new “engine” for an existing vehicle. Conceptually useful—but today’s embodied intelligence also requires tight integration across perception, control, simulation, and safety, not just swapping motors. Progress is real (foundation models for vision and language transfer surprisingly well), yet reliable grasping, long-horizon autonomy, and recovery from edge cases remain research frontiers. The “AI engine” metaphor risks underestimating those system-level challenges.

Third, the notion that no current advantage forms a durable moat is directionally optimistic and healthy for competition; still, moats can emerge from datasets with verified provenance, reinforcement-learning pipelines at scale, distribution, and compliance. Even if individual components commoditize, the orchestration (agents, tools, retrieval, evals, and workflow integration) can compound into real defensibility.

Finally, I agree with his emphasis that creativity is the scarcest input. Where I’d extend the argument is execution discipline: teams need evaluation harnesses, safety checks, and shipping cadences so creativity feeds a measurable loop. In other words, pair inspired ideas with ruthless metrics.

The upshot: Dr. Wang’s thesis—compute reshapes thinking, markets mature tech, creativity drives breakthroughs—captures much of what’s powering AI right now. My caveats don’t negate his vision; they refine it. The winners will be those who marry inventive product design with pragmatic engineering and acknowledge that, even in a marathon, hardware, data, and distribution still set the course.

Hierarchical Reasoning Model: a tiny, brain-inspired model that out-reasons giant CoT LLMs

 Most frontier models “reason” by narrating token-by-token chains of thought. Sapient Intelligence’s Hierarchical Reasoning Model (HRM) argues you don’t need that narration—or billions of parameters—to solve hard puzzles. The 27 M-parameter model runs two coupled recurrent modules at different timescales (a slow H-module for abstract planning and a fast L-module for detailed computation) to perform deep latent reasoning in a single forward pass. Trained from scratch with no pretraining and no CoT supervision, HRM hits standout scores across inductive-reasoning and search-heavy tasks.

Why it works: depth without the usual pain

HRM’s core trick is hierarchical convergence: the fast L-module iterates to a local equilibrium, then the slow H-module updates once and “resets” context for the next refinement cycle—stacking many effective computation steps without vanishing into a fixed point. To train it efficiently, the authors derive a one-step gradient approximation that avoids backpropagation-through-time, cutting memory from O(T) to O(1) per sequence. 

There’s also an adaptive halting head (a small Q-learner) that decides whether to stop or continue another reasoning segment, enabling “think-more-if-needed” behavior at inference time—useful when a problem demands longer planning. 

The receipts

With roughly 1,000 training examples per task, HRM posts numbers that would make far larger CoT systems blush:

  • ARC-AGI-1: 40.3 %, beating o3-mini-high (34.5), Claude-3.7 8K (21.2) and DeepSeek-R1 (21.0); a Transformer trained directly on IO pairs manages 15.8. 

  • ARC-AGI-2: HRM reaches 5.0 % where strong CoT baselines hover near zero—consistent with the benchmark’s step-up in compositional difficulty. 

  • Sudoku-Extreme (9×9, 1k ex.): 55.0 % accuracy; on the full Sudoku-Extreme-Full (3.83 M puzzles), HRM approaches near-perfect accuracy. 

  • Maze-Hard (30×30, 1k ex.): 74.5 % optimal-path success—where CoT baselines flatline. 

What this means for builders

  • Latent > linguistic reasoning: HRM shows you can get deep, backtracking-style reasoning inside hidden states—no verbose CoT, fewer tokens, lower latency. 

  • Tiny models, big compute depth: By recycling computation through nested recurrent cycles, HRM attains “depth” that standard Transformers don’t, even when you stack layers. 

  • Knob for “thinking time”: The halting mechanism effectively scales compute at inference—handy for tasks like Sudoku where a few extra cycles pay off more than on ARC-style transformations. 

Dataset & evaluation notes

Sudoku-Extreme combines easier Kaggle-style puzzles with community “forum-hard” sets; difficulty is measured by average backtracks (≈22 per puzzle on the new subset—much tougher than common datasets). Maze-Hard requires optimal 30×30 paths; ARC-AGI results follow the official challenge protocols with standard augmentations. 

If subsequent open-sourced code (the paper links a GitHub repo) spurs replication, expect a wave of BPTT-free recurrent designs and “reason-more-on-demand” controls to show up in lightweight agents—especially where token budgets and latency matter more than eloquent chain-of-thoughts. 

Paper link: arXiv 2506.21734 (PDF)

Stargate Norway: OpenAI’s First European AI Data Center Bets Big on Clean Power and Local Ecosystems

 OpenAI has announced Stargate Norway, its first AI data center initiative in Europe, marking a major step in the company’s plan to place world-class compute closer to the communities that use it. The project debuts under the OpenAI for Countries program, which aims to pair national priorities with frontier-grade AI infrastructure. The announcement was posted on July 31, 2025

The site will rise in Narvik, Norway, chosen for its abundant hydropower, cool climate, and established industrial base—factors that make it a compelling home for sustainable, at-scale AI. OpenAI frames Stargate Norway as “one of the most ambitious AI infrastructure investments in Europe to date,” designed to boost productivity and growth for developers, researchers, startups, and public bodies across the region. 

Two heavyweight partners anchor the build: Nscale, an AI infrastructure provider with deployments across Europe and North America, and Aker, whose century-long industrial track record in energy makes it a natural fit. Nscale will design and build the facility, and ownership is expected to be a 50/50 joint venture between Nscale and Aker. OpenAI is positioned as an initial offtaker, with the option to scale usage over time through OpenAI for Countries. 

On capacity, the numbers are striking: 230 MW at launch, with ambitions to add another 290 MW as demand grows. The plan targets 100,000 NVIDIA GPUs by the end of 2026, with room to expand significantly thereafter. For a continent grappling with surging AI workloads, that’s meaningful headroom—and a signal that sovereign compute is moving from rhetoric to reality. 

Sustainability is built in, not bolted on. The facility will run entirely on renewable power and incorporate closed-loop, direct-to-chip liquid cooling for high thermal efficiency. Even better, waste heat from the GPU systems will be made available to local low-carbon enterprises, turning a by-product into regional value. This approach pairs performance with environmental responsibility in a way that European stakeholders have been demanding. 

Crucially, OpenAI stresses that priority access will flow to Norway’s AI ecosystem—supporting homegrown startups and scientific teams—while surplus capacity will be available to public and private users across the UK, Nordics, and Northern Europe. That regional framing aims to accelerate Europe’s AI development while strengthening resilience and choice for organizations seeking high-end compute. 

Stargate Norway follows Stargate UAE earlier this year and sits alongside OpenAI’s growing collaborations with European governments, including a recent MOU with the UK Government, partnerships in Estonia’s schools, and expressions of interest for the EU’s AI Gigafactories initiative. It’s part of a larger strategy to meet demand locally and support sovereign AI goals with credible infrastructure. 

As an AI enthusiast, I see Stargate Norway as more than a data center—it’s an ecosystem commitment. By blending renewable energy, advanced cooling, heat-reuse, and regional access policies, OpenAI is sketching a blueprint for how frontier compute can serve communities, not just workloads. If Europe wants AI’s benefits widely shared, this is the kind of build that makes it possible.

1.8.25

Inside Gemini Deep Think: Google’s Gold-Medal Reasoning Engine with a 16-Minute Brain-Cycle

 When Google DeepMind quietly flipped the switch on Gemini 2.5 Deep Think, it wasn’t just another toggle in the Gemini app. The same enhanced-reasoning mode had already notched a gold-medal-level score at the 2025 International Mathematical Olympiad (IMO)—solving five of six notoriously brutal problems and tying the human cutoff for gold. That feat put DeepMind shoulder-to-shoulder with OpenAI’s own experimental “gold-IMO” model, announced the very same week .

What makes the IMO special?

Founded in 1959, the IMO pits six pre-university prodigies from each country against six problems spanning algebra, geometry, number theory, and combinatorics. Every question is worth seven points, so 42 is perfection; a score of 35 secured this year’s gold cutoff. DeepMind’s best 2024 system managed silver, but needed more time than the four-and-a-half hours allotted to humans. In 2025, Deep Think achieved the same result within the human time window, using only plain-language prompts instead of formal proof assistants .

Under the hood: parallel minds at work

Deep Think is Gemini 2.5 Pro running in a multi-agent “parallel thinking” mode. Instead of one chain-of-thought, it spins up dozens, scores them against intermediate goals, and fuses the strongest ideas into a final answer. Google says the approach boosts benchmark scores for math, logic, and coding, at the cost of far longer inference times .

A field test from the transcript

In the YouTube walkthrough, the host pastes a 2025 IMO geometry problem into Deep Think. The clock ticks 16 minutes before the first full token arrives—but the model nails the official solution, listing the only valid values of k as 0, 1, 3. A second experiment on an AIME-25 algebra question takes 13 minutes yet again lands the correct answer (204) with detailed derivations. The lesson: breakthroughs come after a coffee break, not in real time.

Beyond math: voxel temples and half-baked Angry Birds

Deep Think’s slow-burn genius extends to generative tasks. Asked to script a colorful 3D “Sala Thai” pavilion in Three.js, the model architected a fully navigable voxel scene—complete with stylized roof eaves—on the first pass. A tougher challenge—re-creating Angry Birds in Pygame—showed its iterative potential: the first build lacked obstacles, but a follow-up prompt produced pigs, wood, glass, and workable physics. Still, each refinement added another ten-plus minutes to the wait.

When speed matters more than brilliance

Because Deep Think withholds partial streams until it has weighed all candidate thoughts, users stare at a blank screen for up to ten minutes. Google engineers admit the mode “isn’t practical for everyday coding” unless you fire a prompt and walk away—then return to review the answer or receive a push notification. For everyday tasks, plain Gemini 2.5 Pro or Flash-Lite may offer better latency-to-value ratios.

How to try it—and what’s next

Deep Think is already live for Gemini Ultra subscribers inside the consumer app, and Google says an API endpoint will roll out in the “next few weeks” to AI Studio and Vertex AI . Once that lands, developers can add a “deep-think” flag to long-form reasoning jobs—think automated theorem proving, contract analysis, or multi-step coding agents.


Bottom line: Gemini Deep Think proves massive parallel reflection can push public models into Olympiad territory, but it also shows there’s no free lunch—each extra IQ point costs time and compute. The next frontier won’t just be smarter LLMs; it will be orchestration layers that decide when a 16-minute think-tank is worth the wait and when a quick, cheaper model will do.



Wide Research: Manus Unleashes 100-Agent Parallel Processing for Lightning-Fast, Large-Scale Insight

 Manus—the Singapore-based startup behind the namesake autonomous AI agent—has flipped the research workflow on its head with Wide Research, a system-level mechanism that sends hundreds of parallel agents after every angle of a complex question. Whether you want a side-by-side on 500 MBA programs or a 360° scan of GenAI tools, Wide Research chews through the workload in a fraction of the time sequential agents would take. 


From Deep to Wide

Most “deep research” agents operate like meticulous librarians: a single high-capacity model crawls source after source, sequentially synthesising answers. It’s thorough—but agonisingly slow at scale. Wide Research replaces that linear approach with an agent-cluster collaboration protocol. Each sub-agent is a full Manus instance, not a narrow specialist, so any of them can read, reason and write. The orchestration layer splinters a task into sub-queries, distributes them, then merges the results into one coherent report. 

Why general-purpose sub-agents matter

Traditional multi-agent designs hard-code roles—“planner,” “coder,” “critic.” Those rigid templates break when a project veers off script. Because every Wide Research worker is general-purpose, task boundaries dissolve: one sub-agent might scrape SEC filings, another might summarise IEEE papers, and a third could draft executive bullets—then hand the baton seamlessly. 


Inside the Architecture

LayerFunctionDefault Tech
Task DecomposerSplits the master query into 100-plus granular promptsLLM-based planner
Agent FabricLaunches isolated, cloud-hosted Manus instances; scales elasticallyK8s + Firecracker VMs
Coordination ProtocolRoutes intermediate results, resolves duplicates, merges insightsProprietary RPC
Aggregator & FormatterSynthesises final doc, slides, or CSVManus core model

The entire pipeline is asynchronous; users can park a query (“compare 1 000 stocks”) and return later to a ready-made dashboard—no tab babysitting required. 

Performance Snapshot

ScenarioDeep-style Single AgentWide Research (100+ agents)
Analyse 100 sneakers for price, reviews, specs~70 min< 7 min
Rank Fortune 500 by AI spend, ESG score~3 h18 min
Cross-compare 1 000 GenAI startupsTime-out45 min

(Internal Manus demo data shown during launch.) 

Early Use Cases

  1. Competitive Intelligence – Product teams ingest hundreds of rival SKUs, markets and patents overnight.

  2. Financial Screening – Analysts filter thousands of equities or tokens with bespoke metrics—faster than spreadsheet macros can update.

  3. Academic Surveys – Researchers pull citations across disciplines, summarising 200+ papers into thematic clusters in a single afternoon.

Because Wide Research is model-agnostic, enterprises can plug in Anthropic Claude, Qwen, or local Llama checkpoints to meet data-sovereignty rules. 


Pricing & Roll-Out

  • Today: Wide Research is live for Pro subscribers (US $199/month).

  • Q3 2025: Gradual access for Plus and Basic tiers.

  • Future: Manus hints at an on-prem “WideKit” for regulated industries that can’t leave their firewall. 


Limitations & Trade-Offs

  • Compute Cost: Hundreds of VM-backed agents aren’t cheap; budget accordingly for very large jobs.

  • Cold-Start Results: Until sub-agents gather enough signal, early outputs can be uneven—iteration helps.

  • Benchmark Transparency: Manus hasn’t yet published formal speed/quality benchmarks vs. sequential baselines, though third-party analyses are emerging. 


The Bigger Picture

Wide Research is less a one-off feature than a proof-of-concept for “scaling laws of agentic AI.” Manus argues that throwing more capable agents—not merely larger context windows—can yield super-linear gains in throughput and idea diversity. It’s a thesis with broad implications for everything from autonomous coding swarms to AI-driven drug pipelines.

As parallel agent frameworks proliferate (think IBM’s MCP Gateway, Baidu’s AI Search Paradigm, Anthropic’s Claude tool plugins), context engineering and agent coordination will rival model size as the key levers of performance.


Key Takeaway

Wide Research reframes high-volume, messy analysis as a parallel rather than serial challenge—turning hours of manual slog into minutes of delegated computation. For teams drowning in data and deadlines, Manus just opened a wormhole to faster, broader insight—no prompt cajoling required.

 Google Research just unveiled MLE-STAR , a machine-learning engineering agent that treats model building like a guided search-and-refine lo...