Wandering Nomad: Google DeepMind

Showing posts with label Google DeepMind. Show all posts

10.9.25

Embedding retrievers hit a math wall—and DeepMind just mapped it

Vector embeddings power everything from RAG to enterprise search. But a new DeepMind paper argues there’s a theoretical ceiling baked into single-vector retrieval: for any embedding dimension

$d$ , there exist query-document relevance patterns that no embedding model can represent—no matter the data or training tricks. The authors connect learning-theory and geometric results to IR and then build a deliberately simple dataset, LIMIT, where leading embedders struggle.

The core result, in plain English

Treat each query’s relevant docs as a row in a binary matrix (“qrels”). The paper introduces row-wise thresholdable rank and lower-bounds it via sign-rank to show a fundamental limit: once the number of documents $n$ crosses a critical threshold for a given $d$ , there exist top-k sets that cannot be realized by any single-vector embedding retriever. That’s a property of geometry, not optimization.

LIMIT: a toy task that breaks real systems

To make the math bite, the team instantiates LIMIT with natural-language facts (“Jon Durben likes quokkas and apples…”) that encode all combinations of relevance over a small doc pool. Despite its simplicity, SoTA MTEB models score <20 recall@100, while classic BM25 is near-perfect—underscoring that the failure is specific to single-vector embedding retrieval.

In a “small” LIMIT (N≈46) sweep, ramping dimensions up to 4096 lifts recall but still doesn’t solve the task; BM25 cruises to 100% at @10/@20. Fine-tuning on in-domain LIMIT data barely helps, indicating intrinsic hardness, not domain shift.

How this differs from usual benchmark talk

LIMIT’s structure—dense overlap of query relevances—looks nothing like BEIR or typical web QA. Compared across datasets, LIMIT shows far higher “graph density” and query-similarity strength than NQ, HotpotQA, or SciFact, approximating instruction-following IR where prompts combine unrelated items with logical operators.

Numbers that sting

A table of critical document counts shows how quickly trouble arrives as $d$ grows (e.g., $d=4 \Rightarrow n\approx10$ ; $d=16 \Rightarrow n\approx79$ ; $d=32 \Rightarrow n\approx296$ ). Put differently: long before you reach enterprise-scale corpora, some seemingly trivial “return docs X and Y, not Z” requests fall outside what an embedder can express.

What to do about it (and what not to)

Don’t only crank up dimension. Bigger $d$ delays but doesn’t remove the wall.
Consider alternative architectures. Multi-vector approaches (e.g., ColBERT-style), sparse methods, or hybrid stacks escape parts of the limit that bind single-vector embedders. The paper’s head-to-heads hint why BM25 and multi-vector models fare better.
Test against LIMIT-style stressors. The team released datasets on Hugging Face and code on GitHub to reproduce results and probe your own models.

Why this matters for RAG and instruction-following IR

Modern agents increasingly ask retrieval systems to honor combinational and logical constraints (“find papers that mention A and B but not C”). The paper shows there’s a mathematical point where single-vector embedders must fail such patterns—explaining why teams often paper over issues with rerankers and handcrafted filters. As instruction-following IR grows, expect more LIMIT-like cases in the wild.

Bottom line: embedding-only retrieval won’t scale to every notion of relevance. If your roadmap leans on expressive, compositional queries, plan for hybrid retrieval and reranking—and add LIMIT to your eval suite.

Paper link: arXiv 2508.21038 (PDF)

12.8.25

From Jagged Intelligence to World Models: Demis Hassabis’ Case for an “Omni Model” (and Why Evals Must Grow Up)

DeepMind’s cadence right now is wild—new drops practically daily. In this conversation, Demis Hassabis connects the dots: “thinking” models (Deep Think), world models that capture physics, and a path toward an omni model that unifies language, vision, audio, and interactive behavior. As an AI practitioner, I buy the core thesis: pure next-token prediction has hit diminishing returns; reasoning, tool-use, and grounded physical understanding are the new scaling dimensions.

I especially agree with the framing of thinking as planning—AlphaGo/AlphaZero DNA brought into the LLM era. The key is not the longest chain of thought, but the right amount of thought: parallel plans, prune, decide, iterate. That’s how strong engineers work, and it’s how models should spend compute. My caveat: “thinking budgets” still pay a real latency/energy cost. Until tool calls and sandboxed execution are bulletproof, deep reasoning will remain spiky in production.

The world model agenda resonates. If you want robust robotics or assistants like Astra/Gemini Live, you need spatiotemporal understanding, not just good text priors. Genie 3 is a striking signal: it can generate coherent worlds where objects persist and physics behaves sensibly. I’m enthusiastic—and I still want tougher tests than “looks consistent.” Sim-to-real is notorious; we’ll need evaluations for controllable dynamics, invariances (occlusion, lighting, continuity), and goal-conditioned behavior before I call it solved.

Hassabis is refreshingly blunt about jagged intelligence. Yes, models ace IMO-style math yet bungle simple logic or even chess legality. Benchmarks saturate (AIME hitting ~99%); we need new stressors. I like Game Arena with Kaggle—self-advancing tournaments give clear, leak-resistant signals and scale with capability. Where I push back: games aren’t the world. Outside well-specified payoffs, reward specification gets messy. The next wave of evals should be multi-objective and long-horizon—measuring planning, memory, tool reliability, and safety traits (e.g., deception) under distribution shift, not just single-shot accuracy.

Another point I applaud: tools as a scaling axis. Let models reason with search, solvers, and domain AIs (AlphaFold-class tools) during planning. The open question—what becomes a built-in capability versus an external tool—is empirical. Coding/math often lifts general reasoning; chess may or may not. My hesitation: as “models become systems,” provenance and governance get harder. Developers will need traceable tool chains, permissions, and reproducible runs—otherwise we ship beautifully wrong answers faster.

Finally, the omni model vision—converging Genie, Veo, and Gemini—feels inevitable. I’m aligned on direction, wary on product surface area. When base models upgrade every few weeks, app teams must design for hot-swappable engines, stable APIs, and eval harnesses that survive version churn.

Net-net: I’m excited by DeepMind’s trajectory—reasoning + tools + world modeling is the right stack. But to turn wow-demos into trustworthy systems, we must grow our evaluations just as aggressively as our models. Give me benchmarks that span days, not prompts; measure alignment under ambiguity; and prove sim-to-real. Do that, and an omni model won’t just impress us—it’ll hold up in the messy, physical, human world it aims to serve.

1.8.25

Inside Gemini Deep Think: Google’s Gold-Medal Reasoning Engine with a 16-Minute Brain-Cycle

When Google DeepMind quietly flipped the switch on Gemini 2.5 Deep Think, it wasn’t just another toggle in the Gemini app. The same enhanced-reasoning mode had already notched a gold-medal-level score at the 2025 International Mathematical Olympiad (IMO)—solving five of six notoriously brutal problems and tying the human cutoff for gold. That feat put DeepMind shoulder-to-shoulder with OpenAI’s own experimental “gold-IMO” model, announced the very same week .

What makes the IMO special?

Founded in 1959, the IMO pits six pre-university prodigies from each country against six problems spanning algebra, geometry, number theory, and combinatorics. Every question is worth seven points, so 42 is perfection; a score of 35 secured this year’s gold cutoff. DeepMind’s best 2024 system managed silver, but needed more time than the four-and-a-half hours allotted to humans. In 2025, Deep Think achieved the same result within the human time window, using only plain-language prompts instead of formal proof assistants .

Under the hood: parallel minds at work

Deep Think is Gemini 2.5 Pro running in a multi-agent “parallel thinking” mode. Instead of one chain-of-thought, it spins up dozens, scores them against intermediate goals, and fuses the strongest ideas into a final answer. Google says the approach boosts benchmark scores for math, logic, and coding, at the cost of far longer inference times .

A field test from the transcript

In the YouTube walkthrough, the host pastes a 2025 IMO geometry problem into Deep Think. The clock ticks 16 minutes before the first full token arrives—but the model nails the official solution, listing the only valid values of k as 0, 1, 3. A second experiment on an AIME-25 algebra question takes 13 minutes yet again lands the correct answer (204) with detailed derivations. The lesson: breakthroughs come after a coffee break, not in real time.

Beyond math: voxel temples and half-baked Angry Birds

Deep Think’s slow-burn genius extends to generative tasks. Asked to script a colorful 3D “Sala Thai” pavilion in Three.js, the model architected a fully navigable voxel scene—complete with stylized roof eaves—on the first pass. A tougher challenge—re-creating Angry Birds in Pygame—showed its iterative potential: the first build lacked obstacles, but a follow-up prompt produced pigs, wood, glass, and workable physics. Still, each refinement added another ten-plus minutes to the wait.

When speed matters more than brilliance

Because Deep Think withholds partial streams until it has weighed all candidate thoughts, users stare at a blank screen for up to ten minutes. Google engineers admit the mode “isn’t practical for everyday coding” unless you fire a prompt and walk away—then return to review the answer or receive a push notification. For everyday tasks, plain Gemini 2.5 Pro or Flash-Lite may offer better latency-to-value ratios.

How to try it—and what’s next

Deep Think is already live for Gemini Ultra subscribers inside the consumer app, and Google says an API endpoint will roll out in the “next few weeks” to AI Studio and Vertex AI . Once that lands, developers can add a “deep-think” flag to long-form reasoning jobs—think automated theorem proving, contract analysis, or multi-step coding agents.

Bottom line: Gemini Deep Think proves massive parallel reflection can push public models into Olympiad territory, but it also shows there’s no free lunch—each extra IQ point costs time and compute. The next frontier won’t just be smarter LLMs; it will be orchestration layers that decide when a 16-minute think-tank is worth the wait and when a quick, cheaper model will do.

31.7.25

AlphaEarth Foundations: Google DeepMind’s “Virtual Satellite” Sets a New Baseline for Planet-Scale Mapping

A virtual satellite built from data

On July 30 2025, Google DeepMind unwrapped AlphaEarth Foundations, an AI model that ingests optical, radar, lidar and climate-simulation feeds and distills them into a single 64-dimensional “embedding field” for every 10 × 10 meter patch of terrestrial land and coastal waters. Think of it as a software satellite constellation: instead of waiting for the next orbital pass, analysts query a unified representation that already encodes land cover, surface materials and temporal change.

How it works

AlphaEarth tackles two long-standing headaches—data overload and inconsistency. First, it merges dozens of public observation streams, weaving them into time-aligned “video” frames of the planet. Second, it compresses those frames 16× more efficiently than previous AI pipelines, slashing storage and compute for downstream tasks. Each embedding becomes a compact, loss-aware summary that models can reason over without re-processing raw pixels.

A leap in accuracy and efficiency

In head-to-head evaluations spanning land-use, surface-property and seasonal-change tasks, AlphaEarth posted a 24 % lower error rate than both classical remote-sensing methods and recent deep-learning baselines. Crucially, it excelled when label data was sparse—proof that its self-supervised pre-training truly generalises. The accompanying research paper on arXiv highlights consistent out-performance across “diverse mapping evaluations” without fine-tuning.

From blog post to real-world maps

To jump-start adoption, DeepMind and Google Earth Engine released the Satellite Embedding dataset: annual global snapshots containing 1.4 trillion embedding footprints per year. More than 50 organisations—including the UN’s Food and Agriculture Organisation, MapBiomas, the Global Ecosystems Atlas and Stanford University—are already piloting projects that range from rainforest monitoring to precision agriculture. Users report faster map production and higher classification accuracy, even in cloudy tropics or sparsely imaged polar regions.

Why it matters for climate and beyond

Accurate, up-to-date geospatial data underpins decisions on food security, infrastructure and conservation. Yet researchers often juggle incompatible satellite products or wrestle with GPU-hungry vision models. AlphaEarth shrinks that friction: a single API call retrieves embeddings that are both information-dense and provenance-rich, ready for plug-and-play into GIS tools, LLM agents or custom model fine-tunes. Cheaper storage and lower latency also mean national agencies with modest budgets can now run continent-scale analyses weekly instead of yearly.

The road ahead

DeepMind hints at extending the framework to real-time streams and coupling it with Gemini-class reasoning agents capable of answering open-ended “why” and “what-if” questions about Earth systems. For AI builders, the combination of long-context language models and AlphaEarth embeddings could enable chatbots that diagnose crop stress or forecast urban heat islands—all grounded in verifiable pixels.

Bottom line: AlphaEarth Foundations compresses the planet into a query-ready lattice of vectors, handing scientists, policymakers and hobbyist mappers a new lens on Earth’s shifting surface. With open data, documented gains and an Apache-style license, DeepMind has effectively democratized a planetary observatory—one 10-meter square at a time.

22.7.25

Gemini “Deep Think” Hits Gold-Medal Performance at the International Mathematical Olympiad

From Silver to Gold in Twelve Months

Last year, DeepMind’s AlphaGeometry and AlphaProof systems collectively solved four of six IMO problems, earning a silver-medal equivalent. In July 2025 the research team leap-frogged that result: an advanced version of Gemini running in “Deep Think” mode solved five of six tasks for 35 points—crossing the 2025 gold-medal threshold and setting a new AI milestone.

International coordinators graded Gemini’s written solutions using the same rubric applied to student competitors. According to IMO President Gregor Dolinar, the proofs were “clear, precise, and, in several cases, easy to follow”.

What Makes Deep Think Different?

Technique	Purpose	Impact on Performance
Parallel Thinking	Explores multiple proof avenues simultaneously, then merges the strongest ideas.	Avoids dead-end, single-thread chains of thought.
Reinforcement-Learning Fine-Tune	Trains on curated theorem-proving and problem-solving data with reward signals for conciseness and rigor.	Raises success rate on multi-step reasoning challenges.
High-Quality Solution Corpus	Ingests expertly written IMO proofs plus heuristic “tips & tricks.”	Gives the model stylistic and structural templates for clearer presentation.

These upgrades let Gemini run longer “scratch-pads” internally while staying within a feasible compute budget—no multi-day cluster runs were required, unlike earlier systems.

Benchmark Significance

35 / 42 points → comparable to a top-25-percent human gold medalist.
Perfect scores on five problems; only one combinatorics task eluded the model.
Order-of-magnitude speed-up vs. AlphaGeometry 2 + AlphaProof, which needed days of inference in 2024.

While specialized theorem solvers have mastered narrow domains, Gemini Deep Think is a general LLM—capable of chat, code, and multimodal tasks—now showing elite mathematical reasoning.

Broader Implications

Curriculum Design for AI
Gemini’s success underscores the value of domain-targeted reinforcement learning on top of large-scale pre-training.
Parallel Thinking as a New Primitive
Instead of a single “chain of thought,” future models may default to branch-and-merge reasoning, akin to how human teams brainstorm proofs.
Human–AI Collaboration
DeepMind notes the technique could become a “proof assistant” for mathematicians—surfacing lemmas or counter-examples at gold-medal quality within minutes.
Educational Outreach
Publishing the solutions provides a free study resource for aspiring IMO contestants and teachers, potentially leveling the global playing field.

Limitations & Next Steps

Interpretability: Despite clearer written proofs, the internal decision tree remains opaque—researchers are now probing why certain branches survive the merge.
Generalization: Performance on under-represented areas (e.g., functional equations) still lags; future training will widen topic coverage.
Trust & Verification: Formal proof checkers like Lean are being integrated to machine-verify each Gemini output before publication.

DeepMind plans to open selected Deep Think capabilities via its Gemini API later this year, with safeguards to prevent misuse in academic competitions.

Key Takeaway

Gemini Deep Think’s gold-medal performance doesn’t just raise the bar for AI mathematics—it redefines what general-purpose language models can achieve when armed with structured parallel reasoning and tailored RL training. The achievement brings researchers a step closer to AI systems that can tackle longstanding open problems and act as partner mathematicians rather than mere calculators.

14.7.25

Google DeepMind Launches GenAI Processors — an Open-Source Python Library for Fast, Parallel, Multimodal Pipelines

Why Google Built GenAI Processors

Modern generative-AI apps juggle many stages: ingesting user data, chunking or pre-processing it, calling one or more models, post-processing the output and streaming results back to the user. Most teams wire these steps together ad-hoc, leading to brittle code and wasted compute.

DeepMind’s answer is GenAI Processors — a modular, async Python library that provides:

A single Processor abstraction – every step (transcription, retrieval, Gemini call, summarisation, etc.) reads an async stream of ProcessorParts and emits another stream, so components snap together like Unix pipes.
Built-in scheduling & back-pressure – the framework transparently parallelises independent steps while preventing slow stages from clogging memory.
First-class Gemini support – ready-made processors for gemini.generate_content, function calling and vision inputs make it easy to swap models or add tool use.
Multimodal parts out of the box – TextPart, ImagePart, AudioPart, VideoPart, plus arbitrary user-defined types enable true cross-media pipelines.

How It Works (A 10-Second Glimpse)

from genai_processors import content_api, processors, streams

pipeline = processors.Chain([
    processors.AudioTranscriber(model="gemini"),
    processors.ChunkText(max_tokens=4_000),
    processors.GeminiGenerator(model="gemini-2.5-pro"),
    processors.MarkdownSummariser()
])

async for part in pipeline(streams.file("meeting.mp3")):
    print(part.as_text())

One file → parallel transcription → chunking → long-context Gemini reasoning → markdown summary — all fully streamed.

Performance & Footprint

DeepMind benchmarks show 2-5× throughput improvements versus naïve, sequential asyncio code when processing long podcasts, PDFs or image batches, with negligible memory overhead on a single CPU core. Because each processor is an asyncio coroutine, the same pipeline scales horizontally across threads or micro-services without code changes.

High-Impact Use-Cases

Domain	Pipeline Sketch
Real-time meeting assistant	`AudioStream → Transcribe → Gemini-Summarise → Sentiment → Stream to UI`
Video moderation	`VideoFrames → DetectObjects → UnsafeFilter → Gemini-Caption`
Multilingual customer support	`InboundChat → Translate(LLM) → RetrieveKB → Gemini-Answer → Back-translate`
Code-review bot	`PRDiff → Gemini-Critique → RiskClassifier → PostComment`

Developers can publish their own processors to PyPI; the library discovers and hot-loads them via entry points, encouraging an ecosystem of plug-ins similar to Hugging Face Datasets or LangChain tools.

Getting Started

pip install genai-processors

# then run the example notebooks

Requires Python 3.10+
Works locally, in Vertex AI Workbench or any serverless function

Documentation, Colab tutorials and a growing gallery of 20+ composable processors live in the GitHub repo.

Why It Matters

Developer Velocity – declarative pipelines mean less glue code, faster iteration and simpler reviews.
Efficiency – built-in parallelism squeezes more work out of each GPU minute or token budget.
Extensibility – swap a Gemini call for an open-weight model, add a safety filter, or branch to multiple generators with one line of code.
Open Governance – released under Apache 2.0, inviting community processors for speciality tasks (e.g., medical OCR, geospatial tiling).

Final Takeaway

With GenAI Processors, DeepMind is doing for generative-AI workflows what Pandas did for tabular data: standardising the building blocks so every team can focus on what they want to build, not how to wire it together. If your application touches multiple data types or requires real-time streaming, this library is poised to become an indispensable part of the Gen AI stack.

10.5.25

New Research Compares Fine-Tuning and In-Context Learning for LLM Customization

On May 9, 2025, VentureBeat reported on a collaborative study by Google DeepMind and Stanford University that evaluates two prevalent methods for customizing large language models (LLMs): fine-tuning and in-context learning (ICL). The research indicates that ICL generally provides better generalization capabilities compared to traditional fine-tuning, especially when adapting models to novel tasks.

Understanding Fine-Tuning and In-Context Learning

Fine-tuning involves further training a pre-trained LLM on a specialized dataset, adjusting its internal parameters to acquire new knowledge or skills. In contrast, ICL does not alter the model's parameters; instead, it guides the model by providing examples of the desired task within the input prompt, allowing the model to infer how to handle similar queries.

Experimental Approach

The researchers designed controlled synthetic datasets featuring complex, self-consistent structures, such as imaginary family trees and hierarchies of fictional concepts. To ensure the novelty of the information, they replaced all nouns, adjectives, and verbs with invented terms, preventing any overlap with the models' pre-training data. The models were then tested on various generalization challenges, including logical deductions and reversals.

Key Findings

The study found that, in data-matched settings, ICL led to better generalization than standard fine-tuning. Models utilizing ICL were more adept at tasks like reversing relationships and making logical deductions from the provided context. However, ICL is generally more computationally expensive at inference time, as it requires providing additional context to the model for each use.

Introducing Augmented Fine-Tuning

To combine the strengths of both methods, the researchers proposed an augmented fine-tuning approach. This method involves using the LLM's own ICL capabilities to generate diverse and richly inferred examples, which are then added to the dataset used for fine-tuning. Two main data augmentation strategies were explored:

Local Strategy: Focusing on individual pieces of information, prompting the LLM to rephrase single sentences or draw direct inferences, such as generating reversals.
Global Strategy: Providing the full training dataset as context, then prompting the LLM to generate inferences by linking particular documents or facts with the rest of the information, leading to longer reasoning traces.

Models fine-tuned on these augmented datasets showed significant improvements in generalization, outperforming both standard fine-tuning and plain ICL.

Implications for Enterprise AI Development

This research offers valuable insights for developers and enterprises aiming to adapt LLMs to specific domains or proprietary information. While ICL provides superior generalization, its computational cost at inference time can be high. Augmented fine-tuning presents a balanced approach, enhancing generalization capabilities while mitigating the continuous computational demands of ICL. By investing in creating ICL-augmented datasets, developers can build fine-tuned models that perform better on diverse, real-world inputs.

7.5.25

Google's Gemini 2.5 Pro I/O Edition: The New Benchmark in AI Coding

In a major announcement at Google I/O 2025, Google DeepMind introduced the Gemini 2.5 Pro I/O Edition, a new frontier in AI-assisted coding that is quickly becoming the preferred tool for developers. With its enhanced capabilities and interactive app-building features, this edition is now considered the most powerful publicly available AI coding model—outperforming previous leaders like Anthropic’s Claude 3.7 Sonnet.

A Leap Beyond Competitors

Gemini 2.5 Pro I/O Edition marks a significant upgrade in AI model performance and coding accuracy. Developers and testers have noted its consistent success in generating working software applications, notably interactive web apps and simulations, from a single user prompt. This functionality has brought it head-to-head—and even ahead—of OpenAI's GPT-4 and Anthropic’s Claude models.

Unlike its predecessors, the I/O Edition of Gemini 2.5 Pro is specifically optimized for coding tasks and integrated into Google’s developer platforms, offering seamless use with Google AI Studio and Vertex AI. This means developers now have access to an AI model that not only generates high-quality code but also helps visualize and simulate results interactively in-browser.

Tool Integration and Developer Experience

According to developers at companies like Cursor and Replit, Gemini 2.5 Pro I/O has proven especially effective for tool use, latency reduction, and improved response quality. Integration into Vertex AI also makes it enterprise-ready, allowing teams to deploy agents, analyze toolchain performance, and access telemetry for code reliability.

Gemini’s ability to reason across large codebases and update files with human-like comprehension offers a new level of productivity. Replit CEO Amjad Masad noted that Gemini was “the only model that gets close to replacing a junior engineer.”

Early Access and Performance Metrics

Currently available in Google AI Studio and Vertex AI, Gemini 2.5 Pro I/O Edition supports multimodal inputs and outputs, making it suitable for teams that rely on dynamic data and tool interactions. Benchmarks released by Google indicate fewer hallucinations, greater tool call reliability, and an overall better alignment with developer intent compared to its closest rivals.

Though it’s still in limited preview for some functions (such as full IDE integration), feedback from early access users has been overwhelmingly positive. Google plans broader integration across its ecosystem, including Android Studio and Colab.

Implications for the Future of Development

As AI becomes increasingly central to application development, tools like Gemini 2.5 Pro I/O Edition will play a vital role in software engineering workflows. Its ability to reduce the development cycle, automate debugging, and even collaborate with human developers through natural language interfaces positions it as an indispensable asset.

By simplifying complex coding tasks and allowing non-experts to create interactive software, Gemini is democratizing development and paving the way for a new era of AI-powered software engineering.

Conclusion

The launch of Gemini 2.5 Pro I/O Edition represents a pivotal moment in AI development. It signals Google's deep investment in generative AI, not just as a theoretical technology but as a practical, reliable tool for modern developers. As enterprises and individual developers adopt this new model, the boundaries between human and AI collaboration in coding will continue to blur—ushering in an era of faster, smarter, and more accessible software creation.