Wandering Nomad

22.5.25

Google Unveils MedGemma: Advanced Open-Source AI Models for Medical Text and Image Comprehension

At Google I/O 2025, Google announced the release of MedGemma, a collection of open-source AI models tailored for medical text and image comprehension. Built upon the Gemma 3 architecture, MedGemma aims to assist developers in creating advanced healthcare applications by providing robust tools for analyzing medical data.

MedGemma Model Variants

MedGemma is available in two distinct versions, each catering to specific needs in medical AI development:

MedGemma 4B (Multimodal Model): This 4-billion parameter model integrates both text and image processing capabilities. It employs a SigLIP image encoder pre-trained on diverse de-identified medical images, including chest X-rays, dermatology, ophthalmology, and histopathology slides. This variant is suitable for tasks like medical image classification and interpretation.
MedGemma 27B (Text-Only Model): A larger, 27-billion parameter model focused exclusively on medical text comprehension. It's optimized for tasks requiring deep clinical reasoning and analysis of complex medical literature.

Key Features and Use Cases

MedGemma offers several features that make it a valuable asset for medical AI development:

Medical Image Classification: The 4B model can be adapted for classifying various medical images, aiding in diagnostics and research.
Text-Based Medical Question Answering: Both models can be utilized to develop systems that answer medical questions based on extensive medical literature and data.
Integration with Development Tools: MedGemma models are accessible through platforms like Google Cloud Model Garden and Hugging Face, and are supported by resources such as GitHub repositories and Colab notebooks for ease of use and customization.

Access and Licensing

Developers interested in leveraging MedGemma can access the models and related resources through the following platforms:

Google Health AI Developer Foundations: MedGemma Overview
Hugging Face: MedGemma CollectionHugging Face
GitHub: MedGemma RepositoryGitHub

The use of MedGemma is governed by the Health AI Developer Foundations terms of use, ensuring responsible deployment in healthcare settings.

Google's Stitch: Transforming App Development with AI-Powered UI Design

Google has introduced Stitch, an experimental AI tool from Google Labs designed to bridge the gap between conceptual app ideas and functional user interfaces. Powered by the multimodal Gemini 2.5 Pro model, Stitch enables users to generate UI designs and corresponding frontend code using natural language prompts or visual inputs like sketches and wireframes.

Key Features of Stitch

Natural Language UI Generation: Users can describe their app concepts in plain English, specifying elements like color schemes or user experience goals, and Stitch will generate a corresponding UI design.
Image-Based Design Input: By uploading images such as whiteboard sketches or screenshots, Stitch can interpret and transform them into digital UI designs, facilitating a smoother transition from concept to prototype. Google Developers Blog
Design Variations: Stitch allows for the generation of multiple design variants from a single prompt, enabling users to explore different layouts and styles quickly.
Integration with Development Tools: Users can export designs directly to Figma for further refinement or obtain the frontend code (HTML/CSS) to integrate into their development workflow.

Getting Started with Stitch

Access Stitch: Visit stitch.withgoogle.com and sign in with your Google account.
Choose Your Platform: Select whether you're designing for mobile or web applications.
Input Your Prompt: Describe your app idea or upload a relevant image to guide the design process.
Review and Iterate: Examine the generated UI designs, explore different variants, and make adjustments as needed.
Export Your Design: Once satisfied, export the design to Figma or download the frontend code to integrate into your project.

Stitch is currently available for free as part of Google Labs, offering developers and designers a powerful tool to accelerate the UI design process and bring app ideas to life more efficiently.

Google Unveils Next-Gen AI Innovations: Veo 3, Gemini 2.5, and AI Mode

At its annual I/O developer conference, Google announced a suite of advanced AI tools and models, signaling a major leap in artificial intelligence capabilities. Key highlights include the introduction of Veo 3, an AI-powered video generator; Gemini 2.5, featuring enhanced reasoning abilities; and the expansion of AI Mode in Search to all U.S. users.

Veo 3: Advanced AI Video Generation

Developed by Google DeepMind, Veo 3 is the latest iteration of Google's AI video generation model. It enables users to create high-quality videos from text or image prompts, incorporating realistic motion, lip-syncing, ambient sounds, and dialogue. Veo 3 is accessible through the Gemini app for subscribers of the $249.99/month AI Ultra plan and is integrated with Google's Vortex AI platform for enterprise users.

Gemini 2.5: Enhanced Reasoning with Deep Think

The Gemini 2.5 model introduces "Deep Think," an advanced reasoning mode that allows the AI to consider multiple possibilities simultaneously, enhancing its performance on complex tasks. This capability has led to impressive scores on benchmarks like USAMO 2025 and LiveCodeBench. Deep Think is initially available in the Pro version of Gemini 2.5, with broader availability planned.

AI Mode in Search: Personalized and Agentic Features

Google's AI Mode in Search has been rolled out to all U.S. users, offering a more advanced search experience with features like Deep Search for comprehensive research reports, Live capabilities for real-time visual assistance, and personalization options that incorporate data from users' Google accounts. These enhancements aim to deliver more relevant and context-aware search results.

21.5.25

Google's Jules Aims to Out-Code Codex in the AI Developer Stack

Google has unveiled Jules, its latest AI-driven coding agent, now available in public beta. Designed to assist developers by autonomously fixing bugs, generating tests, and consulting documentation, Jules operates asynchronously, allowing developers to delegate tasks while focusing on other aspects of their projects.

Key Features of Jules

Asynchronous Operation: Jules functions in the background, enabling developers to assign tasks without interrupting their workflow.
Integration with GitHub: Seamlessly integrates into GitHub workflows, enhancing code management and collaboration.
Powered by Gemini 2.5 Pro: Utilizes Google's advanced language model to understand and process complex coding tasks.
Virtual Machine Execution: Runs tasks within a secure virtual environment, ensuring safety and isolation during code execution.
Audio Summaries: Provides audio explanations of its processes, aiding in understanding and transparency.

Josh Woodward, Vice President of Google Labs, highlighted Jules' capability to assist developers by handling tasks they prefer to delegate, stating, "People are describing apps into existence."

Competitive Landscape

Jules enters a competitive field alongside OpenAI's Codex and GitHub's Copilot Agent. While Codex has evolved from a coding model to an agent capable of writing and debugging code, GitHub's Copilot Agent offers similar asynchronous functionalities. Jules differentiates itself with its integration of audio summaries and task execution within virtual machines.

Community Reception

The developer community has shown enthusiasm for Jules, with early users praising its planning capabilities and task management. One developer noted, "Jules plans first and creates its own tasks. Codex does not. That's major."

Availability

Currently in public beta, Jules is accessible for free with usage limits. Developers interested in exploring its capabilities can integrate it into their GitHub workflows and experience its asynchronous coding assistance firsthand.

Google Launches NotebookLM Mobile App with Offline Audio and Seamless Source Integration

Google has officially launched its NotebookLM mobile application for both Android and iOS platforms, bringing the capabilities of its AI-powered research assistant to users on the go. The mobile app mirrors the desktop version's core functionalities, including summarizing uploaded sources and generating AI-driven Audio Overviews, which can be played in the background or offline, catering to users' multitasking needs.

Key Features of NotebookLM Mobile App

Offline Audio Overviews: Users can download AI-generated, podcast-style summaries of their documents for offline listening, making it convenient to stay informed without constant internet access.
Interactive AI Hosts: The app introduces a "Join" feature, allowing users to engage with AI hosts during playback, ask questions, and steer the conversation, enhancing the interactivity of the learning experience.
Seamless Content Sharing: NotebookLM integrates with the device's native share function, enabling users to add content from websites, PDFs, and YouTube videos directly to the app, streamlining the research process.
Availability: The app is available for download on the Google Play Store for Android devices running version 10 or higher, and on the App Store for iOS devices running iOS 17 or later.

The release of the NotebookLM mobile app addresses a significant user demand for mobile accessibility, allowing users to engage with their research materials more flexibly and efficiently. With features tailored for mobile use, such as offline access and interactive summaries, NotebookLM continues to evolve as a versatile tool for students, professionals, and researchers alike.

Reference:
1. https://blog.google/technology/ai/notebooklm-app/

19.5.25

DeepSeek V3: High-Performance Language Modeling with Minimal Hardware Overhead

DeepSeek-AI has unveiled DeepSeek V3, a large language model (LLM) that delivers high performance while minimizing hardware overhead and maximizing computational efficiency. This advancement positions DeepSeek V3 as a competitive alternative to leading models like GPT-4o and Claude 3.5 Sonnet, offering comparable capabilities with significantly reduced resource requirements.

Innovative Architectural Design

DeepSeek V3 employs a Mixture-of-Experts (MoE) architecture, featuring 671 billion total parameters with 37 billion active per token. This design allows the model to activate only a subset of parameters during inference, reducing computational load without compromising performance.

The model introduces Multi-Head Latent Attention (MLA), enhancing memory efficiency and enabling effective handling of long-context inputs. Additionally, DeepSeek V3 utilizes FP8 mixed-precision training, which balances computational speed and accuracy, further contributing to its efficiency.

Efficient Training and Deployment

Trained on 14.8 trillion high-quality tokens, DeepSeek V3 underwent supervised fine-tuning and reinforcement learning stages to refine its capabilities. The training process was completed using 2,048 NVIDIA H800 GPUs over 55 days, incurring a total cost of approximately $5.58 million—a fraction of the expenditure associated with comparable models.

The model's training infrastructure was optimized to minimize communication latency and maximize throughput, employing strategies such as overlapping computation and communication, and dynamic load balancing across GPUs.

Benchmark Performance

DeepSeek V3 demonstrates superior performance across various benchmarks, outperforming open-source models like LLaMA 3.1 and Qwen 2.5, and matching the capabilities of closed-source counterparts such as GPT-4o and Claude 3.5 Sonnet.

Open-Source Accessibility

Committed to transparency and collaboration, DeepSeek-AI has released DeepSeek V3 under the MIT License, providing the research community with access to its architecture and training methodologies. The model's checkpoints and related resources are available on

References

"This AI Paper from DeepSeek-AI Explores How DeepSeek V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency" – MarkTechPost MarkTechPost
DeepSeek V3 Technical Report – arXiv
Insights into DeepSeek V3: Scaling Challenges and Reflections on Hardware for AI Architectures

AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications, and Challenges

A recent study by researchers Ranjan Sapkota, Konstantinos I. Roumeliotis, and Manoj Karkee delves into the nuanced differences between AI Agents and Agentic AI, providing a structured taxonomy, application mapping, and an analysis of the challenges inherent to each paradigm.

Defining AI Agents and Agentic AI

AI Agents: These are modular systems primarily driven by Large Language Models (LLMs) and Large Image Models (LIMs), designed for narrow, task-specific automation. They often rely on prompt engineering and tool integration to perform specific functions.
Agentic AI: Representing a paradigmatic shift, Agentic AI systems are characterized by multi-agent collaboration, dynamic task decomposition, persistent memory, and orchestrated autonomy. They move beyond isolated tasks to coordinated systems capable of complex decision-making processes.

Architectural Evolution

The transition from AI Agents to Agentic AI involves significant architectural enhancements:

AI Agents: Utilize core reasoning components like LLMs, augmented with tools to enhance functionality.
Agentic AI: Incorporate advanced architectural components that allow for higher levels of autonomy and coordination among multiple agents, enabling more sophisticated and context-aware operations.

Applications

AI Agents: Commonly applied in areas such as customer support, scheduling, and data summarization, where tasks are well-defined and require specific responses.
Agentic AI: Find applications in more complex domains like research automation, robotic coordination, and medical decision support, where tasks are dynamic and require adaptive, collaborative problem-solving.

Challenges and Proposed Solutions

Both paradigms face unique challenges:

AI Agents: Issues like hallucination and brittleness, where the system may produce inaccurate or nonsensical outputs.
Agentic AI: Challenges include emergent behavior and coordination failures among agents.

To address these, the study suggests solutions such as ReAct loops, Retrieval-Augmented Generation (RAG), orchestration layers, and causal modeling to enhance system robustness and explainability.

References

Sapkota, R., Roumeliotis, K. I., & Karkee, M. (2025). AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges. arXiv preprint arXiv:2505.10468.

Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks

Researchers from Tsinghua University and ModelBest have introduced Ultra-FineWeb, a large-scale, high-quality dataset comprising approximately 1 trillion English tokens and 120 billion Chinese tokens. This dataset aims to enhance the performance of large language models (LLMs) by providing cleaner and more efficient training data.

Efficient Data Filtering Pipeline

The creation of Ultra-FineWeb involved an efficient data filtering pipeline that addresses two main challenges in data preparation for LLMs:

Lack of Efficient Data Verification Strategy:
Traditional methods struggle to provide timely feedback on data quality. To overcome this, the researchers introduced a computationally efficient verification strategy that enables rapid evaluation of data impact on LLM training with minimal computational cost.
Selection of Seed Data for Classifier Training:
Selecting appropriate seed data often relies heavily on human expertise, introducing subjectivity. The team optimized the selection process by integrating the verification strategy, improving filtering efficiency and classifier robustness.

A lightweight classifier based on fastText was employed to efficiently filter high-quality data, significantly reducing inference costs compared to LLM-based classifiers.

Benchmark Performance

Empirical results demonstrate that LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, including MMLU, ARC, CommonSenseQA, and others. The dataset's quality contributes to enhanced training efficiency and model accuracy.

Availability

Ultra-FineWeb is available on Hugging Face, providing researchers and developers with access to this extensive dataset for training and evaluating LLMs.

References

17.5.25

How FutureHouse’s AI Agents Are Reshaping Scientific Discovery

In a major leap for scientific research, FutureHouse—a nonprofit backed by former Google CEO Eric Schmidt—has introduced a powerful lineup of AI research agents aimed at accelerating the pace of scientific discovery. Built to support scientists across disciplines, these agents automate key parts of the research workflow—from literature search to chemical synthesis planning—reducing bottlenecks and enhancing productivity.

This suite includes four primary agents: Crow, Falcon, Owl, and Phoenix, each specialized in a unique aspect of the research pipeline. Together, they form a comprehensive AI-powered infrastructure for modern science.

Meet the AI Agents Changing Science

1. Crow – The Concise Search Specialist

Crow acts as a rapid-response research assistant. It provides short, precise answers to technical queries by intelligently retrieving evidence from full-text scientific papers. Designed for speed and accuracy, it’s especially useful for API-based interactions, where precision and performance matter most. Crow is built on top of FutureHouse’s custom PaperQA2 architecture.

2. Falcon – Deep Research Assistant

Falcon takes things further by conducting expansive literature reviews. It produces full-length research reports in response to broader or more open-ended scientific questions. By analyzing papers, data sources, and context-rich materials, Falcon allows researchers to dive deep into topics without manually sorting through endless PDFs.

3. Owl – Precedent Investigator

Owl helps scientists find out whether an experiment or research idea has already been executed. This is crucial for grant applications, patent filings, and ensuring that researchers don’t waste time reinventing the wheel. By surfacing related studies and experiments, Owl enables more informed, original work.

4. Phoenix – The Chemistry Innovator

Phoenix is built for early-stage chemistry research. Leveraging cheminformatics tools, it assists in designing molecules, suggesting synthetic routes, and evaluating chemical feasibility. It builds upon an earlier FutureHouse prototype called ChemCrow and remains in active development as a sandbox tool for chemists to explore and provide feedback.

Performance and Potential

In benchmark tests, Crow, Falcon, and Owl outperformed PhD-level biologists on scientific retrieval and reasoning tasks. Unlike many AI tools that only read paper abstracts or summaries, these agents consume and analyze full-text documents, allowing them to detect nuanced issues like methodological flaws or statistical limitations.

Although Phoenix is still in its experimental phase and may sometimes produce errors, it represents an important step toward automating complex tasks in synthetic chemistry.

Why This Matters

The bottlenecks of modern science often lie not in experimentation, but in navigating the overwhelming volume of prior work. By offloading repetitive and time-consuming research tasks to AI, FutureHouse's agents free up scientists to focus on creativity, innovation, and critical thinking.

These tools are also being made openly available for scientists and research institutions, fostering a collaborative environment for AI-augmented science.

Final Takeaway

FutureHouse’s AI agents aren’t just productivity boosters—they’re a vision of a new research paradigm. By augmenting human researchers with scalable, intelligent assistants, we’re witnessing the early stages of a revolution in how science is done. As these tools evolve, they hold the potential to dramatically accelerate scientific discovery across disciplines.