Wandering Nomad: Google AI

Showing posts with label Google AI. Show all posts

28.6.25

Google AI’s Gemma 3n Brings Full Multimodal Intelligence to Low-Power Edge Devices

A Mobile-First Milestone

Google has released Gemma 3n, a compact multimodal language model engineered to run entirely offline on resource-constrained hardware. Unlike its larger Gemma-3 cousins, the 3n variant was rebuilt from the ground up for edge deployment, performing vision, audio, video and text reasoning on devices with as little as 2 GB of RAM.

Two Ultra-Efficient Flavors

Variant	Activated Params*	Typical RAM	Claimed Throughput	Target Hardware
E2B	≈ 2 B (per token)	2 GB	30 tokens / s	Entry-level phones, micro-PCs
E4B	≈ 4 B	4 GB	50 tokens / s	Laptops, Jetson-class boards

*Mixture-of-Experts routing keeps only a subset of the full network active, giving E2B speeds comparable to 5 B dense models and E4B performance near 8 B models.

Key Technical Highlights

Native Multimodality – Single checkpoint accepts combined image, audio, video and text inputs and produces grounded text output.
Edge-Optimized Attention – A local–global pattern plus per-layer embedding (PLE) caching slashes KV-cache memory, sustaining 128 K-token context on-device.
Low-Precision Friendly – Ships with Q4_K_M quantization recipes and TensorFlow Lite / MediaPipe build targets for Android, iOS, and Linux SBCs.
Privacy & Latency – All computation stays on the device, eliminating round-trip delays and cloud-data exposure—critical for regulated or offline scenarios.

Early Benchmarks

Task	3n-E2B	3n-E4B	Gemma 3-4B-IT	Llama-3-8B-Instruct
MMLU (few-shot)	60.1	66.7	65.4	68.9
VQAv2 (zero-shot)	57.8	61.2	60.7	58.3
AudioQS (ASR)	14.3 WER	11.6 WER	12.9 WER	17.4 WER

Despite the tiny footprint, Gemma 3n matches or outperforms many 4-8 B dense models across language, vision and audio tasks.

Developer Experience

Open Weights (Apache 2.0) – Available on Hugging Face, Google AI Studio and Android AICore.
Gemma CLI & Vertex AI – Same tooling as larger Gemma 3 models; drop-in replacement for cloud calls when bandwidth or privacy is a concern.
Reference Apps – Google has published demos for offline voice assistants, real-time captioning, and hybrid AR experiences that blend live camera frames with text-based reasoning.

Why It Matters

Unlocks Edge-First Use Cases – Wearables, drones, smart-home hubs and industrial sensors can now run frontier-level AI without the cloud.
Reduces Cost & Carbon – Fewer server cycles and no data egress fees make deployments cheaper and greener.
Strengthens Privacy – Keeping raw sensor data on-device helps meet GDPR, HIPAA and other compliance regimes.

Looking Ahead

Google hints that Gemma 3n is just the first in a “nano-stack” of forthcoming sub-5 B multimodal releases built to scale from Raspberry Pi boards to flagship smartphones. With open weights, generous licences and robust tooling, Gemma 3n sets a new bar for AI everywhere—where power efficiency no longer has to compromise capability.

Google Launches Gemini CLI: An Open‑Source AI Agent for Your Terminal

💻 Gemini CLI Places AI Power in Developers’ Terminals

Google has unveiled Gemini CLI, a fully open-source AI agent that brings its latest Gemini 2.5 Pro model directly into developers’ terminals. Built for productivity and versatility, it supports tasks ranging from code generation to content creation, troubleshooting, research, and even image or video generation—all initiated via natural-language prompts.

🚀 Key Features & Capabilities

Powered by Gemini 2.5 Pro: Supports a massive 1 million-token context window, ideal for long-form conversations and deep codebases.
Multi-task Utility: Enables developers to write code, debug, generate documentation, manage tasks, conduct research, and create images/videos using Google’s Imagen and Veo tools.
MCP & Google Search Integration: Offers external context via web search and connects to developer tools using the Model Context Protocol.
Rich Extensibility: Fully open-source (Apache 2.0), enabling community contributions. Ships with MCP support, customizable prompts, and non-interactive scripting for automated workflows.
Generous Free Preview: Personal Google account grants 60 requests/minute and 1,000 requests/day, among the highest rates available from any provider.

🔧 Seamless Setup & Integration

Installs easily on Windows, macOS, and Linux.
Requires only a Google account with a free Gemini Code Assist license.
Works in tandem with Gemini Code Assist for VS Code, providing a unified CLI and IDE experience.
Ideal for both interactive use and automation within scripts or CI/CD pipelines.

Why It Matters

Meets Developers Where They Work: Integrates AI directly into the CLI—developers' most familiar environment—without needing new interfaces.
Long-Context Reasoning: The 1M-token window enables handling large codebases, multi-file logic, and in-depth document analysis in one session.
Multimodal Power: Beyond code, it supports image and video generation—making it a fully-fledged creative tool.
Openness & Community: As open-source software, Gemini CLI invites global collaboration, transparency, and innovation. Google encourages contributions via its GitHub repo
Competitive Edge: With elite token limits and flexibility, it positions itself as a strong alternative to existing tools like GitHub Copilot CLI and Anthropic’s Claude Code

✅ Final Takeaway

Gemini CLI marks a generational leap for developer AI tools—offering open-source freedom, high context capacity, and multimodal capabilities from within the terminal. With generous usage, extensibility, and seamless integration with developer workflows, it emerges as a compelling entry point into AI-first development. For teams and individuals alike, it’s a powerful new way to harness Gemini at scale.

9.6.25

Google Open‑Sources a Full‑Stack Agent Framework Powered by Gemini 2.5 & LangGraph

Google has unveiled an open-source full-stack agent framework that combines Gemini 2.5 and LangGraph to create conversational agents capable of multi-step reasoning, iterative web search, self-reflection, and synthesis—all wrapped in a React-based frontend and Python backend

🔧 Architecture & Workflow

The system integrates these components:

React frontend: User interface built with Vite, Tailwind CSS, and Shadcn UI.
LangGraph backend: Orchestrates agent workflow using FastAPI for API handling and Redis/PostgreSQL for state management
Gemini 2.5 models: Power each stage—dynamic query generation, reflection-based reasoning, and final answer synthesis.

🧠 Agent Reasoning Pipeline

Query Generation
The agent kicks off by generating targeted web search queries via Gemini 2.5.
Web Research
Uses Google Search API to fetch relevant documents.
Reflective Reasoning
The agent analyzes results for "knowledge gaps" and determines whether to continue searching—essential for deep, accurate answers
Iterative Looping
It refines queries and repeats the search-reflect cycle until satisfactory results are obtained.
Final Synthesis
Gemini consolidates the collected information into a coherent, citation-supported answer.

🚀 Developer-Friendly

Hot-reload support: Enables real-time updates during development for both frontend and backend
Full-stack quickstart repo: Available on GitHub with Docker‑Compose setup for local deployment using Gemini and LangGraph
Robust infrastructure: Built with LangGraph, FastAPI, Redis, and PostgreSQL for scalable research applications.

🎯 Why It Matters

This framework provides a transparent, research-grade AI pipeline: query ➞ search ➞ reflect ➞ iterate ➞ synthesize. It serves as a foundation for building deeper, more reliable AI assistants capable of explainable and verifiable reasoning—ideal for academic, enterprise, or developer research tools

⚙️ Getting Started

To get hands-on:

Clone the Gemini Fullstack LangGraph Quickstart from GitHub.
Add .env with your GEMINI_API_KEY.
Run make dev to start the full-stack environment, or use docker-compose for production setup

This tooling lowers the barrier to building research-first agents, making multi-agent workflows more practical for developers.

✅ Final Takeaway

Google’s open-source agent stack is a milestone: it enables anyone to deploy intelligent agents capable of deep research workflows with citation transparency. By combining Gemini's model strength, LangGraph orchestration, and a polished React UI, this stack empowers users to build powerful, self-improving research agents faster.

Google’s MASS Revolutionizes Multi-Agent AI by Automating Prompt and Topology Optimization

Designing multi-agent AI systems—where several AI "agents" collaborate—has traditionally depended on manual tuning of prompt instructions and agent communication structures (topologies). Google AI, in partnership with Cambridge researchers, is aiming to change that with their new Multi-Agent System Search (MASS) framework. MASS brings automation to the design process, ensuring consistent performance gains across complex domains.

🧠 What MASS Actually Does

MASS performs a three-stage automated optimization that iteratively refines:

Block-Level Prompt Tuning
Fine-tunes individual agent prompts via local search—sharpening their roles (think “questioner”, “solver”).
Topology Optimization
Identifies the best agent interaction structure. It prunes and evaluates possible communication workflows to find the most impactful design.
Workflow-Level Prompt Refinement
Final tuning of prompts once the best network topology is set.

By alternating prompt and topology adjustments, MASS achieves optimization that surpasses previous methods which tackled only one dimension

🏅 Why It Matters

Benchmarked Success: MASS-designed agent systems outperform AFlow and ADAS on challenging benchmarks like MATH, LiveCodeBench, and multi-hop question-answering
Reduced Manual Overhead: Designers no longer need to trial-and-error their way through thousands of prompt-topology combinations.
Extended to Real-World Tasks: Whether for reasoning, coding, or decision-making, this framework is broadly applicable across domains.

💬 Community Reactions

Reddit’s r/machinelearningnews highlighted MASS’s leap beyond isolated prompt or topology tuning:

“Multi-Agent System Search (MASS) … reduces manual effort while achieving state‑of‑the‑art performance on tasks like reasoning, multi‑hop QA, and code generation.” linkedin.com

📘 Technical Deep Dive

Originating from a February 2025 paper by Zhou et al., MASS represents a methodological advance in agentic AI

Agents are modular: designed for distinct roles through prompts.
Topology defines agent communication patterns: linear chain, tree, ring, etc.
MASS explores both prompt and topology spaces, sequentially optimizing them across three stages.
Final systems demonstrate robustness not just in benchmarks but as a repeatable design methodology.

🚀 Wider Implications

Democratizing Agent Design: Non-experts in prompt engineering can deploy effective agent systems from pre-designed searches.
Adaptability: Potential for expanding MASS to dynamic, real-world settings like real-time planning and adaptive workflows.
Innovation Accelerator: Encourages research into auto-tuned multi-agent frameworks for fields like robotics, data pipelines, and interactive assistants.

🧭 Looking Ahead

As Google moves deeper into its “agentic era”—with initiatives like Project Mariner and Gemini's Agent Mode—MASS offers a scalable blueprint for future AS/AI applications. Expect to see frameworks that not only generate prompts but also self-optimize their agent networks for performance and efficiency.

22.5.25

Google Unveils MedGemma: Advanced Open-Source AI Models for Medical Text and Image Comprehension

At Google I/O 2025, Google announced the release of MedGemma, a collection of open-source AI models tailored for medical text and image comprehension. Built upon the Gemma 3 architecture, MedGemma aims to assist developers in creating advanced healthcare applications by providing robust tools for analyzing medical data.

MedGemma Model Variants

MedGemma is available in two distinct versions, each catering to specific needs in medical AI development:

MedGemma 4B (Multimodal Model): This 4-billion parameter model integrates both text and image processing capabilities. It employs a SigLIP image encoder pre-trained on diverse de-identified medical images, including chest X-rays, dermatology, ophthalmology, and histopathology slides. This variant is suitable for tasks like medical image classification and interpretation.
MedGemma 27B (Text-Only Model): A larger, 27-billion parameter model focused exclusively on medical text comprehension. It's optimized for tasks requiring deep clinical reasoning and analysis of complex medical literature.

Key Features and Use Cases

MedGemma offers several features that make it a valuable asset for medical AI development:

Medical Image Classification: The 4B model can be adapted for classifying various medical images, aiding in diagnostics and research.
Text-Based Medical Question Answering: Both models can be utilized to develop systems that answer medical questions based on extensive medical literature and data.
Integration with Development Tools: MedGemma models are accessible through platforms like Google Cloud Model Garden and Hugging Face, and are supported by resources such as GitHub repositories and Colab notebooks for ease of use and customization.

Access and Licensing

Developers interested in leveraging MedGemma can access the models and related resources through the following platforms:

Google Health AI Developer Foundations: MedGemma Overview
Hugging Face: MedGemma CollectionHugging Face
GitHub: MedGemma RepositoryGitHub

The use of MedGemma is governed by the Health AI Developer Foundations terms of use, ensuring responsible deployment in healthcare settings.

Google's Stitch: Transforming App Development with AI-Powered UI Design

Google has introduced Stitch, an experimental AI tool from Google Labs designed to bridge the gap between conceptual app ideas and functional user interfaces. Powered by the multimodal Gemini 2.5 Pro model, Stitch enables users to generate UI designs and corresponding frontend code using natural language prompts or visual inputs like sketches and wireframes.

Key Features of Stitch

Natural Language UI Generation: Users can describe their app concepts in plain English, specifying elements like color schemes or user experience goals, and Stitch will generate a corresponding UI design.
Image-Based Design Input: By uploading images such as whiteboard sketches or screenshots, Stitch can interpret and transform them into digital UI designs, facilitating a smoother transition from concept to prototype. Google Developers Blog
Design Variations: Stitch allows for the generation of multiple design variants from a single prompt, enabling users to explore different layouts and styles quickly.
Integration with Development Tools: Users can export designs directly to Figma for further refinement or obtain the frontend code (HTML/CSS) to integrate into their development workflow.

Getting Started with Stitch

Access Stitch: Visit stitch.withgoogle.com and sign in with your Google account.
Choose Your Platform: Select whether you're designing for mobile or web applications.
Input Your Prompt: Describe your app idea or upload a relevant image to guide the design process.
Review and Iterate: Examine the generated UI designs, explore different variants, and make adjustments as needed.
Export Your Design: Once satisfied, export the design to Figma or download the frontend code to integrate into your project.

Stitch is currently available for free as part of Google Labs, offering developers and designers a powerful tool to accelerate the UI design process and bring app ideas to life more efficiently.

21.5.25

Google's Jules Aims to Out-Code Codex in the AI Developer Stack

Google has unveiled Jules, its latest AI-driven coding agent, now available in public beta. Designed to assist developers by autonomously fixing bugs, generating tests, and consulting documentation, Jules operates asynchronously, allowing developers to delegate tasks while focusing on other aspects of their projects.

Key Features of Jules

Asynchronous Operation: Jules functions in the background, enabling developers to assign tasks without interrupting their workflow.
Integration with GitHub: Seamlessly integrates into GitHub workflows, enhancing code management and collaboration.
Powered by Gemini 2.5 Pro: Utilizes Google's advanced language model to understand and process complex coding tasks.
Virtual Machine Execution: Runs tasks within a secure virtual environment, ensuring safety and isolation during code execution.
Audio Summaries: Provides audio explanations of its processes, aiding in understanding and transparency.

Josh Woodward, Vice President of Google Labs, highlighted Jules' capability to assist developers by handling tasks they prefer to delegate, stating, "People are describing apps into existence."

Competitive Landscape

Jules enters a competitive field alongside OpenAI's Codex and GitHub's Copilot Agent. While Codex has evolved from a coding model to an agent capable of writing and debugging code, GitHub's Copilot Agent offers similar asynchronous functionalities. Jules differentiates itself with its integration of audio summaries and task execution within virtual machines.

Community Reception

The developer community has shown enthusiasm for Jules, with early users praising its planning capabilities and task management. One developer noted, "Jules plans first and creates its own tasks. Codex does not. That's major."

Availability

Currently in public beta, Jules is accessible for free with usage limits. Developers interested in exploring its capabilities can integrate it into their GitHub workflows and experience its asynchronous coding assistance firsthand.

Google Launches NotebookLM Mobile App with Offline Audio and Seamless Source Integration

Google has officially launched its NotebookLM mobile application for both Android and iOS platforms, bringing the capabilities of its AI-powered research assistant to users on the go. The mobile app mirrors the desktop version's core functionalities, including summarizing uploaded sources and generating AI-driven Audio Overviews, which can be played in the background or offline, catering to users' multitasking needs.

Key Features of NotebookLM Mobile App

Offline Audio Overviews: Users can download AI-generated, podcast-style summaries of their documents for offline listening, making it convenient to stay informed without constant internet access.
Interactive AI Hosts: The app introduces a "Join" feature, allowing users to engage with AI hosts during playback, ask questions, and steer the conversation, enhancing the interactivity of the learning experience.
Seamless Content Sharing: NotebookLM integrates with the device's native share function, enabling users to add content from websites, PDFs, and YouTube videos directly to the app, streamlining the research process.
Availability: The app is available for download on the Google Play Store for Android devices running version 10 or higher, and on the App Store for iOS devices running iOS 17 or later.

The release of the NotebookLM mobile app addresses a significant user demand for mobile accessibility, allowing users to engage with their research materials more flexibly and efficiently. With features tailored for mobile use, such as offline access and interactive summaries, NotebookLM continues to evolve as a versatile tool for students, professionals, and researchers alike.

Reference:
1. https://blog.google/technology/ai/notebooklm-app/

5.5.25

Google’s AI Mode Gets Major Upgrade With New Features and Broader Availability

Google is taking a big step forward with AI Mode, its experimental feature designed to answer complex, multi-part queries and support deep, follow-up-driven search conversations—directly inside Google Search.

Initially launched in March as a response to tools like Perplexity AI and ChatGPT Search, AI Mode is now available to all U.S. users over 18 who are enrolled in Google Labs. Even bigger: Google is removing the waitlist and beginning to test a dedicated AI Mode tab within Search, visible to a small group of U.S. users.

What’s New in AI Mode?

Along with expanded access, Google is rolling out several powerful new features designed to make AI Mode more practical for everyday searches:

🔍 Visual Place & Product Cards

You can now see tappable cards with key info when searching for restaurants, salons, or stores—like ratings, reviews, hours, and even how busy a place is in real time.

🛍️ Smarter Shopping

Product searches now include real-time pricing, promotions, images, shipping details, and local inventory. For example, if you ask for a “foldable camping chair under $100 that fits in a backpack,” you’ll get a tailored product list with links to buy.

🔁 Search Continuity

Users can pick up where they left off in ongoing searches. On desktop, a new left-side panel shows previous AI Mode interactions, letting you revisit answers and ask follow-ups—ideal for planning trips or managing research-heavy tasks.

Why It Matters

With these updates, Google is clearly positioning AI Mode as a serious contender in the AI-powered search space. From hyper-personalized recommendations to deep dive follow-ups, it’s bridging the gap between traditional search and AI assistants—right in the tool billions already use.