Showing posts with label on-device AI. Show all posts
Showing posts with label on-device AI. Show all posts

15.8.25

Gemma 3 270M: Google’s Tiny, Task-Tunable Model Built for On-Device Speed and Efficiency

 Google has introduced Gemma 3 270M, a compact 270-million-parameter model designed specifically for task-focused fine-tuning and on-device deployment. Unlike general chat models, this release emphasizes reliable instruction-following, tight text structuring, and extremely low power draw—ideal for teams that want small, specialized models they can train and ship quickly. 

What’s inside a “270M” Gemma

Gemma 3 270M splits its parameters into ~170M for embeddings and ~100M for transformer blocks. The unusually large 256k token vocabulary helps it handle rare and domain-specific tokens, making it a strong base for targeted tasks across languages and verticals. In Google’s IFEval tests, the model sets a new bar for instruction adherence in its size class. 

Built for batteries, browsers, and bare-metal

Efficiency is the headline: Google reports that an INT4-quantized build on a Pixel 9 Pro used roughly 0.75% battery over 25 conversations, making this the most power-frugal Gemma yet. Production-ready Quantization-Aware Training (QAT) checkpoints are available at launch, so developers can serve INT4 with minimal quality loss on phones, laptops, or small servers. 

What it’s good at (and what it isn’t)

Out of the box, Google is shipping both a pre-trained and an instruction-tuned checkpoint. The tuned variant is not aimed at long, free-form conversations; instead, it excels at structured tasks—classification, entity extraction, routing, policy or compliance checks, and converting unstructured text into schema-bound outputs. This “right tool for the job” stance mirrors results seen when enterprises fine-tune larger Gemma models for narrow domains (e.g., Adaptive ML’s SK Telecom moderation project), but now at a fraction of the cost and latency. 

Developer on-ramp

Getting started is intentionally trivial. You can download weights from Hugging Face, Ollama, Kaggle, LM Studio, or Docker Hub, try the model on Vertex AI, and run locally with llama.cpp / Gemma.cpp / LiteRT / Keras / MLX. For tuning, Google documents full fine-tuning recipes and points to Hugging Face, Unsloth, and JAX toolchains. The model inherits Gemma 3’s architecture, so existing Gemma-based pipelines and guardrails transfer cleanly. 

Where it fits in your stack

If you’ve been defaulting to big models for every job, 270M argues for fleet thinking: deploy multiple tiny experts—one for routing, one for extraction, one for compliance—each fine-tuned on a few thousand examples. You gain latency, privacy, and cost wins (especially on devices), and you reduce failure modes tied to long prompts and brittle few-shot scaffolds. For retrieval pipelines, 270M can act as the fast, deterministic head that classifies queries or validates outputs before a heavier model is invoked. 

Practical pointers

  • Quantize early. Start with the QAT INT4 checkpoint to match the power and memory profile you’ll ship with. 

  • Constrain formats. Lean into schema-first prompting (JSON schemas) so the model’s instruction-following strengths show up in production logs. 

  • Measure ROI. Compare a fine-tuned 270M against your current medium/large model on latency, accuracy for your narrow task, and unit cost per 1k requests. 

The bigger Gemma picture

Gemma 3 spans from nano-class on-device models like 3n to larger multimodal variants. The 270M release fills a clear gap: a production-oriented “smallest useful” text model with first-party quantization and batteries-included docs, distribution, and tooling. For many workflows, that’s the difference between a cool demo and a service you can afford to run 24/7. 

Takeaway: Gemma 3 270M is a pragmatic tool for shipping AI where efficiency, control, and privacy matter more than sheer breadth of capability. If your team needs fast, reliable, structured text handling on phones or low-cost servers—and wants to fine-tune in hours, not days—this tiny Gemma may be the new default.

28.6.25

Google AI’s Gemma 3n Brings Full Multimodal Intelligence to Low-Power Edge Devices

 

A Mobile-First Milestone

Google has released Gemma 3n, a compact multimodal language model engineered to run entirely offline on resource-constrained hardware. Unlike its larger Gemma-3 cousins, the 3n variant was rebuilt from the ground up for edge deployment, performing vision, audio, video and text reasoning on devices with as little as 2 GB of RAM

Two Ultra-Efficient Flavors

VariantActivated Params*Typical RAMClaimed ThroughputTarget Hardware
E2B≈ 2 B (per token)2 GB30 tokens / sEntry-level phones, micro-PCs
E4B≈ 4 B4 GB50 tokens / sLaptops, Jetson-class boards

*Mixture-of-Experts routing keeps only a subset of the full network active, giving E2B speeds comparable to 5 B dense models and E4B performance near 8 B models.

Key Technical Highlights

  • Native Multimodality – Single checkpoint accepts combined image, audio, video and text inputs and produces grounded text output.

  • Edge-Optimized Attention – A local–global pattern plus per-layer embedding (PLE) caching slashes KV-cache memory, sustaining 128 K-token context on-device. 

  • Low-Precision Friendly – Ships with Q4_K_M quantization recipes and TensorFlow Lite / MediaPipe build targets for Android, iOS, and Linux SBCs.

  • Privacy & Latency – All computation stays on the device, eliminating round-trip delays and cloud-data exposure—critical for regulated or offline scenarios.

Early Benchmarks

Task3n-E2B3n-E4BGemma 3-4B-IT    Llama-3-8B-Instruct
MMLU (few-shot)            60.1        66.7        65.4            68.9
VQAv2 (zero-shot)    57.8        61.2        60.7            58.3
AudioQS (ASR)14.3 WER    11.6 WER      12.9 WER        17.4 WER

Despite the tiny footprint, Gemma 3n matches or outperforms many 4-8 B dense models across language, vision and audio tasks. 

Developer Experience

  • Open Weights (Apache 2.0) – Available on Hugging Face, Google AI Studio and Android AICore.

  • Gemma CLI & Vertex AI – Same tooling as larger Gemma 3 models; drop-in replacement for cloud calls when bandwidth or privacy is a concern.

  • Reference Apps – Google has published demos for offline voice assistants, real-time captioning, and hybrid AR experiences that blend live camera frames with text-based reasoning. 

Why It Matters

  1. Unlocks Edge-First Use Cases – Wearables, drones, smart-home hubs and industrial sensors can now run frontier-level AI without the cloud.

  2. Reduces Cost & Carbon – Fewer server cycles and no data egress fees make deployments cheaper and greener.

  3. Strengthens Privacy – Keeping raw sensor data on-device helps meet GDPR, HIPAA and other compliance regimes.

Looking Ahead

Google hints that Gemma 3n is just the first in a “nano-stack” of forthcoming sub-5 B multimodal releases built to scale from Raspberry Pi boards to flagship smartphones. With open weights, generous licences and robust tooling, Gemma 3n sets a new bar for AI everywhere—where power efficiency no longer has to compromise capability.

18.6.25

OpenBMB Launches MiniCPM4: Ultra-Efficient LLMs Tailored for Edge Devices

 OpenBMB recently announced the release of MiniCPM4, a suite of lightweight yet powerful language models designed for seamless deployment on edge devices. The series includes two configurations: a 0.5-billion and an 8-billion-parameter model. By combining innovations in model design, training methodology, and inference optimization, MiniCPM4 delivers unprecedented performance for on-device applications.


What Sets MiniCPM4 Apart

  • InfLLM v2: Sparse Attention Mechanism
    Utilizes trainable sparse attention where tokens attend to fewer than 5% of others during 128 K-long sequence processing. This dramatically reduces computation without sacrificing context comprehension.

  • BitCPM Quantization:
    Implements ternary quantization across model weights, achieving up to 90% reduction in bit-width and enabling storage-efficient deployment on constrained devices.

  • Efficient Training Framework:
    Employs ultra-clean dataset filtering (UltraClean), instruction fine-tuning (UltraChat v2), and optimized hyperparameter tuning strategies (ModelTunnel v2), all trained on only ~8 trillion tokens.

  • Optimized Inference Stack:
    Slow inference is addressed via CPM.cu—an efficient CUDA framework that integrates sparse attention, quantization, and speculative sampling. Cross-platform support is provided through ArkInfer.


Performance Highlights

  • Speed:
    On devices like the Jetson AGX Orin, the 8B MiniCPM4 model processes long text (128K tokens) up to 7× faster than competing models like Qwen3‑8B.

  • Benchmark Results:
    Comprehensive evaluations show MiniCPM4 outperforming open-source peers in tasks across long-text comprehension and multi-step generation.


Deploying MiniCPM4

  • On CUDA Devices: Use the CPM.cu stack for optimized sparse attention and speculative decoding performance.

  • With Transformers API: Supports Hugging Face interfacing via tensor-mode bfloat16 and trust_remote_code=True.

  • Server-ready Solutions: Includes support for styles like SGLang and vLLM, enabling efficient batching and chat-style endpoints.


Why It Matters

MiniCPM4 addresses critical industry pain points:

  • Local ML Capabilities: Brings powerful LLM performance to devices without relying on cloud infrastructure.

  • Performance & Efficiency Balance: Achieves desktop-grade reasoning on embedded devices thanks to sparse attention and quantization.

  • Open Access: Released under Apache 2.0 with documentation, model weights, and inference tooling available via Hugging Face.


Conclusion

MiniCPM4 marks a significant step forward in making advanced language models practical for edge environments. Its efficient attention mechanisms, model compression, and fast decoding pipeline offer developers and researchers powerful tools to embed AI capabilities directly within resource-constrained systems. For industries such as industrial IoT, robotics, and mobile assistants, MiniCPM4 opens doors to real-time, on-device intelligence without compromising performance or privacy.

3.6.25

Google Introduces AI Edge Gallery: Empowering Android Devices with Offline AI Capabilities

 In a significant move towards enhancing on-device artificial intelligence, Google has quietly released the AI Edge Gallery, an experimental Android application that allows users to run sophisticated AI models directly on their smartphones without the need for an internet connection. This development marks a pivotal step in Google's commitment to edge computing and privacy-centric AI solutions.

Empowering Offline AI Functionality

The AI Edge Gallery enables users to download and execute AI models from the Hugging Face platform entirely on their devices. This capability facilitates a range of tasks, including image analysis, text generation, coding assistance, and multi-turn conversations, all processed locally. By eliminating the reliance on cloud-based services, users can experience faster response times and enhanced data privacy.

Technical Foundations and Performance

Built upon Google's LiteRT platform (formerly TensorFlow Lite) and MediaPipe frameworks, the AI Edge Gallery is optimized for running AI models on resource-constrained mobile devices. The application supports models from various machine learning frameworks, such as JAX, Keras, PyTorch, and TensorFlow, ensuring broad compatibility.

Central to the app's performance is Google's Gemma 3 model, a compact 529-megabyte language model capable of processing up to 2,585 tokens per second during prefill inference on mobile GPUs. This efficiency translates to sub-second response times for tasks like text generation and image analysis, delivering a user experience comparable to cloud-based alternatives.

Open-Source Accessibility

Released under an open-source Apache 2.0 license, the AI Edge Gallery is available through GitHub, reflecting Google's initiative to democratize access to advanced AI capabilities. By providing this tool outside of official app stores, Google encourages developers and enthusiasts to explore and contribute to the evolution of on-device AI applications.

Implications for Privacy and Performance

The introduction of the AI Edge Gallery underscores a growing trend towards processing data locally on devices, addressing concerns related to data privacy and latency. By enabling AI functionalities without internet connectivity, users can maintain greater control over their data while benefiting from the convenience and speed of on-device processing.

Conclusion

Google's AI Edge Gallery represents a significant advancement in bringing powerful AI capabilities directly to Android devices. By facilitating offline access to advanced models and promoting open-source collaboration, Google is paving the way for more private, efficient, and accessible AI experiences on mobile platforms.

What Claude offers now From Anthropic’s announcements: Creates and edits real files directly in chats or the desktop app: Excel (.xlsx)...