Wandering Nomad: Hugging Face

Showing posts with label Hugging Face. Show all posts

6.8.25

OpenAI Unveils GPT-OSS: Two Apache-Licensed Open-Weight Models Aimed at Reasoning, Agents, and Real-World Deployment

OpenAI has released GPT-OSS, a pair of open-weight language models designed for strong reasoning and agentic workflows—gpt-oss-120b and gpt-oss-20b—marking the company’s most significant “open” move since GPT-2. Both models are distributed under Apache 2.0 (with an accompanying GPT-OSS usage policy), positioning them for commercial use, customization, and local deployment.

What’s in the release

Two sizes, one family. The larger gpt-oss-120b targets top-tier reasoning; gpt-oss-20b is a lighter option for edge and on-prem use. OpenAI says 120b achieves near-parity with o4-mini on core reasoning benchmarks, while 20b performs similarly to o3-mini—a notable claim for open-weight models.
Hardware footprint. OpenAI highlights efficient operation for the 120b model (single 80 GB GPU) and 20b running with as little as 16 GB memory in edge scenarios, enabling local inference and rapid iteration without costly infrastructure.
Licensing & model card. The company published a model card and licensing details (Apache 2.0 + usage policy), clarifying intended use, evaluations, and limitations.

Why this matters

For years, OpenAI prioritized API-only access to frontier systems. GPT-OSS signals a strategic broadening toward open-weight distribution, meeting developers where they build—local, cloud, or hybrid—and competing more directly with leaders like Llama and DeepSeek. Early coverage underscores the shift: outlets note this is OpenAI’s first open-weight release since GPT-2 and frame it as both an ecosystem and competitive move.

Where you can run it (day one)

OpenAI launched with unusually wide partner support, making GPT-OSS easy to try in existing MLOps stacks:

Hugging Face: downloadable weights and a welcome post with implementation details.
AWS SageMaker JumpStart: curated deployment templates for OSS-20B/120B.
Azure AI Foundry & Windows AI Foundry: managed endpoints and tooling for fine-tuning and inference.
Databricks: native availability with 131k-context serving options and enterprise controls.
NVIDIA: performance tuning for GB200 NVL72 systems; NVIDIA cites up to ~1.5M tokens/sec rack-scale throughput for the 120B variant.

Developer ergonomics: Harmony & agents

OpenAI also published Harmony, a response format and prompt schema that GPT-OSS models are trained to follow. Harmony standardizes conversation structure, reasoning output, and function-calling/tool-use—useful for building agents that require predictable JSON and multi-step plans. If you’re serving via common runtimes (Hugging Face, vLLM, Ollama), the formatting is handled for you; custom servers can adopt the schema from the public repo.

Safety posture

OpenAI says GPT-OSS went through Preparedness Framework testing, including trials where a maliciously fine-tuned 120B model was evaluated for risky capabilities. The company reports that such variants did not reach high-capability thresholds, presenting a measured step forward in open-model safety practices.

How it stacks up (early read)

Early reports highlight the significance of the move and the headline performance claims—near-o4-mini for 120B and o3-mini-like results for 20B—alongside the practical win of local, customizable models under a permissive license. Analysts also point out the competitive context: GPT-OSS arrives as open-weight ecosystems (Llama, DeepSeek, Qwen, Kimi) surge in adoption.

What to build first

Agent backends that rely on structured tool use and local policy control (Harmony + Apache 2.0 helps here).
Sovereign/air-gapped deployments in regulated environments using on-prem GPUs or edge hardware, especially with the 20B model.
Cost-sensitive RAG and analytics where fine-tuning and local inference can beat per-token API economics—now supported across major clouds and MLOps platforms.

The takeaway

GPT-OSS is OpenAI’s clearest embrace of the open-weight ecosystem to date: credible reasoning performance, permissive licensing, broad partner availability, and practical tooling for agents. If your roadmap calls for customizable, locally deployable models with strong reasoning, GPT-OSS belongs on your shortlist—whether you’re targeting laptops, single-GPU servers, or GB200-class scale.

3.7.25

Baidu Open-Sources ERNIE 4.5: A Full LLM Family from 0.3 B to 424 B Parameters

A Flagship Release for the Open-Source Community

On July 1 2025, Baidu announced the open-source launch of ERNIE 4.5, a complete large-language-model family scaling from 0.3 billion to 424 billion parameters. The weights, training code, and evaluation suites are now freely available to researchers and enterprises under the Apache 2.0 license.

Six Sizes, One Architecture

Model	Dense / MoE	Context Window	Target Hardware*	Intended Use
ERNIE-Tiny 0.3B	Dense	16 K	Mobile/Edge	Lightweight chat & IoT
ERNIE-Base 7B	Dense	32 K	1× A10 24 GB	Mainstream apps
ERNIE-Large 34B	Dense	128 K	2× A100 80 GB	RAG & agents
ERNIE-XL 124B	MoE (8 experts)	256 K	4× H100 80 GB	Multimodal research
ERNIE-Mega 276B	MoE (16)	256 K	8× H100 80 GB	Enterprise AI
ERNIE-Ultra 424B	MoE (24)	1 M	TPU v5p / 16× H100	Frontier-level reasoning

*at int8 + FlashAttention-2 settings

Technology Highlights

FlashMask Dynamic Attention – a masking scheme that activates only the most relevant key-value blocks per token, cutting memory by 40 % while retaining context depth.
Heterogeneous Multimodal MoE – vision-audio experts share early layers with text, enabling cross-modal reasoning without separate encoders.
Knowledge-Centric Corpus – Baidu’s in-house “Wenxin KG-2” injects 4 T tokens of curated facts and regulations, boosting compliance answers.
Self-Feedback Post-Training – iterative reflection steps reduce hallucination rate by 28 % vs. ERNIE 4.0.

Benchmark Performance

Benchmark (June 2025)	GPT-4.5*	ERNIE 4.5-Ultra 424B	ERNIE 4.5-Large 34B
MMLU (5-shot)	88.7 %	89.3 %	82.1 %
MathGLUE	55.4 %	57.2 %	48.0 %
VQA-v2 (zero-shot)	83.0 %	84.6 %	78.9 %
Code HumanEval+	93.5 %	94.1 %	87.3 %

*closed model; public leaderboard values. ERNIE 4.5 data from Baidu release notes.

Why It Matters

End-to-End Transparency – full training configs (FlashMask, MoE routing, safety filters) are published, enabling reproducible research.
Scalable Deployment – identical API across sizes lets startups choose Tiny/7B locally and swap to 424B in the cloud without prompt changes.
Multilingual & Multimodal – supports 34 languages and native image, audio, and short-video tokens out of the box.
Cost Innovation – FlashMask and MoE shrink inference FLOPs by up to 55 % versus dense GPT-4-class models, lowering GPU bills for enterprise users.

Access & Tooling

Hugging Face Hub – weights and safetensors for all six checkpoints.
Docker & vLLM Images – ready-to-serve stacks with Triton / TensorRT-LLM.
Agent Starter Kits – sample Model-Context-Protocol (MCP) tools for retrieval, calculators, and code execution.
Chinese & English Docs – prompt templates, fine-tuning scripts, and safety policy examples.

Roadmap

Baidu’s research blog notes upcoming “ERNIE 4.6” experiments with FlashMask-2 and sparse Mixture-of-Experts vision heads, plus a policy-aligned Turbo variant targeting 80 % cheaper inference for chat applications.

Takeaway
With ERNIE 4.5, Baidu throws open the doors to a fully transparent, parameter-scalable, multimodal LLM family—giving practitioners a home-grown alternative to closed giants and pushing the frontier of what open-source models can achieve.

18.6.25

Groq Supercharges Hugging Face Inference—Then Targets AWS & Google

Groq, the AI inference startup, is making bold moves by integrating its custom Language Processing Unit (LPU) into Hugging Face and expanding toward AWS and Google platforms. The company now supports Alibaba’s Qwen3‑32B model with a groundbreaking full 131,000-token context window, unmatched by other providers.

🔋 Record-Breaking 131K Context Window

Groq's LPU hardware enables inference on extremely long sequences—essential for tasks like full-document analysis, comprehensive code reasoning, and extended conversational threads. Benchmarking firm Artificial Analysis measured 535 tokens per second, and Groq offers competitive pricing at $0.29 per million input tokens and $0.59 per million output tokens.

🚀 Hugging Face Partnership

As an official inference provider on Hugging Face, Groq offers seamless access via the Playground and API. Developers can now select Groq as the execution backend, benefiting from high-speed, cost-efficient inference directly billed through Hugging Face. This integration extends to popular model families such as Meta LLaMA, Google Gemma, and Alibaba Qwen3-32B.

⚡ Future Plans: AWS & Google

Groq's strategy targets more than Hugging Face. The startup is challenging cloud giants by providing high-performance inference services with specialized hardware optimized for AI tasks. Though AWS Bedrock, Google Vertex AI, and Microsoft Azure currently dominate the market, Groq's unique performance and pricing offer a compelling alternative.

🌍 Scaling Infrastructure

Currently, Groq operates data centers across North America and the Middle East, handling over 20 million tokens per second. They plan further global expansion to support increasing demand from Hugging Face users and beyond.

📈 The Bigger Picture

The AI inference market—projected to hit $154.9 billion by 2030—is becoming the battleground for performance and cost supremacy. Groq’s emphasis on long-context support, fast token throughput, and competitive pricing positions it to capture a significant share of inference workloads. However, the challenge remains: maintaining performance at scale and competing with cloud giants’ infrastructure power.

✅ Key Takeaways

Advantage	Details
Unmatched Context Window	Full 131K tokens—ideal for extended documents and conversations
High-Speed Inference	535 tokens/sec performance, surpassing typical GPU setups
Simplified Access	Integration via Hugging Face platform
Cost-Effective Pricing	Token-based costs lower than many cloud providers
Scaling Ambitions	Expanding globally, targeting AWS/Google market share

Groq’s collaboration with Hugging Face marks a strategic shift toward democratizing high-performance AI inference. By focusing on specialized hardware, long context support, and seamless integration, Groq is positioning itself as a formidable challenger to established cloud providers in the fast-growing inference market.

4.6.25

SmolVLA: Hugging Face's Compact Vision-Language-Action Model for Affordable Robotics

Hugging Face has introduced SmolVLA, a compact and efficient Vision-Language-Action (VLA) model designed to democratize robotics by enabling robust performance on consumer-grade hardware. With only 450 million parameters, SmolVLA achieves competitive results compared to larger models, thanks to its training on diverse, community-contributed datasets.

Bridging the Gap in Robotics AI

While large-scale Vision-Language Models (VLMs) have propelled advancements in AI, their application in robotics has been limited due to high computational demands and reliance on proprietary datasets. SmolVLA addresses these challenges by offering:

Compact Architecture: A 450M-parameter model that balances performance and efficiency.
Community-Driven Training Data: Utilization of 487 high-quality datasets from the LeRobot community, encompassing approximately 10 million frames.
Open-Source Accessibility: Availability of model weights and training data under the Apache 2.0 license, fostering transparency and collaboration.

Innovative Training and Annotation Techniques

To enhance the quality of training data, the team employed the Qwen2.5-VL-3B-Instruct model to generate concise, action-oriented task descriptions, replacing vague or missing annotations. This approach ensured consistent and informative labels across the diverse datasets.

Performance and Efficiency

SmolVLA demonstrates impressive capabilities:

Improved Success Rates: Pretraining on community datasets increased task success on the SO100 benchmark from 51.7% to 78.3%.
Asynchronous Inference: Decoupling perception and action prediction from execution allows for faster response times and higher task throughput.
Resource-Efficient Deployment: Designed for training on a single GPU and deployment on CPUs or consumer-grade GPUs, making advanced robotics more accessible.

Getting Started with SmolVLA

Developers and researchers can access SmolVLA through the Hugging Face Hub:

Model Repository: lerobot/smolvla_base
Technical Report: SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

By offering a compact, efficient, and open-source VLA model, SmolVLA paves the way for broader participation in robotics research and development, fostering innovation and collaboration in the field.

3.6.25

Google Introduces AI Edge Gallery: Empowering Android Devices with Offline AI Capabilities

In a significant move towards enhancing on-device artificial intelligence, Google has quietly released the AI Edge Gallery, an experimental Android application that allows users to run sophisticated AI models directly on their smartphones without the need for an internet connection. This development marks a pivotal step in Google's commitment to edge computing and privacy-centric AI solutions.

Empowering Offline AI Functionality

The AI Edge Gallery enables users to download and execute AI models from the Hugging Face platform entirely on their devices. This capability facilitates a range of tasks, including image analysis, text generation, coding assistance, and multi-turn conversations, all processed locally. By eliminating the reliance on cloud-based services, users can experience faster response times and enhanced data privacy.

Technical Foundations and Performance

Built upon Google's LiteRT platform (formerly TensorFlow Lite) and MediaPipe frameworks, the AI Edge Gallery is optimized for running AI models on resource-constrained mobile devices. The application supports models from various machine learning frameworks, such as JAX, Keras, PyTorch, and TensorFlow, ensuring broad compatibility.

Central to the app's performance is Google's Gemma 3 model, a compact 529-megabyte language model capable of processing up to 2,585 tokens per second during prefill inference on mobile GPUs. This efficiency translates to sub-second response times for tasks like text generation and image analysis, delivering a user experience comparable to cloud-based alternatives.

Open-Source Accessibility

Released under an open-source Apache 2.0 license, the AI Edge Gallery is available through GitHub, reflecting Google's initiative to democratize access to advanced AI capabilities. By providing this tool outside of official app stores, Google encourages developers and enthusiasts to explore and contribute to the evolution of on-device AI applications.

Implications for Privacy and Performance

The introduction of the AI Edge Gallery underscores a growing trend towards processing data locally on devices, addressing concerns related to data privacy and latency. By enabling AI functionalities without internet connectivity, users can maintain greater control over their data while benefiting from the convenience and speed of on-device processing.

Conclusion

Google's AI Edge Gallery represents a significant advancement in bringing powerful AI capabilities directly to Android devices. By facilitating offline access to advanced models and promoting open-source collaboration, Google is paving the way for more private, efficient, and accessible AI experiences on mobile platforms.

9.5.25

Alibaba’s ZeroSearch: Empowering AI to Self-Train and Slash Costs by 88%

On May 8, 2025, Alibaba Group unveiled ZeroSearch, an innovative reinforcement learning framework designed to train large language models (LLMs) in information retrieval without relying on external search engines. This approach not only enhances the efficiency of AI training but also significantly reduces associated costs.

Revolutionizing AI Training Through Simulation

Traditional AI training methods for search capabilities depend heavily on real-time interactions with search engines, leading to substantial API expenses and unpredictable data quality. ZeroSearch addresses these challenges by enabling LLMs to simulate search engine interactions within a controlled environment. The process begins with a supervised fine-tuning phase, transforming an LLM into a retrieval module capable of generating both relevant and irrelevant documents in response to queries. Subsequently, a curriculum-based rollout strategy is employed during reinforcement learning to gradually degrade the quality of generated documents, enhancing the model's ability to discern and retrieve pertinent information.

Achieving Superior Performance at Reduced Costs

In extensive evaluations across seven question-answering datasets, ZeroSearch demonstrated performance on par with, and in some cases surpassing, models trained using actual search engines. Notably, a 14-billion-parameter retrieval module trained with ZeroSearch outperformed Google Search in specific benchmarks. Financially, the benefits are substantial; training with approximately 64,000 search queries using Google Search via SerpAPI would cost about $586.70, whereas utilizing a 14B-parameter simulation LLM on four A100 GPUs incurs only $70.80—a remarkable 88% reduction in costs.

Implications for the AI Industry

ZeroSearch's introduction marks a significant shift in AI development paradigms. By eliminating dependence on external search engines, developers gain greater control over training data quality and reduce operational costs. This advancement democratizes access to sophisticated AI training methodologies, particularly benefiting startups and organizations with limited resources. Furthermore, the open-source release of ZeroSearch's code, datasets, and pre-trained models on platforms like GitHub and Hugging Face fosters community engagement and collaborative innovation.

Looking Ahead

As AI continues to evolve, frameworks like ZeroSearch exemplify the potential for self-sufficient learning models that minimize external dependencies. This development not only streamlines the training process but also paves the way for more resilient and adaptable AI systems in various applications.

8.5.25

NVIDIA Unveils Parakeet-TDT-0.6B-v2: A Breakthrough in Open-Source Speech Recognition

On May 1, 2025, NVIDIA released Parakeet-TDT-0.6B-v2, a state-of-the-art automatic speech recognition (ASR) model, now available on Hugging Face. This open-source model is designed to deliver high-speed, accurate transcriptions, setting a new benchmark in the field of speech-to-text technology.

Exceptional Performance and Speed

Parakeet-TDT-0.6B-v2 boasts 600 million parameters and utilizes a combination of the FastConformer encoder and TDT decoder architectures. When deployed on NVIDIA's GPU-accelerated hardware, the model can transcribe 60 minutes of audio in just one second, achieving a Real-Time Factor (RTFx) of 3386.02 with a batch size of 128. This performance places it at the top of current ASR benchmarks maintained by Hugging Face.

Comprehensive Feature Set

The model supports:

Punctuation and Capitalization: Enhances readability of transcriptions.
Word-Level Timestamping: Facilitates precise alignment between audio and text.
Robustness to Noise: Maintains accuracy even in varied noise conditions and telephony-style audio formats.

These features make it suitable for applications such as transcription services, voice assistants, subtitle generation, and conversational AI platforms.

Training Data and Methodology

Parakeet-TDT-0.6B-v2 was trained on the Granary dataset, comprising approximately 120,000 hours of English audio. This includes 10,000 hours of high-quality human-transcribed data and 110,000 hours of pseudo-labeled speech from sources like LibriSpeech, Mozilla Common Voice, YouTube-Commons, and Librilight. NVIDIA plans to make the Granary dataset publicly available following its presentation at Interspeech 2025.

Accessibility and Deployment

Developers can deploy the model using NVIDIA’s NeMo toolkit, compatible with Python and PyTorch. The model is released under the Creative Commons CC-BY-4.0 license, permitting both commercial and non-commercial use. It is optimized for NVIDIA GPU environments, including A100, H100, T4, and V100 boards, but can also run on systems with as little as 2GB of RAM.

Implications for the AI Community

The release of Parakeet-TDT-0.6B-v2 underscores NVIDIA's commitment to advancing open-source AI tools. By providing a high-performance, accessible ASR model, NVIDIA empowers developers, researchers, and enterprises to integrate cutting-edge speech recognition capabilities into their applications, fostering innovation across various industries.

4.5.25

Microsoft Launches Phi-4-Reasoning-Plus: Small Model, Big Reasoning Power

Microsoft has unveiled Phi-4-Reasoning-Plus, a compact yet highly capable open-weight language model built for deep, structured reasoning. With just 14 billion parameters, it punches far above its weight—outperforming much larger models on key benchmarks in logic, math, and science.

Phi-4-Reasoning-Plus is a refinement of Microsoft’s earlier Phi-4 model. It uses advanced supervised fine-tuning and reinforcement learning to deliver high reasoning accuracy in a lightweight format. Trained on 16 billion tokens—half of which are unique—the model’s data includes synthetic prompts, carefully filtered web content, and a dedicated reinforcement learning phase focused on solving 6,400 math problems.

What makes this model especially valuable to developers and businesses is its MIT open-source license, allowing free use, modification, and commercial deployment. It's also designed to run efficiently on common AI frameworks like Hugging Face Transformers, vLLM, llama.cpp, and Ollama—making it easy to integrate across platforms.

Key Features of Phi-4-Reasoning-Plus:

✅ 14B parameters with performance rivaling 70B+ models in reasoning tasks
✅ Outperforms larger LLMs in math, coding, and logical reasoning
✅ Uses special tokens to improve transparency in reasoning steps
✅ Trained with outcome-based reinforcement learning for better accuracy and brevity
✅ Released under the MIT license for open commercial use
✅ Compatible with lightweight inference frameworks

One of the standout results? Phi-4-Reasoning-Plus achieved a higher first-pass score on the AIME 2025 math exam than a 70B model—an impressive feat that showcases its reasoning efficiency despite a smaller model size.

Takeaway

Microsoft’s Phi-4-Reasoning-Plus marks a turning point in AI development: high performance no longer depends on massive scale. This small but mighty model proves that with smarter training and tuning, compact LLMs can rival giants in performance—while being easier to deploy, more cost-effective, and openly available. It’s a big leap forward for accessible AI, especially for startups, educators, researchers, and businesses that need powerful reasoning without the heavy compute demands.