Wandering Nomad: open-source robotics

4.6.25

SmolVLA: Hugging Face's Compact Vision-Language-Action Model for Affordable Robotics

Hugging Face has introduced SmolVLA, a compact and efficient Vision-Language-Action (VLA) model designed to democratize robotics by enabling robust performance on consumer-grade hardware. With only 450 million parameters, SmolVLA achieves competitive results compared to larger models, thanks to its training on diverse, community-contributed datasets.

Bridging the Gap in Robotics AI

While large-scale Vision-Language Models (VLMs) have propelled advancements in AI, their application in robotics has been limited due to high computational demands and reliance on proprietary datasets. SmolVLA addresses these challenges by offering:

Compact Architecture: A 450M-parameter model that balances performance and efficiency.
Community-Driven Training Data: Utilization of 487 high-quality datasets from the LeRobot community, encompassing approximately 10 million frames.
Open-Source Accessibility: Availability of model weights and training data under the Apache 2.0 license, fostering transparency and collaboration.

Innovative Training and Annotation Techniques

To enhance the quality of training data, the team employed the Qwen2.5-VL-3B-Instruct model to generate concise, action-oriented task descriptions, replacing vague or missing annotations. This approach ensured consistent and informative labels across the diverse datasets.

Performance and Efficiency

SmolVLA demonstrates impressive capabilities:

Improved Success Rates: Pretraining on community datasets increased task success on the SO100 benchmark from 51.7% to 78.3%.
Asynchronous Inference: Decoupling perception and action prediction from execution allows for faster response times and higher task throughput.
Resource-Efficient Deployment: Designed for training on a single GPU and deployment on CPUs or consumer-grade GPUs, making advanced robotics more accessible.

Getting Started with SmolVLA

Developers and researchers can access SmolVLA through the Hugging Face Hub:

Model Repository: lerobot/smolvla_base
Technical Report: SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

By offering a compact, efficient, and open-source VLA model, SmolVLA paves the way for broader participation in robotics research and development, fostering innovation and collaboration in the field.