Wandering Nomad: open-source robotics

21.7.25

RoboBrain 2.0 aims to be the one brain your robot needs

When you send a service bot to restock a fridge or map a disaster zone, you usually stitch together half-a-dozen neural nets: one to segment objects, another to read instructions, a planner to plot a path. RoboBrain 2.0 wants to scrap that Franken-stack and replace it with a single vision-language foundation model that can see, read, think and act. Introduced this month by Beijing Academy of Artificial Intelligence (BAAI), the system comes in two flavors—a resource-friendly 7 B-parameter variant and a flagship 32 B model—both built around a heterogenous architecture that couples a powerful vision encoder to a large-language backbone.

What’s new under the hood

Building block	Why it matters
Unified spatial + temporal training	Multistage curriculum mixes affordance prediction, spatial referring, trajectory forecasting and real-time scene-graph updates so the model learns to reason and plan.
Dense perception head	Adds point-, box- and mask-level outputs to the language decoder, letting the same network return precise coordinates without extra detectors.
Closed-loop interaction module	Keeps a rolling memory of scene changes, enabling multi-step tasks like “pick the red mug you just washed and place it on the left shelf.”

Benchmark clean-sweep

According to the technical report and accompanying GitHub data, RoboBrain 2.0-32B posts state-of-the-art or near-SOTA scores on nine spatial-reasoning suites (BLINK-Spatial, CV-Bench, EmbSpatial, RoboSpatial, RefSpatial, SAT, VSI-Bench, Where2Place, ShareRobot-Bench) and three temporal/decision-making tests (Multi-Robot-Planning, Ego-Plan2, RoboBench-Planning). That’s enough to edge past open-source front-runners like Cosmos-Reason 1 and Qwen 2.5-VL and proprietary contenders such as Gemini 2.5 Pro, o4-mini and Claude Sonnet 4.

Why those results matter

From perception to action — in one pass. A single forward call yields language, bounding boxes and future trajectories, trimming latency for real-time robotics.
Scales down gracefully. The 7 B version, small enough for an RTX 6000, still cracks the top tier on most spatial tasks, making embodied AI workflows feasible outside big-tech labs.
Open weights, permissive license. Both checkpoints, training code and a new embodied-reasoning benchmark suite are already public, inviting startups to fine-tune for warehouse picking, home assistance or search-and-rescue.

The road ahead

BAAI hints that RoboBrain’s next milestones include on-device distillation for humanoid form factors and a memory-augmented planner for week-long missions. Whether the project can keep pace with multi-modal titans like Meta’s Open Sora or Google’s RT-2 remains to be seen, but RoboBrain 2.0 proves that an all-in-one “robot brain” is no longer science fiction.

Paper link: arXiv 2507.02029 (PDF)

4.6.25

SmolVLA: Hugging Face's Compact Vision-Language-Action Model for Affordable Robotics

Hugging Face has introduced SmolVLA, a compact and efficient Vision-Language-Action (VLA) model designed to democratize robotics by enabling robust performance on consumer-grade hardware. With only 450 million parameters, SmolVLA achieves competitive results compared to larger models, thanks to its training on diverse, community-contributed datasets.

Bridging the Gap in Robotics AI

While large-scale Vision-Language Models (VLMs) have propelled advancements in AI, their application in robotics has been limited due to high computational demands and reliance on proprietary datasets. SmolVLA addresses these challenges by offering:

Compact Architecture: A 450M-parameter model that balances performance and efficiency.
Community-Driven Training Data: Utilization of 487 high-quality datasets from the LeRobot community, encompassing approximately 10 million frames.
Open-Source Accessibility: Availability of model weights and training data under the Apache 2.0 license, fostering transparency and collaboration.

Innovative Training and Annotation Techniques

To enhance the quality of training data, the team employed the Qwen2.5-VL-3B-Instruct model to generate concise, action-oriented task descriptions, replacing vague or missing annotations. This approach ensured consistent and informative labels across the diverse datasets.

Performance and Efficiency

SmolVLA demonstrates impressive capabilities:

Improved Success Rates: Pretraining on community datasets increased task success on the SO100 benchmark from 51.7% to 78.3%.
Asynchronous Inference: Decoupling perception and action prediction from execution allows for faster response times and higher task throughput.
Resource-Efficient Deployment: Designed for training on a single GPU and deployment on CPUs or consumer-grade GPUs, making advanced robotics more accessible.

Getting Started with SmolVLA

Developers and researchers can access SmolVLA through the Hugging Face Hub:

Model Repository: lerobot/smolvla_base
Technical Report: SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

By offering a compact, efficient, and open-source VLA model, SmolVLA paves the way for broader participation in robotics research and development, fostering innovation and collaboration in the field.