Wandering Nomad: vision-language-action

12.8.25

MolmoAct brings editable spatial plans to robot foundation models

Most robot FMs still map pixels + instructions straight to torques—a shortcut that crumbles on long-horizon tasks. MolmoAct proposes a cleaner recipe: an Action Reasoning Model (ARM) that explicitly separates perception, planning, and control so robots can reason about where to act before deciding how.

A three-stage pipeline you can steer

MolmoAct encodes images and instructions into depth-aware perception tokens, then produces a mid-level spatial plan as editable trajectory traces, and finally emits precise low-level actions. Because the plan lives as a manipulable trajectory, behavior is explainable—and steerable—without retraining.

Numbers that move the needle

SimplerEnv (Visual Matching, zero-shot): 70.5%, beating closed models like Pi-0 and GR00T N1.
LIBERO (avg): 86.6% success, including a +6.3-point gain over ThinkAct on long-horizon tasks.
Real-world fine-tuning: additional +10% task progression on single-arm and +22.7% on bimanual setups vs Pi-0-FAST.
OOD generalization: +23.3% over baselines; also top human-preference scores for instruction following and trajectory steering.

An open blueprint, not just a model

The team releases MolmoAct-7B-D weights, training code, and—importantly—the MolmoAct Dataset, over 10,000 high-quality robot trajectories spanning diverse scenarios. Adding this mid-training set yields an average +5.5% performance lift over the base model, making it a practical plug-in for existing stacks.

Why it matters

By promoting spatial plans to first-class citizens, MolmoAct bridges the gap between language-level intent and controller-level execution. For labs and startups, that means debuggable policies, few-shot steerability, and a realistic path to explainable manipulation at scale—without signing away to a closed stack.

Paper link: arXiv 2508.07917 (PDF)