Showing posts with label vision-language-action. Show all posts
Showing posts with label vision-language-action. Show all posts

12.8.25

MolmoAct brings editable spatial plans to robot foundation models

 Most robot FMs still map pixels + instructions straight to torques—a shortcut that crumbles on long-horizon tasks. MolmoAct proposes a cleaner recipe: an Action Reasoning Model (ARM) that explicitly separates perception, planning, and control so robots can reason about where to act before deciding how

A three-stage pipeline you can steer

MolmoAct encodes images and instructions into depth-aware perception tokens, then produces a mid-level spatial plan as editable trajectory traces, and finally emits precise low-level actions. Because the plan lives as a manipulable trajectory, behavior is explainable—and steerable—without retraining. 

Numbers that move the needle

  • SimplerEnv (Visual Matching, zero-shot): 70.5%, beating closed models like Pi-0 and GR00T N1

  • LIBERO (avg): 86.6% success, including a +6.3-point gain over ThinkAct on long-horizon tasks. 

  • Real-world fine-tuning: additional +10% task progression on single-arm and +22.7% on bimanual setups vs Pi-0-FAST

  • OOD generalization: +23.3% over baselines; also top human-preference scores for instruction following and trajectory steering. 

An open blueprint, not just a model

The team releases MolmoAct-7B-D weights, training code, and—importantly—the MolmoAct Dataset, over 10,000 high-quality robot trajectories spanning diverse scenarios. Adding this mid-training set yields an average +5.5% performance lift over the base model, making it a practical plug-in for existing stacks. 

Why it matters

By promoting spatial plans to first-class citizens, MolmoAct bridges the gap between language-level intent and controller-level execution. For labs and startups, that means debuggable policies, few-shot steerability, and a realistic path to explainable manipulation at scale—without signing away to a closed stack. 

Paper link: arXiv 2508.07917 (PDF)

 Anthropic has expanded Claude Sonnet 4’s context window to a full 1,000,000 tokens, a five-fold jump that shifts what teams can do in a sin...