Showing posts with label Allen Institute for AI. Show all posts
Showing posts with label Allen Institute for AI. Show all posts

12.8.25

MolmoAct brings editable spatial plans to robot foundation models

 Most robot FMs still map pixels + instructions straight to torques—a shortcut that crumbles on long-horizon tasks. MolmoAct proposes a cleaner recipe: an Action Reasoning Model (ARM) that explicitly separates perception, planning, and control so robots can reason about where to act before deciding how

A three-stage pipeline you can steer

MolmoAct encodes images and instructions into depth-aware perception tokens, then produces a mid-level spatial plan as editable trajectory traces, and finally emits precise low-level actions. Because the plan lives as a manipulable trajectory, behavior is explainable—and steerable—without retraining. 

Numbers that move the needle

  • SimplerEnv (Visual Matching, zero-shot): 70.5%, beating closed models like Pi-0 and GR00T N1

  • LIBERO (avg): 86.6% success, including a +6.3-point gain over ThinkAct on long-horizon tasks. 

  • Real-world fine-tuning: additional +10% task progression on single-arm and +22.7% on bimanual setups vs Pi-0-FAST

  • OOD generalization: +23.3% over baselines; also top human-preference scores for instruction following and trajectory steering. 

An open blueprint, not just a model

The team releases MolmoAct-7B-D weights, training code, and—importantly—the MolmoAct Dataset, over 10,000 high-quality robot trajectories spanning diverse scenarios. Adding this mid-training set yields an average +5.5% performance lift over the base model, making it a practical plug-in for existing stacks. 

Why it matters

By promoting spatial plans to first-class citizens, MolmoAct bridges the gap between language-level intent and controller-level execution. For labs and startups, that means debuggable policies, few-shot steerability, and a realistic path to explainable manipulation at scale—without signing away to a closed stack. 

Paper link: arXiv 2508.07917 (PDF)

 Anthropic has expanded Claude Sonnet 4’s context window to a full 1,000,000 tokens, a five-fold jump that shifts what teams can do in a sin...