Anyone who has watched today’s end‑to‑end robot policies fail a complex kitchen task knows the weakness: they map pixels to motors with no explicit plan. ThinkAct flips that script. The NTU‑NVIDIA team behind the paper trains a multimodal LLM to write a high‑level reasoning plan, turns that plan into a compact visual‑plan latent, then hands it to a DiT‑based action model that executes at control‑loop speed. The result is an agent that deliberates like GPT‑4o yet moves with the reactivity of classic policies.
How ThinkAct pulls it off
Component | What it does | Why it matters |
---|---|---|
Reinforced visual latent planning | Rewards the reasoning LLM with goal‑completion and trajectory‑consistency signals derived from vision, forcing plans that actually work in the scene. | Bridges abstract language plans to pixel‑level feedback. |
Visual‑plan latent | Compresses the entire chain‑of‑thought into a fixed‑size latent that conditions a frozen DiT policy. | Keeps the policy lightweight and allows asynchronous slow‑think / fast‑act loops. |
Dual‑system inference | LLM thinks a few times per second; the action model ticks every 20 ms. | Yields real‑time control without sacrificing deliberation. |
Benchmark sweep at two skill levels
Suite | Metric | Prev SOTA | ThinkAct |
---|---|---|---|
EgoPlan‑Bench2 | Acc. ↑ | Qwen 2.5‑VL* 66.3 | 71.4 |
RoboVQA | Acc. ↑ | Qwen 2.5‑VL* 63.5 | 69.2 |
OpenEQA | Acc. ↑ | OpenVLA 52.1 | 57.8 |
SimplerEnv (manip.) | Succ.% ↑ | DiT‑Policy 45.2 | 62.7 |
LIBERO (manip.) | Succ.% ↑ | OpenVLA 48.9 | 60.3 |
Few‑shot powers
With just 5–10 demos per LIBERO task, ThinkAct’s policy finetunes to new objects and layouts, beating OpenVLA by 9–12 points.o
Why this matters
-
Plan‑centric embodied AI. ThinkAct shows that giving agents an explicit, reward‑aligned plan latent trumps opaque end‑to‑end policies for long‑horizon tasks.
-
Self‑reflection in the loop. The reasoning LLM can detect a failure mid‑episode, revise its latent plan, and rescue the run — a first for open‑source VLA systems.
-
Few‑shot deployment. Labs can adapt to a new kitchen or warehouse with handfuls of tele‑op traces instead of days of retraining.
ThinkAct’s code is coming soon, but the project page already hosts videos of robots closing drawers, shifting condiments and answering environment‑specific questions after reasoning out loud. The message is clear: future embodied agents won’t just map images to torque — they’ll think, decide why, then act.
Paper link: arXiv 2507.16815 (PDF)