Wandering Nomad: RoboVQA

23.7.25

ThinkAct lets robots “think, then act” — and the payoff is new SOTA across embodied AI benchmarks

Anyone who has watched today’s end‑to‑end robot policies fail a complex kitchen task knows the weakness: they map pixels to motors with no explicit plan. ThinkAct flips that script. The NTU‑NVIDIA team behind the paper trains a multimodal LLM to write a high‑level reasoning plan, turns that plan into a compact visual‑plan latent, then hands it to a DiT‑based action model that executes at control‑loop speed. The result is an agent that deliberates like GPT‑4o yet moves with the reactivity of classic policies.

How ThinkAct pulls it off

Component	What it does	Why it matters
Reinforced visual latent planning	Rewards the reasoning LLM with goal‑completion and trajectory‑consistency signals derived from vision, forcing plans that actually work in the scene.	Bridges abstract language plans to pixel‑level feedback.
Visual‑plan latent	Compresses the entire chain‑of‑thought into a fixed‑size latent that conditions a frozen DiT policy.	Keeps the policy lightweight and allows asynchronous slow‑think / fast‑act loops.
Dual‑system inference	LLM thinks a few times per second; the action model ticks every 20 ms.	Yields real‑time control without sacrificing deliberation.

Benchmark sweep at two skill levels

Suite	Metric	Prev SOTA	ThinkAct
EgoPlan‑Bench2	Acc. ↑	Qwen 2.5‑VL* 66.3	71.4
RoboVQA	Acc. ↑	Qwen 2.5‑VL* 63.5	69.2
OpenEQA	Acc. ↑	OpenVLA 52.1	57.8
SimplerEnv (manip.)	Succ.% ↑	DiT‑Policy 45.2	62.7
LIBERO (manip.)	Succ.% ↑	OpenVLA 48.9	60.3

Qwen 2.5‑VL numbers are the authors’ fine‑tuned baseline.

Few‑shot powers

With just 5–10 demos per LIBERO task, ThinkAct’s policy finetunes to new objects and layouts, beating OpenVLA by 9–12 points.o

Why this matters

Plan‑centric embodied AI. ThinkAct shows that giving agents an explicit, reward‑aligned plan latent trumps opaque end‑to‑end policies for long‑horizon tasks.
Self‑reflection in the loop. The reasoning LLM can detect a failure mid‑episode, revise its latent plan, and rescue the run — a first for open‑source VLA systems.
Few‑shot deployment. Labs can adapt to a new kitchen or warehouse with handfuls of tele‑op traces instead of days of retraining.

ThinkAct’s code is coming soon, but the project page already hosts videos of robots closing drawers, shifting condiments and answering environment‑specific questions after reasoning out loud. The message is clear: future embodied agents won’t just map images to torque — they’ll think, decide why, then act.

Paper link: arXiv 2507.16815 (PDF)