Wandering Nomad: physical realism

26.7.25

PhyWorldBench asks: can your video model obey gravity?

Text-to-video (T2V) generators can paint dazzling scenes, but do they respect momentum, energy conservation—or even keep objects from phasing through walls? PhyWorldBench says “not yet.” The new 31-page study introduces a physics-first benchmark that pits 12 state-of-the-art models (five proprietary, seven open source) against 1,050 carefully curated prompts spanning real and deliberately impossible scenarios. The verdict: even the best models fumble basic mechanics, with the proprietary Pika 2.0 topping its class at a modest 0.262 success rate, while Wanx-2.1 leads open source.

A benchmark built like a physics textbook

Researchers defined 10 main physics categories, each split into 5 subcategories, then wrote 7 scenarios per subcategory—and for every scenario, three prompt styles (event, physics‑enhanced, detailed narrative). That’s how you get to 1,050 prompts without redundancy.

Anti‑physics on purpose

One twist: an “Anti‑Physics” track where prompts violate real laws (e.g., objects accelerating upward). These gauge whether models blindly mimic training data or can intentionally break rules when asked.

Cheap(er) scoring with an MLLM judge

Instead of hand‑labeling 12,600 generated videos, the team devised a yes/no metric using modern multimodal LLMs (GPT‑4o, Gemini‑1.5‑Pro) to check “basic” and “key” physics standards. Large human studies back its reliability, making large‑scale physics eval feasible.

What tripped models up

Temporal consistency & motion realism still break first.
Higher‑complexity composites (rigid body collisions, fluids, human/animal motion) expose bigger gaps.
Models often follow cinematic cues over physics, picking “cool” shots that contradict dynamics.

Prompting matters (a lot)

Richer, physics‑aware prompts help—but only so much. The authors outline prompt‑crafting tips that nudge models toward lawful motion, yet many failures persist, hinting at architectural limits.

Why this matters

Reality is the next frontier. As T2V engines head for simulation, education and robotics, looking right isn’t enough—they must behave right.
Benchmarks drive progress. Prior suites (VBench, VideoPhy, PhyGenBench) touched pieces of the problem; PhyWorldBench widens coverage and difficulty, revealing headroom hidden by softer tests.
MLLM evaluators scale oversight. A simple, zero‑shot judge could generalize to other “lawfulness” checks—chemistry, finance, safety—without armies of annotators.

The authors release all prompts, annotations and a leaderboard, inviting labs to iterate on physical correctness—not just prettier pixels. Until models stop dropping balls through floors, PhyWorldBench is likely to be the scoreboard everyone cites.

Paper link: arXiv 2507.13428 (PDF)