Wandering Nomad: text-to-video

26.7.25

PhyWorldBench asks: can your video model obey gravity?

Text-to-video (T2V) generators can paint dazzling scenes, but do they respect momentum, energy conservation—or even keep objects from phasing through walls? PhyWorldBench says “not yet.” The new 31-page study introduces a physics-first benchmark that pits 12 state-of-the-art models (five proprietary, seven open source) against 1,050 carefully curated prompts spanning real and deliberately impossible scenarios. The verdict: even the best models fumble basic mechanics, with the proprietary Pika 2.0 topping its class at a modest 0.262 success rate, while Wanx-2.1 leads open source.

A benchmark built like a physics textbook

Researchers defined 10 main physics categories, each split into 5 subcategories, then wrote 7 scenarios per subcategory—and for every scenario, three prompt styles (event, physics‑enhanced, detailed narrative). That’s how you get to 1,050 prompts without redundancy.

Anti‑physics on purpose

One twist: an “Anti‑Physics” track where prompts violate real laws (e.g., objects accelerating upward). These gauge whether models blindly mimic training data or can intentionally break rules when asked.

Cheap(er) scoring with an MLLM judge

Instead of hand‑labeling 12,600 generated videos, the team devised a yes/no metric using modern multimodal LLMs (GPT‑4o, Gemini‑1.5‑Pro) to check “basic” and “key” physics standards. Large human studies back its reliability, making large‑scale physics eval feasible.

What tripped models up

Temporal consistency & motion realism still break first.
Higher‑complexity composites (rigid body collisions, fluids, human/animal motion) expose bigger gaps.
Models often follow cinematic cues over physics, picking “cool” shots that contradict dynamics.

Prompting matters (a lot)

Richer, physics‑aware prompts help—but only so much. The authors outline prompt‑crafting tips that nudge models toward lawful motion, yet many failures persist, hinting at architectural limits.

Why this matters

Reality is the next frontier. As T2V engines head for simulation, education and robotics, looking right isn’t enough—they must behave right.
Benchmarks drive progress. Prior suites (VBench, VideoPhy, PhyGenBench) touched pieces of the problem; PhyWorldBench widens coverage and difficulty, revealing headroom hidden by softer tests.
MLLM evaluators scale oversight. A simple, zero‑shot judge could generalize to other “lawfulness” checks—chemistry, finance, safety—without armies of annotators.

The authors release all prompts, annotations and a leaderboard, inviting labs to iterate on physical correctness—not just prettier pixels. Until models stop dropping balls through floors, PhyWorldBench is likely to be the scoreboard everyone cites.

Paper link: arXiv 2507.13428 (PDF)

3.6.25

OpenAI's Sora Now Free on Bing Mobile: Create AI Videos Without a Subscription

In a significant move to democratize AI video creation, Microsoft has integrated OpenAI's Sora into its Bing mobile app, enabling users to generate AI-powered videos from text prompts without any subscription fees. This development allows broader access to advanced AI capabilities, previously available only to ChatGPT Plus or Pro subscribers.

Sora's Integration into Bing Mobile

Sora, OpenAI's text-to-video model, can now be accessed through the Bing Video Creator feature within the Bing mobile app, available on both iOS and Android platforms. Users can input descriptive prompts, such as "a hummingbird flapping its wings in ultra slow motion" or "a tiny astronaut exploring a giant mushroom planet," and receive five-second AI-generated video clips in response.

How to Use Bing Video Creator

To utilize this feature:

Open the Bing mobile app.
Tap the menu icon in the bottom right corner.
Select "Video Creator."
Enter a text prompt describing the desired video.

Alternatively, users can type a prompt directly into the Bing search bar, beginning with "Create a video of..."

Global Availability and Future Developments

The Bing Video Creator feature is now available worldwide, excluding China and Russia. While currently limited to five-second vertical videos, Microsoft has announced plans to support horizontal videos and expand the feature to desktop and Copilot Search platforms in the near future.

Conclusion

By offering Sora's capabilities through the Bing mobile app at no cost, Microsoft and OpenAI are making AI-driven video creation more accessible to a global audience. This initiative not only enhances user engagement with AI technologies but also sets a precedent for future integrations of advanced AI tools into everyday applications.