Wandering Nomad: HunyuanVideo-Foley brings studio-grade Foley to AI-generated video

28.8.25

HunyuanVideo-Foley brings studio-grade Foley to AI-generated video

Text-to-video has gone cinematic, but most clips still sound like a silent movie. HunyuanVideo-Foley aims to fix that: it’s an end-to-end text-video-to-audio (TV2A) system that generates synchronized, high-quality Foley from pixels and prompts—no sound library or manual sound design required. The team marries a multimodal diffusion transformer with representation alignment and a large, purpose-built dataset, and reports state-of-the-art results on fidelity and sync.

What’s new

100k-hour TV2A dataset. A scalable pipeline filters web video into 8-second segments, drops silent/low-bandwidth clips, scores audio aesthetics/SNR, and checks both semantic (ImageBind) and temporal (AV-align) match before tagging and captioning with GenAU.
Dual-phase multimodal attention. Video and audio are fused with joint self-attention (interleaved RoPE) for frame-level sync; text cues are injected later via cross-attention to avoid text dominating the mix.
REPA loss for audio. A Representation Alignment (REPA) objective pulls internal DiT features toward self-supervised audio embeddings (ATST-Frame), stabilizing training and improving timbre/semantics.
Continuous-latent DAC-VAE. Replaces RVQ with a VAE (128-dim latents @48 kHz, 50 Hz latent rate) for cleaner reconstructions and fewer artifacts.

How it’s built

HunyuanVideo-Foley stacks N₁ multimodal (audio-video) DiT blocks followed by N₂ audio-only blocks, modulated by Synchformer-derived sync features. The model used 18 MMDiT + 36 audio DiT layers (1536 hidden, 12 heads) and was trained 200k steps on the 100k-hour corpus; autoencoder pretraining ran 700k steps. The main run used 128 H20 GPUs with an effective batch size of 2048.

The receipts

Across three testbeds—Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench—the paper reports new SOTA on multiple axes, including audio quality (PQ), visual-semantic alignment (IB) and temporal sync (DeSync), plus higher human MOS scores on MovieGen-Audio-Bench. A sample from Kling-Audio-Eval: the model improves FD (PANNs) and KL vs. prior systems and lifts IB while keeping DeSync low.

Example objective results (Kling-Audio-Eval)

Metric	Best prior (sample)	HunyuanVideo-Foley
FD (PANNs) ↓	9.01 (MMAudio)	6.07
PQ ↑	6.05 (FoleyCrafter)	6.12
IB ↑	0.30 (MMAudio)	0.38
DeSync ↓	0.56 (MMAudio)	0.54

Why it matters

Sound that matches the shot. By separating frame-sync (video↔audio) from semantic guidance (text↔audio), the model avoids the classic failure where captions drown out visual cues.
Production-friendly fidelity. REPA and the continuous-latent DAC-VAE cut hiss, mushy transients, and texture mismatch—key for believable footsteps, doors, and crowd beds.
Built to scale. A reproducible data pipeline and a demo page suggest this is more than a lab toy; it’s an audio stack teams can evaluate today.

If generative video is to replace B-roll and animatics, it needs audio that lands. HunyuanVideo-Foley offers a blueprint: curate better multimodal data, align internal representations to robust audio features, and architect attention so text helps—without hijacking—the soundscape.

Paper link: arXiv 2508.16930 (PDF)