Showing posts with label audio fidelity. Show all posts
Showing posts with label audio fidelity. Show all posts

28.8.25

HunyuanVideo-Foley brings studio-grade Foley to AI-generated video

 Text-to-video has gone cinematic, but most clips still sound like a silent movie. HunyuanVideo-Foley aims to fix that: it’s an end-to-end text-video-to-audio (TV2A) system that generates synchronized, high-quality Foley from pixels and prompts—no sound library or manual sound design required. The team marries a multimodal diffusion transformer with representation alignment and a large, purpose-built dataset, and reports state-of-the-art results on fidelity and sync. 

What’s new

  • 100k-hour TV2A dataset. A scalable pipeline filters web video into 8-second segments, drops silent/low-bandwidth clips, scores audio aesthetics/SNR, and checks both semantic (ImageBind) and temporal (AV-align) match before tagging and captioning with GenAU. 

  • Dual-phase multimodal attention. Video and audio are fused with joint self-attention (interleaved RoPE) for frame-level sync; text cues are injected later via cross-attention to avoid text dominating the mix. 

  • REPA loss for audio. A Representation Alignment (REPA) objective pulls internal DiT features toward self-supervised audio embeddings (ATST-Frame), stabilizing training and improving timbre/semantics. 

  • Continuous-latent DAC-VAE. Replaces RVQ with a VAE (128-dim latents @48 kHz, 50 Hz latent rate) for cleaner reconstructions and fewer artifacts. 

How it’s built

HunyuanVideo-Foley stacks N₁ multimodal (audio-video) DiT blocks followed by N₂ audio-only blocks, modulated by Synchformer-derived sync features. The model used 18 MMDiT + 36 audio DiT layers (1536 hidden, 12 heads) and was trained 200k steps on the 100k-hour corpus; autoencoder pretraining ran 700k steps. The main run used 128 H20 GPUs with an effective batch size of 2048. 

The receipts

Across three testbeds—Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench—the paper reports new SOTA on multiple axes, including audio quality (PQ), visual-semantic alignment (IB) and temporal sync (DeSync), plus higher human MOS scores on MovieGen-Audio-Bench. A sample from Kling-Audio-Eval: the model improves FD (PANNs) and KL vs. prior systems and lifts IB while keeping DeSync low. 

Example objective results (Kling-Audio-Eval)

MetricBest prior (sample)HunyuanVideo-Foley
FD (PANNs) ↓9.01 (MMAudio)6.07
PQ ↑6.05 (FoleyCrafter)6.12
IB ↑0.30 (MMAudio)0.38
DeSync ↓0.56 (MMAudio)0.54

Why it matters

  • Sound that matches the shot. By separating frame-sync (video↔audio) from semantic guidance (text↔audio), the model avoids the classic failure where captions drown out visual cues. 

  • Production-friendly fidelity. REPA and the continuous-latent DAC-VAE cut hiss, mushy transients, and texture mismatch—key for believable footsteps, doors, and crowd beds. 

  • Built to scale. A reproducible data pipeline and a demo page suggest this is more than a lab toy; it’s an audio stack teams can evaluate today. 

If generative video is to replace B-roll and animatics, it needs audio that lands. HunyuanVideo-Foley offers a blueprint: curate better multimodal data, align internal representations to robust audio features, and architect attention so text helps—without hijacking—the soundscape.

Paper link: arXiv 2508.16930 (PDF)

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...