Wandering Nomad: Tencent Hunyuan

Showing posts with label Tencent Hunyuan. Show all posts

28.8.25

HunyuanVideo-Foley brings studio-grade Foley to AI-generated video

Text-to-video has gone cinematic, but most clips still sound like a silent movie. HunyuanVideo-Foley aims to fix that: it’s an end-to-end text-video-to-audio (TV2A) system that generates synchronized, high-quality Foley from pixels and prompts—no sound library or manual sound design required. The team marries a multimodal diffusion transformer with representation alignment and a large, purpose-built dataset, and reports state-of-the-art results on fidelity and sync.

What’s new

100k-hour TV2A dataset. A scalable pipeline filters web video into 8-second segments, drops silent/low-bandwidth clips, scores audio aesthetics/SNR, and checks both semantic (ImageBind) and temporal (AV-align) match before tagging and captioning with GenAU.
Dual-phase multimodal attention. Video and audio are fused with joint self-attention (interleaved RoPE) for frame-level sync; text cues are injected later via cross-attention to avoid text dominating the mix.
REPA loss for audio. A Representation Alignment (REPA) objective pulls internal DiT features toward self-supervised audio embeddings (ATST-Frame), stabilizing training and improving timbre/semantics.
Continuous-latent DAC-VAE. Replaces RVQ with a VAE (128-dim latents @48 kHz, 50 Hz latent rate) for cleaner reconstructions and fewer artifacts.

How it’s built

HunyuanVideo-Foley stacks N₁ multimodal (audio-video) DiT blocks followed by N₂ audio-only blocks, modulated by Synchformer-derived sync features. The model used 18 MMDiT + 36 audio DiT layers (1536 hidden, 12 heads) and was trained 200k steps on the 100k-hour corpus; autoencoder pretraining ran 700k steps. The main run used 128 H20 GPUs with an effective batch size of 2048.

The receipts

Across three testbeds—Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench—the paper reports new SOTA on multiple axes, including audio quality (PQ), visual-semantic alignment (IB) and temporal sync (DeSync), plus higher human MOS scores on MovieGen-Audio-Bench. A sample from Kling-Audio-Eval: the model improves FD (PANNs) and KL vs. prior systems and lifts IB while keeping DeSync low.

Example objective results (Kling-Audio-Eval)

Metric	Best prior (sample)	HunyuanVideo-Foley
FD (PANNs) ↓	9.01 (MMAudio)	6.07
PQ ↑	6.05 (FoleyCrafter)	6.12
IB ↑	0.30 (MMAudio)	0.38
DeSync ↓	0.56 (MMAudio)	0.54

Why it matters

Sound that matches the shot. By separating frame-sync (video↔audio) from semantic guidance (text↔audio), the model avoids the classic failure where captions drown out visual cues.
Production-friendly fidelity. REPA and the continuous-latent DAC-VAE cut hiss, mushy transients, and texture mismatch—key for believable footsteps, doors, and crowd beds.
Built to scale. A reproducible data pipeline and a demo page suggest this is more than a lab toy; it’s an audio stack teams can evaluate today.

If generative video is to replace B-roll and animatics, it needs audio that lands. HunyuanVideo-Foley offers a blueprint: curate better multimodal data, align internal representations to robust audio features, and architect attention so text helps—without hijacking—the soundscape.

Paper link: arXiv 2508.16930 (PDF)

19.8.25

AutoCodeBench turns LLMs into benchmark factories — and today’s coders sweat at ~52%

Code benchmarks have a scaling problem: hand-written tasks don’t keep up with fast-improving models, and multilingual coverage is thin. Tencent Hunyuan’s new paper proposes a fix: AutoCodeGen, an automated workflow that inverts dataset creation—generate solutions and tests first, then ask the LLM to write the problem—validated by a multilingual execution sandbox. The result is AutoCodeBench (ACB), a 3,920-problem suite evenly spread over 20 languages, with ~9.6 tests per problem and a deliberate bias toward hard tasks. Even frontier “think-mode” models top out around ~52% Pass@1, signaling real headroom.

How they build hard, correct problems

AutoCodeGen runs in four steps: (1) LLMs evolve self-contained code solutions from real multilingual snippets; (2) LLMs propose public and private test inputs, which the sandbox executes to compute ground-truth outputs; (3) the LLM then writes the problem description constrained by strict specs (language, entry points, naming); (4) a three-stage filter (multi-sampling for difficulty, LLM-as-critic for quality, diversity tagging) trims the set. This “reverse-order” recipe yields correct, executable tests without humans in the loop.

What’s inside ACB

Scale & spread: 3,920 problems, 37,777 tests, 20 languages (Python→TypeScript), 14 task categories from data structures to systems programming. >60% are “hard.”
Sandbox: open-sourced, 20+ languages, high-concurrency, request-based calls—usable for eval and data synthesis.
Lite & Complete: ACB-Lite (≈1,586 problems) for faster evals; ACB-Complete (1,000 completion-style tasks, 3-shot) targets base models rather than chat-tuned ones.

The scoreboard: even elites struggle

Averaged across 20 languages, the leaderboard’s top tier lands ~50–52% Pass@1, led by Claude Opus 4 (Think) at 52.4%, with o3-high, Grok-4, Claude Sonnet 4 (Think), and DeepSeek-R1-0528 clustered close behind. Mid-tier open models sit in the 30s–40s; smaller coders drop to the 20s. Translation: the multilingual + multi-logical mix is punishing.

Iterating with sandbox feedback helps

Across three refinement turns using execution error messages, models like DeepSeek-V3-0324 and Qwen2.5-Coder-32B-Instruct gain ~8–12 points, with the biggest jump on turn one—evidence that automated error signals materially improve code generation.

Base-model check: ACB-Complete

On the 1,000-item, 3-shot ACB-Complete, Seed-Coder-8B-Base leads its class (≤8B) at 31.6% Pass@1, edging OpenCoder-8B-Base and Qwen2.5-Coder-7B—useful signal for pre-instruct comparisons that classic HumanEval/MBPP miss.

Why it matters

Human-free, multilingual, hard. ACB shows you can scale quality and coverage without armies of annotators.
Better evals for code agents. Emphasis on multi-logical tasks (several core functions per problem) aligns with agent workflows like SWE-Bench.
Sandbox as a lever. Open, concurrent execution infra doubles as a training-data factory and an iterative-repair oracle.

Benchmarks drive progress. If your coding model cruises through Python puzzles but face-plants in Kotlin, Shell, or Elixir, AutoCodeBench will make that obvious—and give you a reproducible path to fix it.

Paper link: arXiv 2508.09101 (PDF)

16.8.25

Hunyuan-GameCraft brings “playable” video gen to AAA-style worlds

Text-to-video systems can paint beautiful clips, but making them playable—reacting smoothly to user inputs over long sequences—has been a brick wall. Tencent Hunyuan’s Hunyuan-GameCraft attacks the problem head-on with a recipe built for game dynamics: unify keyboard/mouse signals into camera-space controls, train with a history-aware objective, and distill the model for real-time latency. The result: long, action-controllable sequences that keep scenes coherent and respond like a game engine—minus the engine.

The trick: turn WASD into camera math

Instead of treating keystrokes as ad-hoc tokens, GameCraft maps keyboard and mouse inputs to a shared, continuous camera representation (translation/rotation directions plus speeds). A lightweight action encoder injects these signals into an MM-DiT video backbone (HunyuanVideo), enabling fine-grained motion like smooth pans or faster strafes without hacking the generator.

Stay coherent over minutes, not seconds

To fight the usual “long-video drift,” the team proposes hybrid history-conditioned training: during autoregressive extension, new chunks are denoised while explicitly conditioning on denoised history with a mask indicator. Compared with training-free or streaming add-ons, this keeps geometry and layout stable across extended play.

Fast enough to feel interactive

A distillation pass (Phased Consistency Model) accelerates inference by 10–20×, cutting latency to <5 s per action in their setup—crucial for anything that calls itself “interactive.”

Trained on real gameplay, then sharpened in 3-D

The dataset is built from 1M+ gameplay clips across 100+ AAA titles (e.g., Assassin’s Creed, Red Dead Redemption, Cyberpunk 2077), segmented with PySceneDetect and annotated with 6-DoF camera trajectories (Monst3R). A synthetic set of rendered motion sequences adds precise camera priors and balances trajectory distributions.

Why this matters

Input-to-motion fidelity. Unifying controls in camera space yields smoother, more physical responses to typical WASD/arrow inputs.
Long-horizon stability. History conditioning curbs error accumulation that wrecks long, user-driven videos.
Path to production. Distillation pushes latency toward “feels responsive,” a precondition for creator tools and AI-assisted level previews.

Availability and what’s next

A project page is live, and the team has released inference code and weights under the Hunyuan-GameCraft-1.0 repository. The arXiv record also notes acceptance to RSS 2025, signaling interest from the robotics community.

Paper link: arXiv 2506.17201 (PDF)