Wandering Nomad: NTU

8.7.25

DeepMesh makes artist-quality 3D meshes a one-click affair

Triangle-mesh modelling is the CAD world’s equivalent of hand-drawn in-betweens: essential, mind-numbing and painfully slow. A new paper out of Tsinghua University, NTU and ShengShu AI says it can hand that job to an LLM-sized transformer without melting your GPU.

The team’s framework, DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning, marries a clever compression trick with a dose of RLHF to crank out clean, editable topology directly from point clouds or images.

Why previous mesh LLMs hit the wall

Most auto-regressive mesh generators treat every vertex coordinate as a token. Feed them a high-poly model and the sequence balloons into tens of thousands of steps, torpedoing training stability and inference speed. Worse, their loss functions optimise geometry alone, so outputs pass numeric checks yet still look like Swiss cheese to artists.

Two upgrades, one big leap

Pillar	What they did	Why it matters
72 % shorter sequences	A hierarchical patch-based tokenization merges duplicate offsets and encodes connectivity inline, shrinking vertex strings by nearly three-quarters without dropping detail.	Cuts pre-training FLOPs and lets the model scale to 30 k-face meshes on a single A100.
Human-aligned RL	Collected 5 000 preference pairs scored with a hybrid of human rating and 3D metrics, then ran Direct Preference Optimization (DPO) on the base model.	Removes holes and stray faces while nudging topology toward “artist-grade” layouts.

The researchers also trimmed an 800 k-mesh corpus to a cleaner 500 k set, tamping down the loss spikes that plague raw WebGL scrapes.

Results: fewer faces, better faces

Up to 1 B parameters: two Hourglass-style transformer variants (500 M & 1 B) both converge in 100 k steps thanks to shorter sequences.
Topology wins: DeepMesh’s large model eliminates 90 % of non-manifold edges that slip through MeshGPT and Nautilus, according to the authors’ “topology-valid” metric.
Visual quality: crowd-sourced raters picked DeepMesh over MeshGPT by 68 % on identical point-cloud prompts (exact numbers in paper’s Sec. 4.3).
Speed: a full 30 k-face generation takes ≈10 min, versus 20–25 min for LoRA-fine-tuned diffusion baselines reported in prior work.

A public demo gallery already shows clean Watertight dragons, furniture and stylised characters rendered straight from sparse point clouds.

Why this is bigger than 3D fan art

Game studios, AR platforms and online-creator tools alike are sitting on troves of unoptimised 3D scans. A transformer that understands connectivity as well as shape could batch-convert those scans into lightweight, animation-ready assets—no retopology pass required. And because DeepMesh’s DPO loop is “just” another RLHF recipe, the same pipeline could teach a mesh LLM brand-specific style or IP-safe anatomy without touching the base weights.

The authors hint at scaling past one billion parameters and adding text-conditioned generation. Given how fast 3D GenAI is snowballing, don’t bet against DeepMesh—or its tokenization trick—showing up in the next wave of text-to-world engines.

Paper link: arXiv 2503.15265 (PDF)

6.7.25

FreeMorph turns Stable Diffusion into a one-click image-morphing engine

Image morphing has been around since Michael Jackson’s Black or White video, but most modern AI pipelines still demand per-pair fine-tuning or laborious warping to keep shapes and textures coherent. A new paper from NTU, Nanjing University and CUHK drops that baggage. FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model repurposes an off-the-shelf Stable Diffusion 2.1 checkpoint to generate frame-perfect transitions between any two images—faces, cars, even cat-to-dog mash-ups—without touching a single weight.

Two tricks make the magic happen

Guidance-aware spherical interpolation (GaSI). Instead of naive latent mixing, FreeMorph blends the key-value pairs inside Stable Diffusion’s self-attention, injecting “identity anchors” from both source images so the morph stays on course.
Step-oriented variation trend (SoVT). A second module dials in how much of each image shows up at every denoising step, taming the non-linear chaos that usually derails tuning-free edits.

Faster and smoother than the competition

Running on a single NVIDIA A100, FreeMorph spits out a full transition sequence in under 30 seconds, beating DiffMorpher and IMPUS—which both require minutes of LoRA fine-tuning—while delivering sharper edges and fewer identity slips.

A new benchmark to prove it

Because existing datasets skew toward near-identical pairs, the authors collected Morph4Data,  four classes of image pairs ranging from “same layout, different semantics” to “totally unrelated.” On this tougher mix, FreeMorph tops every published method in quantitative metrics and user studies alike.

Why this matters

For creative-tool startups, FreeMorph means morphing features can ship as a call to Stable Diffusion rather than a 30-minute fine-tune. For researchers, GaSI + SoVT point to a broader lesson: you can co-opt diffusion attention layers for structural edits without sacrificing model generality.

The code, demo video and ready-to-run Colab notebook are already live on GitHub, so expect FreeMorph-powered GIF makers to surface on your timeline before summer’s out.

Paper link: arXiv 2507.01953 (PDF)