8.7.25

DeepMesh makes artist-quality 3D meshes a one-click affair

 Triangle-mesh modelling is the CAD world’s equivalent of hand-drawn in-betweens: essential, mind-numbing and painfully slow. A new paper out of Tsinghua University, NTU and ShengShu AI says it can hand that job to an LLM-sized transformer without melting your GPU.

The team’s framework, DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning, marries a clever compression trick with a dose of RLHF to crank out clean, editable topology directly from point clouds or images. 


Why previous mesh LLMs hit the wall

Most auto-regressive mesh generators treat every vertex coordinate as a token. Feed them a high-poly model and the sequence balloons into tens of thousands of steps, torpedoing training stability and inference speed. Worse, their loss functions optimise geometry alone, so outputs pass numeric checks yet still look like Swiss cheese to artists.


Two upgrades, one big leap

PillarWhat they didWhy it matters
72 % shorter sequencesA hierarchical patch-based tokenization merges duplicate offsets and encodes connectivity inline, shrinking vertex strings by nearly three-quarters without dropping detail. Cuts pre-training FLOPs and lets the model scale to 30 k-face meshes on a single A100.
Human-aligned RLCollected 5 000 preference pairs scored with a hybrid of human rating and 3D metrics, then ran Direct Preference Optimization (DPO) on the base model. Removes holes and stray faces while nudging topology toward “artist-grade” layouts.

The researchers also trimmed an 800 k-mesh corpus to a cleaner 500 k set, tamping down the loss spikes that plague raw WebGL scrapes. 

Results: fewer faces, better faces

  • Up to 1 B parameters: two Hourglass-style transformer variants (500 M & 1 B) both converge in 100 k steps thanks to shorter sequences. 

  • Topology wins: DeepMesh’s large model eliminates 90 % of non-manifold edges that slip through MeshGPT and Nautilus, according to the authors’ “topology-valid” metric.

  • Visual quality: crowd-sourced raters picked DeepMesh over MeshGPT by 68 % on identical point-cloud prompts (exact numbers in paper’s Sec. 4.3).

  • Speed: a full 30 k-face generation takes ≈10 min, versus 20–25 min for LoRA-fine-tuned diffusion baselines reported in prior work.

A public demo gallery already shows clean Watertight dragons, furniture and stylised characters rendered straight from sparse point clouds. 


Why this is bigger than 3D fan art

Game studios, AR platforms and online-creator tools alike are sitting on troves of unoptimised 3D scans. A transformer that understands connectivity as well as shape could batch-convert those scans into lightweight, animation-ready assets—no retopology pass required. And because DeepMesh’s DPO loop is “just” another RLHF recipe, the same pipeline could teach a mesh LLM brand-specific style or IP-safe anatomy without touching the base weights.

The authors hint at scaling past one billion parameters and adding text-conditioned generation. Given how fast 3D GenAI is snowballing, don’t bet against DeepMesh—or its tokenization trick—showing up in the next wave of text-to-world engines.

Paper link: arXiv 2503.15265 (PDF)

No comments:

 Large language models have learned to call external tools, but in computer vision they still walk a narrow, hand-coded path: crop the image...