Showing posts with label Tsinghua University. Show all posts
Showing posts with label Tsinghua University. Show all posts

8.7.25

DeepMesh makes artist-quality 3D meshes a one-click affair

 Triangle-mesh modelling is the CAD world’s equivalent of hand-drawn in-betweens: essential, mind-numbing and painfully slow. A new paper out of Tsinghua University, NTU and ShengShu AI says it can hand that job to an LLM-sized transformer without melting your GPU.

The team’s framework, DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning, marries a clever compression trick with a dose of RLHF to crank out clean, editable topology directly from point clouds or images. 


Why previous mesh LLMs hit the wall

Most auto-regressive mesh generators treat every vertex coordinate as a token. Feed them a high-poly model and the sequence balloons into tens of thousands of steps, torpedoing training stability and inference speed. Worse, their loss functions optimise geometry alone, so outputs pass numeric checks yet still look like Swiss cheese to artists.


Two upgrades, one big leap

PillarWhat they didWhy it matters
72 % shorter sequencesA hierarchical patch-based tokenization merges duplicate offsets and encodes connectivity inline, shrinking vertex strings by nearly three-quarters without dropping detail. Cuts pre-training FLOPs and lets the model scale to 30 k-face meshes on a single A100.
Human-aligned RLCollected 5 000 preference pairs scored with a hybrid of human rating and 3D metrics, then ran Direct Preference Optimization (DPO) on the base model. Removes holes and stray faces while nudging topology toward “artist-grade” layouts.

The researchers also trimmed an 800 k-mesh corpus to a cleaner 500 k set, tamping down the loss spikes that plague raw WebGL scrapes. 

Results: fewer faces, better faces

  • Up to 1 B parameters: two Hourglass-style transformer variants (500 M & 1 B) both converge in 100 k steps thanks to shorter sequences. 

  • Topology wins: DeepMesh’s large model eliminates 90 % of non-manifold edges that slip through MeshGPT and Nautilus, according to the authors’ “topology-valid” metric.

  • Visual quality: crowd-sourced raters picked DeepMesh over MeshGPT by 68 % on identical point-cloud prompts (exact numbers in paper’s Sec. 4.3).

  • Speed: a full 30 k-face generation takes ≈10 min, versus 20–25 min for LoRA-fine-tuned diffusion baselines reported in prior work.

A public demo gallery already shows clean Watertight dragons, furniture and stylised characters rendered straight from sparse point clouds. 


Why this is bigger than 3D fan art

Game studios, AR platforms and online-creator tools alike are sitting on troves of unoptimised 3D scans. A transformer that understands connectivity as well as shape could batch-convert those scans into lightweight, animation-ready assets—no retopology pass required. And because DeepMesh’s DPO loop is “just” another RLHF recipe, the same pipeline could teach a mesh LLM brand-specific style or IP-safe anatomy without touching the base weights.

The authors hint at scaling past one billion parameters and adding text-conditioned generation. Given how fast 3D GenAI is snowballing, don’t bet against DeepMesh—or its tokenization trick—showing up in the next wave of text-to-world engines.

Paper link: arXiv 2503.15265 (PDF)

19.5.25

Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks

 Researchers from Tsinghua University and ModelBest have introduced Ultra-FineWeb, a large-scale, high-quality dataset comprising approximately 1 trillion English tokens and 120 billion Chinese tokens. This dataset aims to enhance the performance of large language models (LLMs) by providing cleaner and more efficient training data.

Efficient Data Filtering Pipeline

The creation of Ultra-FineWeb involved an efficient data filtering pipeline that addresses two main challenges in data preparation for LLMs:

  1. Lack of Efficient Data Verification Strategy:
    Traditional methods struggle to provide timely feedback on data quality. To overcome this, the researchers introduced a computationally efficient verification strategy that enables rapid evaluation of data impact on LLM training with minimal computational cost.

  2. Selection of Seed Data for Classifier Training:
    Selecting appropriate seed data often relies heavily on human expertise, introducing subjectivity. The team optimized the selection process by integrating the verification strategy, improving filtering efficiency and classifier robustness.

A lightweight classifier based on fastText was employed to efficiently filter high-quality data, significantly reducing inference costs compared to LLM-based classifiers.

Benchmark Performance

Empirical results demonstrate that LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, including MMLU, ARC, CommonSenseQA, and others. The dataset's quality contributes to enhanced training efficiency and model accuracy.

Availability

Ultra-FineWeb is available on Hugging Face, providing researchers and developers with access to this extensive dataset for training and evaluating LLMs.


References

  1. Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks – MarkTechPost. 

  2. Ultra-FineWeb Dataset on Hugging Face. 

  3. Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data















16.5.25

Ultra-FineWeb: A Trillion-Token Dataset Elevating LLM Performance Across Benchmarks

 In a groundbreaking development for artificial intelligence, researchers from Tsinghua University and ModelBest have unveiled Ultra-FineWeb, a massive, high-quality dataset designed to bolster the training of large language models (LLMs). Comprising approximately 1 trillion English tokens and 120 billion Chinese tokens, Ultra-FineWeb sets a new standard in dataset curation, emphasizing both scale and quality to enhance LLM performance across a spectrum of benchmarks.


Innovative Filtering Methodology

The creation of Ultra-FineWeb addresses two critical challenges in dataset preparation for LLMs: the need for efficient data verification and the selection of high-quality seed data for classifier training.

  1. Efficient Verification Strategy: To rapidly assess data quality, the researchers implemented a verification approach that evaluates the impact of data on LLM training with minimal computational overhead. This strategy enables timely feedback, facilitating the swift refinement of the dataset.

  2. Optimized Seed Selection: Recognizing the subjectivity in manual seed selection, the team developed a method to systematically choose positive and negative samples. By integrating the verification strategy, they enhanced the robustness and quality of the classifier used for data filtering.

A lightweight classifier based on fastText was employed to efficiently filter the dataset. This choice significantly reduced inference costs while maintaining high filtering precision, ensuring that only the most relevant and high-quality data were included in Ultra-FineWeb.


Benchmark Performance

LLMs trained on Ultra-FineWeb demonstrated remarkable improvements across various benchmarks:

  • English Benchmarks: Models exhibited substantial gains in tasks such as MMLU, ARC-C, ARC-E, and OpenbookQA, with average score increases of over 3% compared to those trained on previous datasets like FineWeb and FineWeb-Edu.

  • Chinese Benchmarks: On evaluations like C-Eval and CMMLU, models trained with Ultra-FineWeb-zh outperformed counterparts, indicating enhanced comprehension and reasoning in Chinese language tasks.

These improvements underscore the dataset's effectiveness in enhancing LLM capabilities across multiple languages and domains.


Implications for AI Development

Ultra-FineWeb's introduction marks a significant advancement in the field of AI, particularly in the training of LLMs. By addressing key challenges in data verification and seed selection, and by employing efficient filtering techniques, the dataset provides a robust foundation for developing more accurate and versatile language models.

The methodologies applied in creating Ultra-FineWeb offer a blueprint for future dataset curation efforts, emphasizing the importance of quality and efficiency in data preparation.


Access and Availability

Ultra-FineWeb is available for the research community through Hugging Face, promoting transparency and collaboration in AI development. Researchers and developers are encouraged to utilize this resource to further advance the capabilities of LLMs.


Takeaway

Ultra-FineWeb represents a pivotal resource in the evolution of large language models, combining extensive scale with meticulous quality control. Its innovative filtering methodologies and demonstrable performance enhancements across benchmarks position it as an essential tool for researchers and developers aiming to push the boundaries of AI language understanding.

 If large language models have one redeeming feature for safety researchers, it’s that many of them think out loud . Ask GPT-4o or Claude 3....