Researchers from Tsinghua University and ModelBest have introduced Ultra-FineWeb, a large-scale, high-quality dataset comprising approximately 1 trillion English tokens and 120 billion Chinese tokens. This dataset aims to enhance the performance of large language models (LLMs) by providing cleaner and more efficient training data.
Efficient Data Filtering Pipeline
The creation of Ultra-FineWeb involved an efficient data filtering pipeline that addresses two main challenges in data preparation for LLMs:
-
Lack of Efficient Data Verification Strategy:
Traditional methods struggle to provide timely feedback on data quality. To overcome this, the researchers introduced a computationally efficient verification strategy that enables rapid evaluation of data impact on LLM training with minimal computational cost. -
Selection of Seed Data for Classifier Training:
Selecting appropriate seed data often relies heavily on human expertise, introducing subjectivity. The team optimized the selection process by integrating the verification strategy, improving filtering efficiency and classifier robustness.
A lightweight classifier based on fastText was employed to efficiently filter high-quality data, significantly reducing inference costs compared to LLM-based classifiers.
Benchmark Performance
Empirical results demonstrate that LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, including MMLU, ARC, CommonSenseQA, and others. The dataset's quality contributes to enhanced training efficiency and model accuracy.
Availability
Ultra-FineWeb is available on Hugging Face, providing researchers and developers with access to this extensive dataset for training and evaluating LLMs.