Automatic dataset construction (ADC): Sample collection, data curation, and beyond

Published in arXiv preprint arXiv:2408.11338, 2024

Building high-quality datasets is crucial for training machine learning models, but manual curation is expensive and time-consuming. This paper presents Automatic Dataset Construction (ADC), a framework for automating the process of sample collection, data curation, and quality control. We describe methods for automatic data filtering, deduplication, and annotation that can significantly reduce the human effort required to create large-scale training datasets while maintaining data quality.

Recommended citation: @article{liu2024automatic, title={Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond}, author={Liu, Minghao and Di, Zonglin and Wei, Jiahao and Wang, Zixin and Zhang, Haotian and Xiao, Runze and Wang, Hongru and Pang, Jiajun and Chen, Hao and Shah, Ankit Parag and others}, journal={arXiv preprint arXiv:2408.11338}, year={2024} }
Download Paper