Framework for evaluation of sound event detection in web videos

Published in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, 2017

The largest source of sound events is web videos. Most videos lack sound event labels at segment level, however, a significant number of them do respond to text queries, from a match found to their metadata by the search engine. In this paper we explore the extent to which a search query could be used as the true label for the presence of sound events in the videos. For this, we developed a framework for large-scale sound event recognition on web videos. The framework crawls videos using search queries corresponding to 78 sound event labels drawn from three datasets. The datasets are used to train three classifiers, which were then run on 3.7 million video segments. We evaluated performance using the search query as the true label and compare it (on a subset) with human labeling. Both types exhibited close performance, to within 10%, and similar performance trends as the number of evaluated segments increased. Hence, our experiments show potential for using search query as a preliminary true label for sound events in web videos.

Citation: Badlani, Rohan, Ankit Shah, Benjamin Elizalde, Anurag Kumar, and Bhiksha Raj. “Framework for evaluation of sound event detection in web videos.” arXiv preprint arXiv:1711.00804 (2017).