Poster Presentation, Carnegie Mellon University, Language Technologies Institute, Pittsburgh, PA
A significant portion of internet’s multimedia data is videos, which contain sounds that often possess a meaning. Hence, automatic analysis of audio content for sound events is crucial. Current literature consists of small scale audio only datasets and with no audio from the web apart from AudioSet since annotating audio events is time-consuming. Videos have no tags or labels for sound events at the segment level adding to the challenges for evaluation of sound recognition on a large scale. We introduce a framework for continuous large-scale sound event recognition on web videos consisting of three modules - Crawl, Hear, and Feedback. The modular design allows our framework to scale and evolve as required. The framework has processed 3.5 million video segments, and humans inspected a subset of segments to evaluate the performance of web audio.