Sounds are described in complex and intricate ways which makes it hard for collecting webly labeled data through text based retrieval from video search engines.
A predefined set of 40 sound events from Audioset [1]
A few things kept in mind while selecting classes.
Classes which precises enough to obtain relevant examples. Hence, classes such as inside- large room or hall, sound effect , etc. which are too vague are not considered.
We focused more on sound classes which contain a “noun phrase” in their name, e.g. dog, truck, flute. The source of the sound is easily identifiable in these cases, and a text query containing a noun phrase often leads to relatively relevant results.
We also considered "onomatopoeia" words and those those cases where the phrase directly represents sounds examples laughter, birdsong, siren.
Also considered classes which are similar (e.g. drum and bass drums) to keep classes with "finer" differences.
Some hierarchical classes, Bus, Car, Truck all from broader Vehicle class.
Finally, classes with higher number of examples in Audioset were preferred.
Vocabulary of Sounds
Vehicle
Guitar
Singing
Car
Animal
Violin Fiddle
Bird
Drum
Engine
Drum Kit
Dog
Boat, Water vehicle
Train
Piano
Truck
Keyboard (musical)
Crowd
Bass drum
Rail Transport
Water
Pigeon, dove
Rock Music
Siren
Railroad car train wagon
Motorboat, speedboat
Tools
Motorcycle
Wind
Race car, auto racing
Chicken, rooster
Bird vocalization
Fowl
Laughter
Emergency Vehicle
Aircraft
Bus
Flute
Cymbal
Electric Piano
Chirp, tweet
Datasets
Webly-2k
Top 50 retrieved results from YouTube for each query corresponding to sound classes.
Total 1906 recordings.
Total 60 hours of audio data.
Average duration of recordings 111 seconds.
Each sound event is labeled in around 50 recordings. The dataset is multi-labeled. Over 200 recordings have more than 1 labels.
The labels are noisy. Both false positive and false negative label noises are present in the dataset.
False Positive noise in labels for different events in Webly-2k (event marked to be present based on retrieved results but not actually present on manual inspection)
Webly-4k
Top 100 retrieved results from YouTube for each query corresponding to sound classes.
Total 3807 recordings
Around 108 hours of audio data.
Average duration of recordings 103 seconds.
Each sound event is labeled in around 100 recordings. The dataset is multi-labeled.
The labels are noisy. Both false positive and false negative label noises are present in the dataset.
False Positive noise in labels for different events in Webly-4k (event marked to be present based on retrieved results but not actually present on manual inspection)