Deciphering GunType Hierarchy through Acoustic analysis
of Gunshot Recordings
People
Abstract
High gun murder rate in the United States is a great problem for the society. Gunshot recordings forms major source of evidence in identifying the perpetrator. Further, acoustic analysis of gunshot recordings can be helpful for crime scene reconstruction, shooter location identification as well as providing critical information to the law enforcement officers to curb the horrendous crime. In this work, we focus on inferring critical information from acoustic analysis of gunshot recordings in particular providing accurate Gun Type hierarchy information. We present the largest collection of gunshot recordings annotated temporally with gunshot locations for 3469 audio recordings with a total duration of 9 hours. We perform joint detection to detect temporal location of gunshot in recording as well as detect the gun type hierarchy such as Rifle, Machine gun etc. Mean Average precision for gunshot detection is 0.89 whereas for gun hierarchy is 0.62 an improvement of 0.2 mAP to SVM baseline. Our results on real life incidents such as Las-Vegas shooting, Florida School shooting show that we are able to detect the gun hierarchy in top 2 with 96 percent accuracy.
Video
Paper & Presentation
ICCV paper (pdf) (hosted on arXiv) |
Citation |
Code and Network
Code | ||||||||
is available on Github | ||||||||
Pre-trained Network | ||||||||
These trained networks can be used to extract fc6 features from 96-by-96 patches. Note that running it requires a version of Caffe later than August 26, 2015, or the normalization will be computed incorrectly. Images should be passed in in RGB format after the per-color-channel means have been subtracted, but no other preprocessing is necessary (the pixel range is automatically scaled correctly). For instructions on using them to extract features for larger images, see the net surgery example in the Caffe repository. python3 magic_init.py -cs -d '/path/to/VOC2007/JPEGImages/*.jpg' \ --gpu 0 --load /path/to/inputmodel.caffemodel \ /path/to/groupless_caffenet.prototxt /path/to/outputmodel.caffemodel where groupless_caffenet refers to this modified version of the default CaffeNet model that has groups removed (for the vgg-style model, this becomes this vgg-style prototxt). You can then do the fine-tuning with fast-rcnn with the caffenet model using this prototxt for training, this solver, and this prototxt for testing. You can use this, this, and this, respectively, for VGG. For magic_init, start with the .caffemodel's that don't have fc6. Also, make sure you use fast rcnn's multiscale configuration (in experiments/cfgs/multiscale.yml). If you're using VGG, this configuration will most likely run out of memory, so use this configuration instead. Run 150K iterations of finetuning for the caffenet model, and 500K for VGG. |
Projection Network: |
[definition prototxt] [caffemodel_with_fc6] [caffemodel_no_fc6] |
Color Dropping Network: |
[definition prototxt] [caffemodel_with_fc6] [caffemodel_no_fc6] |
VGG-Style Network (Color Dropping): |
[definition prototxt] [caffemodel_with_fc6] [caffemodel_no_fc6] |
Additional Materials
Full Results: Visual Data Mining on unlabeled PASCAL VOC 2011 | ||||||||
This is the expanded version of Figure 7 in the paper. We have two versions, depending on the method used to avoid chromatic aberration at training time. WARNING: These pages contain about
10000 patches; don't click the link unless your browser can handle it! | ||||||||
Nearest Neighbors on PASCAL VOC 2007 | ||||||||
This is the expanded version of Figure 4 in the paper. We have a separate page for each feature, corresponding to
the three columns in Figure 4.
In each case, the leftmost column shows randomly selected patches that were used as queries; no hand-filtering was done. | ||||||||
Maximal Activations of Network Units | ||||||||
We report here the patches that maximally activated the units in our networks. For each unit, we show a 3-by-3 grid of the top 9 patches. The units themselves are arranged in no particular order. We ran the networks on about 2.5M randomly-sampled patches from ImageNet. Note that most units have a receptive field that does not span the receptive field; hence, we only show the approximate region that is contained within the receptive field. Regions outside the patch are shown as black.
| ||||||||
Deep Dream | ||||||||
Alexander Mordvintsev decided to visualize the contents of our VGG-style network by applying Deep Dream separately to each filter in our network, and has kindly shared his results with us. Below are 8 of the filters in conv5_3 (the second-to-last layer before the representations are fused). Below each is my interpretation. Mouse over them to see it (I don't want to bias your interpretation!)
|
Related Papers
We first proposed context as supervision in:
C. Doersch, A. Gupta, and A. A. Efros. Context as Supervisory Signal:
Discovering Objects with Predictable Context European Conference on Computer Vision, September 2014.
A paper that helps us fine-tune our model:
P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data-dependent Initializations of Convolutional Neural Networks International Conference on Learning Representations, May 2015.
An extension of this paper, which trains deep nets with jigsaw puzzles:
M. Noroozi and P. Favaro. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles arXiv:1603.09246, March 2016.
Other works on “Self-Supervised” Learning |
P. Isola, D. Zoran, D. Krishnan and E. Adelson Learning visual groups from co-occurrences in space and time arXiv:1511.06811, November 2015
P. Agrawal, J. Carreira and J. Malik Learning to See by Moving ICCV 2015
D. Jayaraman and K. Grauman Learning image representations tied to ego-motion ICCV 2015
X. Wang and A. Gupta Unsupervised Learning of Visual Representations using Videos ICCV 2015
|
Funding
This research partially was supported by:
Contact
For questions/comments, contact Ankit Shah