Deciphering GunType Hierarchy through Acoustic analysis
of Gunshot Recordings



High gun murder rate in the United States is a great problem for the society. Gunshot recordings forms major source of evidence in identifying the perpetrator. Further, acoustic analysis of gunshot recordings can be helpful for crime scene reconstruction, shooter location identification as well as providing critical information to the law enforcement officers to curb the horrendous crime. In this work, we focus on inferring critical information from acoustic analysis of gunshot recordings in particular providing accurate Gun Type hierarchy information. We present the largest collection of gunshot recordings annotated temporally with gunshot locations for 3469 audio recordings with a total duration of 9 hours. We perform joint detection to detect temporal location of gunshot in recording as well as detect the gun type hierarchy such as Rifle, Machine gun etc. Mean Average precision for gunshot detection is 0.89 whereas for gun hierarchy is 0.62 an improvement of 0.2 mAP to SVM baseline. Our results on real life incidents such as Las-Vegas shooting, Florida School shooting show that we are able to detect the gun hierarchy in top 2 with 96 percent accuracy.


Paper & Presentation

ICCV paper (pdf) (hosted on arXiv)

Presented by Carl at ICCV (

ICCV Slides (pptx, 8MB)

Ankit Shah, Alexander G Hauptmann Deciphering GunType Hierarchy through Acoustic Analysis of Gunshot recordings In preparation for ICASSP 2019. [Show BibTex]

Code and Network


is available on Github

Pre-trained Network

These trained networks can be used to extract fc6 features from 96-by-96 patches. Note that running it requires a version of Caffe later than August 26, 2015, or the normalization will be computed incorrectly. Images should be passed in in RGB format after the per-color-channel means have been subtracted, but no other preprocessing is necessary (the pixel range is automatically scaled correctly). For instructions on using them to extract features for larger images, see the net surgery example in the Caffe repository.

Note that when fine-tuning a model for Pascal VOC (and probably for most datasets), significant performance boosts may be obtained by using our Magic Init (code here) to rescale the weights in the following networks. To obtain the results in the paper, I used:

python3 -cs -d '/path/to/VOC2007/JPEGImages/*.jpg' \
   --gpu 0 --load /path/to/inputmodel.caffemodel \
   /path/to/groupless_caffenet.prototxt /path/to/outputmodel.caffemodel

where groupless_caffenet refers to this modified version of the default CaffeNet model that has groups removed (for the vgg-style model, this becomes this vgg-style prototxt). You can then do the fine-tuning with fast-rcnn with the caffenet model using this prototxt for training, this solver, and this prototxt for testing. You can use this, this, and this, respectively, for VGG. For magic_init, start with the .caffemodel's that don't have fc6. Also, make sure you use fast rcnn's multiscale configuration (in experiments/cfgs/multiscale.yml). If you're using VGG, this configuration will most likely run out of memory, so use this configuration instead. Run 150K iterations of finetuning for the caffenet model, and 500K for VGG.

Projection Network:

[definition prototxt] [caffemodel_with_fc6] [caffemodel_no_fc6]

Color Dropping Network:

[definition prototxt] [caffemodel_with_fc6] [caffemodel_no_fc6]

VGG-Style Network (Color Dropping):

[definition prototxt] [caffemodel_with_fc6] [caffemodel_no_fc6]

Additional Materials

Full Results: Visual Data Mining on unlabeled PASCAL VOC 2011

This is the expanded version of Figure 7 in the paper. We have two versions, depending on the method used to avoid chromatic aberration at training time. WARNING: These pages contain about 10000 patches; don't click the link unless your browser can handle it!
[Color Dropping] [Color Projection]

Nearest Neighbors on PASCAL VOC 2007

This is the expanded version of Figure 4 in the paper. We have a separate page for each feature, corresponding to the three columns in Figure 4. In each case, the leftmost column shows randomly selected patches that were used as queries; no hand-filtering was done.
[Our architecture, random initialization] [AlexNet (ImageNet Labels)] [Ours (ImageNet Unlabeled)]

Maximal Activations of Network Units

We report here the patches that maximally activated the units in our networks. For each unit, we show a 3-by-3 grid of the top 9 patches. The units themselves are arranged in no particular order. We ran the networks on about 2.5M randomly-sampled patches from ImageNet. Note that most units have a receptive field that does not span the receptive field; hence, we only show the approximate region that is contained within the receptive field. Regions outside the patch are shown as black.

Projection Network:

[norm1] [norm2] [conv3] [conv4] [conv5] [fc6]

Color Dropping Network:

[norm1] [norm2] [conv3] [conv4] [conv5] [fc6]

Deep Dream

Alexander Mordvintsev decided to visualize the contents of our VGG-style network by applying Deep Dream separately to each filter in our network, and has kindly shared his results with us. Below are 8 of the filters in conv5_3 (the second-to-last layer before the representations are fused). Below each is my interpretation. Mouse over them to see it (I don't want to bias your interpretation!)



left side of dog's face

human face


circles (very useful for predicting relative position!)



Related Papers

We first proposed context as supervision in:
C. Doersch, A. Gupta, and A. A. Efros. Context as Supervisory Signal: Discovering Objects with Predictable Context European Conference on Computer Vision, September 2014.

A paper that helps us fine-tune our model:
P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data-dependent Initializations of Convolutional Neural Networks International Conference on Learning Representations, May 2015.

An extension of this paper, which trains deep nets with jigsaw puzzles:
M. Noroozi and P. Favaro. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles arXiv:1603.09246, March 2016.

Other works on “Self-Supervised” Learning

P. Isola, D. Zoran, D. Krishnan and E. Adelson Learning visual groups from co-occurrences in space and time arXiv:1511.06811, November 2015

P. Agrawal, J. Carreira and J. Malik Learning to See by Moving ICCV 2015

D. Jayaraman and K. Grauman Learning image representations tied to ego-motion ICCV 2015

X. Wang and A. Gupta Unsupervised Learning of Visual Representations using Videos ICCV 2015


This research partially was supported by:


For questions/comments, contact Ankit Shah