Accident Forecasting in Traffic Camera CCTV Videos



This paper presents a novel dataset for traffic accidents analysis. Our aim is to resolve the lack of public data for research about automatic spatio-temporal annotations for traffic safety in the roads. Our Car Accident Detection and Prediction~(CADP) dataset consists of 1,416 video segments collected from YouTube, with 205 video segments have full spatio-temporal annotations. Through analysis of CADP dataset, we observed a significant degradation of object detection in pedestrian category in our dataset, due to the object sizes and complexity of the scenes. To this end, we propose to integrate the Augmented Context Mining (ACM) into the Faster R-CNN detector to complement the accuracy for small pedestrian detection. Our experiments indicated a considerable improvement in object detection accuracy. Finally, we demonstrate the performance of accident forecasting in our dataset using our Imporved Faster R-CNN and the Accident LSTM architectures. We achieved an average 1.5 seconds in terms of Time-To-Accident measure with highest Average Precision is 61.63 %. We expect our dataset can serve as the starting point of a new research direction, which can grow incrementally in coming years.

Network Architecture


Paper & Presentation

ICCV paper (pdf) (hosted on arXiv)

Presented by Carl at ICCV (

ICCV Slides (pptx, 8MB)

Ankit Shah*, Jean Baptiste Lamare*, Tuan Nguyen Anh*, Alexander Hauptmann Accident Forecasting in Traffic Camera CCTV Videos Submitted to AAAI 2018. [Show BibTex]

Code and Network


Code is available on Github


Additional Materials

Full Results: Visual Data Mining on unlabeled PASCAL VOC 2011

This is the expanded version of Figure 7 in the paper. We have two versions, depending on the method used to avoid chromatic aberration at training time. WARNING: These pages contain about 10000 patches; don't click the link unless your browser can handle it!
[Color Dropping] [Color Projection]

Nearest Neighbors on PASCAL VOC 2007

This is the expanded version of Figure 4 in the paper. We have a separate page for each feature, corresponding to the three columns in Figure 4. In each case, the leftmost column shows randomly selected patches that were used as queries; no hand-filtering was done.
[Our architecture, random initialization] [AlexNet (ImageNet Labels)] [Ours (ImageNet Unlabeled)]

Maximal Activations of Network Units

We report here the patches that maximally activated the units in our networks. For each unit, we show a 3-by-3 grid of the top 9 patches. The units themselves are arranged in no particular order. We ran the networks on about 2.5M randomly-sampled patches from ImageNet. Note that most units have a receptive field that does not span the receptive field; hence, we only show the approximate region that is contained within the receptive field. Regions outside the patch are shown as black.

Projection Network:

[norm1] [norm2] [conv3] [conv4] [conv5] [fc6]

Color Dropping Network:

[norm1] [norm2] [conv3] [conv4] [conv5] [fc6]

Deep Dream

Alexander Mordvintsev decided to visualize the contents of our VGG-style network by applying Deep Dream separately to each filter in our network, and has kindly shared his results with us. Below are 8 of the filters in conv5_3 (the second-to-last layer before the representations are fused). Below each is my interpretation. Mouse over them to see it (I don't want to bias your interpretation!)



left side of dog's face

human face


circles (very useful for predicting relative position!)



Related Papers

We first proposed context as supervision in:
C. Doersch, A. Gupta, and A. A. Efros. Context as Supervisory Signal: Discovering Objects with Predictable Context European Conference on Computer Vision, September 2014.

A paper that helps us fine-tune our model:
P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data-dependent Initializations of Convolutional Neural Networks International Conference on Learning Representations, May 2015.

An extension of this paper, which trains deep nets with jigsaw puzzles:
M. Noroozi and P. Favaro. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles arXiv:1603.09246, March 2016.

Other works on “Self-Supervised” Learning

P. Isola, D. Zoran, D. Krishnan and E. Adelson Learning visual groups from co-occurrences in space and time arXiv:1511.06811, November 2015

P. Agrawal, J. Carreira and J. Malik Learning to See by Moving ICCV 2015

D. Jayaraman and K. Grauman Learning image representations tied to ego-motion ICCV 2015

X. Wang and A. Gupta Unsupervised Learning of Visual Representations using Videos ICCV 2015


This research partially was supported by:


For questions/comments, contact Ankit Shah, Jean Baptiste, Tuan Nguyen Anh