Research Project: Deformable Part-based Video Models
Multiple Instance Learning for Action Detection
Feature sampling for visual vocabulary generation
Space-time deformable part-based action models

We proposed an action classification framework in which discriminative action subvolumes are learned in a weakly supervised setting, owing to the difficulty of manually labelling massive video datasets. The learned sub-action models are used to simultaneously classify video clips and to localise actions in space-time. Each subvolume is cast as a BoF instance in an MIL framework, which in turn is used to learn its class membership. We demonstrate quantitatively that the classification performance of our proposed algorithm is comparable and in some cases superior to the current state-of-the-art on the most challenging video datasets, whilst additionally estimating space-time localisation information.

The recent trend in action recognition is towards larger datasets, an increasing number of action classes and larger visual vocabularies. State-of-the-art human action classification in challenging video data is currently based on a bag-of-visual-words pipeline in which space-time features are aggregated globally to form a histogram. The strategies chosen to sample features and construct a visual vocabulary are critical to performance, in fact often dominating performance. In this work we provide a critical evaluation of various approaches to building a vocabulary and show that good practises do have a significant impact. By subsampling and partitioning features strategically, we are able to achieve state-of-the-art results on 5 major action recognition datasets using relatively small visual vocabularies.

We devised an action classification framework in which actions are modelled by discriminative subvolumes, learned using weakly supervised training. The learned action models are used to simultaneously classify video clips and to localise actions by aggregating the subvolume scores to form a dense space-time saliency map. Each subvolume gives rise to a bag-of-features (BoF) instance in a multiple-instance-learning framework. We show that by using general video subvolumes we are able to achieve better performance than the state-of-the-art BoF baseline, whilst being able to localise space-time actions even in the most challenging datasets.

Relevant papers:

 Lab Member(s): Michael Sapienza, Philip Torr