Research Theme: Deformable Part-based Models for Unconstrained
  Action Recognition
Learning discriminative space-time deformable part-based action models
Multiple Instance Learning for Action Localisation and Recognition
Feature sampling and partitioning for visual vocabulary generation on large action classification datasets
Current state-of-the-art action classification methods represent space-time features globally, from the entire video clip under consideration. However the features extracted may in part be due to irrelevant scene context and movements which are shared amongst multiple action classes. For example, a waving action may be performed whilst walking, however if the walking movement appears together with other types of action, then features extracted on the walking part of the scene should not be included to help learn a waving movement classifier.
In this work we propose an action classification framework in which actions are modelled by discriminative subvolumes, learned using weakly supervised training. The learned action models are used to simultaneously classify video clips and to localise actions by aggregating the subvolume scores to form a dense space-time saliency map. Each subvolume gives rise to a bag-of-features (BoF) instance in a multiple-instance-learning framework. We show that by using general video subvolumes we are able to achieve better performance than the state-of-the-art BoF baseline, whilst being able to localise space-time actions even in the most challenging datasets.
Current state-of-the-art action classification methods derive action representations from the entire video clip in which the action unfolds, even though this representation may include parts of actions and scene context which are shared amongst multiple classes. For example, different actions involving the movement of the hands may be performed whilst walking, against a common background. In this work, we propose an action classification framework in which discriminative action subvolumes are learned in a weakly supervised setting, owing to the difficulty of manually labelling massive video datasets. The learned sub-action models are used to simultaneously classify video clips and to localise actions in space-time. Each subvolume is cast as a BoF instance in an MIL framework, which in turn is used to learn its class membership. We demonstrate quantitatively that the classification performance of our proposed algorithm is comparable and in some cases superior to the current state-of-the-art on the most challenging video datasets, whilst additionally estimating space-time localisation information. The recent trend in action recognition is towards larger datasets, an increasing number of action classes and larger visual vocabularies. State-of-the-art human action classification in challenging video data is currently based on a bag-of-visual-words pipeline in which space-time features are aggregated globally to form a histogram. The strategies chosen to sample features and construct a visual vocabulary are critical to performance, in fact often dominating performance. In this work we provide a critical evaluation of various approaches to building a vocabulary and show that good practises do have a significant impact. By subsampling and partitioning features strategically, we are able to achieve state-of-the-art results on 5 major action recognition datasets using relatively small visual vocabularies.
Funded by: Intelligent Transport Systems Doctoral Training Programme

Lab Member(s): Michael Sapienza, Philip Torr