We proposed an action classification framework in which discriminative action subvolumes are learned in
a weakly supervised setting, owing to the difficulty of manually labelling massive video
datasets. The learned sub-action models are used to simultaneously classify video clips
and to localise actions in space-time. Each subvolume is cast as a BoF instance in an
MIL framework, which in turn is used to learn its class membership. We demonstrate
quantitatively that the classification performance of our proposed algorithm is comparable
and in some cases superior to the current state-of-the-art on the most challenging
video datasets, whilst additionally estimating space-time localisation information.
The recent trend in action recognition is towards larger datasets, an increasing number of action classes and larger visual vocabularies.
State-of-the-art human action classification in challenging video data is currently based on a bag-of-visual-words pipeline in which space-time
features are aggregated globally to form a histogram. The strategies chosen to sample features and construct a visual vocabulary are critical to
performance, in fact often dominating performance. In this work we provide a critical evaluation of various approaches to building a vocabulary
and show that good practises do have a significant impact. By subsampling and partitioning features strategically, we are able to achieve
state-of-the-art results on 5 major action recognition datasets using relatively small visual vocabularies.
We devised an action classification framework in which actions are modelled by discriminative
subvolumes, learned using weakly supervised training. The learned action models are used to simultaneously
classify video clips and to localise actions by aggregating the subvolume scores to form a dense space-time
saliency map. Each subvolume gives rise to a bag-of-features (BoF) instance in a multiple-instance-learning
framework. We show that by using general video subvolumes we are able to achieve better performance than
the state-of-the-art BoF baseline, whilst being able to localise space-time actions even in the most challenging