Learning discriminative space-time actions from weakly labelled videos

Michael Sapienza, Fabio Cuzzolin and Philip H.S. Torr
Proceedings of the British Machine Vision Conference (BMVC'12), Guildford, UK, September 2012


Current state-of-the-art action classification methods extract feature representations from the entire video clip in which the action unfolds, however this representation may include irrelevant scene context and movements which are shared amongst multiple action classes. For example, a waving action may be performed whilst walking, however if the walking movement and scene context appear in other action classes, then they should not be included in a waving movement classifier. In this work, we propose an action classification framework in which more discriminative action subvolumes are learned in a weakly supervised setting, owing to the difficulty of manually labelling massive video datasets. The learned models are used to simultaneously classify video clips and to localise actions to a given space-time subvolume. Each subvolume is cast as a bag-of-features (BoF) instance in a multiple-instance-learning framework, which in turn is used to learn its class membership. We demonstrate quantitatively that even with single fixedsized subvolumes, the classification performance of our proposed algorithm is superior to the state-of-the-art BoF baseline on the majority of performance measures, and shows promise for space-time action localisation on the most challenging video datasets.

BibTeX entry

  AUTHOR = "Michael Sapienza, Fabio Cuzzolin and Philip H.S. Torr", 
  TITLE = "Learning discriminative space-time actions from weakly labelled videos",
  JOURNAL = "Proceedings of BMVC 2012", 
  YEAR = "2012"