Research Theme: Computer Vision
Live Projects
Deep learning for action detection
Deep-predicting future actions
Deep video captioning
Deep modelling of complex activities

We developed a deep learning framework able to localise multiple action instances in real time in the form of 'action tubes', which has topped all other competitors on accuracy, while demonstrating better then real time capabilities (up to 52 fps).
For a nice summery of current action detection research, cfr.

Emerging applications of artificial intelligence are bringing about important paradigm shifts in machine learning and computer vision. Machines need a comprehensive awareness of what takes place in complex environments, and to be able to use this understanding to make predictions about other machines' and humans' future behaviour.

We are working in collaboration with AI Labs, Bologna, and Prof Thomas Lukasiewicz, Oxford University, on a neuro-symbolic approach to video captioning able to combine our leading action detection technology and the latest advances in symbolic and ontological reasoning to deliver realistic natural language descriptions.

This is a wide-reaching effort, separately involving University Federico II and Huawei Technologies, which aims to extend deep learning approaches to complex activities formed by a number of coordinated 'atomic' actions. We seek in particular a novel deep learning formulation of part-based models, tailored to spatio-temporal videos.

Causal 3D CNNs
Sports footage analysis
Scene understanding and video inpainting

Recently, three dimensional (3D) convolutional neural networks (CNNs) have emerged as dominant methods to capture spatiotemporal representations. Such 3D CNNs, however, are non-causal, as they exploit information from both the past and the future, a central requirement in many real-world applications. To address these serious limitations, we have devised a new architecture for the causal/online spatiotemporal representation of videos which we call Recurrent Convolutional Network (RCN).

We are exploring collaborations with innovative startups in the Oxford-London area to develop deep-learning based systems capable of automatically annotating sports footage, in terms of single player actions and overall team manouvres, in both batch and real-time settings.

As part of both our KTP with Supponor and of our SARAS EU project, we are looking at pixel-level scene understanding in videos, with a focus on temporal consistency across video frames and quality of the boundaries. We are interested in modelling scene structures using graphs to aid the labelling. In the sports context, the final goal is the inpainting of virtual content in real time.

Past Projects
Part-based video deformable models
Identity recognition from gait
Laplacian methods for 3D human motion analysis
Automated visual weld inspection

We proposed an action classfication framework in which actions are modelled by discriminative subvolumes, learned using weakly supervised training. The learned action models are used to simultaneously classify video clips and to localise actions by aggregating the subvolume scores to form a dense space-time saliency map.

In this EPSRC-funded project we studied the design of a novel class of multilinear/tensorial classifiers able to linearly model the influence of several covariance factors, in a robust approach to identity recognition from gait.

While at INRIA Rhone-Alpes, Prof Cuzzolin studied the use of Laplacian embeddings in the analysis of human motion, both in terms of automated bodypart segmentation and tracking, and of robust matching of deformable bodies.

In this Knowledge Transfer Partnership, the lab collaborated with Meta Vision LTD on the design and implementation of an inspection framework for the detection and localisation of weld defects from reconstructed 3D surfaces.

Example-based pose estimation
Gesture recognition from depth cameras
Unsupervised action localisation

In example-based pose estimation, the configuration or "pose" of an evolving object is sought given visual evidence, having to rely uniquely on a set of examples. A sensible approach consists in learning maps from features to poses, using the information provided by the training set, and fuse features expressed as belief functions. We call this approach Belief Modelling Regression.

Thanks to visiting students De Rosa and Jetley, we also worked on action and gesture recognition from depth cameras, in particular via a tessellation of local classifiers in the feature space which locally approximates the optimal Bayes classifier.

When training annotion on the location of actions is not available, weakly supervised learning can be employed, in combination with unsupervised video segmentation, to locate actions of interest within an input video.