Action and gesture recognition by imprecise-probabilistic graphical
  models (ActionIGM)

EU FP7 - STREP : Small and Medium Scale Research Project
Submitted January 18 2011

Section 1: Scientific and/or technical quality, relevant to the topics addressed by the call

1.1 Concept and objectives

1.1.1 Outline of the main project ideas

Action and gesture recognition. Developing robust tools for the recognition of human actions and gestures appears as a crucial step towards the full development of more natural ways of interacting between humans and machines. The number of scenarios that such methodologies can open is impressive: they range from ergonomic human-machine interfaces for computers and game controllers, to smart rooms in which people are assisted in their everyday activities by distributed intelligence in their own homes, to assistance to non-autonomous elderly people, to surveillance scenarios in which large public spaces need to be monitored at a distance, to novel non-cooperative behavioral biometric techniques such as gait identification in which people are identified from the way they walk. As an example, the newest generation of controllers (like the movement-sensing Kinect controller by Microsoft) is already an important achievement in this direction. Yet, these tools are only intended to track the movements of the user, without any real elaboration and interpretation of his actions. Action recognition represents a further sophistication in the interaction with computers. As an analogy (still well within the field of interfaces), what we want to do for contact-less interfaces would bring the same kind of improvements that the new generation of mouses (like Apple’s Magic Mouse) has brought to traditional point-and-click mouses.

Figure 1. Some of the many application of action recognition: virtual reality, human-computer interaction, surveillance, entertainment.

Crucial issues with action recognition. While the formulation of the problem is simple and intuitive, activity recognition is a hard problem. Motions inherently possess an extremely high degree of variability. Movements quite different from each can carry the same meaning, or represent the same gesture. In addition, motions are subject to a large number of nuisance or “covariate” factors [37], such as: illumination, moving background, camera(s) viewpoint, etcetera. The list goes on and on. If we remove the assumption that a single motion of interest is present in our video(s), locality emerges too as a critical factor: for instance, a person can walk and wave at the same time, and often does. If we also remove the assumption of having a single body moving in the field of interest, problems like occlusion assume a critical importance: self-occlusion is always a problem. On the other hand, the presence of one or more (static) objects in the vicinity of the motion can effectively help to disambiguate the activity class (recognition “in context”). Sometimes, very few instances of each action category are available. Video data are subject to a number of sources of uncertainty: frames can be missing or misprocessed; there is inherent ambiguity associated with the perspective projection model; occlusions lead to missing or partial data; finally, learning a larger number of action categories requires the collection of huge training sets, which pose the problem of how to classify large amount of complex data, while being able to train a working system from a limited training set would greatly improve user experience in entertainment or human machine interface scenarios.

Importance of dynamics. Recently, recognition methods which neglect action dynamics (typically extracting spatio-temporal features from the 3D volume associated with a video) have proven very effective. However, encoding the dynamics of videos or image sequences by means of some sort of dynamical model can be useful in situations in which the dynamics is critically discriminative. The actions of interest have to be temporally segmented from a video sequence. Furthermore, in situations in which a significant number of people ambulate in the field of view we need to shift from single objects/bodies tracking to approaches in which the monitored crowd is considered like some sort of fluid: dynamical models are well equipped to deal with such scenarios. In these scenarios, action (or identity) recognition reduces to classifying dynamical models. Various classes of dynamical models have been proposed in the past for gesture or action recognition, ranging from hidden Markov models to linear, nonlinear, stochastic or even chaotic dynamical systems, later classified by measuring distances in their space. Sophisticated graphical models can be useful to learn in a bottom-up fashion the temporal structure or plot of a footage, or to describe causal relationships in complex activity patterns.
Unfortunately, classical dynamical models have been sometimes found too rigid to describe complex activities, or prone to overfitting to the existing training examples. This is because they have typically to estimate single, “precise” probability distributions describing, for instance, the transitions between the states of a Markov models (even a hierarchical HMM), or the state-output distribution associated with each state in the space of observations.

Figure 2. Issues with action recognition, from top left to bottom right: viewpoint, illumination, occlusion, presence of objects, missing data.

Imprecise probabilities and imprecise-probabilistic graphical models. The key to outperforming existing algorithms for action recognition is a more sophisticated modelling technique. The theory of imprecise probabilities [82] and the extension of graphical models to allow the manipulation of whole convex sets of probabilities [81] provides a way of overcoming overfitting issues, while allowing the resulting classifier to handle missing or partial data in a natural way, and produce a system which is more robust to the sources of uncertainty and various issues mentioned above.
In decision making and estimation the state of the world is assumed to be described by a probability distribution over a set of alternative hypotheses. Decision and estimation then require to estimate this distribution from the data. Sometimes, however, like in the case of extremely rare events (e.g., a volcanic eruption), few statistics are available to drive the estimation. Observed data can be missing. Furthermore, under the law of large numbers, probability distributions are the outcome of an infinite process of evidence accumulation, while in all practical cases the available evidence only provides some sort of constraint on the unknown, “true” probability governing the process. Different kinds of constraints are associated with different generalisations of probabilities, such as interval probabilities [80], convex sets of probabilities or credal sets [81], Shafer's theory of belief functions [83,84], coherent lower previsions [86], and others. We focus on graphical models which manipulate convex sets of probabilities as the most suitable generalisation of classical probabilistic graphical models.

Proposed breakthrough. We intend to adopt a novel class of dynamic probabilistic models, which are intended to capture the abstract nature of what different sequences depicting a particular action have in common, thus allowing for more accurate recognition performances. With respect to other related approaches mentioned above, the distinctive feature of our approach is to move from the standard probabilistic framework to an imprecise-probabilistic framework in which the model probabilities are not forced to be quantified by sharp, “precise”, numerical values, but can be eventually required to vary in a set (an interval, for instance, or an entire convex set). This has crucial repercussions when learning the model from the video sequence, as this often consists on learning probabilities from a relatively small and incomplete dataset: a situation where imprecise-probabilistic approaches have been shown to provide much more robust and reliable quantifications, that rely much less heavily on prior assumptions. Our plan is to develop the theoretical tools required to allow inference of imprecise-probabilistic graphical models from training data, develop suitable recognition algorithms based on the classification of such imprecise-probabilistic models, and finally to implement them as efficient software tools to be later embedded as firmware in commercial cameras.

Figure 3. A classical hidden Markov model is defined by a transition matrix between its finite states, and an emission probability in the observation space Y for each state. Its imprecise-probabilistic version allows entire convex sets of transition and emission probabilities.

1.1.2 Scientific objectives

In a nutshell, we plan to achieve the following objectives:

  • bringing to full maturity the theory of inference on imprecise-probabilistic graphical models, with a special focus on imprecise hidden Markov models, due to the relevance of such models to gesture and action recognition, but later extended to arbitrary imprecise-probabilistic graphical models;

  • prove how modelling videos with imprecise-probabilistic models allows to cope with some of the major issues of action and gesture recognition, such as the presence of numerous covariate factors, occlusions and self occlusions, missing data within a video sequence, and more general lack of sufficient data in terms of datasets large enough to allow sensible recognition, resulting in a dramatic increase of the robustness of the overall system;

  • develop a related theory of classification of large amounts of video sequences, represented as imprecise-probabilistic dynamical graphical models, by means of metric learning techniques, the structured learning approach to SVM classification, and possibly the application of randomised projection developed in the context of compressed sensing to deal with classification of large datasets;

  • study new ways of selecting and extracting robust features from video data coming in multiple modalities, with a focus on the new range cameras that are flooding the market at this very moment (e.g. the Kinect console);

  • produce a working prototype of an overall robust, all purpose action and gesture (and identity) recognition system, tested on both internally gathered datasets and the publicly available ones, validated under the manifold scenarios illustrated above: human-machine interaction, surveillance, gaming, smart rooms, gait identification; and

  • proceed to a firmware implementation of the algorithms generated by the proposed research, as first step towards the possible commercialisation/patenting of the techniques developed in the course of the project.
In Section 1.3 we will further detail how these goals are to be reach in terms of a number of partial scientific objectives and tasks.

1.1.3 Relation to the topics addressed by the call

The scientific objectives we set for our project are clearly reflected in the ICT work programme. In particular, they relate to Challenge 3, “Alternative Paths to Components and Systems”, Objective ICT- 2011.3.2, “Smart components and smart systems”, with a focus on target outcome a), “Future smart components and smart systems”.
As stated there,
  • Challenge 3 "covers ... cooperating complex systems¨.

    This is the case of our proposed integrated system for robust action and gesture recognition from range and conventional cameras;
More specifically:
  • The integration of new functionalities for the next generation of application-specific components and smart systems through the convergence of microelectronics, nanomaterials, biochemistry, measurement technology and ICT.

    The use of the new range camera technology, in integration with advances in the theoretical understand of imprecise-probabilistic graphical models can provide new functionalities for next generation computers via human-machine interaction, surveillance systems via semi-automatic alert systems, and biometrics via automatic identity recognition from gait;
In particular we seek to target: Objective ICT-2011.3.2 : Smart components and smart systems integration.
The objective is about :
  • "Smart (miniaturised) systems [that] have the ability to sense, describe, and qualify a given situation, as well as to mutually address and identify each other. They are able to predict, decide or help to decide, and to interact with their environment by using highly sophisticated interfaces between systems and users.¨

    All these functionalities are addressed in this project. Through the use of conventional and range cameras, the proposed integrated system will be able to sense motion in the external world, describe or represent such motion in terms of sophisticated imprecise-probabilistic graphical models, decide (or help to decide) what is the action, gesture or identity is involved in the motion(s), and interact with humans as a means for human computer interaction, or as an aid for surveillance monitoring.
Target outcome a), Future smart components and smart systems, is about
  • "technologies, processes, manufacturing techniques and design methods for: integrated smart systems with advanced functionality, autonomously operating ... smart systems, robust systems, compatible and adaptive to environment".

    Clearly what we are proposing here is a system able to autonomously operate, but also to provide assistance to human agents; it is robust methodology designed in order to be able to cope with high uncertainty, missing or partial data, or insufficient information due to too small training sets which would ordinarily not allow sensible decisions.
Finally, as explicitely mentioned in the work programme:
  • "Advanced Functionalities include ... Human-Machine Interfacing using gesture, tactile and motion detection¨; "Basic Methodologies include: New architectures for devices and smart components that can fulfil the complexity and the very advanced, very high performance requirements¨

    Human machine interfacing is the single most important application of the proposed theoretical framework and the associated prototype; tackling complexity in a smart way is the trademark of the present project.
1.1.4 Feasibility and timing

The proposed research requires a broad range of skills, including very theoretical background in AI, and in particular uncertain reasoning, abilities in signal processing, and in particular image analysis in order to transform the frames of a video sequence into quantitative information, and experience in the software (and firmware) implementation of such a kind of tools. Nevertheless, the objectives we maintain we will be able to reach are well within the proposed time scale of three years.
The consortium (with IDSIA and SYSTeMS, in particular) features partners who are among the pioneers of the study of inference via imprecise Markov and graphical models in general. The aim of bringing to full development learning and inference algorithms such as Expectation-Maximisation (EM) for imprecise-probabilistic models is very ambitious, but within reach for two of the top groups in the area. The timescale and resources we request are what we believe are adequate to reach such goals.
The consortium (via OBU) also possesses world class expertise in feature selection and extraction, fast classification of large amounts of images and video via graph cut methods and SVMs, gesture and identity recognition from gait. The group has a wealth of industrial partnerships with top companies such as Sony and Microsoft research which will be extremely valuable to maximise impact, and can provide existing infrastructure in terms of range camera acquisition and motion capture validation.
Finally, the presence of a valuable industrial partner such as MARCH ensures that the consortium will be able to deliver a fully functional prototype, which will be used to disseminate the outcomes of the project and kick-start follow-up research activities, novel industrial partnerships, and possibly patenting. For all these reasons we are confident the set ourselves ambitious but realistic objectives which will be achieve within the project, while a host of follow-up activities are likely to be generated.

1.1.5 Milestones

The scientific output of the project will obviously be measured in terms of (but not only) high-impact publications. Feasible targets are Computer Vision and Pattern Recognition (CVPR) 2012, 2013 and 2014 (deadline normally in November) and Neural Information Processing (NIPS) 2012, 2013, 2014 (deadline each year in June) for publication of the preliminary results. Towards the end of the project we intend to organise workshops and special sessions on the project and related topics at major conferences such as ICCV, ECCV. We of course intend to consolidate the project’s outcome in articles to submit to high impact journals in computer vision such as IJCV or PAMI.
At project completion a toolbox collecting the routines implementing the different stages of the action and gesture recognition framework will be made available on the internet to all scientists working in either behavioral biometrics, activity recognition or video indexing, in order to disseminate our results. We will conduct thorough tests on all the available public databases with the goal of attaining improvements on the current state of the art on these benchmark test-beds. We will as well run tests on surveillance and gait identification test-beds.
As a result of the research proposed here, additional extensive databases in different modalities such as synchronised cameras and range cameras will later be made public. This will be done through a dedicated dissemination and management web site.
In case the results are in line with our expectations we will explore the possibility of patenting our results in part (especially their firmware implementation). All these aspects are described in much more detail later in the proposal.

2.1 Progress beyond the state-of-the-art

2.1.1 Action, activity and gesture recognition

Recognising human activities from video is a natural application for computer vision. Since the classic experiment of Johansson showing that moving light displays were enough for people to recognise motions or even identities, this topic has been of increasing interest in the vision community. The problem consists in telling, given one or more image sequences capturing one or more people performing an activity, what category (among those previously learned) this activity belongs to. A useful even though quite vague distinction is that between “actions”, meant as simple (usually stationary) motion patterns, and “activities” considered to be more complex sequences of actions, sometimes overlapping in time as they are performed by different parts of the body or different agents in the scene. The activity recognition problem involves, as we have seen, numerous challenges at different levels of representation.

Major issues with action recognition. Though the formulation of the problem is simple and intuitive, activity recognition is a much harder problem than it may look. Motions inherently possess an extremely high degree of variability. Movements quite different from each other can in fact carry the same meaning, or represent the same gesture. Even in perfectly controlled situations (constant illumination, static background) this inherent variability makes recognition hard (e.g., the trajectory followed by a walking person is irrelevant to the classification of the action they perform). In addition, motions are subject to a large number of nuisance or “covariate” factors [37], such as: illumination, background, viewpoint. The list goes on and on. The combination of inherent variability and nuisance factors explains why experiments have been so far conducted in small, controlled environments to reduce their complexity.
Recently, attempts have been made to go beyond such controlled environments. Liu, Luo and Shah [37] have taken on the challenge of recognising actions “in the wild” and coping with the tremendous nuisance variations of unconstrained videos. Motion statistics are used to prune motion and static features. Adaboost is chosen to integrate all those single-feature classifiers. The recognition algorithms have been tested on the KTH dataset and on YouTube videos. Hu et al [59] have also investigated action detection in complex, cluttered scenes, representing candidates regions as bags of instances. In their SMILE-SVM framework human action detectors are learnt based on imprecise action locations. They test their approach on both the CMU dataset and on videos shot at a shopping mall. Yao and Zhu [58] learn deformable action templates from cluttered real-world videos. Such action templates are sequences of image templates consisting of a set of shape and motion primitives. Templates are deformed to match KTH and CMU videos by space-time warping.
If we remove the assumption that a single motion of interest is present in our video, locality emerges too as a critical factor. For instance, a person can walk and wave at the same time. Different actions performed by different agents can go on in different regions of the image sequence, without being necessarily synchronised or coordinated. Gilbert et al [27], for instance, perform multi-action recognition using very dense spatiotemporal corner features in a hierarchical classification framework, testing it on the multi-KTH dataset and the Real-World Movie dataset. Reddy et al [45] propose to cope with incremental recognition using feature trees that can handle multiple actions without intensive training. Local features are recognised by neural networks, while actions are later labelled using a simple voting strategy. They use the KTH and IXMAS datasets.
If we also remove the assumption of having a single body moving in the field of interest, problems like occlusion assume a critical importance. On the other hand, the presence of one of more (static) objects in the vicinity of the motion can effectively help to disambiguate the activity class (recognition “in context”). Marszalek, Laptev and Schmid [39] apply a joint SVM classifier of action and scene to bags of features, using movie scripts as means of automatic supervision for training (as proposed in [21]) on the newly proposed “Hollywood movies” database. Han et al [29], for instance, introduce bag-of-detectors scene descriptors to encode such contextual presence, and use Gaussian processes to select and weight multiple features for classification on the KTH and Hollywood dataset. Sun et al [53] use the Hollywood and LSCOM databases to model and test spatio-temporal context in a hierarchical way, using SIFT for point-level context and Markov processes to model trajectory proximity and transition descriptors, on top of which they apply multi-channel non-linear SVM.
In opposition, sometimes very few instances of each action category are available. Seo and Milanfar [49] detect actions from single examples by computing sense space-time local regression kernels, which are used as descriptors and passed through PCA to extract salient features. A matrix generalisation of cosine similarity is then used to compare videos, on the Irani database.

Bag-of-features on spatio-temporal volumes methods. Recently, recognition methods that neglect action dynamics (typically extracting spatio-temporal [6] features from the 3D volume associated with a video [32]) have proven very effective ( Kim and Cipolla [32] do spatio-temporal pattern matching, representing volumes as tensors. They propose an extension of canonical correlation analysis to tensor analysis to detect actions on a 3D window search with exemplars on the KTH dataset. Bregonzio et al [7] also use a global spatio-temporal distribution of interest points. They extract and select “holistic” features from clouds of interest points over multiple temporal scales. Rapantzikos et al [44] also use dense spatio-temporal features detected using saliency measures, which incorporate color, motion and intensity, in a multi-scale volumetric representation. Yuan at al [60] describe actions as collections of spatio-temporal invariant features, and propose a naive Bayes mutual information maximisation method for multi-class. A search algorithm is also proposed to locate the optimal 3D subvolume in which the action can be detected. Experiments are run on the KTH and CMU databases. Some effort has been put in recognition from single images too (e.g. [30], where actions are learnt from static images taken from the web).

Dynamical models in action recognition. However, encoding the dynamics of videos or image sequences by means of some sort of dynamical model can be useful in situations in which the dynamics is critically discriminative. Furthermore, the actions of interest have to be temporally segmented from a video sequence: we need to know when an action/activity starts or stops. Actions of sometimes very different lengths have to be encoded in a homogeneous fashion in order to be compared (“time warping”). Dynamical representations are very effective in coping with time warping or action segmentation [51]. Furthermore, in limit situations where a significant number of people move or ambulate in the field of view (as it is common in surveillance scenarios), the attention needs to shift from single objects/bodies to approaches that consider the monitored crowd as some sort of fluid, and describe its behaviour in a way similar to the physical fluids modelling. Dynamical models are well equipped to dealing with such scenarios.

In these scenarios, action (or identity) recognition reduces to classifying dynamical models. Hidden Markov models [23] have been indeed widely employed in action recognition [43, 51] and gait identification [54, 9].
HMM classification can happen either by evaluating the likelihood of a new sequence with respect to the learnt models, or by learning a new model for the test sequence through the Expectation-Maximisation (EM) algorithm [67], measuring its distance from the old models, and attributing to it the label of the closest one. Indeed, many researchers have explored the idea of encoding motions via linear [5], nonlinear [25], stochastic [42, 24] or chaotic [1] dynamical systems, and classifying them by measuring distances in their space. Chaudry et al [10], for instance, have used nonlinear dynamical systems (NLDS) to model times series of histograms of oriented optical flow, measuring distances between NLDS by means of Cauchy kernels, while Wang and Mori [56], have proposed sophisticated max-margin conditional random fields to address locality by recognising actions as constellations of local motion patterns. Sophisticated graphical models can be useful to learn in a bottom-up fashion the temporal structure or plot of a footage, or to describe causal relationships in complex activity patterns [38]. Gupta et al [28] work on determining the plot of a video by discovering causal relationships between actions, represented as an AND/OR graph whose edges are associated with spatio-temporal constraints

Issues with “precise” dynamical graphical models. Unfortunately, classical dynamical models have been sometimes found too rigid to describe complex activities, or prone to over-fitting to the existing training examples. This is because they have typically to estimate single, “precise” probability distributions describing, for instance, the transitions between the states of a hidden Markov model (even a hierarchical HMM), or the state-output or emission distribution associated with each state in the space of observations. Both transition and emission are described using probabilistic models: conditional mass functions. In order to use such models effectively, we need to be able to learn the transition and emission models from data and prior assumptions, and to be able to efficiently reconstruct the hidden state sequence from the observation sequence that it emitted. There are efficient ways of dealing with this, involving, respectively, the EMmethod and the Viterbi algorithm [65]. A problem with this approach is that, especially when, as is the case in action or gesture recognition, little data is available for learning the model, results are obtained that depend quite strongly on the prior assumptions (probabilities) about the behaviour of the dynamical model. Such priors are, by the nature of the model involving often abstract hidden states, very hard to elicit and/or justify by themselves.

1.2.2 Imprecise probability theory

Imprecise-probabilistic graphical models. This points to the need of using methods that lead to inferences that are more robust with respect to the choice of prior information, such as imprecise-probability models which replace single probability distributions by convex closed sets of them, also called credal sets. When applied to probabilistic graphical models, a special case of which are HMMs, these credal sets lead to imprecise-probabilistic graphical models, more commonly called credal networks [66]. Despite a number of early successes [67], progress in this field has been hampered by the computational complexity of algorithms for doing inference in such networks. This has made finding computationally efficient or even feasible counterparts of the EM-method and Viterbi algorithm very difficult. Quite recently, however, significant progress was made towards efficient exact inference algorithms in credal trees by looking at the problem from a different angle [62,63]. This research, initiated and actively pursued by members of the present consortium (SYSTeMS and IDSIA) opens up exciting new avenues towards the development of efficient EM and Viterbi-like algorithms for hidden Markov models with imprecise probabilities [64].

Advantages of imprecise-probabilistic graphical models for action recognition. The main indicator of the performances of an action recognition algorithm with respect to these benchmarks is clearly the accuracy rate. On the basis of this indicator, we intend to outperform the state-of-the-art algorithms by means of our more sophisticated probabilistic approach. Furthermore, an important feature of imprecise-probabilistic classification tools is the variable number of outputs. In practice, our recognition algorithm could eventually return more than a single candidate action in order to explain the content of a video sequence. This could appear as a weakness of our approach, when compared to “precise” classifiers always returning a single explanation for the action to be recognised. On the contrary, empirical comparisons generally show that the imprecise-probabilistic algorithm returns more than a single output, when the precise algorithm just returns a wrong output. In other words, the imprecise-probabilistic algorithm could be able to distinguish between hard-to-recognise sequences and the other ones for which the recognition is very likely to be correct. This appears a very desirable feature, especially for security applications.

Classifying imprecise-probabilistic dynamical graphical models. In this framework, action classification reduces to the classification of imprecise-probabilistic (dynamical) graphical models. In the “precise” case, a number of distance functions have been introduced in the past (e.g., [52]), and a vast literature about dissimilarity measures for Markov models also exists [22], mostly concerning variants of the Kullback- Leibler divergence [33]. However, as models (or sequences) can be endowed with different labels (e.g., action, ID) while maintaining the same geometry, no single distance function can possibly outperform all the others in every classification problem.
A reasonable approach when possessing some a-priori information is therefore trying to learn in a supervised fashion the “best” distance function for a specific classification problem [3,4,48], for instance by employing differential-geometric methods [2,13,34]. The same approach can be proposed for imprecise-probabilistic graphical models. Dissimilarity measures for imprecise-probabilistic models can be derived from distances between convex sets of distributions. Kernel-based methods [69] or approaches based on the “structured learning” approach to SVM classification can be considered.
We will describe the proposed methodology in more detail in the next Section.

1.3 S/T methodology and associated work plan

1.3.1 Motivations of methodological choices

In order to describe our S/T methodology, let us first outline which are the particular features of the problem under consideration motivating our methodological choices.

Dynamics and abstraction issues. Consider a number of video sequences depicting simple human actions like single men moving in a uniform environment. Say that some of them show people walking and some others jumping. Although any person could easily distinguish walks from jumps without any hesitation, let us consider the problem from an AI perspective. Implementing of ad hoc algorithms to automatically perform this kind of identification seems to be relatively simple. We could for instance track the silhouette of the person in each frame of the sequence and then decide that he is walking if the trajectory of his center of mass has some regularity, and otherwise decide that he is jumping. Yet, this solution to the identification problem is based on a quantitative model based on our expert knowledge about what distinguishes a walk from a jump. An approach of this kind can produce very accurate identifications, but it will be probably lacking in terms of generality. To make our tool able to distinguish even other kinds of actions, the expert knowledge modelling would typically require a revision (or a complete reformulation). Note for example that the information associated the trajectory of the center of mass could be useless when trying to distinguish a walk from a run.
Following a pure data mining approach to this identification problem, we can try to face the problem without the support of any expert knowledge, and implement algorithms to extract the information to distinguish an action from another only from the data (i.e., the available sequences). Approaches of this kind are clearly more general: when considering different identification problems, the algorithm is the same, and only the input data are different. Apart from the quality of the algorithm, the accuracy of the recognition will depend on the amount of data and on their capability of describe the information about the different actions. In particular, we can focus on a supervised situation, where the available sequences are labelled by their corresponding action. This leads us to a standard classification problem, where the class variable denotes the particular kind of action depicted in the sequence.
Concerning the attributes, we typically extract information from the video sequence by some image analysis technique, to be applied separately to each frame of the sequence. Notably, when putting all the attributes together, the temporal nature of these data is neglected. This does not prevent approaches of this kind to achieve good performances. Yet, we believe that taking into account the temporal aspects by means of a dynamic (instead of a static) model could be the key to achieve better identification performances. Besides this emphasis on the dynamics, let us consider some aspects related to the level of abstraction we may require to our description of the problem. When deciding which kind of action is depicted in a particular video sequence, what we are trying to capture is a kind of abtract essence of the action. A very detailed description can capture lots of details which are not crucial for the identification, this having effect both on the accuracy and the computational efficiency of the recognition tools. For example, when distinguishing an action with a man with a blue t-shirt slowly wolking from the right to left, from a woman with green t-shirt quickly jumping from the left to the right, there are many unnecessary (and potentially misleading) details we should better ignore.

Hidden Markov models for action recognition. The above discussion suggests the opportunity of moving from a refined, low-level, description of the sequence, as the one provided by the features we extract from each frame, to a more abstract description obtained by (somehow) processing the information, and then using such a higher-level information as an input for the classification task. A possible approach in this direction could be modelling actions as stochastic processes. Considering that the video is a (finite) sequence of frames, we can focus on discrete time processes. We also assume the process to be Markovian, which means that, at each time, the configuration of a particular frame is only affected by that at the previous frame. Although clearly not true, especially for standard frame rates, this assumption seems to be a reasonable approximation. Note also that, during the second part of the project, we intend to consider models where the Markovian assumption is relaxed (see WP8). Furthermore, we describe the evolution of the process by means of a sequence of hidden (i.e., directly unobservable) variables, one for each frame. Similarly, we can model the information to be extracted from each frame (see WP10 on features extraction) as a collection of manifest (i.e., observable in a perfectly reliable way) variables. For each frame, the correlation between the hidden and the relative manifest variables corresponds to the abstraction we put in our description. More specifically, the number of states we set for the hidden variables can be regarded as an indicator of the level of abstraction characterising the model. Overall, this is exactly the modelling provided by hidden Markov models (HMMs).
One of the key concepts we want to emphasise by this proposal is that HMMs can describe the temporal nature of the data associated to a video sequence and, in the same time, provide the necessary level of abstraction to distinguish an action from another. More specifically, for each sequence, we first extract the features (see WP10) from each frame and we use these data to learn a HMM. The HMM associated to a particular sequence represent therefore a summary of the information we need to perform the identification. To perform the identification, we need a measure of similarity for HMMs, in order to decide which one(s) among the available tagged sequences, is the most similar to the one under consideration. In order to perform this kind of analysis, we should therefore explore the possibility of defining metrics (like, for instance, the KL divergence) in probabilistic spaces. Once, this further manipulation of the data has been done, we can finally use standard algorithms for supervised classification like for instance k-nearest neighbors or support vector machines in order to detect which is the action depicted in a particular video sequence. Overall, this defines a clear work plan to perform automatic action recognition by means of HMMs. Some of these techniques have been preliminarily tested (and described in [X]), this leading to very promising results. Yet, we believe that the performances of these approaches can be furtherly improved by means by employing more reliable learning algorithms.
The data we use to learn a HMM from a video sequence are only referred to the manifest variables (and not to the hidden variables, which, by definition, not observable). Thus, learning algorithms for incomplete data, like for example the EM algorithm (in its specialisation to the HMM structure, which is called Baum-Welch algorithm), should be used. At standard frame-rates for short actions, the amount of available data can be relatively small, this negatively affecting the quality of the learning. In these situations, EM algorithm (and analogous procedure) are generally providing not very reliable estimates for the model probabilities.

Moving to the imprecise-probabilistic framework. As noted in section 1.2, the framework of imprecise probabilities can offer a particularly robust approach to the learning of a probabilistic model from a small and incomplete datasets. Basically, this is achieved by adopting a collection, instead of a single, say “precise’”, model from the data. The idea, is to define an analogous, for the imprecise-probabilistic framework, of the standard EM algorithm for learning from incomplete data.
An imprecise EM algorithm could be used to learn from the video sequence an imprecise HMM (iHMM), i.e., a HMM with an imprecise quantification of the model probabilities (e.g., with interval-valued probabilities). As a standard HMM is a particular Bayesian network, an imprecise HMM represents a (kind of) Bayesian network with an imprecise quantification of the probabilities. Models of this kind have been already studied for more than a decade, and are called credal networks [66]. Roughly speaking, a credal network is equivalent to a number of Bayesian networks. Accordingly, we can regard a iHMM as a collection of HMMs. Yet, the number of HMMs associated to the single iHMM we can learn from a sequence is generally exponentially large. This makes not possible to trivially move from the precise to the imprecise-probabilistic framework by just iterating the operations to be performed in the precise case.

1.3.2 Preliminary results by the partners

One of the main challenges of this project is to develop a complete theory (including learning, inference and classification algorithms) for imprecise hidden Markov models. Some work in this direction has been already done by the groups at SYSTeMS and IDSIA. An inference algorithm for imprecise HMMs has been already proposed in a joint work of SYSTeMS and IDSIA [62]. The algorithm focus on predictive inference, but some ideas can be extended to other inference problems. This algorithm has been already applied to character recognition. This was possible by means of a learning procedure to obtain an imprecise HMM from complete data, which was already formalised. Of course, to do the same with incomplete data can be much more difficult, but some of the ideas are the same. In another paper, still a joint work by SYSTeMS and IDSIA [63], the same algorithm was applied to tracking problems. Some of the ideas of that paper should be certainly considered when coping with continuous features without discretisation. Concerning the EM algorithm, some papers by IDSIA should be regarded as an important example of how learning an impreciseprobabilistic graphical models from (complete) datasets, while there are a number of both SYSTeMS and IDSIA papers about approaches to general learning of imprecise-probabilistic models from missing data [95,96]. Regarding imprecise-probabilistic graphical models, many papers from IDSIA and SYSTeMS [62- 64,97-100], as well as a recent joint work by IDSIA and OBU should be regarded as a strong background to develop the theoretical tools required by this project.
OBU has a significant record of research in human motion analysis and recognition [18]. Most relevant to the current proposal, the Coordinator has recently explored the use of bilinear and multilinear models to identity recognition from gait [11,14], a relatively new but promising branch of biometrics. In a strictly related topic, he is currently exploring manifold learning techniques for dynamical models representing (human) motions [13,16], in order to learn the optimal metric of the space they live in and maximize classification performance. In the wider field of articulated motion analysis he has published several papers on spectral motion capture techniques [17], focusing in particular on the crucial issue of how to select and map eigenspaces generated by two different shapes in order to track 3D points on their surfaces or consistently segment body-parts along sequences of voxelsets, as a pre-processing step to action recognition.
Finally, regarding the practical implementation of the recognition tools, the R&D team at MARCH has a long tradition in both software and firmware realisation of video-analysis tools to be embedded into the products they manifacture.

1.3.3 Overview of the S/T Methodology

We already described how the general ideas behind this proposal define a clear work plan to improve the state of the art in action and gesture recognition. This ambitious research project has been split into eleven work packages, with a clear inter-dependence structure we describe in this paragraph. Next, a detailed description of each package is reported. The key to address a robust action/gesture modelling by means of (imprecise) hidden Markov models is a generalisation of the EM algorithm to the imprecise-probabilistic framework. This task is split into two different work packages, as a deep investigation of the procedure for general probabilistic models is required (WP1), before considering its specialisation to HMMs (WP4). To identify a new action we can therefore compare the imprecise HMM we learn by the EM with those we have already obtained from the other (labelled) sequences. This requires the development of dissimilarity measures for imprecise-probabilistic models, to be specialised to graphical models. This represents a unitary field of investigation we want to address in a single work package (WP3). Of course, as we intend to achieve real-time analysis tools, these measures should be computed in efficient time. As in the precise case, this can be done by means of, computationally efficient, inference algorithms to be developed for imprecise HMMs. This is what we intend to do in another work package (WP2). Remarkably, these two latter work packages do not need the imprecise EM to be already implemented. Thus, these research activities can be independently accomplished during the same period. These first four work packages define the theoretical investigations to be achieved during the first half of the project (18 months). During the second part, the learning and inference algorithms already developed should be applied to classification tasks. This is a very important work package (WP5). Furthermore, all these tools, originally developed for the HMM structure, should be extended to more general dynamic probabilistic graphical models with imprecision. This could give more realism to the description of the system dynamics (e.g., by relaxing the Markovian assumption). This is a work package representing the conclusion of the theoretical research required by this project (WP6). Besides theoretical investigation, there are a number of practical issues to be considered. First of all, the feature extraction process, required to extract from the video sequences the quantitative data to be processed is a crucial step. This corresponds to an autonomous work package (WP7), which required a lot of empirical work, in order to evaluate which are the more suitable features to consider, and a part of theoretical investigation in order to make the feature extraction procedure computationally efficient. Furthermore, gathering and creating a large corpus of (labelled) videos sequences to be treated as benchmarks for the tools we intend to develop represents a necessary, although maybe not very challenging, work, we should regard as a single work package (WP8). Finally, the practical implementation of the action recognition algorithm as a stand-alone software tool represents the (almost) final work package in this research project (WP9). The true conclusion of our implementation efforts is in fact another work package (WP10), where the software implementation is embedded into the firmware of commercial cameras, this allowing for higher performances and possible patenting and commercialisation of the technology we developed. Finally, the (very) last work package of the project concerns the necessary management of the consortium (WP11).

1.3.4 Detailed description of the Work Packages

WP1 – Imprecise EM (general theory)

The distinctive feature of our approach to the modelling of an action/gesture dynamics is that we learn imprecise-probabilistic models from the data associated to the video sequence. A number of procedures have been already proposed, and are currently employed, to learn imprecise-probabilistic models from complete datasets. Yet, the way these approaches can be extended to cope with missing data is still an open field of investigation. In particular, we want to learn an imprecise-probabilistic graphical model by means a generalisation of the EM algorithm, which is commonly used to learn (precise models) from datasets incomplete because of the presence of latent variables. An imprecise-probabilistic reformulation of both the maximum likelihood (frequentist) approach and the maximum a posteriori (Bayesian) approach to the EM algorithm can be considered. These algorithms have to be specialised to the particular independence relations characterising HMMs. This will be considered in a separate work package. In fact, before considering this issue, it is important to empirically compare the performances of the two learning algorithms on synthetic data, in order to decide which one provides a more reliable estimate of the model probabilities. This work load will be splitted in the following three tasks.
Task 1.1 Bayesian approach to imprecise EM
A possible approach to the development of an imprecise-probabilistic version of the EM algorithm could be based on a direct generalisation of the Bayesian formulation of the algorithm. The idea is simply to replace the single uniform prior distribution adopted by such a precise-probabilistic learning algorithm by means of a set of priors, modelling a condition of prior-ignorance about the process generating the data. By analogy with the strategy already considered for complete data, we want to employ a collection of Dirichlet distributions and exploit the fact that the distribution is the conjugate prior of the multinomial likelihood.
Task 1.2 Frequentist approach to the imprecise EM
Another possible approach to the development on an imprecise-probabilistic version of the EM algorithm, we could consider a likelihood-based (i.e., frequentist) approach. The idea is simply to evaluate the likelihood of the available (incomplete) data according to any possible (precise) probabilistic model we can define over the variables under consideration. Then, we consider the imprecise-probabilistic model including all the precise models whose likelihood is over a certain (“alpha cut”) level. Clearly, an analytical identification of these models is typically unfeasible, while a recursive approach to the identification of an approximate estimate can be achieved.ù
Task 1.3 Empirical comparison based on synthetic data
It is not obvious which one between the Bayesian-like and the frequentist-like approach to the imprecise EM is going to provide the most reliable estimates. Leaving apart philosophical issues, we leave the choice to an empirical comparison between the two approaches: in order to test the performances of the two algorithms, we plan to generate synthetic data by randomly-generated probabilistic model, and then check how accurate are the probabilities estimates of the two imprecise-probabilistic models obtained when learning from the incomplete data.

WP2 - Inference on Imprecise HMMs (iHMMs)

Hidden Markov models are a very natural way to describe dynamical processes with uncertainty. Yet, their popularity is not only related to the proper modelling they can offer, but is definitely also caused by the availability of efficient algorithms for doing inference on models of this kind: both EM and typical inferences such as calculation of probabilities and likelihoods, and most probable explanation (using Viterbi-like algorithms) can be performed efficiently. The goal of this work package is to promote the state of the art of the algorithms for imprecise HMMs to the same level as those for the precise ones. This amounts to developing and implementing computationally efficient algorithms for the following basic inference tasks:
Task 2.1. Computation of lower and upper probabilities of state and state-observation, sequences
Based on the general ideas and methods developed by the SYSTeMS-IDSIA teams and described recently in [62,63], it should be possible to derive formulae for the probabilities of such sequences in imprecise HMMs that lend themselves to backwards recursion, and therefore to efficient generalisations of the backward algorithms for doing similar calculations in precise HMMs. We expect the complexity of these algorithms to be essentially linear in the length of the Markov chain.
Task 2.2. Computation of lower and upper likelihoods
These are the lower and upper probabilities of the observation sequences. Backwards recursion calculations for these entities should still be possible based on the ideas initiated in [62,63], but will be more involved, due to an extra marginalisation step. Nevertheless, we expect the complexity to be still essentially linear in the length of the Markov chain.
Task 2.3. Most probable explanation
Most probable explanation for precise HMMs consists in finding the state sequence (i.e. the explanation) with the highest probability, conditional on the observation sequence (i.e. the data). In an imprecise probabilities context, there are two important ways this problem can be reformulated [94]. The first, called the maximin method, is a pessimistic approach that consists in finding the state sequence with the highest lower probability, conditional on the observations. The second is a robust approach, and called the maximality method: one state sequence is considered to be a better explanation than another if it has a higher posterior probability in all the precise models that are compatible (or consistent) with the imprecise-probabilistic model. This leads to a partial ordering of state sequences, for which we want to identify the maximal, or undominated, ones. Preliminary explorations of this problem [64] indicate that the second approach will most likely be more amenable to an efficient implementation, and may give rise to a complexity that is comparable to that of the Viterbi algorithm. Since all maximin solutions are in particular also maximal, we can also hope to use this maximality approach to dramatically reduce the search space for the maximin approach.

WP3 – Dissimilarity measures for imprecise-probabilistic graphical models

Once represented video sequences as imprecise-probabilistic graphical models, for instance of the class of imprecise hidden Markov models, gesture and action recognition reduce to classifying such imprecise-probabilistic models. Different competing approaches to this issue can be foreseen. We propose to pursue several such alternatives in the course of this project. Two of such approaches are both based on the idea of finding appropriate ways of measuring distance or dissimilarity between imprecise-probabilistic models, and use such measures to cluster the available set of imprecise-probabilistic models (each representing a video sequence) in an unsupervised classification approach. Finding such measures is the purpose of work package 3.
A first sensible option is to consider imprecise-probabilistic graphical models, and iHMMs in particular, as convex sets of probability distributions themselves. The distance/dissimilarity between two imprecise-probabilistic models is then just a special case of the distance between two sets of distributions (Tasks 3.1, 3.2).
A second option is to learn in a supervised way the “best” metric for a given training set of models, in order to maximise classification performance (Tasks 3.3, 3.4).
Task 3.1 Modelling similarity between sets of probability distributions
A simple but reasonable approach to the problem of describing the dissimilarity between two sets of probability distributions can be based on a direct extension of the techniques adopted for single distributions. The distance between the two sets would be therefore a multiple-valued object containing the distances between all the possible pairs of distributions in the two sets. Accordingly, we could therefore characterise the distance between the two sets by means of the minimum and the maximum distance. Other approaches could be based on the identification of a single representative element in the set (e.g., the maximum entropy or the center of mass), to which distance measures for single distributions can be applied.
Task 3.2 Specialisation to the class of imprecise hidden Markov models (iHMMs)
The results achieved in the previous task should be indeed specialised to the sets of (joint) probability distributions associated to an iHMM. The challenge, here, is to perform the distance evaluation in efficient time. This should be possible by exploiting the algorithms developed for WP2.
Task 3.3 Supervised learning of similarity measures for precise HMMs
We will first tackle the case of learning similarities between precise stochastic models, with a focus on hidden Markov models. A number of distance functions have been introduced in the past (e.g., [52]), and a vast literature about dissimilarity measures for Markov models also exists [22], mostly concerning variants of the Kullback-Leibler divergence [33]. However, as models (or sequences) can be endowed with different labels (e.g., action, ID) while maintaining the same geometrical structure, no single distance function can possibly outperform all the others in every classification problem.
A reasonable approach when possessing some a-priori information is therefore trying to learn in a supervised fashion the “best” distance function for a specific classification problem [3, 4, 48]. A natural optimisation criterion consists on maximising the classification performance achieved by the learnt metric, a problem which has elegant solutions in the case of linear mappings [50, 57]. However, as even the simplest linear dynamical models live in a nonlinear space, the need for a principled way of learning Riemannian metrics from such data naturally arises. An interesting tool is provided by the formalism of pullback metrics. If the models belong to a Riemannian manifold M, any diffeomorphism of M onto itself or “automorphism” induces such a metric on M. By designing a suitable family of automorphisms depending on a parameter, we obtain a family of pullback metrics on M we can optimise on. OBU already has relevant experience in this sense [13]. As important classes of dynamical models used in action recognition such as HMMs or variable length Markov models [26] (VLMMs) a proper metric has not yet been identified, we necessarily need to relax the constraint of having a proper manifold structure, extending the pullback learning technique to mere distance functions or divergences. This will be the third goal of WP3.
Task 3.4 Supervised learning of similarity measures for imprecise HMMs
An obvious development of the pullback approach of Task 3.3 is the extension of the presented methodology to other, more powerful classes of dynamical models (such as, for instance, semi-Markov [51], hierarchical or variable length [26] Markov models) able to describe complex activities instead of elementary actions. The other natural development, most relevant to the current project, is the extension of metric learning approaches (and the pullback one in particular) to imprecise HMMs first, and other classes of imprecise-probabilistic graphical models later. We will be able to use the results of Tasks 3.1 and 3.2 to define a base metric/similarity for iHMMs, on which to apply our metric optimisation approach.

WP4 - Imprecise EM for HMMs

The first work package was intended to develop an imprecise-probabilistic formulation of the EM algorithm. This algorithm should be employed for the learning of imprecise HMMs from video sequences. This step is not trivial, as the number of operation to be performed for each iteration of the algorithm can be exponentially large in the input size. For this reason, we allocate a separate work package to the specialisation of the algorithm to the independence structure of the HMM.
Task 4.1 Specialisation of the EM algorithms to the HMM topology
The imprecise EM developed in WP1 should be first reformulated in the special case of the HMMs. This should be based on the inference algorithms developed in WP2. Yet, the partial overlapping in the scheduling of WP2 and WP4 is not a real problem, as this task can be achieved without explicitly invoke the necessary inference algorithms.
Task 4.2 Implementation of recursive formulae for the parameters updating
To achieve this second task, the recursive formulae describing the revision of the model parameter for each iteration of the algorithm should be determined.
Task 4.3 Convergence and initialisation empirical study
The goal of this task is to evaluate the convergence properties (and speed) of the EM algorithm, when specialised to the HMM topology. Even the robustness with respect to the initialisation of the parameters should be considered.

WP5 - Classification of imprecise-probabilistic graphical models

As we mentioned in WP3, in our proposed framework the classification of video sequences (actions, gaits, gesture) reduces to the classification of imprecise-probabilistic graphical models, with an emphasis on iHMMs. Work package 3 deals with the approach to classification based on finding the nearest neighbor to the model (data-point) to classify, and attaching to the new model the label of such nearest neighbor (or the label of the majority of the k nearest ones).
Work package 5 is indeed designed to use a-apriori or learnt distance measures of WP3 to classify imprecise-probabilistic models (Task 5.1) and imprecise HMMs in particular (Task 5.2), but also to pursue alternative classification strategies based on kernel SVMs in an approach called “structured learning” (Task 5.3), or adopt large scale classification techniques based on randomised projections onto a reduced, measurement space (Task 5.4).
Task 5.1 Distance-based clustering and classification of imprecise-probabilistic models
Distance functions between convex sets of probabilities (WP3, Task 3.1) can be used in k-nearest-neighbor or other distance-based classifiers to classify arbitrary imprecise-probabilistic models representing credal sets. We will formalise and test such an approach as first step in our classication strategy, as it can be extended to more general classes of imprecise-probabilistic graphical models such as those designed in work package 7.
Task 5.2 Efficient specialisation to the case of imprecise HMMs
The results of WP3, Task 3.2 can in the same way be used to deliver and test distance-based classifiers for imprecise Markov models as a special but very relevant case, given the importance of the Markovian hypothesis in action and gesture recognition (see Section 1.2). These techniques need to be specialised to the HMM topology in a computationally efficient way.
Task 5.3 Kernel SVM classification of imprecise-probabilistic models: a structured learning approach
A second approach we will pursue to imprecise-probabilistic model classification is the use of kernel classification techniques in a structured learning framework. Kernel SVM classification is a powerful tool, which allows us to find optimal implicit linear classifiers in sufficiently large feature spaces without having to compute feature maps explicitly, thanks to the so-called “kernel trick”. “Structured learning” [89] is the sub-field of machine learning that deals with the set of techniques devoted to classifying complex objects, rather than simple collections of numbers (vectors), by learning how to map inputs to arbitrarily complex outputs which possess a possible complex internal structure [88]. This is clearly the case of graphical models, so that structured learning approaches can be proposed to classify stochastic models.
Task 5.4 Compressed sensing applied to imprecise-probabilistic models
“Compressed sensing” [76] is the name under which a recent techniques based on projecting the available data on a randomly picked subspace [71,72,73] of the original space the data live in (or measurement space), provided it has a sufficiently large dimension, whose bounds are determined by the restricted isometry property [74,75]. OBU has recently started to explore the use of compressed sensing for large scale classification tasks, such as those involved in image segmentation [85].
Kernel SVM [78] classification is a powerful tool, which allows us to find optimal implicit linear classifiers in sufficiently large feature spaces without having to compute feature maps explicitly, thanks to the kernel trick. The complexity of most available SVM solvers, however, in terms of both storage and classification time, is so large to prevent the application of such methodology to truly large scale problems, such as those involving most real-world tasks (e.g., video segmentation).
Results proving that SVM classification can be performed efficiently after random projection, greatly reducing computational requirements, have recently appeared. While these straightforwardly apply to linear SVM, however, their naive application to more useful kernel SVM classifiers is elusive. Oxford Brookes has recently proposed (by making use of specific random projectors developed in the PSH context to kernelise hash function [77]) a framework in which linear SVM classification equivalent to kernel SVM classification is performed in an implicitly defined compressed feature space, extending to kernel methods recently proven results on “compressed” classifiers. The dramatically reduces complexity of the resulting linear classifier in a compressed feature space open the way to the solution of truly large scale classification problems [79]. We plan to bring to full development these techniques and apply them to the classification of imprecise-probabilistic graphical models.
Task 5.5 Empirical comparison of different classification strategies
Last but crucial task of this work package is the empirical comparison of the different classification strategies we plan to pursue. Feedbacks will be received from work package on “Integration and Testing” in order to select one or possibly two winning approaches we will eventually distribute and implement as firmware (WP11).

WP6 - Imprecise-probabilistic Dynamical Graphical Models

One of the main goals of this research project is to develop a theory of imprecise Hidden Markov Models, including a formalisation of the models, appropriate and efficient inference algorithms, and a procedure to learn these models from (either complete and incomplete) data. Such HMMs are an example of probabilistic graphical models (being Bayesian networks when quantified by precise probabilistic assessments) with a dynamic component. In a sense they should be regarded as the simplest example of dynamic Bayesian networks. A general theory for dynamic BNs has been already formalised, but something similar is missing for the credal networks (i.e., the imprecise-probabilistic version of a Bayesian network). The goal of this work package is mostly to start filling this gap, by paying special attention to higher-order HMM topologies able to provide a better model of the correlation between the frames of a video sequence depicting an action. We also intend to explore the possibility of modelling these relations by means of undirected graphs, providing a generalisation of random Markov fields (which are already widely used in computer vision) to the imprecise-probabilistic framework.
Task 6.1 Dynamic (and object-oriented) credal networks
A crucial step in dealing with robustness and imprecision in dynamic BNs, and higher-order HMMs in particular, will consist in investigating whether the ideas pioneered in [62,63] for inference in credal trees can be applied, or extended, to these more general types of probabilistic graphical models. Methods will then have to be devised for (recursively) constructing joint models from the local imprecise-probabilistic models attached to the nodes of such credal networks.
Task 6.2 Inference algorithms and complexity results for dynamic credal networks
Based on those (recursive) methods for constructing joints, recursive or message-passing algorithms need to be developed for updating these networks with data, computing lower and upper probabilities and likelihoods, and most probable explanation.
Task 6.3 Imprecise Markov random fields
Hidden Markov models are not the only example of probabilistic graphical models adopted for computer visions. Markov random fields are graphical models based on undirected graphs particularly suited for modelling pixel-to-pixel correlation in image analysis. For the same reasons we have provided to justify our research on imprecise HMMs, it could be important to explore the possibility of a similar extension for imprecise Markov random fields.

WP7 - Benchmark datasets gathering

A crucial step in reaching the overall goals of the proposed project consists on validating the theoretical advances on the theory of imprecise-probabilistic models, classification and feature selection (in the form of the produced algorithms) on challenging, real world data. The objective of this work package is therefore to define and acquire a representative set of sensor data and set up a repository that allows for data organisation, upload and download by all partners. The believe the data should stem from multiple and redundant sensors, in order to make sure the obtained performances do not crucial depend on the chosen setup.
Given the current state of the art in video analysis, we selected synchronised video cameras and range (time of flight) cameras (in virtue of the amazing impact they are having at this very moment) as the sources of our video data.
Task 7.1 Acquisition of video testbeds from multiple cameras
A large number of video sequences depicting different people performing different actions and gestures, with or without missing data and occlusions, under different environmental conditions, in different scenarios (human computer interaction, smart room, surveillance, identity recognition) are collected. The technologies already available in the MARCH headquarters will be used for that.
Task 7.2 Acquisition of data from range cameras
Range cameras are changing the landscape of computer vision at this very moment, as they provide several advantages over stereo configurations for depth estimation.
A time-of-flight camera (TOF camera) is a camera system that creates distance data with help of the time-offlight (TOF) principle which is different from time-of-flight mass spectrometry. The principle is similar to that of LIDAR scanners with the advantage that whole scene is captured at the same time. Time-of-flight cameras are relatively new devices, as the semiconductor processes have only recently become fast enough for such devices. The systems cover ranges of a few meters up to about 60 m. The distance resolution is about 1 cm. In contrast to stereo vision or triangulation systems, the whole system is very compact. It is very easy to extract the distance information out of the output signals of the TOF sensor, therefore this task uses only a small amount of processing power, again in contrast to stereo vision, where complex correlation algorithms have to be implemented. After the distance data has been extracted, object detection, for example, is also easy to carry out because the algorithms are not disturbed by patterns on the object. However, if several time-of-flight cameras are running at the same time, the cameras may disturb each others' measurements. Besides, time-of-flight cameras illuminate a whole scene. Due to multiple reflections, the light may reach the objects along several paths and therefore, the measured distance may be greater than the true distance. The most striking application of range cameras is perhaps Microsoft “Natal” project, which eventually led to the recent commercialisation of the Kinect game console. With its controller-free gaming experience, Kinect is already revolutionising the whole field of interactive video games and consoles (see for some amazing demos).
Oxford Brookes Vision Group enjoys continuing strong links with Microsoft through its head Philip Torr, and has recently acquired a range camera to kick-start cutting-edge research in motion analysis. OBU can also provide a VICON motion capture system, able to track in real time a number of markers on moving persons on multiple synchronised IR cameras: the final result is ground truth values on the 3D locations of the markers.
Task 7.3 Data repository and dissemination among partners
Large amounts of data of different modalities (synchronised video cameras and range cameras) will be acquired during the project in order to test the developed prototype. Therefore, it is important to have a means to store and efficiently access this data for all participants, and for the public. This will involved the development of a secure web site, or an intranet section of the project website developed for wider dissemination purposes. To what extent the data can be made public will be discussed by the consortium as outlined in section 4.

WP8 - Feature extraction

In order to represent actions or gestures as dynamical models, and in particular robust, impreciseprobabilistic ones, the associated video streams have to be processed in order to extract from each individual frame the most relevant information: relevant image measurements are typically referred to as “features”. Feature extraction is therefore a crucial stage in any human action recognition process. Features come in many categories, such as: local gradient histograms (HOGs), silhouettes, motion features based on the optical flow, etcetera. Selecting an appropriate class of features is then extremely important, but it not all.
Task 8.1 Feature extraction from synchronised video frames
Historically, silhouettes have been often (but by no means always [35]) used to encode the shape of the walking person in the context of gait identification, but also in the wider action recognition context. However, they are nowadays widely criticised for their sensitivity to noise and the fact that they require (at least when working on conventional 2D cameras) solving the (inherently ill defined) background subtraction problem. To achieve the goal of developing a robust action and gesture recognition system it is therefore essential to move beyond silhouette-based representations. An interesting feature descriptor, for instance, called “action snippets” [43] is based on motion and shape extraction within rectangular bounding boxes which, contrarily to silhouettes, can be reliably obtained in most scenarios by using person detectors [6] or trackers [19]. Our final goal is to adopt a discriminative feature selection stage, such as the one proposed in [42], where discriminative features are selected from an initial bag of HOG-based descriptors. In this sense the expertise of OBU in this area will be extremely valuable to the final success of the project.
Task 8.2 Feature extraction from range cameras
As we said for WP7, multi-view stereo methods frequently fail to properly reconstruct 3D scene geometry if visible texture is sparse or the scene exhibits difficult self-occlusions. Time-of-Flight (ToF) depth sensors can provide 3D information regardless of texture but with only limited resolution and accuracy [90]. In the case of range cameras, the type of data requires a taylored feature extraction process. For instance, in [91] methods for extracting stable features for camera and laser range finder in order to use these features in SLAM are discussed. The proposed methods include a dot-product feature extractor, a line segment feature extractor for laser range finder, and a corner feature extractor for camera images. In [93] an algorithm is presented which extracts features from a time series of Flash LADAR images without a-priori knowledge of the surrounding environment. The extracted features are tested to identify motion artifacts and separate entities into static and non-static states. [92] provides a really good overview of the current status of range imaging and its use in computer graphics and vision. The study of feature extraction from range images or a fusion of range and traditional images will be explored in Task 8.2.

WP9 – Integration and testing

Work Package 9 is in some sense the core of the project. As the parallel lines of research concerning feature extraction, imprecise Markov models, and inference by generalised Expectation-Maximisation deliver their expected outcomes, and pieces of code are generated and tested within each individual subject area, the need for integrating all the code in a single coherent prototype to be tested on the data gathered by OBU and MARCH in WP7 arises.
Task 9.1 Integration in a single prototype
The algorithms for imprecise-probabilistic model parameter identification from partial data, inference on imprecise-probabilistic graphical models, and feature extraction from multiple views and range cameras are assembled in a coherent prototype in order to be tested on the internally generated datasets and the available public testbeds on action, gesture, and identity recognition from gait.

Figure 4. Samples images from the Weizmann, UCF Sports, YouTube and Hollywood action recognition datasets, in this order.
Task 9.2 Validation on gathered and publicly available datasets
As we mentioned above, all three parallel pipelines in which the project is subdivided (feature extraction and selection, inference of imprecise-probabilistic models from partial data, and classification of imprecise-probabilistic models) contain individual test stages in which the generated specific algorithms are tested and validated to help the development process. Of course, however, a comprehensive validation stage in which the entire prototype in all its components is tested on state-of-the-art action, gesture, and identity recognition databases is necessary to validate the interaction between the single components and the overall performance in terms of robustness and classification rates.
Major action recognition datasets include the following.
  • The Weizmann database includes videos of 9 different actions performed by 16 subjects, and is considered easier as the scene is quite static and all sequences have similar characteristics (

  • The KTH database includes 6 different actions performed by 25 subjects; such dataset is more challenging, as four different scenarios are considered, subjects wear different clothes and act at different positions and distances from the camera (

  • The YouTube action database [37] contains 11 action categories: basketball shooting, biking/cycling, diving, golf swinging, horse back riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, and walking with a dog. The dataset is very challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etcetera (

  • The HOHA (Hollywood) dataset has been collected in 2008 [87] and 2009 [39]. The second batch, in particular, is intended to provide a comprehensive benchmark for human action recognition in realistic and challenging settings. The dataset is composed of video clips extracted from 69 movies, it contains approximately 150 samples per action class and 130 samples per scene class in training and test subsets (
Other action recognition datasets include the ViHasi, the LSCOM, the IRISA, the CMU RADAR dataset, the Daily living activities dataset (

Gait identification datasets include, among others: On top of that, our prototype will be tested on the data gathered by the consortium, with a focus on range data for its potentially dramatic contribution to the success of the project and its wider impact in both academic and industrial communities.
Task 9.3 Feedback to feature extraction, classification, and inference pipeline
As it can be observed in the GANTT chart of page 24, work package 9 runs for two entire years. The reason is that preliminary results of validation tests have to be continually fed back to the theoretical pipelines in order to adjust approaches and strategies to the evidence coming from real-world experiments. We expect there to be several such cycles of algorithm development, testing, and feedback for the entire second part of the project.

WP10 - Firmware implementation

The software tools implemented in WP9 can be employed as stand-alone tools for action recognition to be run on machines connected to video sources. Yet, the more sophisticated video cameras can embed analysis algorithms in their firmware. This can drastically improve the efficiency of the algorithm and promote the dissemination of our technology. This work package should be split into two different tasks.
Task 10.1 Choosing the camera architecture
The choice of the hardware architecture of the camera should be based on the complexity and structure of the identification algorithms, and on the kind and size of the data structure to be manipulated. Of course, especially in case of a commercialisation of these devices, other aspects like the price and the size of the different hardware components can affect the choice. This first task consists in the identification of the optimal architecture. As the R&D team at MARCH has a lot of experience in problems of this kind, we plan to achieve this issue with no particular complications.
Task 10.2 Firmware integration of the algorithms
The biggest challenge posed by this work package is the integration of the identification algorithms into the firmware of the camera, Once the camera architecture has been identified on the basis of the principles described by the previous task, the identification algorithm should be imported and integrated into the firmware. The development tools to be used and the skills of the developer in charge of the firmware implementation are strongly related to the choice of the hardware. Yet, the firmware team at MARCH as specific expertise in this field.

WP11 - Management and dissemination

The overall coordination of the project will be carried out by Fabio Cuzzolin of Oxford Brookes. He will be assisted by an administrative manager which will be hired at 25% of his/her time to assist with project management services and dissemination activities. The second objective of this work package is to ensure dissemination of the results on a scientific level, as well as dissemination and exploitation in the industrial domain.
Task 11.1 Coordination
Coordination comprises the organisation of a kick off meeting with clear instructions on administrative and financial responsibilities of partners, reporting formats and time-tables. In subsequent Steering Committee meetings this will be a recurrent item on the agenda, next to the monitoring of technical progress and results (deliverables, milestones). Coordination further includes the preparation of annual progress reports and final reporting to the EC, including financial statements and audit certificates. The reports will be presented as drafts at the Steering Committee meetings where they will be approved before submission to the Commission. The reporting is done according to FP7 rules. Although the reports are confidential in nature, they will contain an executive summary to be made public through the website.
Task 11.2 Dissemination
The project foresees a wide spectrum of dissemination activities, in particular:
  • targeted seminars to be held in company active in the fields of gesture and identity recognition, human-machine interfacing, and biometrics;

  • organisation of special sessions at major computer vision and imprecise probability conferences;

  • media press conferences to present the results of the project to the wider public. An example can be, at national level, the dissemination effort of the Laboratorio di Intelligenza Artificiale e Robotica of Politecnico di Milano in the context of their project on an autonomous wheelchair ( As a result, the project has enjoyed several passages on national television networks, with a significant positive impact on the widespread diffusion of the project's outcomes;

  • engaging government agencies potentially interested in security/surveillance applications of the developed robust recognition techniques;

  • exploiting present and future academic links of the partners to kickstart, in case the results are as good as expected, follow-up research projects at national or European level, possibly large scale.

Consortium management. Consortium management is the first goal of Work Package 11. A more detailed discussion of the management structure of the project is available in Section 2.1.

(i) Overall strategy (of the work plan)

Our workplan is basically articulated into three parallel pipelines (see PERT diagram of Section (v)). This is a natural consequence of the different tasks involved in delivering a robust action and gesture recognition framework and prototype, in the context described above: 1-extracting robust features from video sequences in different modalities; 2-developing a coherent inference theory for imprecise-probabilistic graphical models; 3-classifying video sequences by classifying such complex objects as imprecise-probabilistic graphical models.
The first area concerns the analysis of the frames forming a video sequence, required to convert the video sequences into numerical information to be processed by the algorithms (WPs 7,8).
The second one encompasses theoretical research focused on the development of the necessary learning and inference algorithms for the imprecise-probabilistic dynamic graphical models we employ to describe actions and gestures (WPs 1,4,2,6) The third pipeline is designed to study the problem of classifying such complex objects in order to recognize actions, gestures or identities.
Each area is further broken down into a number of work packages, corresponding to specific (and measurable) tasks to be achieved. There is an obvious temporal relationship between work packages in the same pipeline, while the results of each pipeline are to be fed to the core of the project, the Integration work package WP9. The level of modularisation of the workload is designed to allow the four different partners to work efficiently during the same period on the project, with no risk of overlap or downtime. The three pipelines are supposed to start from month 1, and run in parallel, managed by different groups. In fact, the second pipeline (classification, WPs 3 and 5) partially depends on the results of the third (inference on imprecise-probabilistic graphical models), but it is also autonomous in the sense that it will start by studying the problem of classifying more ¡§classical¡¨ dynamical models to later move to the case of imprecise-probabilistic ones.
A number of feedbacks can also be noted, notably between Integration and Testing (WP9) and each of the three pipelines: initial approaches and angles on the separate tasks will have to be revised in the light of the experimental results obtained. This feedback loop will run for the entire second half of the project. The number of work packages (11) is relatively high for a project of this scale, but we believe it to be appropriate to the complexity the proposed project. This is due to the fact that the consortium has extensive expertise in all the tasks involved: this in turn allows them to plan in much detail the proposed effort, facilitating in this way monitoring by the Commission.

(ii) Timing of the work packages (WPs)

The project is intended to cover a period of three years. The Gantt chart below depicts the overall schedule, including the timing of the different work packages.

(iii) Detailed work description broken into work packages

Work package list

List of Deliverables

List of Milestones

Summary of effort

iv) PERT diagram of the interdependencies between individual work packages

As can be observed in the PERT diagram, there are a lot of interdependencies between the individual work packages.
We would nevertheless like to stress two points:
  • the individual work packages are very specific, and this is reflected in their relatively large number: however, they can also be grouped into larger more or less independent units of research, namely: "computer vision" (WPs 7, 8), "model classification¨ (WPs 3,4), "imprecise EM algorithms¨ (WPs 1,4) and "imprecise hidden Markov models¨ (WPs 2,3,5) within which dependencies are more pronounced;

  • there are several feedback links between the integration/testing stage of the project and the more theoretical/propedeutical work packages: as algorithms are assembled and tested, strategies and approach will likely have to undergo one or more cycles of revision.

v) Risk assessment and contingency plan

Some of the objectives of the proposals, reflected in the above work packages, admittedly involve a certain degree of risk.
In particular, the development of inference algorithms for imprecise hidden Markov models is cutting edge research at this time, while the formulation of a general theory for arbitrary imprecise-probabilistic graphical models poses a beautiful but serious challenge. The classification of “precise” graphical models is also subject of current, deep research: its extension to IGMs is also ambitious, while crucial to the success of the project. These risks are balanced by the strong skills of the three academic partners in their respective area of expertise, and the specific features of the Coordinator with his interdisciplinarity expertise in both the applications object of this project (action, gesture and identity recognition) and its theoretical side (theory of imprecise probabilities and classification of dynamical models).
From a general point of view, as we mentioned above, the project is articulated into three/four different pipelines. While a coherent formulation of inference and classification algorithms for imprecise-probabilistic graphical models is a challenge that cannot be met by simply building on existing results, activities in the first pipeline (database gathering and feature extraction) are characterised by a lower level of risk, while, given OBU' expertise in computer vision applications and software development, we not envisage any insurmountable difficulty with integration and testing (WP10). Overall, the project is designed as a pinch in which several arms attack the problem from different angles in a complementary and cooperative manner. Concerning WPs 9 and 10, we believe our the industrial partner has the necessary background to guarantee a proper software and firmware implementation of these tools. Finally, concerning the management of the consortium, the three academic partners have smoothly and fruitfully worked together in the past without any particular issue: it seems therefore reasonable to expect the same will happen in the course of this project as well.
To moderate risk, we have foreseen in work package 5 (classification) to pursue several different alternative options to the classification of imprecise Markov and graphical models, based on clustering, structured kernel learning, and possible compressed sensing. Further actions will be assessed by the steering committee in its periodic meetings and through informal contacts and video conferences between the partners. All actions deemed necessary will be proactively considered and enacted.

[1] S. Ali, A. Basharat, and M. Shah, Chaotic invariants for human action recognition, ICCV’07.
[2] S.-I. Amari, Differential geometric methods in statistics, Springer-Verlag, 1985.
[3] A. Bar-Hillel, T. Hertz, N. Shental, and D.Weinshall, Learning distance functions using equivalence relations, ICML03, 2003, pp. 11–18.
[4] M. Bilenko, S. Basu, and R. J. Mooney, Integrating constraints and metric learning in semi-supervised clustering, Proc. of ICML’04, 2004.
[5] A. Bissacco, A. Chiuso, and S. Soatto, Classification and recognition of dynamical models: The role of phase, independent components, kernels and optimal transport, IEEE Trans. PAMI 29 (2007), no. 11, 1958– 1972.
[6] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions as space-time shapes, IEEE Conference on Computer Vision (2005), 1395–1402.
[7] M. Bregonzio, S. Gong, and T. Xiang, Recognising action as clouds of space-time interest points, CVPR’09, pp. 1948–1955.
[8] P. Burman, A comparative study of ordinary crossvalidation, v-fold cross-validation and the repeated learning-testing methods, Biometrika 76(3) (1989), 503–514.
[9] N. L. Carter, D. Young, and J. M. Ferryman, Supplementing Markov chains with additional features for behavioural analysis, 2006, pp. 65–65.
[10] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, Histograms of oriented optical flow and Binet- Cauchy kernels on nonlinear dynamical systems for the recognition of human actions, CVPR’09, pp. 1932– 1939.
[11] F. Cuzzolin, Using bilinear models for view-invariant action and identity recognition, CVPR’06, vol. 1, pp. 1701–1708.
[12] F. Cuzzolin, A geometric approach to the theory of evidence, IEEE Transactions on Systems, Man, and Cybernetics – Part C 38 (2008), no. 4, 522–534.
[13] F. Cuzzolin, Learning pullback metrics for linear models, Workshop on Machine Learning for Visionbased Motion Analysis MLVMA, 2008.
[14] F. Cuzzolin, Multilinear modeling for robust identity recognition from gait, Behavioral Biometrics for Human Identification: Intelligent Applications (Liang Wang and Xin Geng, eds.), IGI, 2009.
[15] F. Cuzzolin, Three alternative combinatorial formulations of the theory of evidence, Intelligent Decision Analysis (2010).
[16] F. Cuzzolin, Manifold learning for multi-dimensional autoregressive dynamical models, Machine Learning for Visionbased Motion Analysis (L. Wang, G. Zhao, L. Cheng, and M. Pietikine, eds.), Springer, 2010.
[17] F. Cuzzolin, D. Mateus, D. Knossow, E. Boyer, and R. Horaud, Coherent Laplacian protrusion segmentation, CVPR’08.
[18] F. Cuzzolin, A. Sarti, and S. Tubaro, Action modeling with volumetric data, ICIP’04, vol. 2, pp. 881– 884.
[19] N. Dalai and B. Triggs, Histograms of oriented gradients for human detection, CVPR’06, pp. 886– 893.
[20] M. N. Do, Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models, IEEE Signal Processing Letters 10 (2003), no. 4, 115 – 118.
[21] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic annotation of human actions in video, ICCV’09.
[22] C. F. Eick, A. Rouhana, A. Bagherjeiran, and R. Vilalta, Using clustering to learn distance functions for supervised similarity assessment, ICML and Data Mining, 2005.
[23] R. Elliot, L. Aggoun, and J. Moore, Hidden markov models: estimation and control, Springer Verlag, 1995.
[24] J. M. Wang et. al., Gaussian process dynamical model, NIPS’05.
[25] L. Ralaivola et al, Dynamical modeling with kernels for nonlinear time series prediction, NIPS’04.
[26] A. Galata, N. Johnson, and D. Hogg, Learning variable length Markov models of behavior, CVIU 81 (2001), no. 3, 398–413.
[27] A. Gilbert, J. Illingworth, and R. Bowden, Scale invariant action recognition using compound features mined from dense spatio-temporal corners, 2008, pp. I: 222–233.
[28] A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis, Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos., CVPR’09, pp. 2012–2019.
[29] D. Han, L. Bo, and C. Sminchisescu, Selection and context for action recognition, ICCV’09.
[30] N. Ikizler-Cinbis, R. G. Cinbis, and S. Sclaroff, Learning actions from the web, ICCV’09.
[31] M. Itoh and Y. Shishido, Fisher information metric and Poisson kernels, Differential Geometry and its Applications 26 (2008), no. 4, 347 – 356.
[32] T. K. Kim and R. Cipolla, Canonical correlation analysis of video volume tensors for action categorization and detection, 31 (2009), no. 8, 1415–1428.
[33] S. Kullback and R. A. Leibler, On information and sufficiency, Annals of Math. Stat. 22 (1951), 79–86.
[34] G. Lebanon, Metric learning for text documents, IEEE Tr. PAMI 28 (2006), no. 4, 497–508.
[35] R. N. Li, R. Chellappa, and S. H. K. Zhou, Learning multimodal densities on discriminative temporal interaction manifold for group activity recognition.
[36] Z. Lin, Z. Jiang, and L. S. Davis, Recognizing actions by shape-motion prototype trees, ICCV’09, pp. 444–451.
[37] J. G. Liu, J. B. Luo, and M. Shah, Recognizing realistic actions from videos ’in the wild’, CVPR’09, pp. 1996–2003.
[38] C. C. Loy, T. Xiang, and S. Gong, Modelling activity global temporal dependencies using time delayed probabilistic graphical model, ICCV’09.
[39] M. Marszalek, I. Laptev, and C. Schmid, Actions in context, CVPR’09.
[40] D. Mateus, R. Horaud, D. Knossow, F. Cuzzolin, and E. Boyer, Articulated shape matching using Laplacian eigenfunctions and unsupervised point registration, CVPR’08.
[41] C. Nandini and C. N. Ravi Kumar, Comprehensive framework to gait recognition, Int. J. Biometrics 1 (2008), no. 1, 129–137.
[42] B. North, A. Blake, M. Isard, and J. Rittscher, Learning and classification of complex dynamics, PAMI 22 (2000), no. 9.
[43] M. Piccardi and O. Perez, Hidden Markov models with kernel density estimation of emission probabilities and their use in activity recognition, VS’07, pp. 1–8.
[44] K. Rapantzikos, Y. Avrithis, and S. Kollias, Dense saliency-based spatiotemporal feature points for action recognition, CVPR’09, pp. 1454–1461.
[45] K. K. Reddy, J. Liu, and M. Shah, Incremental action recognition using feature-tree, ICCV’09.
[46] G. Rogez, J. Rihan, S. Ramalingam, C. Orrite, and P. H. S. Torr, Randomized trees for human pose detection, CVPR’08.
[47] K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, CVPR’08.
[48] M. Schultz and T. Joachims, Learning a distance metric from relative comparisons, NIPS’04.
[49] H. J. Seo and P. Milanfar, Detection of human actions from a single example, ICCV’09.
[50] N. Shental, T. Hertz, D.Weinshall, and M. Pavel, Adjustment learning and relevant component analysis, ECCV’02.
[51] Q. F. Shi, L. Wang, L. Cheng, and A. Smola, Discriminative human action segmentation and recognition using semi-Markov model, CVPR’08.
[52] A. J. Smola and S. V. N. Vishwanathan, Hilbert space embeddings in dynamical systems, IFAC’03, pp. 760 – 767.
[53] J. Sun, X. Wu, S. C. Yan, L. F. Cheong, T. S. Chua, and J. T. Li, Hierarchical spatio-temporal context modeling for action recognition, CVPR’09, pp. 2004–2011.
[54] A. Sundaresan, A. K. Roy Chowdhury, and R. Chellappa, A hidden Markov model based framework for recognition of humans from gait sequences, ICIP’03, pp. II: 93–96.
[55] I. W. Tsang, J. T. Kwok, C. W. Bay, and H. Kong, Distance metric learning with kernels, ICAI’03.
[56] Y. Wang and G. Mori, Max-margin hidden conditional random fields for human action recognition, CVPR’09, pp. 872–879.
[57] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russel, Distance metric learning with applications to clustering with side information, NIPS’03.
[58] B. Yao and S.C Zhu, Learning deformable action templates from cluttered videos, ICCV’09.
[59] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. S. Huang, Action detection in complex scenes with spatial and temporal ambiguities, ICCV’09.
[60] J. S. Yuan, Z. C. Liu, and Y. Wu, Discriminative subvolume search for efficient action detection, CVPR’09, pp. 2442–2449.
[61] Z. Zhang, Learning metrics via discriminant kernels and multidimensional scaling: Toward expected euclidean representation, ICML’03.
[62] G. de Cooman, F. Hermans, M. Zaffalon, and A. Antonucci, Epistemic irrelevance in credal nets: the case of imprecise Markov trees, International Journal of Approximate Reasoning 51 (2010) 1029-1052.
[63] A. Antonucci, A. Benavoli, M. Zaffalon, G. de Cooman, and F. Hermans, Multiple Model Tracking by Imprecise Markov Trees, Fusion 2009 pp. 1767-1774.
[64] J. De Bock and G. de Cooman, State sequence prediction in imprecise hidden Markov models, ISIPTA 2011 (submitted).
[65] L. M. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE 77 (1989) 257-286.
[66] Cozman, F. G., Credal networks, Artificial Intelligence 120 (2000) 199-233.
[67] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm, J. of the Royal Statistical Society B, 39:1–38, 1977.
[68] J.M. Wang et. al., Gaussian process dynamical model. In NIPS’06, 18:1441–1448, 2006.
[69] L. Ralaivola et. al. Dynamical modeling with kernels for nonlinear time series prediction. In NIPS’03, 16, 2003.
[70] I. Laptev and T. Lindeberg. Local descriptors for spatiotemporal recognition. In ICCV’03, 2003.
[71] Baraniuk, R.G., Wakin, M.B.: Random projections for manifold learning. In: Proceedings of NIPS. (2007)
[72] Baraniuk, R.G., Wakin, M.B.: Random projections of smooth manifolds. Found. Comput. Math. 9 (2009) 51–77
[73] Candes, E., Tao, T.: Near optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inform. Theory 52 (2006) 5406–5425
[74] Baraniuk, R., Davenport, M., DeVore, R.,Wakin, M.: A simple proof of the restricted isometry property for random matrices. Constructive Approximation (2007)
[75] Cand‘es, E.: The restricted isometry property and its implications for compressed sensing. Compte Rendus de lAcademie des Sciences (2008)
[76] Calderbank, R., Jafarpour, S., Schapire, R.: Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. Technical report (2009)
[77] Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In: Proceedings of ICCV’09. (2009)
[78] Scholkopf, B., Smola, A., Muller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10 (1998) 1299–1319
[79] Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Proceedings of NIPS. (2007)
[80] L. de Campos, J. Huete, and S. Moral, Probability intervals: a tool for uncertain reasoning, Int. J. Uncertainty Fuzziness Knowledge-Based Syst., 167–196, 1994
[81] I. Levi, The enterprise of knowledge, MIT Press, 1980
[82] P. Walley, Statistical reasoning with imprecise probabilities, Chapman and Hall, 1991
[83] G. Shafer, A mathematical theory of evidence, Princeton University Press, 1976
[84] A. P. Dempster, Upper and lower probabilities generated by a random closed interval, Annals of Mathematical Statistics, Vol. 39, 957-966, 1968
[85] P. Kohli and Ph. Torr, Efficiently solving dynamic Markov random fields using graph cuts, Proc. of ICCV'05, Vol. 2, pp. 922-929, 2005
[86] E. Miranda and G. de Cooman, Marginal extension in the theory of coherent lower previsions, International Journal of Approximate Reasoning, Vol. 46(1), pp. 188-225, 2007
[87] I. Laptev, M. Marsza³ek, C. Schmid and B. Rozenfeld, Learning Realistic Human Actions from Movies, Proc. CVPR'08.
[88] I. Tsochantaridis, T. Joachims, T. Hofmann and Y. Altun (2005), Large Margin Methods for Structured and Interdependent Output Variables, JMLR, Vol. 6, pages 1453-1484.
[89] G. BakIr, B. Taskar, T. Hofmann, B. Schölkopf, A. Smola and SVN Vishwanathan (2007), Predicting Structured Data, MIT Press.
[90] Y.M. Kim, C. Theobalt, J. Diebel, J. Kosecka, B. Miscusik, S. Thrun, Multi-view Image and ToF Sensor Fusion for Dense 3D Reconstruction, Proc. of 3DIM 2009.
[91] Toth, C., N. Markiel and D. Brzezinska, An Algorithm for Automated Feature Extraction from Laser Ranging Data, ASPRS PECORA Conference, Denver, CO, November 19 – 21, 2008
[92] A. Kolb, E. Barth, R. Koch and R. Larsen, Time-of-flight cameras in computer graphics, Computer Graphics Forum, Vol. 29(1), pages 141-159, 2010.
[93] Y.-W. Chee, T.-L. Yang, Vision and LIDAR Feature Extraction, Tech report, Cornell University
[94] M.C.M. Troffaes, Decision making under uncertainty using imprecise probabilities, International Journal of Approximate Reasoning 45 (2007) 17-29.
[95] G. de Cooman and M. Zaffalon, Updating beliefs with incomplete observations, Artificial Intelligence 159 (2004) 75-125.
[96] M. Zaffalon and E. Miranda, Conservative Inference Rule for Uncertain Reasoning under Incompleteness, Journal Of Artificial Intelligence Research 34 (2009) 757-821.
[97] G. de Cooman and E. Miranda, Forward irrelevance, Journal Of Statistical Planning And Inference 139 (2009) 256-276.
[98] E. Miranda and G. de Cooman. Marginal extension in the theory of coherent lower previsions, International Journal Of Approximate Reasoning 46 (2007) 188-225.
[99] G. de Cooman and F. Hermans, Imprecise probability trees: Bridging two theories of imprecise probability, Artificial Intelligence 172 (2008) 1400-1427.
[100] G. de Cooman, F. Hermans and E. Quaeghebeur, Imprecise Markov chains and their limit behaviour, Probability in the Engineering and Informational Sciences 23 (2009) 597-635.
[101] M. Zaffalon, The naive credal classifier, Journal of Statistical Planning and Inference 105 (2002) 5- 21.
[102] G. Corani and M. Zaffalon, JNCC2: An extension of naive Bayes classifier suited for small and incomplete data sets, Environmental Modelling & Software 23 (2008) 960-961.
[103] G. Corani and M. Zaffalon, JNCC2: The Java Implementation of Naive Credal Classifier 2, Journal of Machine Learning Research 9 (2008) 2695-2698.

List of project proposals