Tensorial modeling of dynamical systems for Gait and Activity Recognition |
1 - Previous Research Track Record
The Proposer: Dr Fabio Cuzzolin graduated in 1997 from the University of Padua (Universitas Studii Paduani, founded 1222, it is the seventh most ancient university in the world) with a laurea magna cum laude in Computer Engineering and Master’s thesis on “Automatic gesture recognition”. He received the Ph.D. degree from the same institution in 2001, for a thesis entitled “Visions of a generalized probability theory”. He was first Visiting Scholar at the ESSRL laboratory at the Washington University in St. Louis, currently 12th in the US universities ranking. He was later appointed fixed-term Assistant Professor with the Politecnico di Milano, Milan, Italy (consistently recognized as the best Italian university), then moved as a Postdoctoral Fellow to the UCLA Vision Lab, University of California at Los Angeles, led by Professor Stefano Soatto. He later received a Marie Curie Fellowship in partnership with INRIA Rhone-Alpes, Grenoble, France. Since September 2008, he is Lecturer and an Early Career Fellow with the Department of Computing, School of Technology, Oxford Brookes University, Oxford, U.K.
In addition, Dr Cuzzolin recently classified second in the 2007 Senior Researcher national recruitment at INRIA, and had interviews with/offer from Oxford University, EPFL, Universitat Pompeu Fabra, UCSD, GeorgiaTech, U. Houston, Honeywell Labs, Riya.
Dr Cuzzolin’s research interests span both machine learning applications of computer vision, including gesture and action recognition and identity recognition from gait, and uncertainty modeling via generalized and imprecise probabilities, to which he has contributed by developing a systematic analysis of the geometry of random sets and other uncertainty measures. His scientific activity goes therefore under the heading of interdisciplinarity. His scientific productivity is extremely high, as the thirty papers he has published in the last three years attest. Dr Cuzzolin is author of 53 peer reviewed scientific publications, 44 of them as first or single author, including two book chapters and 8 journal papers. Several more journals are currently under review or revision. His work has recently won a best paper award at the recent Pacific Rim Conference on AI symposium (PRICAI’08).
Dr Cuzzolin has been recently elected member of the Board of Directors of the “Belief Functions and Applications Society”. He has been Guest Editor for “Information Fusion”, and collaborates with several other international journals in both computer vision and probability, such as: the IEEE Tr. on Systems, Man, and Cybernetics, the Int. J. on Approximate Reasoning, Computer Vision and Image Understanding, the IEEE Trans. on Fuzzy Systems, Information Sciences, the Journal of Risk and Reliability, the International Journal on Uncertainty, Fuzziness, and Knowledge-Based Systems, Image and Vision Computing. He has served in the program committee of some 15 international conferences in both imprecise probabilities (e.g. ISIPTA, ECSQARU, BELIEF) and computer vision (e.g. VISAPP). He is reviewer for BMVC and ECCV. He has co-supervised two Ph.D. and two MsC. students and is involved in the supervision of the Oxford Brookes vision group’s cohort of students.
1.1 Proposer’s Related Work.
The proposer has a significant record of research in human motion analysis and recognition. After some early work on gesture recognition, he moved to study marker-less motion capture in 3D. Using a system of 12 synchronized cameras available at the Politecnico di Milano, silhouettes of the moving person in all cameras were extracted to recover its volumetric extension. The final purpose was to recognize actions and behaviors from sequences of volumes using Markov models [19]. Most relevant to the current proposal, he has recently explored the use of bilinear and multilinear models to identity recognition from gait [13, 15], a relatively new but promising branch of biometrics. As in this context performance is influenced by factors as diverse as viewpoint, emotional state, illumination, presence of clothes/occlusions, etcetera, that can be modeled through tensor algebra. He has recently published a book chapter [15] with IGI on this topic. In a distinct but strictly related topic, he is currently exploring manifold learning techniques for dynamical models representing (human) motions, in order to learn the optimal metric of the space they live in and maximize classification performance. Another book chapter [16] collecting his preliminary results on the topic has been recently accepted by Springer. In the wider field of articulated motion analysis he has published several papers on spectral motion capture techniques [36], focusing in particular on the crucial issue of how to select and map eigenspaces generated by two different shapes in order to track 3D points on their surfaces or consistently segment bodyparts along sequences of voxelsets [18], as a preprocessing step to action recognition.
1.2 Proposer’s Other Scientific Contributions.
Dr Cuzzolin is also recognized as one of the most prominent experts in the field of non-additive probabilities and belief functions. He has been recently elected member of the Board of Directors of the newly founded “Belief Functions and Applications Society”, and is member of the “Society for Imprecise Probabilities and Their Applications”.
His most important contribution in the field of uncertainty theory and imprecise probabilities is a general geometric approach to uncertainty measures, in which probabilities, possibilities and belief functions can all be represented as points of a Cartesian space and there analyzed [14]. Evidence aggregation operators (the analogues of Bayes’ rule in the Bayesian formalism) can also be seen as geometric operators. The issues of how to approximate a belief function with an additive probability or a possibility measure, or what probability transformation is appropriate for decision making can be all solved by geometric means.
In his recent award-winning paper [17], Dr Cuzzolin has also investigated alternative combinatorial foundations for the theory of belief functions, and their algebraic properties. He is currently working on the generalization of the total probability theorem to finite random sets, as a key contribution to the field of non-additive probabilities. He is in the process of finalizing a book entitled “The geometry of uncertainty” which will collect all his contributions to the mathematics of uncertainty.
1.3 Past Collaborations and Professional Links.
Dr Cuzzolin acquired considerable international experience by working in the past for some of the most prominent research laboratories in both the US and Europe. He gave seminar and invited talks at several world-leading institutions such as MIT, EPFL, GeorgiaTech, Microsoft Research Europe, INRIA. His network of collaborations with groups of researchers in both Europe and the United States is quite large and expanding.
Dr Cuzzolin is currently arranging meetings with several groups of researchers all around Europe to set up active collaborations to support his goal of establishing a fairly large group of five-ten people in a few years time in perspective of reaching a professorial position in the medium term. He is in talks with M. Zaffalon (IDSIA, Switzerland) for a STREP on imprecise Markov chains for gesture recognition. He is setting up with INRIA’s Radu Horaud, Alejandro Frangi (Pompeu Fabra) and Technion (R. Kimmel, M. Bronstein) an interdisciplinary Future Emerging Technology (FET) EU proposal on large scale manifold learning, with applications to scene understanding. He is discussing a collaborative project on uncertainty theory at UK level with J. Lawry (Head of Department of Bristol’s Engineering Mathematics) and F. Coolen (Durham’s Dept of Statistics), and exploring the opportunity of a European Network of Excellence in the same field. He plans to apply for the European Research Council Starting Grant in October 2010. Dr Cuzzolin enjoys personal links with several world class companies (many of them with research divisions in the UK) such as Microsoft Research, Honeywell Labs (I. Cohen), Boston’s MERL (M. Brand, S. Ramalingam), GE (G. Doretto), Google (A. Bissacco), Riya.
1.4 The host organization: Oxford Brookes University, School of Technology, (OBU).
In the department there are around 30 academic staff, these include, in computer graphics, Prof. David Duce (co-chair of Eurographics 2003, 2006 conference), Bob Hopgood OBE, Prof. M.K. Pidcock, world leader in Electrical Impedance Tomography, and in AI and image processing, Prof. William Clocksin. The Computer Science department had the following break down in the recent RAE: 4* 15%, 3* 35%, 2* 35% and 1* 15%, which means that 85% of output was deemed internationally leading and that no research output was considered unclassified. The School of Technology has recently established a new doctoral training programme with the theme of intelligent transport systems (http://tech.brookes.ac.uk/research/), which includes many computer vision problems, with a set of courses and associated infrastructure which will be directly beneficial to this project.
Dr Fabio Cuzzolin belongs to the Oxford Brookes vision group founded by Professor Philip Torr (http://cms.brookes.ac.uk/research/visiongroup/), which comprises some seventeen staff, students and post-docs who will add value to this project. Professor Torr was awarded the Marr Prize, the most prestigious prize in computer vision, in 1998. Members of the group have recently received awards in 4 other conferences, including best paper at CVPR 08 and honourary mention at NIPS, the top machine learning conference. The group was mentioned four times in the UKCRC Submission to the EPSRC International Review Sep. 20064. It enjoys ongoing collaborations with companies such as 2d3, Vicon Life, Yotta, Microsoft Research Europe, Sharp Laboratories Europe, Sony Entertainments Europe. The group’s work with the Oxford Metrics Group in a Knowledge Transfer Partnership 2005-9 won the National Best Knowledge Transfer Partnership of the year at the 2009 awards, sponsored by the Technology Strategy Board, selected out of several hundred projects. Oxford Brookes also has close links with Oxford University, with both the Active Vision Group and Prof. Zissermans Visual Geometry group, including a joint EPSRC grant and EU collaboration as well as co-supervision. Members of the all groups regularly attend each others reading groups and seminars.
Dr Cuzzolin holds a Early Career Fellow position with minimal undergraduate teaching duties and hence has sufficient time to conduct the research listed in this proposal.2 - Proposed research and its context
2.1 Background
Topic of research.
Biometrics such as face, iris, or fingerprint recognition have received growing attention in the last decade, as automatic identification systems for surveillance and security have started to enjoy widespread diffusion. They suffer, however, from two major limitations: they cannot be used at a distance, and require user cooperation, assumptions impractical in real-world scenarios.
Interestingly, psychological studies show that people are capable of recognizing their friends just from the way they walk, even when their “gait” is poorly represented by point light display [11]. Gait has several advantages over other biometrics, as it can be measured at a distance, is difficult to disguise or occlude, can be identified even in low-resolution images, and is non-cooperative in nature. Furthermore, gait and face biometrics can be easily integrated for human identity recognition [58, 44].
Despite its attractive features, though, gait identification is still far from being ready to be deployed in practice. What limits its adoption in real-world scenarios is the influence of a large number of nuisance or “covariate” factors [31] which affect appearance and dynamics of the gait. These include: walking surface, lighting, camera setup (viewpoint), but also footwear and clothing, objects carried, time of execution, walking speed (see Figure 1). These issues are shared by other applications of motion classification, such as action and activity recognition [50].
Figure 1: An illustration of the different covariate factors affecting identity recognition from gait. Clockwise from top-left: presence of carried objects, clothing, slope, dynamic background, speed, viewpoint (courtesy http://www.gait.ecs.soton.ac.uk/).
State of the art.
Human motion analysis. Two different but complementary sub-problems have been identified since the early days of machine vision analysis of human behavior: estimating the pose and motion of the person (tracking), and classifying this very motion from a sequence of images (recognition). Most attention has been given to the task of discriminating different actions or activities in smart room or human-machine interaction scenarios. Only more recently human identification from gait has started to receive growing attention from the vision community. While action and identity recognition are both instances of motion classification, they differ when it comes to the assumptions on the underlying motion, as identification is traditionally associated with the walking gait. Vast literature on gait ID. Albeit being a quite recent branch of machine vision, the literature on gait identification is already too extensive to be covered here in its entirety [36, 22, 53]. Gait analysis, generally speaking, involves two separate issues: how to represent the shape or appearance of the moving person, and if and how to encode the dynamics of the motion itself. A variety of image or gait signatures have been studied, most of them based on silhouette analysis [9], even though many other approaches have been explored [35]. The issue of tackling the numerous nuisance factors such as clothing or illumination has been recently recognized as central. Viewpoint as main covariate factor. The most important of those covariate factors is probably viewpoint variation, as in a realistic setup the person to identify steps into the surveyed area from an arbitrary direction [51, 56, 2, 44, 27, 28]. This problem has been studied in the gait ID context by several groups [25]. If a 3D articulated model of the moving person is available, tracking can be used as a pre-processing stage to drive recognition [10, 2, 56]. Model-based 3D tracking, however, is a difficult task. Manual initialization of the model is often required, while optimization in a higher-dimensional parameter space suffers from convergence issues. Alternative methods for generating synthetic side [28] or frontal [41] views, view-normalization techniques in a multiple camera framework [44], and view transformation models in the frequency domain [32] have been proposed. Multiple camera views have been used to extract static body parameters [27] and limb lengths [57]. A principled way of tackling covariates is still lacking. Viewpoint is just one of the many covariates which make gait ID a difficult problem. The effects of all such nuisances have not yet been thoroughly investigated, even though some effort has been recently done is this direction. Bouchrika and Nixon [4] have conducted a comparative study of covariates influence in gait analysis. Veres et al. [53] have proposed a remarkable predictive model of the time of execution covariate to improve recognition performance. The issue has been approached so far on an empirical basis, i.e., by trying to measure the influence of individual covariate factors. A principled strategy for their treatment has not yet been brought forward.
Tensor modeling. A mathematical formalism general enough to potentially address the fundamental issue of covariate factors in a principled way, however, exists under the name of multilinear/tensorial analysis. Its fundamental assumption is that the various factors linearly mix to generate the measurements which we observe, in our case the different walking gaits. The problem of recovering those factors given the observations is often referred to in the literature as “nonnegative tensor factorization” or NTF [48]. Different approaches to tensor factorization [30, 45] have been brought forward, ranging from the PARAFAC model [34] to multi-layer methods [8].
Bilinear models, in particular [49], are among the best studied tensorial models. They can be seen as tools for separating “style” (expressing nuisance or intrinsic style variation, a covariate factor in our case) and “content” (the label to classify, identity or action) of the objects to recognize. As a natural generalization, De Lathauwer et al. [29] have proposed to disentangle the different factors in a multilinear mixture or tensor through an extension of conventional singular value decomposition, or Higher-Order (HO) SVD.
Industrial and societal context.
Government policy and guidelines. With its implications for crime prevention and security, biometrics is obviously a growing business area. This is reflected by the increasing number of government-sponsored initiatives in the area in most advanced economies. As terrorists often hide their faces, gait ID has a clear edge in security scenarios.
The humanID initiative was originally launched by the US defence agency DARPA with a 50 million dollar project of enormous impact: http://infowar.net/tia/www.darpa.mil/iao/HID.htm. The US has even funded related research at the University of Southampton from 2000-20041, while the EU has recently funded a STREP (http://www.humabio-eu.org/) on combining novel biometrics to enhance security. Similar initiatives have been launched by the Natural Science Foundation of China (NSFC): http://www.dcs.qmul.ac.uk/˜sgg/NSFC RSL/index.html.
EPSRC itself has recognized the growing importance of behavioral biometrics by supporting an International Centre for Advanced Research in Identification Science (ICARIS - Reference: GR/S66671/01). The government at large is supportive of biometrics identification and semi-automatic surveillance for security purposes. In the July 2005 bombings the police needed to do a search of all the data from CCTV cameras and mobile phones for terrorists, a lot of this was done by human eye. Automatizing such a process in some sort of video mining scenario would be invaluable. The UK Biometrics Working Group (BWG) is a cross government group focused on the use of biometric technology across government and Critical National Infrastructure.
Target industries in biometrics and surveillance. Natural target industries for a project on gait identification are companies whose core business is directly in automatic or semi-automatic surveillance. Other companies active in the wider field of biometrics are also likely to be interested in developing a partnership, possibly in the form of a Knowledge Transfer Partnership. Most such companies focus at the moment on cooperative biometrics such as face or iris recognition: investing in behavioral, non-cooperative biometrics before the rest of the market could provide them with a significant competitive advantage. “In both the identity management and security arenas, the use of biometric technology is increasing apace ... the world biometrics market has expanded exponentially. Annual growth is forecast at 33% between the years 2000 and 2010. Europe is expected to have the fastest growing biometrics market by 2010 ... the UK body that represents companies developing these technologies ... has fostered close ties with the UK Border Agency and Home Office.” (Biometrics, 2008). Physicians specialized in gait analysis and sport teams labs (e.g., http://www.sportlink.co.uk/) are also examples of commercial entities which could benefit as “end users” of the gait classification techniques proposed here, as walking disorders and other anomalies can in principle be identified as classes of variation of the walking gait. A list of such companies can be found at http://www.univie.ac.at/cga/links.html#Companies, including Vicon (http://www.vicon.com/) which with our group already has a strong link.
Target industries in action and identity recognition. The growing market for action and gesture recognition applications, activity recognition or human-computer interface is just too big to be described extensively here. People collect a huge amount of data on their smart phones, which raises the issue of organizing and retrieving video data from online or personal repositories. Motion-based video games interfaces constitute another fast growing sector. Microsoft has recently launched its Project Natal, which with its controller-free gaming experience is probably destined to revolutionize the whole field of interactive video games and consoles (see http://www.xbox.com/en-us/live/projectnatal/ for some amazing demos). Oxford Brookes Vision Group enjoys continuing strong links with Microsoft through its founder Philip Torr, and has recently acquired a range camera (http://en.wikipedia.org/wiki/Range imaging) to kick-start cutting-edge research in motion analysis.
2.2 Research hypotheses and objectives
Research idea. Gait identification has demonstrated potential as a behavioral biometric tool, as the approach is strongly supported by both psychological experiments and a growing wealth of empirical evidence. However, its deployment in real-world scenarios is hindered by the presence of numerous nuisance, covariate factors. Tensor models have been proven in the recent past to be able to describe the influence of such factors, for instance in the context of face recognition [52]. However, video sequences are complex objects. In order to apply tensorial methods to them we need to find a compact representation for image sequences.
Encoding the dynamics of videos or image sequences by means of some sort of dynamical model (such as, for instance, hidden Markov models [38, 46]) has been proven useful in both action recognition2 [7, 55] and gait identification [47, 5], in situations in which the dynamics is critically discriminative. Furthermore, the actions of interest have to be temporally segmented from a video sequence, while actions of sometimes very different lengths have to be encoded in a homogeneous fashion in order to be compared (“time warping”). Dynamical representations are very effective in coping with time warping or action segmentation [46]. Many researchers have explored the idea of encoding motions via linear [3], nonlinear [40], stochastic [37, 54] or chaotic [1] dynamical systems. Hierarchical models [21] are more suitable when describing complex activities.
In this project, therefore, we propose to develop a novel, general framework for the classification of video sequences (with a focus on the walking gait), based on the application of tensorial decomposition techniques to image sequences represented as realizations of suitable dynamical models. We will build on encouraging results recently obtained by the P.I. [12, 14, 15].
Novelty and contributions. A contribution to real-world deployment of gait ID. The proposed framework will allow us to describe the influence of covariate factors which greatly affect identification from gait in a principled way, and push towards a more widespread diffusion of gait identification as a commercial viable biometric, in this way contributing to enhance the security levels in this country.Timeliness.
- A first comprehensive tensorial framework for dynamical models.
While preliminary results have been published in recent years by Dr Cuzzolin and a few other researchers [48], a coherent framework based on tensor classification of video sequences represented as dynamical models has not yet been brought forward. The impact of this contribution goes beyond the original application to gait identification.
- Natural extension to action and activity recognition.
As they are concerned with the classification of video sequences affected by a number of nuisance factors, the techniques devised in this proposal are in principle obviously extendable to action and identity recognition, with immense commercial exploitation potential, ranging from content-based video retrieval from repositories such as YouTube and in security and surveillance scenarios, to HMI, to interactive video games, etcetera.
- Advances in tensor modeling.
A number of issues, such as complexity and data representation, arise when applying tensor modeling to difficult problems such as gait identification. Significant contributions to the solution of these problems will potentially be stimulated by the proposed application (see Section 2.3).
An application close to commercial maturity. The potential for a commercial exploitation of a functioning behavioral biometric system is enormous, as gait identification can be performed at a distance, with low resolution cameras, without requiring neither collaboration nor awareness on the subject’s side. The field is mature to move from simplified settings to more realistic outdoor tests.
Uncertain times. In the current climate terror threats rank high in the public’s perception and priorities. Robust video classification and mining for semi-automatic surveillance would give an invaluable contribution to national security.
Commercial applications of gesture and action recognition. The Natal project is just an example of the potential enormous impact on people’s lives of action recognition and markerless motion capture. In this as in many other examples, computer vision finally looks close to move from research to reliable technology and commercial exploitation.
UK competitiveness may be at risk in behavioral biometrics. More disturbingly, as we argued above, the governments of most advanced economies (including all of the UK’s competitors) are now recognizing the potential of behavioral biometrics in the new contexts of serious threats to national and the public’s security. Besides some pilot projects funded by EPSRC, more efforts may be necessary in the near future to ensure UK competitiveness in this area.
Goals of the project.
The goal of the project is to design and test a general framework for tensorial modeling of dynamical models. This involves two orders of challenges. Theoretical issues arise such as: what classes of dynamical models to use to encode sequences, how to tackle the complexity induced by a realistically large number of factors, how to cope with dimensionality and sparsity of the data. In parallel, as this framework is designed to push towards the real-world deployment of gait identification, it is paramount to outperform current state of the art algorithms on all the existing public datasets. Moreover, as people can be identified from the way they perform actions other than the classical walking gait, we aim at testing the framework’s ability of supporting non-gait based identity recognition, following the recognition that upper-body motions actually predominate in the process [39].
As their societal and commercial impact is simply enormous, we intend the framework to be applicable to activity recognition and video mining, and cope with classes of dynamical models able to represent complex activities (e.g., variable length [23] or hierarchical [21] Markov models).
Milestones. The scientific output of the project will obviously be measured in terms of high-impact publications. Feasible targets are Computer Vision and Pattern Recognition (CVPR) 2011 (deadline November 2011) and Neural Information Processing (NIPS) 2011 (deadline June 2011) for publication of the preliminary results. We plan to consolidate the project’s outcome in articles to submit to high-impact journals in computer vision such as IJCV or PAMI.
At project completion a toolbox collecting the routines implementing the different stages of the gait recognition framework will be made available on the internet to all scientists working in either behavioral biometrics, activity recognition or video indexing, in order to disseminate our results. We will conduct thorough tests on all the available public databases with the goal of attaining improvements on the current state of the art on these benchmark test-beds.
A non-exhaustive list includes:
- the Southampton database (http://www.gait.ecs.soton.ac.uk);
- the CMU MoBo database;
- the USF GaitID challenge dataset (http://marathon.csee.usf.edu/GaitBaseline/);
- the CASIA database (http://www.cbsr.ia.ac.cn/english/Gait%20Databases.asp);
- the GeorgiaTech dataset (http://www.cc.gatech.edu/cpl/projects/hid/index.html).
We will explore the opportunity of acquiring our own database later to be made public, or run tests on surveillance test-beds as well. In case the results are in line with our expectations we will apply for a “proof of concept” scheme in order to test the commercial viability of our algorithms.
2.3 Programme and methodology
2.3.1 Methodology
Preliminary results on bilinear modeling. The proposer [12] has recently investigated bilinear models [49] as tools for view-invariant gait ID, as they allow (for instance) to recognize the identity of a known person from a walking gait captured from an unknown viewpoint. In [14] he has proposed a three-layer model in which each motion sequence (observation) depended on three factors: identity, action type, and view. A bilinear model was trained from those observations by considering two such factors at a time. Hidden Markov models of fixed dynamics were used to clusters the sequence into a fixed number of poses, whose stacked vector would eventually represent the input motion as a whole. A bilinear model for such a set of observation vectors was learnt and new sequences with different style (e.g., viewpoint) classified in terms of identity or action.
Tensor modeling. Bilinear modeling can be naturally extended in order to address the issue of nuisance factors in gait or action recognition by means of multilinear or “tensorial” models, appealing mathematical descriptions of the way such factors linearly interact in a mixed training set. Most tensorial approaches share a common structure. Any training set of video sequences containing walking gaits performed by different people, under different conditions such as viewpoint, illumination, clothing, etcetera, is represented as a multi-dimensional matrix or “tensor” D. As matrices can be decomposed into orthogonal column and row subspaces by means of singular-value decomposition (SVD), tensors can also be decomposed into a product of N orthogonal spaces, one for each dimension of the tensor itself. This can be done, for instance, by “flattening” [52] the tensor along its i-th dimension to get a matrix, and subsequently applying standard SVD to such a matrix. In the case of interest to us, one of the tensor’s dimension will be the label (identity/action) to classify, while each of the others will be related to one of the covariate factors. When a new observation is available (e.g., a test video), it is possible to project it onto the identity/action subspace and classify it there by means of any off-the-shelf classification algorithm.
Tensor modeling of dynamical models. In order to be described by a single tensorial model, however, all observations have to form vectors of the same size. Therefore, in order to apply tensor modeling to image sequences, we need to encode such sequences as vectors of the same length. An effective way to do this is to use parameter identification algorithms to represent each sequence of feature measurements extracted from the images as a dynamical model. An option widely explored in the past to represent video is to use hidden Markov models (see Figure 2) to encode motions in a compact way and cope with issues such as time warping and temporal segmentation.
Figure 2: A finite state hidden Markov model can be used to effectively encode a gait sequence as an observation vector.
They represent actions such as the walking gait as a finite state dynamical model, whose parameters can be collected in a single observation vector of a fixed size for each sequence. However, even simple linear dynamical models (LDS) such as ARMA models have delivered impressive results in representing dynamic textures [20]. Indeed, a variety of different classes of models have been recently used for motion recognition [3, 40, 37, 54, 1]. More sophisticated ones, such as hierarchical [21] or variable length Markov models [23], need to be considered when representing complex activities. In any case, encoding videos by means of dynamical models provides a solution to the complexity/convergence issues that arise in tensor decomposition when representing images or videos as raw collections of pixels [52].
Image feature representation. Historically, silhouettes have been often (but by no means always [35]) used to encode the shape of the walking person along the sequence, but are widely criticized for their sensitivity to noise and the fact that they require solving the (inherently ill defined) background subtraction problem. In the perspective of a real-world deployment of behavioral biometrics it is essential to move beyond silhouette-based representations, as a crucial step to improve the robustness of the recognition process. An interesting feature descriptor, for instance, called “action snippets” [43] is based on motion and shape extraction within rectangular bounding boxes which, contrarily to silhouettes, can be reliably obtained in most scenarios by using person detectors [6] or trackers [19]. Our final goal is to adopt a discriminative feature selection stage, such as the one proposed in [42], where discriminative features are selected from an initial bag of HOG-based descriptors. In this sense the expertise of the Oxford Brookes vision group in this area will be extremely valuable to the final success of the project.
A general framework. The overall framework for tensor classification of videos (as dynamical models) is illustrated in Figure 3.
Figure 3: The proposed framework.
Again, its key elements are two: the representation of observations depending on multiple factors as tensors, and the encoding of video sequences as sophisticated enough dynamical models6. Tensorial models are not an off-the-shelf tool that can be mindlessly thrown at a problem. Several issues but also potentially extremely interesting developments arise from their application, and are essential to the final success of this project.
Theoretical advances on tensor modeling.
Convergence of the algorithms. Most algorithms for multilinear modeling [29, 49] are based on repeated SVD or EM optimization, whose numerical convergence issues may hinder the reliability of the resulting recognition system. Number of factors. In identity or activity recognition the number of factors involved can be very large. Trying to describe them all can lead to model overfitting (too large a core tensor to estimate). A selection scheme to design a few “most relevant” factors will be desirable.
Size and distribution of the training set. In bilinear modeling [49] it is assumed that the data are equally distributed w.r.t. style and content. This is often not the case in gait identification, stimulating the search for an extension of such methods to sparse training sets.
Incremental learning. In surveillance scenarios the training set is not fixed, but evolves in time as new videos become available. It is undesirable to learn a new tensorial model from scratch every time new observations are available. Online SVD is available for standard SVD: its extension to HOSVD will be a clear target of this project.
Extension to tensorial observations. Tensor modeling assumes that observations come in the form of vectors, which is why we need to encode video sequences as stacked vectors. An alternative is provided by tensorial extensions of linear decomposition algorithms such as SVD [24], allowing us to work with image sequences as rank 3 tensors. Such a bi-tensorial model would probably represent both a more appropriate description of video training sets and an extremely significant theoretical contribution.
2.3.2 Programme of work and milestones
Stage 1: first year. The target of the first year of the project is to deliver a first working version of the general framework depicted in Figure 3, and obtain significant preliminary results to be submitted to major vision and machine learning conferences such as NIPS or ICCV/CVPR towards the end of 2011. The results will have to demonstrate the potential of the framework. This will be the PI’s task, who will work on his own for the entire first year of the project. In detail:
Stage 2: second year. In the second year the PI will be complemented by a Research Assistant (RA) for eleven months. The RA will take care of the bulk of the experimentations, the coding, the design of the dissemination web site, and data collection. Project management is expected to be relatively simple for a short term project with just two people involved, and will be taken care of by the PI himself. The PI has been involved in the past in the co-supervision of two PhD students and the supervision of many MSc ones. For the first nine months of this second year the PI will mostly dedicate himself to the supervision of the RA’s work. In parallel, he will tackle the theoretical aspects of tensor modeling that are crucial to a further development/more extensive application of the framework. In the last three months the PI will allocate more of his time to the writing up of the final reports and to make sure that the overall goals of the project are met. In detail:
- the different existing techniques for positive tensor factorization will be analyzed and tested;
- the representation of video sequences as dynamical models as different classes (e.g. LDS, NLDS, HMM, AR, ARMA [55]) will be explored, and different options concerning the vector encoding of models considered;
- the framework will be tested on all public datasets for gait identification: the results will be fed back to the activities at 1.1 and 1.2 to redirect the research if necessary.
The overall outcomes of the project consist of: state of the art results in at least gait identification; a wealth of dedicated code (in the form of a ready-to-use Matlab toolbox) available from a web site; scientific output in the form of submissions to top journals in machine learning and vision, such as the IEEE Transactions on PAMI, the International Journal of Computer Vision, or Machine Learning. More details on the workprogramme are given below.
- in order to extend the scope of the framework, the issues discussed in Section 2.3.1 need to be tackled: this will be the main task of the PI during the second year;
- crucial to the applicability of the framework to activity recognition and video retrieval, for instance in surveillance scenarios, is the selection of a flexible enough class of dynamical models (such as Hierarchical LDS and HMMs or VLMM): non-gait based ID recognition will be explored;
- in parallel, the RA will extend the tests to action/ activity recognition datasets and put to a test the theoretical advances achieved in the second year.
2.4 Relevance to academic beneficiaries
As the proposed project spans both theory and application, the likely impact of the proposed research will affect not only the growing gait ID community and the already large action recognition one, but the way people treat classification in the presence of nuisance in vision and elsewhere.
Impact on gait ID. While the usefulness of tensor modeling has been explored in other fields of computer vision, such as face recognition [52] or medical imaging [26], its potential for addressing the covariate issue in gait ID has not been yet recognized. This is probably due to the fact the problem requires much more than a straightforward application of off-the-shelf algorithms. If the results match our expectations, this could be a significant step towards the deployment of behavioral biometrics in real-world scenarios.
Impact on activity recognition. The proposed methodology naturally applies to contiguous fields involving classification of video sequences, e.g., action and activity recognition [50] and video retrieval for security purposes, as the very same nuisance factors which affect gait identification are present there too.
Impact on theory of tensor modeling. The algorithmic and theoretical developments generated by the proposed research are likely to have consequences for tensor modeling itself. Issues such as algorithmic convergence, data dimensionality, reliability of training observations, extension to non-vectorial observations are all areas on which the proposed research could deliver a significant contribution.References
[1] S. Ali, A. Basharat, and M. Shah, Chaotic invariants for human action recognition, ICCV’07, pp. 1–8.
[2] B. Bhanu and J. Han, Individual recognition by kinematic-based gait analysis, ICPR’02, vol. III, pp. 343–346.
[3] A. Bissacco, A. Chiuso, and S. Soatto, Classification and recognition of dynamical models: The role of phase, independent components, kernels and optimal transport, IEEE Trans. PAMI 29 (2007), no. 11, 1958–1972.
[4] I. Bouchrika and M. Nixon, Exploratory factor analysis of gait recognition, AFGR’08, pp. 1–6.
[5] N.L. Carter, D. Young, and J.M. Ferryman, Supplementing Markov chains with additional features for behavioural analysis, AVSBS06, pp. 65–65.
[6] A. Casile and M.A. Giese, Critical features for the recognition of biological motion, J Vision 5, 348–360.
[7] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions, CVPR’09, pp. 1932–1939.
[8] A. Cichocki, R. Zdunek, R. Plemmons, and S. Amari, Novel multi-layer nonnegative tensor factorization with sparsity constraints, LNCS, vol. 4432, 2007, pp. 271–280.
[9] R.T. Collins, R. Gross, and J.B. Shi, Silhouette-based human identification from body shape and gait, AFGR’02, pp. 351–356.
[10] D. Cunado, J.M. Nash, M.S. Nixon, and J.N. Carter, Gait extraction and description by evidence-gathering, AVBPA’99, pp. 43–48.
[11] J. Cutting and L. Kozlowski, Recognizing friends by their walk: Gait perception without familiarity cues, Bull. Psychon. Soc. 9 (1977), 353–356.
[12] F. Cuzzolin, Using bilinear models for view-invariant action and identity recognition, CVPR’06, vol. 2, pp. 1701–1708.
[13] , A geometric approach to the theory of evidence, IEEE Trans. SMC-C 38 (2008), no. 4, 522–534.
[14] , Multilinear modeling for robust identity recognition from gait, Behavioral Biometrics for Human Identification: Intelligent Applications, IGI, 2009.
[15] , Manifold learning for multi-dimensional autoregressive dynamical models, Machine Learning for Vision-based Motion Analysis, Springer, 2010.
[16] , Three alternative combinatorial formulations of the theory of evidence, Intelligent Decision Analysis (2010).
[17] F. Cuzzolin, D. Mateus, D. Knossow, E. Boyer, and R. Horaud, Coherent laplacian protrusion segmentation, CVPR’08, pp. 1–8.
[18] F. Cuzzolin, A. Sarti, and S. Tubaro, Action modeling with volumetric data, ICIP’04, vol. 2, pp. 881–884.
[19] N. Dalai and B. Triggs, Histograms of oriented gradients for human detection, CVPR’06, pp. 886– 893.
[20] G. Doretto, A. Chiuso, Y.-N. Wu, and S. Soatto, Dynamic textures, IJCV 51 (2003), no. 2, 91–109.
[21] S. Fine, Y. Singer, and N. Tishby, The hierarchical hidden Markov model: Analysis and applications, Mach. Learn. 32 (1998), no. 1, 41–62.
[22] D. Gafurov, A survey of biometric gait recognition: Approaches, security and challeges, NIK 2007.
[23] A. Galata, N. Johnson, and D. Hogg, Learning variable-length Markov models of behavior, CVIU 81 (2001), no. 3, 398–413.
[24] L. Grasedyck, Hierarchical singular value decomposition of tensors, SIAM J. Matrix Anal. & Appl. 31 (2010), no. 4, 2029–2054.
[25] J. Han, B. Bhanu, and A.K.R. Chowdhury, A study on view-insensitive gait recognition, ICIP’05, vol. III, pp. 297–300.
[26] C. Hoogendoorn, F.M. Sukno, S. Ordas, and A.F. Frangi, Bilinear models for spatio-temporal point distribution analysis, IJCV 85 (2009), no. 3, 237–252.
[27] A.Y. Johnson and A.F. Bobick, A multi-view method for gait recognition using static body parameters, AVBPA’01, pp. 301–311.
[28] A. Kale, A.K. Roy Chowdhury, and R. Chellappa, Towards a view invariant gait recognition algorithm, AVSBS’03, pp. 143–150.
[29] L. De Lathauwer, B. De Moor, and J. Vandewalle, Multilinear singular value decomposition, SIAM Journal of Matrix Analysis and Applications 21 (2000), no. 4, 1253 – 1278.
[30] H. Lee, Y.-D. Kim, A. Cichocki, and S. Choi, Nonnegative tensor factorization for continuous EEG classification, Int. J. of Neural Systems 17 (2007), no. 4, 305–317.
[31] X.L. Li, S.J. Maybank, S.J. Yan, D.C. Tao, and D.J. Xu, Gait components and their application to gender recognition, IEEE Trans. on SMC - C 38 (2008), no. 2, 145–155.
[32] Y.S. Makihara, R. Sagawa, Y. Mukaigawa, T. Echigo, and Y.S. Yagi, Gait recognition using a view transformation model in the frequency domain, ECCV’06, pp. 151–163.
[33] D. Mateus, R. Horaud, D. Knossow, F. Cuzzolin, and E. Boyer, Articulated shape matching using laplacian eigenfunctions and unsupervised point registration, CVPR’08.
[34] M. Morup, L.K. Hansen, C.S. Herrmann, J. Parnas, and S.M. Arnfred, Parallel factor analysis as an exploratory tool for wavelet transformed event-related EEG, NeuroImage 29 (2006), no. 3, 938–947.
[35] C. Nandini and C.N. Ravi Kumar, Comprehensive framework to gait recognition, Int. J. Biometrics 1 (2008), no. 1, 129–137.
[36] M.S. Nixon and J.N. Carter, Automatic recognition by gait, Proceedings of IEEE 94 (2006), no. 11, 2013–2024.
[37] B. North, A. Blake, M. Isard, and J. Rittscher, Learning and classification of complex dynamics, IEEE Trans. PAMI 22 (2000), no. 9, 1016–1034.
[38] M. Piccardi and O. Perez, Hidden Markov models with kernel density estimation of emission probabilities and their use in activity recognition, VS’07, pp. 1–8.
[39] F.E. Pollick, Y. Ma, J. Tsao, and M.S. Nixon, Attitudinal and biometric contributions to the recognition of identity from point-light walkers, J Vis 5 (2005), no. 8, 938.
[40] L. Ralaivola and F. d’Alche Buc, Dynamical modeling with kernels for nonlinear time series prediction, NIPS’04, vol. 16, pp. 129–136.
[41] G. Rogez, J.J. Guerrero, J.M. del Rincon, and C. Orrite-Uranela, Viewpoint independent human motion analysis in man-made environments, BMVC’06, vol. II, p. 659.
[42] G. Rogez, J. Rihan, S. Ramalingam, C. Orrite, and P.H.S. Torr, Randomized trees for human pose detection, CVPR’08.
[43] K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, CVPR’08.
[44] G. Shakhnarovich, L. Lee, and T.J. Darrell, Integrated face and gait recognition from multiple views, CVPR’01, vol. I, pp. 439–446.
[45] A. Shashua and T. Hazan, Non-negative tensor factorization with applications to statistics and computer vision, ICML’05, pp. 792–799.
[46] Q.F. Shi, L. Wang, L. Cheng, and A. Smola, Discriminative human action segmentation and recognition using semi- Markov model, CVPR’08, 2008, pp. 1–8.
[47] A. Sundaresan, A.K. Roy Chowdhury, and R. Chellappa, A hidden Markov model based framework for recognition of humans from gait sequences, 2003, pp. II: 93–96.
[48] D. Tao, X. Li, X.Wu, and S.J. Maybank, General tensor discriminant analysis and Gabor features for gait recognition, IEEE Trans. PAMI 29 (2007), no. 10, 1700–1715.
[49] J.B. Tenenbaum and W.T. Freeman, Separating style and content with bilinear models, Neural Computation 12 (2000), 1247–1283.
[50] P.K. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea, Machine recognition of human activities: A survey, CirSysVideo 18 (2008), no. 11, 1473–1488.
[51] R. Urtasun and P. Fua, 3D tracking for gait characterization and recognition, AFGR’04, pp. 17–22.
[52] M.A.O. Vasilescu and D. Terzopoulos, Multilinear image analysis for facial recognition, ICPR’02, pp. 511–514.
[53] G.V. Veres, M.S. Nixon, and J.N. Carter, Modelling the timevariant covariates for gait recognition, AVBPA’05, pp. 597–606.
[54] J.M. Wang, D.J. Fleet, and A. Hertzmann, Gaussian process dynamical model, NIPS’06, vol. 18, pp. 1441–1448.
[55] Y.Wang, K.Q. Huang, and T.N. Tan, Group activity recognition based on ARMA shape sequence modeling, ICIP07, vol. III, pp. 209–212.
[56] C.Y. Yam, M.S. Nixon, and J.N. Carter, Automated person recognition by walking and running via model-based approaches, Pattern Recognition 37 (2004), no. 5, 1057–1072.
[57] G.Y. Zhao, G.Y. Liu, H. Li, and M. Pietikainen, 3D gait recognition using multiple cameras, FGR’06, pp. 529–534.
[58] X.L. Zhou and B. Bhanu, Integrating face and gait for human recognition at a distance in video, IEEE Trans. on SMC - B 37 (2007), no. 5, 1119–1137.
Lab Member(s): Fabio Cuzzolin, Wenjuan Gong, Michael Sapienza