Action recognition is a fast-growing area of research in computer vision. The problem consists in, given a video captured by one or more cameras,
detecting and recognising the category of the action performed by the person(s) who appear in the video.
The problem is very challenging,
for a number of reasons: labelling videos is an ambiguous task, as the same sequence can be assigned different verbal descriptions by different
human observers; different motions can carry the same meaning (inherent variability); nuisance factors such as viewpoint, illumination variations,
occlusion (as parts of the moving person can be hidden behind objects or other people) further complicate recognition.
In addition, traditional
action recognition benchmarks are based on a 'batch' philosophy: it is assumed that a single action is present within each video, and videos are
processed as a whole, typically via algorithms which require entire days to be completed. This can be ok for tasks such as video browsing and
retrieval over the internet (although speed is a huge issue there), but is completely unacceptable for a number of real world applications which
require a prompt, real-time interpretation of what is going on.
Examples are: human-robot and human-machine interaction (using gestures to send commands to a computer or a robot), surveillance (detecting
potentially dangerous actions or events in live feeds), car driver's monitoring (monitoring the level of attention, or responding to gestural
commands), gaming (interpreting the body language of a video game player), intelligent vehicles (understanding the behaviour of pedestrians and
other vehicles in the vicinity of a car).
Consequently, a new paradigm of 'online', 'real-time' action recognition is rapidly emerging, and is likely to shape the field in coming years.
The AI and Vision group is already building on its multi-year experience in batch action recognition to expand towards online recognition, based
on two distinct approaches: one based on the application of novel 'deep learning' neural networks to automatically segmented video regions, the
other resting on continually updating an approximation of the space of 'feature' measurements extracted from images, via a set of balls of radius
which depends on how difficult classification is within that region of the space.