Current video search engines on the Web are text-based, i.e., they process queries based on the words posted around the videos in Web pages.
Any search method based on such textual 'labels' is bound to have frustrating performances on all but the simplest queries (Figure 1).
Being able to retrieve videos based on their actual content, instead, would be extremely useful when browsing online movie/TV/music/news/sports
video databases, searching for private videos in social networks, advertising products, handling educational videos in online university courses,
mining information on companies and high-profile people for big data corporations, or even analysing live surveillance video streams to detect
potential crimes (say, in a parking lot) or dangerous situations for children, the elderly, or the mentally or physically disabled.
One could use as tags the search strings used by people whenever they click on the video after a search: this, however, leads to intractable
multi-label classification problems.
To date, traditional video retrieval focusses on 'summarizing' videos via significant key-frames, or clustering frames from the same scene,
exploiting manually generated meta-data (actors, scenes, etc.) and low-level video descriptors. Companies such as MadVideo
(http://www.themadvideo.com/) sell online tools for meta-data video tagging.
But this is not good enough:
it is impossible to manually annotate millions of videos.
Figure 1's example suggests that references to semantic notions such as places and
activities may be crucial for performance.