Datasets and Code

ROad event Awareness Dataset
for Autonomous Driving
Multi-domain Endoscopic
Surgical Action Dataset
Surgical Action Dataset
Continual Activity
Recognition Dataset
Continual Crowd
Counting Dataset

March 2021

ROAD - The ROad event Awareness Dataset for Autonomous Driving

We are proud to announce the release of the new ROad event Awareness Dataset for Autonomous Driving (ROAD). The dataset is publicly available on GitHub at

ROAD is the first benchmark of its kind, designed to allow the autonomous vehicles community to investigate the use of semantically meaningful representations of dynamic road scenes to facilitate situation awareness and decision making for autonomoous driving.

ROAD is a multilabel dataset containing 22 long-duration videos (ca 8 minutes each) comprising 122K frames annotated in terms of *road events*, defined as triplets E = (Agent, Action, Location) and represented as ‘tubes’, i.e., a series of frame-wise bounding box detections.

ROAD has the ambition to become the reference benchmark for agent and event detection, intention and trajectory prediction, future events anticipation, modelling of complex road activities, instance- and class-incremental continual learning, machine theory of mind and automated decision making.

An original 3D-RetinaNet baseline model is available at:

Please cite our arXiv preprint when using ROAD in your work.

March 2021

The SARAS-MESAD Multi-domain Endoscopic Surgical Action Dataset

In our SARAS work, we have captured endoscopic video data during radical prostatectomy under two different settings ('domains'): real procedures on real patients, and simplified procedures on artificial anatomies ('phantoms'). As shown in our MIDL 2020 challenge (over real data only), variations due to patient anatomy, surgeon style and so on dramatically reduce the performance of even state-of-the-art detectors compared to nonsurgical benchmark datasets. Videos captured in an artificial setting can provide more data, but are characterised by significant differences in appearance compared to real videos and are subject to variations in the looks of the phantoms over time. Inspired by these all-too-real issues, this challenge's goal is to test the possibility of learning more robust models across domains (e.g. across different procedures which, however, share some types of tools or surgeon actions; or, in the SARAS case, learning from both real and artificial settings whose list of actions overlap, but do not coincide).

The challenge provides two datasets for surgeon action detection: the first dataset (Dataset-R) is composed by 4 annotated videos of real surgeries on human patients, while the second dataset (Dataset-A) contains 6 annotated videos of surgical procedures on artificial human anatomies. All videos capture instances of the same procedure, Robotic Assisted Radical Prostatectomy (RARP), but with some difference in the set of classes. The two datasets share a subset of 10 action classes, while they differ in the remaining classes (because of the requirements of SARAS demonstrators). These two datasets provide a perfect opportunity to explore the possibility of exploiting multi-domain datasets designed for similar objectives to improve performance in each individual task.

Link to full challenge proposal description


SARAS-MESAD MICCAI 2021 challenge
July 2020

The SARAS-ESAD Endoscopic Surgical Action Dataset

Minimally Invasive Surgery (MIS) is a very sensitive medical procedure, whose success depends on the competence of the human surgeons and the degree of effectiveness of their coordination. The SARAS (Smart Autonomous Robotic Assistant Surgeon) EU consortium,, is working towards replacing the assistant surgeon in MIS with two assistive robotic arms. To accomplish that, an artificial intelligence based system is required which not only can understand the complete surgical scene but also detect the actions being performed by the main surgeon. This information can later be used infer the response required from the autonomous assistant surgeon. The correct detection of surgeon action and its localization is a critical task to design the trajectories for the motion of robotic arms. This challenge has recorded four sessions of complete prostatectomy procedure performed by expert surgeons on real patients with prostate cancer. Later, expert AI and medical professions annotated these complete surgical procedures for the actions. Multiple action instances might be present at any point during the procedure (as, e.g., the right arm and the left arm of the da Vinci robot operated by the main surgeon might perform different coordinated actions). Hence, each frame is labeled for multiple actions and these actions can have overlapping bounding boxes.

The bounding boxes, in the training data, are selected to cover both the ‘tool performing the action’ and the ‘organ under the operation’. A set of 21 actions is selected for the challenge after the consultation with the expert medical professionals. From a technical point of view, then, a suitable online surgeon action detection system must be able to: (1) locate and classify multiple action instances in real time; (2) connect the detection associated bounding boxes.

To the best of our knowledge, this challenge presents the first benchmark dataset for action detection in the surgical domain, and paves the way for the introduction, for the first time, of partial/full autonomy in surgical robotics. Within computer vision, other datasets for action detection exist, but are of limited size.

Link to the ESAD Grand Challenge website.

August 2021

CAR - The Continual Activity Recognition Dataset

The Continual Activity Recognition (CAR) dataset was released as part of the Continual Semi-Supervised Learning workshop co-hosted by IJCAI 2021 (CSSL @ IJCAI 2021).

The MEVA dataset

Our CAR benchmark is built on top of the MEVA (Multiview Extended Video with Activities) activity detection dataset1, for continual, long duration semi-supervised learning in a classification setting by classifying the input video frames in terms of activity classes. MEVA is part of the NIST ActEV (Activities in Extended Video) challenge. As of December 2019, 328 hours of ground-camera data and 4.2 hours of Unmanned Arial Vehicle video had been released, broken down into 4304 video clips, each 5 minutes long.
The original annotations are available on GitLab.


To create a suitable continual learning dataset for action/activity recognition we provided a modified set of annotations, with a reduced set of action classes for the purpose of temporal activity detection Each video frame is annotated in terms of 8 different activity classes selected from the 37 original classes (e.g. person_enters_scene_through_structure, person_exits_vehicle, vehicle_starts). Each instance of activity is annotated with the start and end frames of the activity of interest. Frames that do not have any associated activity label are assigned to a “background” activity class.

CAR's 15 long-duration sequences

MEVA lends itself well to the CSSL problem, for it comprises 4,304 video clips, each five minutes long, captured by 19 cameras across the Muscatatuck Urban Training Center facility. Several of those videos are contiguous, while others are separate by short (5-15 minutes) or long (hours or days) intervals of time. We thus selected sets of videos of the three kinds to compose longer sequences that can model learning in different settings (continually, with short gaps, episodically).
As a result CAR is composed of 15 sequences, each 15 minute long, broken down into three categories:
  1. Five 15-minute-long sequences from sites G326, G331, G341, G420, and G638 each of which are formed by three original videos which are contiguous.
  2. Five 15-minute-long sequences from sites G326, G331, G341, G420, and G638 each of which are formed by three original videos which are separated by a short gap (5-15 minutes).
  3. Five 15-minute-long sequences from sites G420, G421, G424, G506, and G638 each of which are formed by three original videos separated by a long gap (hours or days).


Training fold (supervised with labels): For each composite sequence, the video frames from the first five minutes (5 x 60 x 25 = 7,500 samples from the first original video) are selected to form the initial supervised training set T0.
Validation fold (unlabeled data stream): The next five minutes from each of the sequences (second original video) are considered as a validation fold that can be used for tuning the model continually in an unsupervised manner.
Test fold (unlabeled data stream): The last five minutes from each of the sequences (third original video) are also unsupervised and can be used for testing the proposed continual learning approach and evaluating the resulting series of models.

The CAR dataset, including annotations and scripts, is also available on GitHub at


Please cite our CSSL @ IJCAI 2021 paper when using CAR in your work:

August 2021

CAR - The Continual Crowd Counting Dataset

Crowd counting

Our Continual Crowd Counting (CCC) dataset is designed to test continual semi-supervised learning for crowd counting in video frames We pose crowd counting as a regression task, with the goal of predicting the density map associated with a test video frame, as standard in crowd counting To the best of our knowledge, continual crowd counting has never been posed as a problem, not even in the fully supervised context – thus, there are no standard benchmarks one can adopt in this domain

The dataset

The CCC benchmark combines components from three existing crowd counting datasets (Mall, UCSD and FDST), augmented with the relevant ground truth in terms of density maps
  • The Mall dataset consists of a single 2,000-frame video sequence captured in a shopping mall via a publicly accessible camera
  • The UCSD dataset is composed of another single sequence 2,000 frames long, captured by a stationary digital camcorder mounted for one hour to overlook a pedestrian walkway at the University of California at San Diego
  • A 750-frame sequence from the Fudan-ShanghaiTech (FDST) dataset, composed of five clips, each of which are 150 frames in duration, portraying a same scene

The ground truth for the CCC sequences (in the form of a density map for each frame) was generated by us for all three datasets following standard annotation protocols1


Training fold (supervised with labels): For both the Mall and UCSD datasets the first 400 frames and for FDST the first 150 frames are selected as the training set T0
Validation fold (unlabeled data stream): The next 800 frames of the Mall and UCSD datasets and 300 frames of FDST are considered as validation fold.
Test fold (unlabeled data stream): The last 800 frames of the Mall and UCSD datasets and 300 frames of FDST also unsupervised and used for the continual learning and final evaluation of the model.

The CCC dataset, including annotations and scripts, is as usual available on GitHub at:


Please cite our CSSL @ IJCAI 2021 paper when using CCC in your work: