Workshops & Challenges

The ROAD challenge: Event detection for situation awareness in autonomous driving
An ICCV 2021 Workshop

Aim of the Workshop

The accurate detection and anticipation of actions performed by multiple road agents (pedestrians, vehicles, cyclists and so on) is a crucial problem to tackle if we wish to endow autonomous vehicles with the capability to support reliable and safe autonomous decision making. While the task of teaching an autonomous vehicle how to drive can be approached in a brute-force fashion using direct reinforcement learning, a sensible and attractive alternative is to first provide the vehicle with situation awareness capabilities, to then feed the resulting semantically-meaningful representations of road scenarios (in terms of agents, events and scene configuration) to a suitable decision-making strategy. This approach has several advantages, from being more explicable to humans to its potential to allow the modelling of the reasoning process of road agents in a theory-of-mind approach, inspired by the behaviour of the human mind in similar contexts.

Accordingly, the goal of this workshop is to put to the forefront of the research in autonomous driving the topic of situation awareness, intended as the ability to create semantically useful representations of dynamic road scenes in terms of the notion of road event, itself inspired by the central computer vision notion of ‘action’.

We proposed to define a road event as a triplet E = (Ag; Ac; Loc) composed by a moving agent Ag, the action Ac it performs, and the location Loc in which this takes place (on the image plane if only video data is available, but potentially on a depth map if 3D info is at hand). Inspired by the standard practice in action detection, we propose to represent road events as 'tubes', i.e., time series of frame-wise bounding box detections, as the building blocks of an intermediate semantic representation of a dynamic road scene.

As a side effect of this proposal, the workshop also aimed to stimulate a change of paradigm in the field of action detection, by shifting the focus from the objects/actors themselves and their appearance to what they do and the meaning of their behaviour, as the concept of action is here extended to apply to human-operated machinery as an extension of the human mind.

Link to full challenge proposal description.

Call for Papers

We invited contributions on the following topics:
  • Detecting and modelling 'atomic' events, intended as simple actions performed by a single agent.
  • Detecting and modelling complex activities, contributed to by several agents over an extended period of time.
  • Predicting agent intentions.
  • Dynamic scene understanding from streaming videos.
  • Predicting the trajectory of pedestrians, vehicles and other road users.
  • Forecasting future road events (both atomic and complex).
  • Decision making, both via reinforcement/imitation learning and via intermediate representations, and a critical/empirical comparison between the two approaches.
  • Explicability of both perception and decision making components of autonomous driving.
  • Modelling road scenarios in a multi-agent framework.
  • Modelling the reasoning processes of road agents in terms of goals or mental states.
  • Machine theory of mind for autonomous vehicles.
  • The role of incremental, life-long and continual learning in autonomous driving, with a focus on situation awareness.
  • The use of realistic simulations to generate training data for semantic scene understanding.
  • Testing and certification of AI algorithms for autonomous driving.
  • The ethical implications of situation awareness and automated decision making.
We invited both paper contributions on these topics, as well as submissions of entries to a challenge specifically designed to test situation awareness capabilities in autonomous vehicles, which is described in detail below.


As part of the Workshop we organised the following 3 Challenges, making use of the new ROAD dataset:
  1. Spatiotemporal agent detection: the output is in the form of agent tubes collecting the bounding boxes associated with an active road agent in consecutive frames (in an object tube formulation).
  2. Action detection: the output is in the form of action tubes formed by bounding boxes around an action of interest in each video frame.
  3. Spatiotemporal road event detection: by road event we mean the triplet (Agent, Action, Location) as explained above. Each road event is once again represented as a tube of frame-level detections. As the autonomous vehicle's decisions make use of all the three types of information provided by ROAD, this task is very significant for autonomous driving applications.

Paper track:
  • Paper submission: July 10 2021
  • Notification: August 10 2021
  • Camera-ready: August 17 2021
  • Challenges open for registration: April 15 2021
  • Training and validation data release: April 30 2021
  • Test fold release: July 20 2021
  • Submission of results: August 10 2021
  • Announcement of results: August 12 2021
  • Challenge event @ workshop: October 10-17 2021
Day of the event

The programme was a blend of invited talks, oral presentations from accepted papers, spotlight presentations from the winners of the 3 Challenges, and a discussion panel on the future of perception for autonomous driving.
The invited speakers were Fisher Yu (ETH), Raquel Urtasun (Waabi), Deva Ramanan (CMU), Alexander Amini (MIT) and Adrien Gaidon (TRI).
The event has seen the participation of around 25-30 people over 8 hours.

Challenge winners and Leaderboards

The winners of the three Challenges on Agent Detection, Action Detection and Event Detection were:

  Chenghui Li, Yi Cheng and Shuhan Wang
YOLOv5 for autonomous vehicles
Winner: Agent Detection challenge
  Lijun Yu and Xiwen Chen (team CMU-INF)
ArgusRoad: Road Activity Detection with Connectionist Spatiotemporal Proposals
Winner: Action Detection challenge
  Yujie Hou and Fengyan Wang (team IFLY)
The IFLY Submission to the ROAD Challenge
Winner: Road Event Detection Challenge
PDF presentation

Some sample results from the Winners of the Agent Detection challenge, using a modified Yolov5:

The complete leaderboards for the three Challenges can be found here, but are summarised below.

The complete recording for ROAD @ ICCV 2021 are accessible on YouTube:
The First International Workshop on Continual Semi-Supervised Learning @ IJCAI 2021

Aim of the Workshop

Whereas the continual learning problem has been recently the object of much attention in the machine learning community, it has been mainly approached from the point of view of preventing the model updated in the light of new data from ‘catastrophically forgetting’ its initial, useful knowledge and abilities. A typical example is that of an object detector which needs to be extended to include classes not originally in its list (e.g., ‘donkey’ in a farm setting), while retaining its ability to correctly detect, say, a ‘horse’. The unspoken assumption there is that we are quite satisfied with the model we have, whilst we wish to extend its capabilities to new settings and classes. An example of this focus is represented by the best paper award assigned at the latest ICML 2020 workshop on the topic.

This way of posing the continual learning problem, however, is in rather stark contrast with common real-world situations in which an initial model is trained using limited data, only for it to then be deployed without any additional supervision. Think of a person detector used for traffic safety purposes on a busy street. Even after having been trained extensively on the many available public datasets, experience shows that its performance in its target setting will likely be less than optimal. In this scenario, the objective is for the model to be incrementally updated using the new (unlabelled) data, in order to adapt to a target domain that is continually shifting with time (think of night/day and weekly/yearly cycles in the data captured by a camera outside an office block entrance).

The aim of this workshop is to formalise this form of continual learning, which we term continual semi-supervised learning (CSSL), and introduce it to the wider machine learning community, in order to mobilise the effort in this original direction. Secondly, it aims at providing clarity as to how training and testing should be designed in a continual setting.

Link to full challenge proposal description.


We propose to organise as part of the Workshop the following four CSSL Challenges, making use of the both the new MEVA-CL dataset (C1 and C2) and the new CCCL continual learning benchmark for crowd counting (C3 and C4):
  1. Continual Semi-supervised Activity Detection (ConSAD) – Absolute performance. The goal is to both achieve the best average performance across all the unlabelled portions of the sequences in the MEVA-CL dataset in a CSSL setting, leaving the choice of the base detector model to the participants.
  2. Continual Semi-supervised Activity Detection – Incremental performance. The goal here is to achieve the best performance improvement over time, measured from start to end of each unlabelled data stream and taking the average over all the available data streams in the dataset, for a baseline detector model chosen by us (see Baselines).
  3. Continual Semi-supervised Crowd Counting (ConSeCC) – Absolute performance. As in C1, the goal is to achieve the best average performance across the unlabelled data streams in the CCCL dataset, leaving the choice of the base crowd counting model to the participants.
  4. Continual Semi-supervised Crowd Counting – Incremental performance. The goal here is to achieve the best performance improvement over time, measured from start to end of the unlabelled data stream and taking the average over the available data streams in the dataset, for a base crowd counting model chosen by us (see Baselines).
Call for Papers: follow this link.


Paper track:
  • Paper submission: May 31 2021
  • Notification: June 30 2021
  • Camera-ready: July 31 2021
  • Challenges open for registration: April 1 2021
  • Training and validation fold release: April 15 2021
  • Test fold release: May 15 2021
  • Submission of results: June 15 2021
  • Announcement of results: June 30 2021
  • Challenge event @ workshop: August 21-23 2021

Footage of the event

Part 1

Part 2

The SARAS challenge on Multi-domain Endoscopic Surgeon Action Detection @ MICCAI 2021

Aim of the Challenge

Minimally Invasive Surgery (MIS) involves very sensitive procedures. Success of these procedures depends on the individual competence and degree of coordination between the surgeons. The SARAS (Smart Autonomous Robotic Assistant Surgeon) EU consortium,, is working on methods to assist surgeons in MIS procedures by devising deep learning models able to automatically detect surgeon actions from streaming endoscopic video. This challenge proposal builds on our previous MIDL 2020 challenge on surgeon action detection (, and aims to attract attention to this research problem and mobilise the medical computer vision community around it. In particular, informed by the challenges encountered in our SARAS work, we decided to focus this year’s challenge on the issue of learning static action detection model across multiple domains (e.g. types of data, distinct surgical procedures).

Despite its huge success, deep learning suffers from two major limitations. Firstly, addressing a task (e.g., action detection in radical prostatectomy, as in SARAS) requires one to collect and annotate a large, dedicated dataset to achieve an acceptable level of performance. Consequently, each new task requires us to build a new model, often from scratch, leading to a linear relationship between the number of tasks and the number of models/datasets, with significant resource implications. Collecting large annotated datasets for every single MIS-based procedure is inefficient, very time consuming and financially expensive.

In our SARAS work, we have captured endoscopic video data during radical prostatectomy under two different settings ('domains'): real procedures on real patients, and simplified procedures on artificial anatomies ('phantoms'). As shown in our MIDL 2020 challenge (over real data only), variations due to patient anatomy, surgeon style and so on dramatically reduce the performance of even state-of-the-art detectors compared to nonsurgical benchmark datasets. Videos captured in an artificial setting can provide more data, but are characterised by significant differences in appearance compared to real videos and are subject to variations in the looks of the phantoms over time. Inspired by these all-too-real issues, this challenge's goal is to test the possibility of learning more robust models across domains (e.g. across different procedures which, however, share some types of tools or surgeon actions; or, in the SARAS case, learning from both real and artificial settings whose list of actions overlap, but do not coincide).

In particular, this challenge aims to explore the opportunity of utilising cross-domain knowledge to boost model performance on each individual task whenever two or more such tasks share some objectives (e.g., some action categories). This is a common scenario in real-world MIS procedures, as different surgeries often have some core actions in common, or contemplate variations of the same movement (e.g. 'pulling up the bladder' vs 'pulling up a gland'). Hence, each time a new surgical procedure is considered, only a smaller percentage of new classes need to be added to the existing ones.

The challenge provides two datasets for surgeon action detection: the first dataset (Dataset-R) is composed by 4 annotated videos of real surgeries on human patients, while the second dataset (Dataset-A) contains 6 annotated videos of surgical procedures on artificial human anatomies. All videos capture instances of the same procedure, Robotic Assisted Radical Prostatectomy (RARP), but with some difference in the set of classes. The two datasets share a subset of 10 action classes, while they differ in the remaining classes (because of the requirements of SARAS demonstrators). These two datasets provide a perfect opportunity to explore the possibility of exploiting multi-domain datasets designed for similar objectives to improve performance in each individual task.

Link to full challenge proposal description.

Zenobo entry for the challenge, with DOI.


The task has the objective of learning how to accomplish two related static action detection tasks, defined on video data generated in two different domains, with overlapping objectives.

We will provide two datasets: Dataset-R (real) and Dataset-A (artificial).

Dataset-R is a set of video frames annotated for static action detection purposes, with bounding boxes around actions of interest and a class label for each bounding box, taken from our MIDL 2020 challenge (https://sarasesad. This dataset comprises 21 action classes, and is composed of 4 real-life videos captured from real patients during a RARP procedure. In total, the dataset contains approx. 45,000 labeled static action instances (bounding boxes with label).
Each action instance is represented as a bounding box of the form (x,y,w,h), where x,y are the coordinates of the centre of the bounding box whereas w and h represent the width and height of the box, respectively.

Dataset-A is also designed for surgeon action detection during prostatectomy, but is composed of videos captured during procedures on artificial anatomies ('phantoms') used for the training of surgeons, and specifically designed for the SARAS demonstrators. As in Dataset-R, surgeries were performed by expert clinicians. Dataset-A contemplates a smaller list of 15 action classes, 10 of which are also in the list for Dataset-R. The reason for this is that many of the actions that occur in a real RARP procedure, as modelled by our SARAS partners Ospedale San Raffaele, are not present in the procedures on phantoms because of their shorter duration or phantom limitations (for instance, actions like 'bleeding' and 'suction' are not contemplated). Dataset-A also contains more than 45,000 action instances, and comprises 6 fully annotated videos.

The challenge organizers will split both datasets into training, validation and test sets. Participant teams will aim to improve performance by learning a single model from both datasets (domains). Joint mean Average Precision (mAP) is the evaluation function, which will also be used for the ranking of submissions.

  • Challenge website launch: April 1 2021
  • Challenge opens for registration: April 1 2021
  • Training/Validation data release: April 15 2021
  • Test data release: July 16 2021
  • Result submission deadline: July 31 2021
  • Final result announcement: August 5 2021
  • Challenge date: September 27 or October 1 2021
The SARAS challenge on Endoscopic Surgeon Action Detection (ESAD) @ MIDL 2020

Aim of the Challenge

Minimally Invasive Surgery (MIS) is a very sensitive medical procedure, whose success depends on the competence of the human surgeons and the degree of effectiveness of their coordination. The SARAS (Smart Autonomous Robotic Assistant Surgeon) EU consortium,, is working towards replacing the assistant surgeon in MIS with two assistive robotic arms. To accomplish that, the core AI of the system needs to be able to recognise what the main surgeon is doing in real time, based on the streaming endoscopic video. This can be posed as an online action detection problem, where both the class (type) of action performed and a bounding box localising the action in each video frame is sought. Action instances are represented as 'action tubes', i.e., series of bounding boxes related to the same action in the series of consecutive video frames spanned by the action instance. Multiple action instances might be present at any point during the procedure (as, e.g., the right arm and the left arm of the da Vinci robot operated by the main surgeon might perform different coordinated actions).

From a technical point of view, then, a suitable online surgeon action detection system must be able to: (1) locate and classify multiple action instances in real time; (2) connect the detection bounding boxes associated with a single action instance in time to create an action tube. To the best of our knowledge, this challenge presents the first benchmark dataset for action detection in the surgical domain, and paves the way for the introduction, for the first time, of partial/full autonomy in surgical robotics. Within computer vision, other datasets for action detection exist, but are of limited size.

To the best of our knowledge, this challenge presents the first benchmark dataset for action detection in the surgical domain, and paves the way for the introduction, for the first time, of partial/full autonomy in surgical robotics. Within computer vision, other datasets for action detection exist, but are of limited size.

Link to full challenge proposal description.


The dataset contains digital recordings from da Vinci Xi robotic system, which is integrated the binocular endoscope, with a diameter of 8 mm (Intuitive Surgical Inc.). Two lenses—0° or 30°— were used. During different stages of the operation, the 30° lens can be used either looking up or down to improve visualisation. The videos used for this challenge are monocular.

The dataset is created from the four sessions of complete prostatectomy procedure performed by expert surgeons on real patients. The patients' consent was obtained for the recording as well as the distribution of the data. More details can be found in the SARAS ethics and data compliance document available at:

The dataset is divided into three sets: train, validation and test. Train and validation set will be released at the start of the challenge while test data will be released for the short test period for final result submission.

Training data contains a total of 22,601 annotated frames and contains 28,055 action instances. There are 21 different action classes in the dataset and each frame can have more than one action instance present. Also, these action instances can have overlapping bounding boxes.

The validation fold contains 4,574 frames with 7,133 action instances present in total. The test fold (only released in a second stage), will contain 6,223 annotated frames with 11,565 action instances.


The challenge is assessed on four real-life videos portraying complete radical prostatectomy procedures, performed by prominent surgeons on real patients at San Raffaele hospital, Milan, Italy. Videos are around four hours long each, and were captured at 25 frame per second. With assistance from medical experts surgeon actions have been categorised into 21 different classes. The extent of the bounding boxes around action instances have been decided after discussion with both medical and computer vision experts.

The task for this challenge is to detect the actions performed by the main surgeon or the assistant surgeon in the current frame. There are 21 action classes.

Evaluation Metrics: The Challenge will use mAP the evaluation metric which is a standard metric in all of the detection tasks. As this is the first of its kind task and correct detection of action in the surgical environment is difficult, we will be used a bit relaxed metric for the evaluation. The evaluation will be performed at three different levels of IOU: 0.1, 0.3 and 0.5. The final score will be mean of all the Average Precision values.

  • Challenge opens for registration: 1 March 2020
  • Training/Validation data release: 1 April 2020
  • Test data release: 10 June 2020
  • Start of evaluation phase for test data: 11 June 2020
  • End of evaluation phase for test data: 25 June 2020
  • Final result announcement: 30 June 2020
  • Virtual challenge event: 09July 2020