The Visual Question Answering task is simple for humans but difficult for machines. Given an image and a question, your
mission is to build a system that draws a bounding box to answer the question.
There is only one correct answer per image.
We use intersection over union (IoU) as the evaluation score. On average,
machine learning methods get an IoU score of 0.20 on the test set,
while non-expert humans score 0.87.
Prizes will be awarded for the highest rate of correct answers,
which is quantified as the highest average IoU score on the hidden test set.
Prizes go to the winner and two runner-ups.
Each team of finalists also gets one complimentary registration at the WSDM conference. Finalists must present their solutions at the WSDM Cup workshop
to qualify for the prizes. If the contestant is a team, the prize is divided into
equal parts among the team members.
Our dataset consists of images associated with textual questions. One entry (instance) in our dataset is a question-image pair labeled with the ground truth coordinates of a bounding box containing the visual answer to the given question. The images were obtained from a CC BY-licensed subset of the Microsoft Common Objects in Context dataset, MS COCO. All data labeling was performed on the Toloka crowdsourcing platform, https://toloka.ai/.
The dataset has 45,199 instances split among three subsets: train (38,990 instances), public test (1,705 instances), and private test (4,504 instances). The entire train dataset will be available for everyone from the start of the challenge. The public test dataset will be available during the evaluation phase of the competition but without any ground truth labels. The private test dataset will not be available until the challenge ends.
|image||string||URL of an image on a public content delivery network|
|question||string||question in English|
|left||integer||bounding box coordinate: left|
|top||integer||bounding box coordinate: top|
|right||integer||bounding box coordinate: right|
|bottom||integer||bounding box coordinate: bottom|
We accept submissions via CodaLab:
Get a starter pack with the competition dataset
and a baseline solution on GitHub:
Competition terms are also available on GitHub.
The final score will be evaluated on the private test dataset during Reproduction phase. We kindly ask you to create a docker image and share it with us. We put an instruction how to create a docker image on github and we will provide more details about sharing it with us later.
We will run your solution on a machine with one Nvidia A100 80 GB GPU, 16 CPU cores, and 200 GB of RAM. Your Docker image must perform the inference in at most 3 hours on this machine. In other words, the docker run command must finish in 3 hours.
Don't hesitate to contact us at firstname.lastname@example.org if you have any questions or suggestions.