Toloka Visual Question 
Answering Challenge

$6,000 in cash prizes and the opportunity to present at WSDM 2023 in Singapore

Task description

The Visual Question Answering task is simple for humans but difficult for machines. Given an image and a question, your 
mission is to build a system that draws a bounding box to answer the question.


Results of the Reproduction phase are ready. We'd like to thank you all for participating. Very well done everyone!

We obtained the final scores by running the inference on the private test set that we are going to release soon. We used the models that the participants sent to us during Reproduction stage.

First placeShengyi Gao, Zhe Chen, Guo Chen, Tong Lu
Second placeXiangyu Wu, Zhouyang Chi, Hengda Bao, Jianfeng Lu, Yang Yang
Third placeEvgenia Komleva


  • Image
    What do you use to hit the ball?
  • Image
    What do people use for cutting?
  • Image
    What do we use to support the immune system and get vitamin C?
The challenge is over. You can find more info on CodaLab.
View on CodaLab


There is only one correct answer per image.

We use intersection over union (IoU) as the evaluation score. On average, 
machine learning methods get an IoU score of 0.20 on the test set, 
while non-expert humans score 0.87.

Prizes will be awarded for the highest rate of correct answers,
which is quantified as the highest average IoU score on the hidden test set.


Prizes go to the winner and two runner-ups.

1st place: $3000
2nd place: $2000
3rd place: $1000

Each team of finalists also gets one complimentary registration at the WSDM conference. Finalists must present their solutions at the WSDM Cup workshop 
to qualify for the prizes. If the contestant is a team, the prize is divided into 
equal parts among the team members.

Data description

Our dataset consists of images associated with textual questions. One entry (instance) in our dataset is a question-image pair labeled with the ground truth coordinates of a bounding box containing the visual answer to the given question. The images were obtained from a CC BY-licensed subset of the Microsoft Common Objects in Context dataset, MS COCO. All data labeling was performed on the Toloka crowdsourcing platform,

The dataset has 45,199 instances split among three subsets: train (38,990 instances), public test (1,705 instances), and private test (4,504 instances). The entire train dataset will be available for everyone from the start of the challenge. The public test dataset will be available during the evaluation phase of the competition but without any ground truth labels. The private test dataset will not be available until the challenge ends.

imagestringURL of an image on a public content delivery network
questionstringquestion in English
widthintegerimage width
heightintegerimage height
leftintegerbounding box coordinate: left
topintegerbounding box coordinate: top
rightintegerbounding box coordinate: right
bottomintegerbounding box coordinate: bottom

How to participate

We accept submissions via CodaLab:

Get a starter pack with the competition dataset 
and a baseline solution on GitHub:

Competition terms are also available on GitHub.


  • Practice phase: Ongoing after September 16
  • Evaluation phase: September 30 – December 16
  • Reproduction phase: December 19 – January 16
  • WSDM Cup Workshops: February 27, 2023


The final score will be evaluated on the private test dataset during Reproduction phase. We kindly ask you to create a docker image and share it with us. We put an instruction how to create a docker image on github and we will provide more details about sharing it with us later.

We will run your solution on a machine with one Nvidia A100 80 GB GPU, 16 CPU cores, and 200 GB of RAM. Your Docker image must perform the inference in at most 3 hours on this machine. In other words, the docker run command must finish in 3 hours.

Don't hesitate to contact us at if you have any questions or suggestions.

Toloka Visual Question 
Answering Challenge