Public datasets

With thousands of performers making millions of evaluations in hundreds of tasks every day, Yandex.Toloka is a major source of human-marked training data. Yandex.Toloka supports academic research and innovation by sharing large amounts of accurate data applicable to machine learning in a variety of areas.
Please note: These public datasets are only available for non-commercial use with a clear reference to Yandex.Toloka as the source of data. If you plan to use any of these datasets for commercial purposes, please contact us for our consent.
Toloka Users & Tasks
This dataset was collected for the KDD'2020 paper «Prediction of Hourly Earnings and Completion Time on a Crowdsourcing Platform». This dataset contains over 18 million tasks performed by 161,377 users in Yandex.Toloka. Data was collected over a 3-month period (September-November, 2018) to include all sessions on the main page of the web platform where a user chose a task, and all tasks that were completed by users. The session data includes the timestamp, user ID, ID of the project chosen by the user, ID of the task assigned to the user, and additional information. The task data includes start and completion timestamps, the project ID, the ID of the assigned user, the reward for completing the task, the number of micro tasks involved in the task, the amount of input data and output data, and additional information. Some static characteristics of users and projects are also provided, such as user registration dates and the length of project instructions. The data has been anonymized: no real internal ids, personal user data, proprietary requester data (including task results), or textual info (including project names and descriptions) have been used.
Toloka Business ID Recognition
For this dataset, commissioned by the Yandex Business Directory, we prepared 10,000 photos of organizations’ information signs and a text file with INN (Taxpayer ID) and OGRN (Main State Registration Number) codes shown on these signs. The computer vision model can learn from this data to recognize these number sequences in images. First we launched the task in the Yandex.Toloka mobile application, where we asked performers to go to the address marked on the map, find the specified organization and take a photo of its information sign. Field tasks like this help us to keep the Yandex Business Directory information updated. Then the quality of completed tasks was checked by other performers. The photos containing INN and OGRN codes were sent for reсognition. Yandex.Toloka performers typed the numbers from the photos, and then we processed the results and formed a dataset. Download Sample 
The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)
This dataset consists of 9515 tweets describing health issues. Each tweet is labeled for whether it contains information about an adverse side effect that occurred when taking a drug. The files contain the tweet ID, class number, and a script for collecting the source text. The dataset was created as part of The Social Media Mining for Health Applications (#SMM4H) Shared Tasks in a competition for automatically extracting information about the side effects of drugs from tweets. The dataset was a joint effort with the UPenn HLP Center and the Chemoinformatics and Molecular Modeling Research Laboratory at Kazan Federal University. To learn more about the dataset and how it was created, see the papper.
Toloka Persona Chat Rus
This dataset of 10,000 dialogues will help dialogue systems researchers to develop approaches for training chat bots. Created in collaboration with iPavlov, MIPT’s Neural Networks and Deep Learning Lab’s project for conversational AI research (which also develops DeepPavlov, an open-source library for chatbot technology), this dataset contains profiles of imaginary personalities with descriptions and dialogues between participants who are given a random profile and instructed to mimic a described personality. A chat bot that is trained on the dataset will be able to communicate on behalf of a certain persona and get to know people by chatting with them on general topics.
Toloka Aggregation Relevance 2

Researchers can use this dataset to explore different methods of quality control in crowdsourcing. The dataset contains around 0.5 million anonymized crowdsourced labels that were collected in the Relevance 2 Gradations project in 2016. It includes the labels from individual performers and golden labels that help to measure the quality of their answers. The dataset contains anonymized information about how performers evaluated a particular document, and in some cases, whether their evaluation was correct. By studying this dataset, you can find out how the opinion of individual performers affects the quality of the final assessment, what aggregation model is most effective, and how many opinions you need in order to get an accurate evaluation.

The key quality metric is accuracy of aggregated labels, which is estimated as the percentage of the aggregated labels that match the golden labels for the golden set.

Toloka Aggregation Relevance 5

This dataset is similar to the previous one, but rather than a binary choice for rating label relevance, it uses a five-point scale in the Relevance 5 Gradations project. The task was to assess the relevance of a document for a query on a 5-point scale. Some tasks in this dataset have more than one golden label. In these cases, all the golden labels are considered equally correct.

The key quality metric is accuracy of aggregated labels, which is estimated as the percentage of the aggregated labels that match one of the golden labels for a given task from the golden set. In addition to the crowdsourced labels, there is also information about performers who were banned for a certain reason. For each banned performer, the reason for banning is provided as one out of four ban types (details about each ban type are not given). The dataset contains more than 1 million labels.

Lexical Relations from the Wisdom of the Crowd (LRWC)
This dataset, assembled by Dmitry Ustalov in 2017, contains the opinions of Russian native speakers about the relationship between a generic term (hypernym) and a specific instance of this term (hyponym). A set of 300 most frequent nouns was extracted from the Russian National Corpus. Then each method or resource (including RuThes and RuWordNet) produced at most five hypernyms, if possible. This resulted in 10,600 unique non-empty subsumption pairs, which were annotated by seven different performers whose mother tongue was Russian and who were at least 20 years old as of February 1, 2017. As a result, 4,576 out of 10,600 pairs were annotated as positive, while the remaining 6,024 were annotated as negative. Interestingly, the performers were more confident in the negative answers than in the positive ones.
Human-Annotated Sense-Disambiguated Word Contexts for Russian
This dataset, assembled by Dmitry Ustalov in 2017, contains human-annotated sense identifiers for 2,562 contexts of 20 words used in the RUSSE’2018 shared task on Word Sense Induction and Disambiguation for the Russian language. 2,562 contexts were annotated by humans trained on 80 pre-annotated contexts so that each context was annotated by nine different annotators. After the annotation, every context was additionally inspected (curated) by the organizers of the shared task.
RuBQ: A Russian Dataset for Question Answering over Wikidata
RuBQ (Russian Knowledge Base Questions, pronounced [‘rubik]) is the first Russian dataset for Knowledge Base Question Answering (KBQA). It consists of 1,500 questions of varying complexity along with their English machine translations, corresponding SPARQL queries, answers, and a subset of Wikidata covering entities with Russian labels. To learn more about the dataset and how it was created, see the paper. The latest version of the dataset along with an evaluation script can be found in the repo.
Toloka Aggregation Features
This dataset contains about 60,000 crowdsourced labels for 1,000 tasks and ground truth labels for almost all the tasks. The task was to classify websites into five categories based on the presence of adult content. Additionally, each task has 52 real-valued features that can be used to predict the category. The key quality metric is accuracy of aggregated labels, which is estimated as the percentage of aggregated labels that match the golden labels for the golden set.
The dataset contains 1244 images of hot and cold water meters as well as their readings and coordinates of the displays showing those readings. The dataset is most suitable for computer vision research. Each image contains exactly one water meter. The archive also includes the pictures of the results of segmentation: the masks and the collages.
The dataset was assembled by Roman Kutsev from in Yandex.Toloka. Performers were first asked to take a picture of their water meter and the result was given to another group of performers to assess its relevance and to type the readings in the form. Relevant pictures were then used for manual segmentation by Toloka performers.

Collect and annotate your dataset

Use the Yandex.Toloka platform to prepare a dataset that meets your needs. 
Have a dataset that you are ready to share? Submit it for publication on this page.
Файлы cookies
Для персонализации сервисов Яндекс использует файлы cookies. Продолжая использование сайта, вы соглашаетесь с этим. Подробности о файлах cookies и об обработке ваших данных в Политике конфиденциальности.
Fri Dec 04 2020 09:22:50 GMT+0300 (Moscow Standard Time)