Public Datasets

With thousands of performers making millions of evaluations in hundreds of tasks every day, Toloka is a major source of human-marked training data. Toloka supports academic research and innovation by sharing large amounts of accurate data applicable to machine learning in a variety of areas.
Please note: These public datasets are only available for non-commercial use with a clear reference to Toloka as the source of data. If you plan to use any of these datasets for commercial purposes, please contact us for our consent.

Toloka Business ID Recognition

ZIP archive, 19.5 GB
Labels: data.tsv
Photos: photos/

This dataset, commissioned by the Yandex Business Directory, contains 10,000 photos of organization information signs shot in the Russian Federation along with the INN (taxpayer ID) and OGRN (Primary State Registration Number) codes shown on these signs. Toloka was used for both capturing photos and recognizing INN and OGRN codes.

Toloka
WaterMeters

This dataset, collected by Roman Kucev from TrainingData.ru, contains 1244 images of hot and cold water meters as well as their readings and coordinates of the displays showing those readings. Each image contains exactly one water meter. The archive also includes the pictures of the results of segmentation with the masks and collages. Toloka was used for photo capturing, segmentation, and recognizing the readings.

ZIP archive, 981 MB
Photos: images/
Masks: masks/
Collages: collage/

RuBQ 2.0: An Innovated Russian Question Answering Dataset

Development set: RuBQ_2.0_dev.json
Test set: RuBQ_2.0_test.json
Paragraphs: RuBQ_2.0_paragraphs.json

RuBQ 2.0 is the second version of RuBQ. It contains 2,910 questions along with the answers and SPARQL queries. The dataset can be used for the evaluation of KBQA and machine reading comprehension, paragraph retrieval, end-to-end open-domain question answering and experiments in hybrid QA, where KBQA and text-based QA can enrich and complement each other.

RuBQ: A Russian Dataset for Question
Answering over Wikidata

For developers: RuBQ_dev.json
Testing array: RuBQ_test.json

RuBQ (Russian Knowledge Base Questions, pronounced [‘rubik]) is the first Russian dataset for Knowledge Base Question Answering (KBQA). It consists of 1,500 questions of varying complexity along with their English machine translations, corresponding SPARQL queries, answers, and a subset of Wikidata covering entities with Russian labels.

Toloka Persona Chat Rus

This dataset of 10,000 dialogues for chatbot research was gathered by the MIPT's Neural Networks and Deep Learning Lab for conversational AI research. The dataset contains profiles of imaginary personalities with descriptions and dialogues between participants who are given a random profile and instructed to mimic a described personality.

ZIP archive, 8.19 MB
Profiles: profile.tsv
Dialogues: dialogues.tsv

The Russian Adverse Drug Reaction Corpus
of Tweets (RuADReCT)

ZIP archive, 95.6 KB
Training data: task2_ru_train.tsv
Validation data: task2_ru_validation.tsv
Testing data: task2_ru_test.tsv
Script for downloading tweets: download_tweets.py
Description and script instructions: Readme.md

Created as part of the Social Media Mining for Health Applications (#SMM4H '20) shared tasks, this dataset consists of 9515 tweets describing health issues. Each tweet is labeled for whether it contains information about an adverse side effect that occurred when taking a drug. The dataset was a joint effort with the UPenn HLP Center and the Chemoinformatics and Molecular Modeling Research Laboratory at Kazan Federal University.

Lexical Relations from the Wisdom
of the Crowd (LRWC)

ZIP archive, 2.01 MB
Input data: lrwc-1.1-assignments.tsv
Training tasks: toloka-isa-50-skip-300-train-hit.tsv
Aggregated results: lrwc-1.1-aggregated.tsv

This dataset, assembled by Dmitry Ustalov in 2017 for the Watlink method, contains the opinions of Russian native speakers about the relationship between a generic term (hypernym) and a specific instance of this term (hyponym) in 10,600 word pairs. It is based on the nouns from the Russian National Corpus and relationships from the RuThes and RuWordNet lexical ontologies.

Toloka Aggregation Features

ZIP archive, 0.45 MB
Ground truth: golden_labels.tsv
Features: features.tsv
Crowd labels: crowd_labels.tsv

This dataset contains about 60,000 crowdsourced labels gathered on Toloka for 1,000 tasks and ground truth labels for almost all of them. The task was to classify websites into five categories based on the presence of adult content. Additionally, each task has 52 real-valued features that can be used to predict the category.

Human-Annotated Sense-Disambiguated Word Contexts for Russian

ZIP archive, 2.23 MB
Crowd labels: assignments_01-12-2017.tsv
Ground truth: report-curated.tsv.xz
Aggregated results: bts-rnc-crowd.tsv

This dataset, assembled by Dmitry Ustalov in 2017, contains human-annotated sense identifiers for 2,562 contexts of 20 words used in the RUSSE’2018 shared task on Word Sense Induction and Disambiguation for Russian. After labeling, every context was additionally inspected and curated by the organizers of the shared task.

Toloka Aggregation Relevance 2

ZIP archive, 3.08 MB
Crowd labels: crowd_labels.tsv
Ground truth: golden_labels.tsv

This dataset, designed for evaluating answer aggregation methods in crowdsourcing, contains around 0.5 million anonymized crowdsourced labels collected in the Relevance 2 Gradations project in 2016 at Yandex. In this project, query-document pairs are provided with binary labels: relevant or non-relevant. The dataset also contains gold labels for comparing aggregation methods.

Toloka Aggregation Relevance 5

ZIP archive, 7.17 MB
Crowd labels: crowd_labels.tsv
Ground truth: golden_labels.tsv
Banned users: bans.tsv

This dataset was designed for evaluating answer aggregation methods in crowdsourcing. It contains around 1 million anonymized crowdsourced labels collected in the Relevance 5 Gradations project in 2016 at Yandex. In this project, query-document pairs are labeled on a scale of 1 to 5. from most relevant to least relevant. The dataset also contains gold labels for comparing aggregation methods.

Toloka Users & Tasks

ZIP archive, 1.07 GB
Completed tasks: assignments.tsv
Project data: projects.tsv
Anonymized user data: users.tsv
Task selection sessions: visits.tsv

Collected for the KDD '20 paper "Prediction of Hourly Earnings and Completion Time on a Crowdsourcing Platform", this dataset contains user activity sessions recorded in 18 million tasks performed by 161,377 users in Toloka over a three-month period (September-November 2018). It includes timestamps, anonymized project and user identifiers, reward information, number of microtasks, instructions, data schema description, responses, and various descriptive task properties.

Collect and annotate your dataset

Use the Toloka platform to prepare a dataset that meets your needs. 
Start now
Have a dataset that you are ready to share? Submit it for publication on this page.
Tue Sep 07 2021 15:42:06 GMT+0300 (Moscow Standard Time)