With thousands of annotators making millions of evaluations in hundreds of tasks every day, Toloka is a major source
of human-marked training data. Toloka supports academic research and innovation by sharing large amounts of
accurate data applicable to machine learning in a variety of areas.
Please note: These public datasets are only available for non-commercial use with a clear reference to Toloka as the source of data.
If you plan to use any of these datasets for commercial purposes, please contact us for our consent.
This dataset contains over 7,000 images of handwritten text from more than 100 unique contributors in 3 languages: Spanish, French, and Arabic. The dataset works well for training and testing recognition models for handwritten text. The images contain text with punctuation and characters unique to each language that don't exist in the Latin alphabet, which makes the recognition task more challenging compared to other open-source benchmark datasets available for text recognition.
ZIP archive, 10.8 GB
Labels: texts.tsv
Photos: images/
This dataset, commissioned by the Yandex Business Directory, contains 10,000 photos of organization information signs shot in the Russian Federation along with the INN (taxpayer ID) and OGRN (Primary State Registration Number) codes shown on these signs. Toloka was used for both capturing photos and recognizing INN and OGRN codes.
ZIP archive, 19.5 GB
Labels: data.tsv
Photos: photos/
This dataset, collected by Roman Kucev from TrainingData.ru, contains 1244 images of hot and cold water meters as well as their readings and coordinates of the displays showing those readings. Each image contains exactly one water meter. The archive also includes the pictures of the results of segmentation with the masks and collages. Toloka was used for photo capturing, segmentation, and recognizing the readings.
ZIP archive, 981 MB
Photos: images/
Masks: masks/
Collages: collage/
Collected for the paper “Crowdsourced Human Evaluation Data in Plot Writing From Pre-Trained Language Models”, this dataset evaluates generated stories from various baselines on multiple aspects: naturalness, interestingness, cohesiveness, and story ending. Separate evaluation tasks were run for each aspect of naturalness, interestingness, and cohesiveness in 50 generated stories. An additional task evaluated story endings in 50 randomly selected pairs (story, ending) as pairwise comparisons.
Raw data: general-new.tsv
In each folder:
Labels: full-annotation-result-new.tsv
Demonstrative examples with their
expected labels: train.csv
RuBQ 2.0 is the second version of RuBQ. It contains 2,910 questions along with the answers and SPARQL queries. The dataset can be used for the evaluation of KBQA and machine reading comprehension, paragraph retrieval, end-to-end open-domain question answering and experiments in hybrid QA, where KBQA and text-based QA can enrich and complement each other.
Development set: RuBQ_2.0_dev.json
Test set: RuBQ_2.0_test.json
Paragraphs: RuBQ_2.0_paragraphs.json
RuBQ 1.0 (Russian Knowledge Base Questions, pronounced [‘rubik]) is the first Russian dataset for Knowledge Base Question Answering (KBQA). It consists of 1,500 questions of varying complexity along with their English machine translations, corresponding SPARQL queries, answers, and a subset of Wikidata covering entities with Russian labels. The dataset is thought to be used as a development and test sets in cross-lingual transfer, few-shot learning, or learning with synthetic data scenarios.
Development set:
RuBQ_1.0_dev.json
Test set: RuBQ_1.0_test.json
This dataset of 10,000 dialogues for chatbot research was gathered by the MIPT's Neural Networks and Deep Learning Lab for conversational AI research. The dataset contains profiles of imaginary personalities with descriptions and dialogues between participants who are given a random profile and instructed to mimic a described personality.
ZIP archive, 8.19 MB
Profiles: profile.tsv
Dialogues: dialogues.tsv
Created as part of the Social Media Mining for Health Applications (#SMM4H '20) shared tasks, this dataset consists of 9515 tweets describing health issues. Each tweet is labeled for whether it contains information about an adverse side effect that occurred when taking a drug. The dataset was a joint effort with the UPenn HLP Center and the Chemoinformatics and Molecular Modeling Research Laboratory at Kazan Federal University.
ZIP archive, 95.6 KB
Training data: task2_ru_train.tsv
Validation data: task2_ru_validation.tsv
Testing data: task2_ru_test.tsv
Script for downloading tweets: download_tweets.py
Description and script instructions: Readme.md
This dataset, assembled by Dmitry Ustalov in 2017 for the Watlink method, contains the opinions of Russian native speakers about the relationship between a generic term (hypernym) and a specific instance of this term (hyponym) in 10,600 word pairs. It is based on the nouns from the Russian National Corpus and relationships from the RuThes and RuWordNet lexical ontologies.
ZIP archive, 2.01 MB
Input data: lrwc-1.1-assignments.tsv
Training tasks: toloka-isa-50-skip-300-train-hit.tsv
Aggregated results: lrwc-1.1-aggregated.tsv
This dataset contains about 60,000 crowdsourced labels gathered on Toloka for 1,000 tasks and ground truth labels for almost all of them. The task was to classify websites into five categories based on the presence of adult content. Additionally, each task has 52 real-valued features that can be used to predict the category.
ZIP archive, 0.45 MB
Ground truth: golden_labels.tsv
Features: features.tsv
Crowd labels: crowd_labels.tsv
This dataset, assembled by Dmitry Ustalov in 2017, contains human-annotated sense identifiers for 2,562 contexts of 20 words used in the RUSSE’2018 shared task on Word Sense Induction and Disambiguation for Russian. After labeling, every context was additionally inspected and curated by the organizers of the shared task.
ZIP archive, 2.23 MB Crowd labels:
assignments_01-12-2017.tsv
Ground truth: report-curated.tsv.xz
Aggregated results: bts-rnc-crowd.tsv
This obtained on Toloka dataset contains transcriptions of audio recordings from LibriSpeech obtained on Toloka. The process is described in the NeurIPS '21 Datasets and Benchmarks paper entitled "CrowdSpeech and VoxDIY: Benchmark Datasets for Crowdsourced Audio Transcription".
ZIP archive, 2.6 MB
crowdspeech-dev-clean:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
crowdspeech-dev-other:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
crowdspeech-test-clean:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
crowdspeech-test-other:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
This dataset, designed for evaluating answer aggregation methods in crowdsourcing, contains around 0.5 million anonymized crowdsourced labels collected in the Relevance 2 Gradations project in 2016 at Yandex. In this project, query-document pairs are provided with binary labels: relevant or non-relevant. The dataset also contains gold labels for comparing aggregation methods.
ZIP archive, 3.08 MB
Crowd labels: crowd_labels.tsv
Ground truth: golden_labels.tsv
This dataset was designed for evaluating answer aggregation methods in crowdsourcing. It contains around 1 million anonymized crowdsourced labels collected in the Relevance 5 Gradations project in 2016 at Yandex. In this project, query-document pairs are labeled on a scale of 1 to 5. from most relevant to least relevant. The dataset also contains gold labels for comparing aggregation methods.
ZIP archive, 7.17 MB
Crowd labels: crowd_labels.tsv
Ground truth: golden_labels.tsv
Banned users: bans.tsv
Collected for the KDD '20 paper "Prediction of Hourly Earnings and Completion Time on a Crowdsourcing Platform", this dataset contains user activity sessions recorded in 18 million tasks performed by 161,377 users in Toloka over a three-month period (September-November 2018). It includes timestamps, anonymized project and user identifiers, reward information, number of microtasks, instructions, data schema description, responses, and various descriptive task properties.
ZIP archive, 1.07 GB
Completed tasks: assignments.tsv
Project data: projects.tsv
Anonymized user data: users.tsv
Task selection sessions: visits.tsv
This dataset, as described in the NeurIPS '20 Data-Centric AI Workshop paper entitled "IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons" , contains 9,150 images appearing in 250,249 paired comparisons annotated on the Toloka crowdsourcing platform. It has balanced distributions of age and gender using the well-known IMDB-WIKI dataset as ground truth.
ZIP archive, 9 MB
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
Have a dataset that you are ready to share? Submit it for publication on this page.
Use the Toloka platform to prepare a dataset that meets your needs.
Start now