Public datasets

With thousands of annotators making millions of evaluations in hundreds of tasks every day, Toloka is a major source 
of human-marked training data. Toloka supports academic research and innovation by sharing large amounts of 
accurate data applicable to machine learning in a variety of areas.

Please note: These public datasets are only available for non-commercial use with a clear reference to Toloka as the source of data.
If you plan to use any of these datasets for commercial purposes, please contact us for our consent.

    • Human Evaluation of Generated Stories
      Collected for the paper “Crowdsourced Human Evaluation Data in Plot Writing From Pre-Trained Language Models”, this dataset evaluates generated stories from various baselines on multiple aspects: naturalness, interestingness, cohesiveness, and story ending. Separate evaluation tasks were run for each aspect of naturalness, interestingness, and cohesiveness in 50 generated stories. An additional task evaluated story endings in 50 randomly selected pairs (story, ending) as pairwise comparisons.
    Learn More

    Raw data: general-new.tsv
    In each folder:
    Labels: full-annotation-result-new.tsv
    Demonstrative examples with their
    expected labels: train.csv

    • RuBQ 2.0: An Innovated Russian Question Answering Dataset
      RuBQ 2.0 is the second version of RuBQ. It contains 2,910 questions along with the answers and SPARQL queries. The dataset can be used for the evaluation of KBQA and machine reading comprehension, paragraph retrieval, end-to-end open-domain question answering and experiments in hybrid QA, where KBQA and text-based QA can enrich and complement each other.
    Learn More

    Development set: RuBQ_2.0_dev.json
    Test set: RuBQ_2.0_test.json
    Paragraphs: RuBQ_2.0_paragraphs.json

    • RuBQ 1.0: A Russian Dataset for Question Answering over Wikidata
      RuBQ 1.0 (Russian Knowledge Base Questions, pronounced [‘rubik]) is the first Russian dataset for Knowledge Base Question Answering (KBQA). It consists of 1,500 questions of varying complexity along with their English machine translations, corresponding SPARQL queries, answers, and a subset of Wikidata covering entities with Russian labels. The dataset is thought to be used as a development and test sets in cross-lingual transfer, few-shot learning, or learning with synthetic data scenarios.
    Learn More

    Development set:
    Test set: RuBQ_1.0_test.json

    • Toloka Persona Chat Rus
      This dataset of 10,000 dialogues for chatbot research was gathered by the MIPT's Neural Networks and Deep Learning Lab for conversational AI research. The dataset contains profiles of imaginary personalities with descriptions and dialogues between participants who are given a random profile and instructed to mimic a described personality.

    ZIP archive, 8.19 MB
    Profiles: profile.tsv
    Dialogues: dialogues.tsv

    • The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)
      Created as part of the Social Media Mining for Health Applications (#SMM4H '20) shared tasks, this dataset consists of 9515 tweets describing health issues. Each tweet is labeled for whether it contains information about an adverse side effect that occurred when taking a drug. The dataset was a joint effort with the UPenn HLP Center and the Chemoinformatics and Molecular Modeling Research Laboratory at Kazan Federal University.

    ZIP archive, 95.6 KB
    Training data: task2_ru_train.tsv
    Validation data: task2_ru_validation.tsv
    Testing data: task2_ru_test.tsv
    Script for downloading tweets:
    Description and script instructions:

    • Lexical Relations from the Wisdom of the Crowd (LRWC)
      This dataset, assembled by Dmitry Ustalov in 2017 for the Watlink method, contains the opinions of Russian native speakers about the relationship between a generic term (hypernym) and a specific instance of this term (hyponym) in 10,600 word pairs. It is based on the nouns from the Russian National Corpus and relationships from the RuThes and RuWordNet lexical ontologies.

    ZIP archive, 2.01 MB
    Input data: lrwc-1.1-assignments.tsv
    Training tasks: toloka-isa-50-skip-300-train-hit.tsv
    Aggregated results: lrwc-1.1-aggregated.tsv

    • Toloka Aggregation Features
      This dataset contains about 60,000 crowdsourced labels gathered on Toloka for 1,000 tasks and ground truth labels for almost all of them. The task was to classify websites into five categories based on the presence of adult content. Additionally, each task has 52 real-valued features that can be used to predict the category.

    ZIP archive, 0.45 MB
    Ground truth: golden_labels.tsv
    Features: features.tsv
    Crowd labels: crowd_labels.tsv

  • Download

    ZIP archive, 2.23 MB Crowd labels:
    Ground truth: report-curated.tsv.xz
    Aggregated results: bts-rnc-crowd.tsv

  • Download

    ZIP archive, 2.6 MB
    Crowd labels: crowd_labels.csv
    Ground truth: gt.csv
    Crowd labels: crowd_labels.csv
    Ground truth: gt.csv
    Crowd labels: crowd_labels.csv
    Ground truth: gt.csv
    Crowd labels: crowd_labels.csv
    Ground truth: gt.csv

Have a dataset that you are ready to share? Submit it for publication on this page.

Collect and annotate 
your dataset

Use the Toloka platform to prepare a dataset that meets your needs.