With thousands of annotators making millions of evaluations in hundreds of tasks every day, Toloka is a major source
of human-marked training data. Toloka supports academic research and innovation by sharing large amounts of
accurate data applicable to machine learning in a variety of areas.
Please note: These public datasets are only available for non-commercial use with a clear reference to Toloka as the source of data.
If you plan to use any of these datasets for commercial purposes, please contact us for our consent.
Raw data: general-new.tsv
In each folder:
Labels: full-annotation-result-new.tsv
Demonstrative examples with their
expected labels: train.csv
Development set: RuBQ_2.0_dev.json
Test set: RuBQ_2.0_test.json
Paragraphs: RuBQ_2.0_paragraphs.json
Development set:
RuBQ_1.0_dev.json
Test set: RuBQ_1.0_test.json
ZIP archive, 8.19 MB
Profiles: profile.tsv
Dialogues: dialogues.tsv
ZIP archive, 95.6 KB
Training data: task2_ru_train.tsv
Validation data: task2_ru_validation.tsv
Testing data: task2_ru_test.tsv
Script for downloading tweets: download_tweets.py
Description and script instructions: Readme.md
ZIP archive, 2.01 MB
Input data: lrwc-1.1-assignments.tsv
Training tasks: toloka-isa-50-skip-300-train-hit.tsv
Aggregated results: lrwc-1.1-aggregated.tsv
ZIP archive, 0.45 MB
Ground truth: golden_labels.tsv
Features: features.tsv
Crowd labels: crowd_labels.tsv
ZIP archive, 2.23 MB Crowd labels:
assignments_01-12-2017.tsv
Ground truth: report-curated.tsv.xz
Aggregated results: bts-rnc-crowd.tsv
ZIP archive, 2.6 MB
crowdspeech-dev-clean:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
crowdspeech-dev-other:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
crowdspeech-test-clean:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
crowdspeech-test-other:
Crowd labels: crowd_labels.csv
Ground truth: gt.csv
Have a dataset that you are ready to share? Submit it for publication on this page.