Public datasets

With thousands of annotators making millions of evaluations in hundreds of tasks every day, Toloka is a major source
of human-marked training data. Toloka supports academic research and innovation by sharing large amounts of
accurate data applicable to machine learning in a variety of areas.

Please note: These public datasets are only available for non-commercial use with a clear reference to Toloka as the source of data.
If you plan to use any of these datasets for commercial purposes, please contact us for our consent.

    • Handwritten Text Datasets
      This dataset contains over 7,000 images of handwritten text from more than 100 unique contributors in 3 languages: Spanish, French, and Arabic. The dataset works well for training and testing recognition models for handwritten text. The images contain text with punctuation and characters unique to each language that don't exist in the Latin alphabet, which makes the recognition task more challenging compared to other open-source benchmark datasets available for text recognition.

    ZIP archive, 10.8 GB
    Labels: texts.tsv
    Photos: images/

    • Toloka Business ID Recognition
      This dataset, commissioned by the Yandex Business Directory, contains 10,000 photos of organization information signs shot in the Russian Federation along with the INN (taxpayer ID) and OGRN (Primary State Registration Number) codes shown on these signs. Toloka was used for both capturing photos and recognizing INN and OGRN codes.

    ZIP archive, 19.5 GB
    Labels: data.tsv
    Photos: photos/

    • Toloka WaterMeters
      This dataset, collected by Roman Kucev from, contains 1244 images of hot and cold water meters as well as their readings and coordinates of the displays showing those readings. Each image contains exactly one water meter. The archive also includes the pictures of the results of segmentation with the masks and collages. Toloka was used for photo capturing, segmentation, and recognizing the readings.

    ZIP archive, 981 MB
    Photos: images/
    Masks: masks/
    Collages: collage/

Have a dataset that you are ready to share? Submit it for publication on this page.

Collect and annotate
your dataset

Use the Toloka platform to prepare a dataset that meets your needs.
Start now