At Toloka, we are committed to unlocking AI opportunities. Every day, our researchers tackle pressing AI and ML challenges, make appearances at prominent global events, and publish their findings in scientific journals. Scroll down to learn more.
Research papers
Browse through some of our latest work.
CrowdSpeech & Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

Domain-specific data is the crux of the successful transfer of machine learning systems from benchmarks to real life. Crowdsourcing has become one of the standard tools for cheap and time-efficient data collection for simple problems such as image classification: thanks in large part to advances in research on aggregation methods.

NeurIPS 2021
A General-Purpose Crowdsourcing Computational Quality Control Toolkit for Python

In this paper, we demonstrate Crowd-Kit, a general-purpose crowdsourcing computational quality control toolkit. It provides efficient implementations in Python of computational quality control algorithms for crowdsourcing, including uncertainty measures and crowd consensus methods.

HCOMP 2021
VLDB 2021 Crowd Science Challenge on Aggregating Crowdsourced Audio Transcriptions

This paper reviews the crowdsourced audio transcription shared task devoted to this problem and co-organized with the Crowd Science Workshop at VLDB 2021; the competition attracted 18 participants, 8 of them have successfully beaten our non-trivial baselines.

VLDB 2021
Prediction of Hourly Earnings and Completion Time on a Crowdsourcing Platform

We study the problem of predicting future hourly earnings and task completion time for a crowdsourcing platform user who sees the list of available tasks and wants to select one of them to execute.

KDD 2020
Text Recognition Using Anonymous CAPTCHA Answers

In this paper, we address the problem of labeling text images via CAPTCHA, where user identification is generally impossible. We propose a new algorithm to aggregate multiple guesses collected through CAPTCHA.

WSDM 2020


We regularly hold tutorials and lead workshops at some of the biggest AI conferences around the globe.
Crowd Science Workshop at VLDB '21
Crowd Science: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale. We focused on the best practices for efficient and trustworthy crowdsourcing.
We shared some of the unique insights we have gained from six years of industry experience in efficient natural language data annotation via crowdsourcing.
TheWebConf '21
We presented a systematic view on using Human-in-the-Loop to obtain scalable offline evaluation processes and, in particular, high-quality relevance judgements.
Toloka Workshop
at ICML '21
We provided a comprehensive picture of how crowdsourcing can be applied to real life AI production.
NeurIPS '20 Crowd Science Workshop
We discussed key issues in preparing labeled data for machine learning, with a focus on remoteness, fairness, and mechanisms in the context of crowdsourcing for data collection and labeling.
CVPR '20
We presented a data processing pipeline used for training self-driving cars. Participants gained practical experience launching an annotation project in Toloka.
We explored the practical aspects of how crowdsourcing can be applied to information retrieval. Participants learnt how to create a dataset with relevant products.
WSDM '20
We explored the practice of efficient data collection via crowdsourcing: aggregation, incremental relabeling, and pricing.
KDD '19
We introduced data labeling via public crowdsourcing marketplaces and will presented the key components of efficient label collection.

News and features

Stay up to date with our latest stories.

Best Text Aggregation Methods: VLDB 2021 Crowd Science Challenge Revisited
September 15, 2021

Aggregating Pairwise Comparison (Inspired by a Demo at ICML 2021)
August 13, 2021

How to Aggregate Categorical Replies via Crowdsourcing
July 29, 2021

Discussing Trust, Ethics, and Responsibility in ML at ICML, VLDB, and ICLR
June 21, 2021

NAACL_HLT 2021: 6 Tutorials You Can’t Afford to Miss
June 1, 2021


Explore our public datasets, Crowd-Kit Python library and grants.
Public datasets
With thousands of performers making millions of evaluations in hundreds of tasks every day, Toloka is a major source of human-marked training data. We support research and innovation by sharing large amounts of accurate data applicable to machine learning in a variety of areas.
Our powerful Python library implements commonly-used aggregation methods for crowdsourced annotation and offers the relevant metrics and datasets. We strive to implement functionality that simplifies working with crowdsourced data.
Grants for research
Our grant program aims to support the academic community and encourage the use of crowdsourcing in research. Take advantage of the chance to obtain high-quality data that will enhance your research by applying for a grant.
Meet our research team
We thrive on continuous improvement and international cooperation. Contact us on LinkedIn if you’d like to collaborate.
Dmitry Ustalov
Head of Research
Google Scholar
Nikita Pavlichenko
Machine Learning Researcher
Google Scholar
Boris Tseytlin
Machine Learning Researcher
Daniil Likhobaba
Junior Analyst
Our collaborators
Alexey Drutsa
Head of Efficiency
& Growth
Google Scholar
Vladimir Losev
Head of Crowdsourcing Tools Development
Evgeny Tulin

Join us

We’re always on the lookout for new talent. Want to join our team and tackle the most pressing challenges of the AI industry? Get in touch with us today!
October 14th, 2021 | 4 pm UTC
Reach out
Write to us if you have any questions or ideas you’d like to share.
Tue Nov 16 2021 09:56:07 GMT+0300 (Moscow Standard Time)