At Toloka, we are committed to unlocking AI opportunities. Every day, our researchers tackle pressing AI and ML challenges, make appearances at prominent global events, and publish their findings in scientific journals. Scroll down to learn more.
Browse through some of our latest work.
CrowdSpeech & Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription
Domain-specific data is the crux of the successful transfer of machine learning systems from benchmarks to real life.
Crowdsourcing has become one of the standard tools for cheap and time-efficient data collection for simple problems such as image classification: thanks in large part to advances in research on aggregation methods.
A General-Purpose Crowdsourcing Computational Quality Control Toolkit for Python
In this paper, we demonstrate Crowd-Kit, a general-purpose crowdsourcing computational quality control toolkit. It provides efficient implementations in Python of computational quality control algorithms for crowdsourcing, including uncertainty measures and crowd consensus methods.
VLDB 2021 Crowd Science Challenge on Aggregating Crowdsourced Audio Transcriptions
This paper reviews the crowdsourced audio transcription shared task devoted to this problem and co-organized with the Crowd Science Workshop at VLDB 2021; the competition attracted 18 participants, 8 of them have successfully beaten our non-trivial baselines.
In this paper, we address the problem of labeling text images via CAPTCHA, where user identification is generally impossible. We propose a new algorithm to aggregate multiple guesses collected through CAPTCHA.
Best Text Aggregation Methods: VLDB 2021 Crowd Science Challenge Revisited
September 15, 2021
Aggregating Pairwise Comparison (Inspired by a Demo at ICML 2021)
August 13, 2021
How to Aggregate Categorical Replies via Crowdsourcing
July 29, 2021
Discussing Trust, Ethics, and Responsibility in ML at ICML, VLDB, and ICLR
June 21, 2021
NAACL_HLT 2021: 6 Tutorials You Can’t Afford to Miss
June 1, 2021
Explore our public datasets, Crowd-Kit Python library and grants.
With thousands of performers making millions of evaluations in hundreds of tasks every day, Toloka is a major source of human-marked training data. We support research and innovation by sharing large amounts of accurate data applicable to machine learning in a variety of areas.
Our powerful Python library implements commonly-used aggregation methods for crowdsourced annotation and offers the relevant metrics and datasets. We strive to implement functionality that simplifies working with crowdsourced data.
Our grant program aims to support the academic community and encourage the use of crowdsourcing in research. Take advantage of the chance to obtain high-quality data that will enhance your research by applying for a grant.