Toloka documentation

More usage examples

The recipes for the Toloka-Kit usage contain data collection, markup, aggregation, and other examples.

It allows you to:

  • Easily reuse projects by just copying and pasting code. No need to configure parameters in the interface over and over again.

  • Train your ML models and run your data labeling projects in the same environment.

  • Take advantage of open-source code that anyone can use and contribute to.

List of recipes

We recommend that you start with our sample project recipe. It describes the typical workflow for the Toloka-Kit usage and explains the main entities (classes and methods) that Toloka-Kit uses to create and set up projects, manage pools, and upload tasks. It also contains the option to download results and aggregate them using our Crowd-Kit library.

Recipe Description and tags
Computer vision
Image collection

Open in ColabOpen in Colab
The goal for this project is to collect a dataset of dogs' and cats' images. Performers will be asked to take a photo of their pet and specify its species.

CV, Classification, Collecting, Dataset
Image classification

Open in ColabOpen in Colab
An example of binary image classification, made on a dataset with cats and dogs. We ask performers to look at the pictures and decide what animal is in the picture.

CV, Classification
Object detection

Open in ColabOpen in Colab
Example of solving the classic problem of annotating images for training detection algorithms. In real-world tasks, annotation is usually done with a polygon. We chose to use a rectangular outline to simplify the task so that we can reduce costs and speed things up.

CV, Segmentation, Detection, Bounding boxes, Street, Traffic sign, Verification Project
HTR image gathering

Open in ColabOpen in Colab
This is an example of simple handwriting images gathering pipeline. Resulting dataset can be used to train and evaluate HTR models.

CV, HTR, Texts, Verification project, Collecting, Dataset
Blood cells classification

Open in ColabOpen in Colab
In this project, we will show an image of a blood cell and a brief instruction for Toloka performers. Then, we will ask performers to choose which type of white blood cell they see on this image.

CV, Classification, Medicine, Benchmark
Video collection

Open in ColabOpen in Colab
The goal is to collect a set of video recordings where people show certain gestures, similar to popular emojis. There are several emoji combinations and we ask Tolokers to record a video similar to those emojis, meeting certain criteria about recording quality.

CV, Video, Collecting, Dataset
Text Recognition

Open in ColabOpen in Colab
We have a set of water meter images. We need to get each water meter’s readings. We ask performers to look at the images and write down the digits on each water meter.

CV, OCR
NLP
Text classification

Open in ColabOpen in Colab
We have a set of news article headlines. We need to get these classified according to whether they are clickbait or not.

NLP, Classification, Texts
Questing answering on SQuAD

Open in ColabOpen in Colab
Solving the problem of question answering on SQUAD2.0 dataset. Collects and validates answers for questions by human performers. One of the most popular tasks in natural language processing.

NLP, Questing Answering, Texts, Benchmark, Verification Project
Sentiment analysis

Open in ColabOpen in Colab
We have a set of customer reviews, and we need to classify them as “Positive” or “Negative”. We ask performers to read a review and decide which category it belongs to.

NLP, Classification, Text
Intent classification

Open in ColabOpen in Colab
We need to define which class the search query belongs to and distribute the queries between several categories inside the class. There’s a list of queries (related to travel and dining), each with an unknown class and category.

NLP, Intent, Classification, Texts
Audio analysis
Audio collection

Open in ColabOpen in Colab
We have a set of texts, and we need to get voice recordings of these texts. We ask performers to read the texts aloud and record themselves. Recordings like these are used for training voice assistants.

ASR, TTS, Collecting, Dataset
Audio classification

Open in ColabOpen in Colab
We have a set of voice recordings from different people. We need to get these classified according to the speaker’s gender. We ask performers to listen to the recordings and decide whether it is a man or a woman speaking.

ASR, TTS, Classification
ASR/TTS based on Wikipedia articles

Open in ColabOpen in Colab
This example contains full speech data collecting pipeline from extracting raw texts to labeling and validating speech records.

ASR, TTS, Texts, Verification project, Audio samples collection
Audio transcription

Open in ColabOpen in Colab
We have a set of audio recordings. We need to obtain a transcription of each recording. We ask performers to listen to the recordings and type what they hear.

ASR, Transcription, Pipline, Post-acceptance
Ranking
Side-by-side image comparision

Open in ColabOpen in Colab
We have a set of 6 icons. We need to find out which icon people prefer and determine the top icon out of the set. We show performers two icons each and ask them to choose the one they prefer. Then we aggregate these results to obtain the top icon.

Ranking, Side-by-side
Spatial Crowdsourcing
Simplest Spatial Crowdsourcing

Open in ColabOpen in Colab
In this example, we will collect pictures of the metro entrances. This example also can be reused for production tasks such as monitoring the state of objects, checking the presence of an organization or other physical object.

Spatial Crowdsourcing, Outdoor monitoring, Collecting
Survey
Simplest survey

Open in ColabOpen in Colab
The goal is to collect some information about how people manage stress and if they are ready to get a meditation app to do that. There is a survey where we ask some questions about stress level and management, meditation practices and users' habits concerning paid apps.

Survey, Collecting
Pipelines
Simple Toloka+ML pipeline on Prefect

Open in ColabOpen in Colab
This example illustrates how crowdsourcing using Toloka can be made easier and cheaper by integrating an ML model. Furthermore, it shows how to run the whole project in the cloud using Prefect, which makes workflow orchestration much simpler.

Prefect, ML, Autohelper
Building streaming pipelines in Toloka

Open in ColabOpen in Colab
Let's solve the following task: find the goods in the online-store by given image and aggange found results by relevance. In this example we unite 3 different Toloka projects into one useful Pipeline.

Pipeline, Collecting, Dataset
Relevance
Search relevance

Open in ColabOpen in Colab
We have a set of search queries and products on a website. We need to determine the extent to which each query is relevant to the corresponding product on the website. We ask performers to look at the search query and the product image from the website and rate the relevance level.

Relevance
Ad relevance

Open in ColabOpen in Colab
In this example we aim to explore webpages containing ads and their descriptions. We will run the project using new Toloka Ready-to-go solutions (App Services).

Relevance
Benchmarks
Image classification

Open in ColabOpen in Colab
Image classification on CINIC-10. Minimal configuration to achieve the described levels of quality. Accuracy on Test = 88%

Benchmark, CV, Classification
Text classification

Open in ColabOpen in Colab
Text classification on IMDB movie reviews. Minimal configuration to achieve the described levels of quality. Accuracy on Test = 89%

Benchmark, NLP, Classification
Metrics
Jupyter dashboard

Open in ColabOpen in Colab
An example of using jupyter dashboard to collect and display Toloka metrics inside jupyter notebook.

Metrics, Visualization
Graphite

Open in ColabOpen in Colab
MetricCollector usage example. In this notebook you will learn how to collect Toloka metrics and send them to Graphite metrics server simultaneously.

Metrics, Logging, Graphite

Need more examples?

If you have an example of data labeling using Toloka-Kit, do not hesitate to send it. Add a link to your GitHub repository and a description to the table via a pull request.

Ideally, a great example should contain the following aspects:

  • Problem statement
  • How to set up a project
  • Where to get the data for the example
  • What to pay attention to when writing instructions
  • How to set up quality control
  • What is the final quality
  • Visualization of the obtained results

You may also ask any question or ask for a new example using Toloka-Kit issues.