Data Solutions

Enterprise

Platform

β

Resource Hub

Company

Arena

Log in

Talk to us

Toloka Research

Our mission

Our team strives to enhance the capabilities and safety of frontier models with valuable data, advanced training and evaluation methods

Interact with the file system, browser, and applications

New data collection methods for SFT and RLHF that leverage synthetic data, AI feedback, and expert human-generated data.

Improved approaches to model training and alignment
that enhance model capabilities in long-horizon reasoning and autonomous behavior.

High-quality evaluation metrics & benchmarks to measure performance in coding, math, reasoning, multilingualism, multimodality, and other complex tasks.

Red-teaming methods for identifying model vulnerabilities and developing safety metrics such as harmfulness, security and CBRN risks, social bias, and more.

Our projects

Tendem: The first Hybrid AI + Human agent

Beemo: Benchmark of Expert-edited Machine-generated Outputs

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

BigCode: Open-scientific collaboration working on the responsible development of Large Language Models for Code

Reinforcement Learning from Human Feedback: A Tutorial

Tutorial: Aligning Large Language Models to Low-Resource Languages

NTIRE 2023 Challenge on Night Photography Rendering

Large-Scale Machine Translation Evaluation for African Languages

Machine Learning for Planetary Science

NASA

AI for Good: Framework to Empower Digital Workers

CLEF. Shared task: Preference Prediction

JEEM: Vision-Language Understanding in Four Arabic Dialects

Publications

JEEM: Vision-Language Understanding in Four Arabic Dialects

arXiv 2025

Hands-on tutorial: Labeling with llm and human-in-the-loop

arXiv 2025

Surveying Professional Writers on AI: Limitations, Expectations, and Fears

arXiv 2025

LLMs Simulate Big5 Personality Traits: Further Evidence

EACL 2024

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

arXiv 2024

Beemo: Benchmark of Expert-edited Machine-generated Outputs

arXiv 2024

StarCoder: may the source be with you!

arXiv 2023

Reinforcement Learning from Human Feedback

ICML 2023

Best Prompts for Text-to-Image Models and How to Find Them

SIGIR 2023

Clustering Without Knowing How To: Application and Evaluation

ECIR 2023

Data Labeling for Machine Learning Engineers: Project-Based Curriculum and Data-Centric Competitions

EAAI 2023

WSDM Cup 2023 Challenge on Visual Question Answering

WSDM 2023

Findings of the WMT’22 Shared Task on Large-Scale Machine Translation Evaluation for African Languages

WMT 2023

IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons

NeurlPS 2021

CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

NeurlPS 2021

A General-Purpose Crowdsourcing Computational Quality Control Toolkit for Python

HCOMP 2021

VLDB 2021 Crowd Science Challenge on Aggregating Crowdsourced Audio Transcriptions

VLDB 2021

Prediction of Hourly Earnings and Completions Time on a Crowdsourcing Platform

KDD 2020

Text Recognition Using Anonymous CAPTCHA Answers

WSDM 2020

Conferences and events

We regularly hold tutorials and lead workshops at some of the biggest AI conferences around the globe.

See more

NeurIPS 2026

Dec 8, 2026

—

Decembre 13, 2026

Ai4 2026

Aug 4, 2026

—

August 06, 2026

We are Developers

Jul 9, 2026

—

July 10, 2026

Blog

See more

Introducing JEEM: A new benchmark for evaluating low-resource Arabic dialects

U-MATH & μ-MATH: new university-level math benchmarks challenge LLMs

Toloka and top universities launch innovative benchmark for detecting AI-generated texts

Applied ML at Toloka

We use ML technologies to enhance data production for better data quality, faster data collection, and lower costs.

Interact with the file system, browser, and applications

AI copilots

In-task tools help experts focus on quality: accurate fact checks, grammar checks, suggestions and more

Antifraud algorithms

Fraud prevention built into every data pipeline from start to finish to guarantee authentic human effort and expertise

Matching algorithms

Task distribution system matches tasks to the best qualified annotators and experts

Automated metrics

Our data quality metrics correlate with model performance gains for confidence in training data

Let's collaborate!
Our Research team would
love to hear from you

Let's collaborate!
Our Research team would love to hear from you

Talk to us