Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka Research

Our mission

Our team strives to enhance the capabilities and safety of frontier models with valuable data, advanced training and evaluation methods

New data collection methods
for SFT and RLHF that leverage synthetic data, AI feedback, and expert human-generated data.

Improved approaches to model training and alignment
that enhance model capabilities in long-horizon reasoning and autonomous behavior.

High-quality evaluation metrics & benchmarks to measure performance in coding, math, reasoning, multilingualism, multimodality, and other complex tasks.

Red-teaming methods for identifying model vulnerabilities and developing safety metrics such as harmfulness, security and CBRN risks, social bias, and more.

Our projects

Beemo: Benchmark of Expert-edited Machine-generated Outputs

2024

Beemo: Benchmark of Expert-edited Machine-generated Outputs

2024

Beemo: Benchmark of Expert-edited Machine-generated Outputs

2024

Beemo: Benchmark of Expert-edited Machine-generated Outputs

2024

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

2024

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

2024

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

2024

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

2024

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

2024

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

2024

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

2024

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

2024

BigCode: Open-scientific collaboration working on the responsible development of Large Language Models for Code

2023

BigCode: Open-scientific collaboration working on the responsible development of Large Language Models for Code

2023

BigCode: Open-scientific collaboration working on the responsible development of Large Language Models for Code

2023

BigCode: Open-scientific collaboration working on the responsible development of Large Language Models for Code

2023

Reinforcement Learning from Human Feedback: A Tutorial

2023

Reinforcement Learning from Human Feedback: A Tutorial

2023

Reinforcement Learning from Human Feedback: A Tutorial

2023

Reinforcement Learning from Human Feedback: A Tutorial

2023

Tutorial: Aligning Large Language Models to Low-Resource Languages

2023

Tutorial: Aligning Large Language Models to Low-Resource Languages

2023

Tutorial: Aligning Large Language Models to Low-Resource Languages

2023

Tutorial: Aligning Large Language Models to Low-Resource Languages

2023

NTIRE 2023 Challenge on Night Photography Rendering

2023

NTIRE 2023 Challenge on Night Photography Rendering

2023

NTIRE 2023 Challenge on Night Photography Rendering

2023

NTIRE 2023 Challenge on Night Photography Rendering

2023

Large-Scale Machine Translation Evaluation for African Languages

2022

Large-Scale Machine Translation Evaluation for African Languages

2022

Large-Scale Machine Translation Evaluation for African Languages

2022

Large-Scale Machine Translation Evaluation for African Languages

2022

Machine Learning for Planetary Science

2022

Machine Learning for Planetary Science

2022

Machine Learning for Planetary Science

2022

Machine Learning for Planetary Science

2022

AI for Good: Framework to Empower Digital Workers

2021

AI for Good: Framework to Empower Digital Workers

2021

AI for Good: Framework to Empower Digital Workers

2021

AI for Good: Framework to Empower Digital Workers

2021

CLEF. Shared task: Preference Prediction

2025

CLEF. Shared task: Preference Prediction

2025

CLEF. Shared task: Preference Prediction

2025

CLEF. Shared task: Preference Prediction

2025

JEEM: Vision-Language Understanding in Four Arabic Dialects

2025

JEEM: Vision-Language Understanding in Four Arabic Dialects

2025

JEEM: Vision-Language Understanding in Four Arabic Dialects

2025

JEEM: Vision-Language Understanding in Four Arabic Dialects

2025

Publications

JEEM: Vision-Language Understanding in Four Arabic Dialects

arXiv 2025

JEEM: Vision-Language Understanding in Four Arabic Dialects

arXiv 2025

JEEM: Vision-Language Understanding in Four Arabic Dialects

arXiv 2025

Hands-on tutorial: Labeling with llm and human-in-the-loop

arXiv 2025

Hands-on tutorial: Labeling with llm and human-in-the-loop

arXiv 2025

Hands-on tutorial: Labeling with llm and human-in-the-loop

arXiv 2025

Surveying Professional Writers on AI: Limitations, Expectations, and Fears

arXiv 2025

Surveying Professional Writers on AI: Limitations, Expectations, and Fears

arXiv 2025

Surveying Professional Writers on AI: Limitations, Expectations, and Fears

arXiv 2025

LLMs Simulate Big5 Personality Traits: Further Evidence

EACL 2024

LLMs Simulate Big5 Personality Traits: Further Evidence

EACL 2024

LLMs Simulate Big5 Personality Traits: Further Evidence

EACL 2024

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

arXiv 2024

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

arXiv 2024

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

arXiv 2024

Beemo: Benchmark of Expert-edited Machine-generated Outputs

arXiv 2024

Beemo: Benchmark of Expert-edited Machine-generated Outputs

arXiv 2024

Beemo: Benchmark of Expert-edited Machine-generated Outputs

arXiv 2024

StarCoder: may the source be with you!

arXiv 2023

StarCoder: may the source be with you!

arXiv 2023

StarCoder: may the source be with you!

arXiv 2023

Reinforcement Learning from Human Feedback

ICML 2023

Reinforcement Learning from Human Feedback

ICML 2023

Reinforcement Learning from Human Feedback

ICML 2023

Best Prompts for Text-to-Image Models and How to Find Them

SIGIR 2023

Best Prompts for Text-to-Image Models and How to Find Them

SIGIR 2023

Best Prompts for Text-to-Image Models and How to Find Them

SIGIR 2023

Clustering Without Knowing How To: Application and Evaluation

ECIR 2023

Clustering Without Knowing How To: Application and Evaluation

ECIR 2023

Clustering Without Knowing How To: Application and Evaluation

ECIR 2023

Data Labeling for Machine Learning Engineers: Project-Based Curriculum and Data-Centric Competitions

EAAI 2023

Data Labeling for Machine Learning Engineers: Project-Based Curriculum and Data-Centric Competitions

EAAI 2023

Data Labeling for Machine Learning Engineers: Project-Based Curriculum and Data-Centric Competitions

EAAI 2023

WSDM Cup 2023 Challenge on Visual Question Answering

WSDM 2023

WSDM Cup 2023 Challenge on Visual Question Answering

WSDM 2023

WSDM Cup 2023 Challenge on Visual Question Answering

WSDM 2023

Findings of the WMT’22 Shared Task on Large-Scale Machine Translation Evaluation for African Languages

WMT 2023

Findings of the WMT’22 Shared Task on Large-Scale Machine Translation Evaluation for African Languages

WMT 2023

Findings of the WMT’22 Shared Task on Large-Scale Machine Translation Evaluation for African Languages

WMT 2023

IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons

NeurlPS 2021

IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons

NeurlPS 2021

IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons

NeurlPS 2021

CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

NeurlPS 2021

CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

NeurlPS 2021

CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

NeurlPS 2021

A General-Purpose Crowdsourcing Computational Quality Control Toolkit for Python

HCOMP 2021

A General-Purpose Crowdsourcing Computational Quality Control Toolkit for Python

HCOMP 2021

A General-Purpose Crowdsourcing Computational Quality Control Toolkit for Python

HCOMP 2021

VLDB 2021 Crowd Science Challenge on Aggregating Crowdsourced Audio Transcriptions

VLDB 2021

VLDB 2021 Crowd Science Challenge on Aggregating Crowdsourced Audio Transcriptions

VLDB 2021

VLDB 2021 Crowd Science Challenge on Aggregating Crowdsourced Audio Transcriptions

VLDB 2021

Prediction of Hourly Earnings and Completions Time on a Crowdsourcing Platform

KDD 2020

Prediction of Hourly Earnings and Completions Time on a Crowdsourcing Platform

KDD 2020

Prediction of Hourly Earnings and Completions Time on a Crowdsourcing Platform

KDD 2020

Text Recognition Using Anonymous CAPTCHA Answers

WSDM 2020

Text Recognition Using Anonymous CAPTCHA Answers

WSDM 2020

Text Recognition Using Anonymous CAPTCHA Answers

WSDM 2020

Conferences and events

We regularly hold tutorials and lead workshops at some of the biggest AI conferences around the globe.

Blog

Applied ML at Toloka

We use ML technologies to enhance data production for better data quality, faster data collection, and lower costs.

AI copilots

In-task tools help experts focus on quality: accurate fact checks, grammar checks, suggestions and more

Antifraud algorithms

Fraud prevention built into every data pipeline from start to finish to guarantee authentic human effort and expertise

Matching algorithms

Task distribution system matches tasks to the best qualified annotators and experts

Automated metrics

Our data quality metrics correlate with model performance gains for confidence in training data

Open job positions:

Research Fellowship Program

Let's collaborate!
Our Research team would love to hear from you

Get in touch

Let's collaborate!
Our Research team would love to hear from you

Get in touch

Let's collaborate!
Our Research team would love to hear from you

Get in touch

Toloka Research

Our team strives to enhance the capabilities and safety of frontier models with valuable data, advanced training and evaluation methods

New data collection methods for SFT and RLHF that leverage synthetic data, AI feedback, and expert human-generated data.

Improved approaches to model training and alignment that enhance model capabilities in long-horizon reasoning and autonomous behavior.

High-quality evaluation metrics & benchmarks to measure performance in coding, math, reasoning, multilingualism, multimodality, and other complex tasks.

Red-teaming methods for identifying model vulnerabilities and developing safety metrics such as harmfulness, security and CBRN risks, social bias, and more.

Our projects

Publications

Conferences and events

Blog

Applied ML at Toloka

Open job positions:

Research Fellowship Program

Let's collaborate! Our Research team would love to hear from you

Let's collaborate! Our Research team would love to hear from you

Let's collaborate! Our Research team would love to hear from you

New data collection methods
for SFT and RLHF that leverage synthetic data, AI feedback, and expert human-generated data.

Improved approaches to model training and alignment
that enhance model capabilities in long-horizon reasoning and autonomous behavior.

Let's collaborate!
Our Research team would love to hear from you

Let's collaborate!
Our Research team would love to hear from you

Let's collaborate!
Our Research team would love to hear from you