Solutions

Datasets

Research

Resources

Company

Talk to us

Read our papers on AI training, evaluation, and safety

Learn more

Read our papers on AI training, evaluation, and safety

Learn more

Read our papers on AI training, evaluation, and safety

Learn more

New multimodal benchmark reveals best LLMs for university-level math challenges

Toloka Team

December 5, 2024

News

December 5, 2024 — Today, Toloka AI and Gradarius announced the launch of U-MATH, a new multimodal benchmark designed specifically to evaluate the capabilities of large language models (LLMs) in solving complex university-level math problems. This new benchmark stands out for its size and complexity, which includes visual reasoning, and provides a comprehensive evaluation framework for developers, researchers, and educators in the field of generative AI. The team also released μ-MATH, the accompanying meta-benchmark, which analyzes how well models carry out the role of judge.

The benchmark reveals surprising outcomes, with emerging competitors like Gemini outperforming established models such as GPT-4o in solving university-level math problems. Gemini's edge is particularly pronounced in visual domains, demonstrating its versatility. Moreover, even smaller models, such as chinese-owned 7B Qwen2.5, are showing remarkable capabilities, narrowing the performance gap in text-only tasks. Notably, the Qwen2.5-Math-72B model significantly surpasses GPT-4o, leveraging its strengths in the text domain.

As the largest university-level and multimodal math benchmark available on the market, our initiative focuses on delivering a robust testing dataset for assessing LLM performance against intricate university-level prompts. This benchmark includes a wide array of prompts and solutions, ensuring a rigorous evaluation of models’ reasoning abilities and problem-solving skills.

Key Features of the Benchmark:

Unmatched Size: With over 1000 complex prompts, U-MATH dwarfs other available assessments, providing a rich dataset for rigorous testing.
Visual Component: 20% of the problems involve visual reasoning, testing models’ ability to to comprehend and reason about mathematical charts, schematics, and similar visual representations.
Complexity Level: Unlike existing benchmarks like GSM8K and MATH, which focus on school-level problems, our benchmark targets university-level challenges, pushing the boundaries of what LLMs can achieve.
Meta-Benchmarking Framework: Alongside U-MATH, we offer μ-MATH, a meta-benchmark for assessing models’ ability to judge math solutions, enabling developers to thoroughly evaluate their systems.

U-MATH benchmark provides essential insights for developers, students and educators alike. As the demand for advanced AI in academic settings grows, this benchmark serves as a vital tool for ensuring that these models can meet the rigorous standards of higher education as well as real world industrial applications. LLM developers can leverage the benchmark to evaluate the ability of their models to solve a wide range of university-level math problems and perform step-by-step reasoning.

“We are confident this comprehensive benchmark will elevate the standards of LLM performance evaluation,” said Olga Megorskaya, CEO at Toloka AI. Our benchmarking approach can be applied to any niche topic, ensuring high-quality performance and accountability among LLM developers and fostering trust and integrity in AI technologies.”

“We’ve all lived through the initial wave of AI excitement and are now beginning to understand its true nature—not as a magic pill, but as a powerful tool with specific strengths and limitations,” said Prof. Alexei Miasnikov, Co-Founder and Scientific Advisor at Gradarius. “This benchmark underscores those nuances in the context of academic mathematics education. Its multimodal approach is especially valuable, and the inclusion of a meta-benchmark highlights that even LLMs evaluating LLMs remains an unresolved challenge.”

The launch of this benchmark comes at a crucial time when the need for reliable assessment tools in the field of generative AI is more important than ever. Toloka AI and Gradarius invite LLM developers, researchers, and educators to engage with this innovative benchmark and explore its capabilities.

For more information about the new benchmarks and to access the open source database, click here.

About Toloka

Toloka AI is a trusted data partner for all stages of AI development, from training to evaluation. With over a decade of experience, Toloka empowers businesses to build high-quality, safe, and responsible AI systems through a unique methodology that combines machine learning technology with human expertise, delivering the highest quality and scalability in the market.

About Gradarius

Gradarius is an advanced educational platform dedicated to transforming math education through interactive, step-by-step guidance. Known for its unique Math Engine that ensures precise math comprehension and immediate feedback, Gradarius reduces teacher workload while boosting student performance and engagement. With a foundation in higher education and extensive experience across U.S. high schools, Gradarius empowers educators and students alike, helping schools deliver rigorous math education and foster STEM proficiency at scale.

Contact

Carolina Escobar
PR Manager, Toloka AI
carolinaesco@toloka.ai

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

AI Deployment essentials: from clean data to continuous model monitoring

Oct 23, 2025

Toloka podcast: How RL Gyms are redefining data for AI agents

Oct 21, 2025

Inside the RL Gym: Reinforcement learning environments explained

Oct 16, 2025

AI Deployment essentials: from clean data to continuous model monitoring

Oct 23, 2025

Toloka podcast: How RL Gyms are redefining data for AI agents

Oct 21, 2025

Inside the RL Gym: Reinforcement learning environments explained

Oct 16, 2025

AI Ethics: Charting a course for a responsible and trustworthy future

Oct 16, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?