U-MATH & μ-MATH
Assessing LLMs on university-level math
Why U-MATH?
U-MATH is the largest university-level math benchmark, designed to address the gap in existing math benchmarks, which are limited in scope and size.
1,100 problems from math courses across 6 key subject areas
20% of tasks include images to challenge LLMs
Practical applications for industry and education
What is µ-MATH?
The Meta-Evaluation Benchmark is a set of 1084 meta-evaluation solutions designed to rigorously assess the quality of LLM judges, based on U-MATH problems.
Together, these benchmarks support comprehensive evaluation of LLM proficiency on university-level math.
LLM leaderboard on U-Math
Performance on U-MATH

Gemini
LlaMA
Qwen2
Claude
OpenAI
Minstral AI
DeepSeek
Learn More
Both U-Math and μ-MATH datasets were collected with the help of Gradarius - a learning platform that helps students master calculus through a step-by-step approach, providing immediate feedback to guide them through problem-solving.
Insights into LLM performance on U-Math
University-level problems pose a challenge to standard-inference models, but reasoning systems boast breakthrough performance.
Integrating vision proves tough: U-MATHv scores lag significantly behind text-only performance, and adding visual capabilities to a model typically leads to degradation.
Continuous training pushes the models forward: the Athene fine-tune of Qwen 2.5 and Nemotron fine-tune of Llama 3.1 deliver improvements across the board.
Open-weight models are rapidly closing the gap on text-only tasks, but continue to lag on multi-modal problems.
Gemini models are most adept at visual reasoning, consistently leading in U-MATHv score within all the model groups.
Specialization trumps size: domain-specific models such as Qwen Math beat models an order of magnitude larger.
Model
Full
1100
U-MATH
T
900
V
200
THard 600
Algebra
T
150
V
30
Diff. Calc
T
150
V
70
Integr. Calc
T
150
V
58
Multiv. Calc
T
150
V
28
Precalculus
T
150
V*
10
Seq. & Series
T
150
V*
4
Text-only models
Small
Ministral 8B 2410
23.1
26.9
6.0
13.5
60.0
6.7
13.3
8.6
10.0
5.2
12.7
3.6
47.3
0.0
18.0
0.0
Llama-3.1 8B
29.5
33.7
11.0
22.8
60.0
3.3
17.3
10.0
22.7
19.0
23.3
3.6
50.7
20.0
28.0
0.0
Qwen2.5 7B
43.3
50.4
11.0
34.5
86.0
20.0
30.7
4.3
32.0
19.0
36.7
3.6
78.7
10.0
38.7
0.0
Qwen2.5-Math 7B
45.5
53.0
11.5
38.0
84.7
6.7
32.0
8.6
24.0
17.2
44.0
10.7
81.3
0.0
52.0
50.0
Medium
Mistral Small 2501 (24B)
34.8
39.9
12.0
22.0
80.7
13.3
13.3
10.0
13.3
15.5
25.3
14.3
70.7
0.0
36.0
0.0
Qwen2.5 32B
52.4
60.4
16.0
46.3
92.7
13.3
42.7
11.4
34.7
25.9
50.0
17.9
85.3
0.0
58.0
0.0
Large
Llama-3.1 70B
35.2
40.4
11.5
23.8
79.3
3.3
17.3
17.1
16.0
10.3
267
7.1
68.0
0.0
35.3
50.0
Llama-3.1 Nemotron 70B
42.5
47.7
19.5
33.7
84.0
23.3
29.3
21.4
21.3
19.0
40.7
14.3
67.3
20.0
43.3
0.0
Llama 3.3 70B
42.5
47.7
19.5
33.7
84.0
23.3
29.3
21.4
21.3
19.0
40.7
14.3
67.3
20.0
43.3
0.0
Mistral Large 2411 (123B)
42.5
47.7
19.5
33.7
84.0
23.3
29.3
21.4
21.3
19.0
40.7
14.3
67.3
20.0
43.3
0.0
Qwen2.5 72B
51.2
58.9
16.5
44.7
90.7
16.7
36.7
15.7
35.3
17.2
52.0
14.3
84.0
10.0
54.7
50.0
Athene-V2 Chat (72B)
54.9
62.9
19.0
49.8
87.3
10.0
43.3
22.9
36.7
17.2
62.0
21.4
90.7
0.0
57.3
75.0
Qwen2.5-Math 72B
59.5
68.7
18.0
57.0
94.7
6.7
46.0
12.9
44.0
25.9
69.3
21.4
89.3
10.0
68.7
75.0
DeepSeek-V3 (MoE 37/671B)
62.6
69.3
32.5
57.5
96.0
10.0
49.3
30.0
38.7
39.7
69.3
42.9
90.0
40.0
72.7
50.0
Multimodal models
Small
Pixtral 12B
17.5
17.9
16.0
8.8
40.0
23.3
10.7
30.0
4.7
3.4
6.7
71.1
32.0
0.0
13.3
0.0
Llama-3.2 11B Vision
20.4
22.9
9.0
10.3
52.0
3.3
7.3
20.0
1.3
3.4
13.3
0.0
44.0
10.0
19.3
0.0
Qwen2-VL 7B
26.3
27.1
22.5
15.3
58.7
10.0
18.7
37.1
11.3
17.2
14.0
17.9
42.7
0.0
17.3
0.0
Large
Llama-3.2 90B Vision
37.2
41.8
16.5
24.7
82.0
23.3
21.3
27.1
11.3
5.2
30.0
10.7
70.0
0.0
36.0
25.0
Qwen2-VL 72B
41.8
43.9
32.5
29.3
80.0
26.7
29.3
44.3
22.0
27.6
32.0
28.6
66.0
10.0
34.0
25.0
Pixtral Large 2411 (124B)
47.8
51.4
31.5
38.2
82.7
33.3
30.0
32.9
24.7
32.8
46.7
28.6
73.3
30.0
51.3
0.0
Proprietary
Claude Sonnet 3.5 (new)
38.7
40.7
30.0
26.2
75.3
30.0
20.7
41.4
12.0
15.5
33.3
39.3
64.0
20.0
38.7
0.0
GPT-4o-mini
43.4
47.2
26.0
30.0
87.3
13.3
26.0
32.9
16.7
17.2
37.3
39.3
76.0
20.0
40.0
50.0
GPT-4o
50.2
53.9
33.5
38.3
90.0
33.3
30.0
37.1
27.3
27.6
49.3
42.9
80.0
30.0
46.7
0.0
Gemini 1.5 Flash
57.8
61.2
42.5
48.5
90.7
46.7
47.3
47.1
30.7
31.0
55.3
53.6
82.7
30.0
60.7
50.0
Gemini 1.5 Pro
67.2
71.7
47.0
62.0
92.0
60.0
62.0
50.0
47.3
27.6
65.3
60.7
90.0
50.0
73.3
75.0
Reasoning models
Open
QVQ-72B-Preview
65.0
69.7
44.0
57.2
94.0
33.3
54.0
41.4
41.3
55.2
65.3
50.0
95.3
30.0
68.0
0.0
QwQ-32B-Preview
73.1
82.7
30.0
75.8
95.3
3.3
70.0
24.3
67.3
50.0
80.7
32.1
97.3
20.0
85.3
50.0
DeepSeek-R1 (MoE 37/671B)
80.7
91.3
33.0
88.2
96.7
16.7
85.3
22.9
87.3
50.0
86.7
42.9
98.3
10.0
93.3
75.0
Proprietary
o1-mini
76.3
82.9
46.5
75.8
97.3
40.0
75.3
52.9
72.0
46.6
78.7
42.9
96.7
30.0
77.3
50.0
Gemini 2.0 Flash Thinking
83.2
89.2
58.5
86.2
95.3
60.0
80.7
48.6
88.7
65.5
85.3
75.0
95.3
50.0
90.0
25.0
o3-mini
82.2
92.8
34.5
89.5
99.3
10.0
88.0
17.1
90.7
60.3
85.3
50.0
99.3
20.0
94.0
75.0
o1
86.8
93.1
58.5
90.5
97.3
50.0
86.0
57.1
90.7
63.8
92.0
60.7
99.3
50.0
93.3
75.0
Accuracy scores on our U-MATH benchmark and its constituent subject splits. For each category (overall and subject-specific scores) two numbers are provided: separately for text-only (T) and visual (V) problems. Additionaly, an overall score across all the subjects excluding Algebra and Precalculus is shown under U-MATH THard. Asterisks denote small sample sizes (<15). Bold indicates the best result within each group.
All the models are used with their latest versions available as of 2025-03-15. Greedy decoding is employed for all the models (except for OpenAI o-series), with ablations performed to ensure no performance degradation occurs compared to sampling-based inference. Images are not included in the prompt for text-only models, only the problem statements. Free-form solutions are verified against golden labels by an ensemble of reasoning models.
What makes
U-MATH stand out
Challenging problems to test deep understanding and advanced reasoning skills.
Covers 6 subjects: Algebra, Precalculus, Differential Calculus, Integral Calculus, Multivariable Calculus, Sequences & Series
Problems and solutions are sourced from real coursework and checked by experts to meet academic standards.
Math Subject
Textual #
Visual #
Algebra
150
30
Differential Calculus
150
70
Integral Calculus
150
58
Multivariable Calculus
150
28
Precalculus
150
10
Sequences and Series
150
4
All
900
200
Dataset
Has Uni. Level
Uni. Level %
#Test
%Visual
% Free Form Answer
MMLUMath
0
1.3k
0
0
GSM8k
0
1k
0
0
MATH
0
5k
0
100
MiniF2F
0
244
0
100
OCWCourses
100
272
0
100
ProofNet
?
371
0
100
CHAMP
0
270
0
100
MathOdyssey
26
387
0
100
MMMUMath
0
505
100
0
MathVista
0
5k
100
46
MATH-V
0
3k
100
50
We-Math
20
1.7k
100
0
MathVerse
0
4.7k
83.3
45
U-Math (Toloka)
100
1.1k
20
100
Accuracy scores on our U-MATH benchmark and its constituent subject splits. For each category (overall and subject-specific scores) two numbers are provided: separately for text-only (T) and visual (V) problems. Additionaly, an overall score across all the subjects excluding Algebra and Precalculus is shown under U-MATH THard. Asterisks denote small sample sizes (<15). Bold indicates the best result within each group.
Test your LLM's math
capabilities with U-MATH
U-Math Dataset Collection Process
To identify the most challenging problems from tens of thousands of examples available for our benchmark we used a multi-stage selection process:
Filter out unsuitable problems
We exclude easy problems, those requiring extensive calculations, and multiple-choice problems
Test LLMs
Solve the selected problems using a number of
popular SLMs
Expert validation
The final set of problems is verified by domain experts from Stevens Institute of Technology
Analyze results
Choose the most challenging problems in each
subject area
U-Math Data Samples




μ-MATH Meta-Evaluation Benchmark insights
1
Problem-solving performance ≠ judgment performance. A tradeoff between these skills emerges in non-reasoners.
2
The tradeoff yields distinctive judgment styles: proprietary models are more conservative, minimizing false positives, while Qwens tend to be more lenient
3
Reasoning models push through the Pareto frontier, typically inducing substantial imbalances. o1 pushes even further and exhibits a far more weighted performance.
4
Judging is a non-trivial skill. The maximum attainable performance is imperfect, even with reasoners, so we need to account for errors when using auto-evaluation.
5
A balanced mix of training data leads to well-rounded performance, as evidenced by Qwen2.5 and Qwen2.5-Math.
6
Reducing the model size, on the contrary, exacerbates bias. Balanced performance requires a model capable enough to properly generalize over the training mixture.
Which LLMs excel at judging math solutions?
U-MATH shows accuracy in solving math problems. μ-MATH reflects accuracy in judging solutions
U-MATH TextHard Accuracy
μ-MATH F1-score


μ-MATH Dataset Collection Process
Data Sample

Robust test for LLM judges
Judgment errors and biases in evaluations are often overlooked, creating uncertainty and unreliability. Meta-evaluations are essential to identify, quantify, and address these issues, yet they remain scarce, especially in math.
Dataset construction
Hand-picked ~25% of U-MATH problems (271 in total), selected for their asessment complexity and overall representativeness of university-level math problems
Generated 4 solutions for each: using Qwen2.5 72B, Llama-3.1 70B, GPT4-o and Gemini 1.5 Pro, resulting in 1084 problem-solution pairs
Supplied each pair with a correct judgment verdict via a combination of Toloka's math experts labeling and Gradarius formal autoverification API
Testing and metrics
During testing, a model is provided with a problem statement, a reference answer, and a solution to evaluate, and tasked to determine whether the solution is correct or not. We treat this as a binary classification task.
Primary metric:
Macro-averaged F1 score, so as to minimize the effect of class imbalanceing
Fine-grained metrics:
Positive Predictive Value (PPV, or Precision) and True Positive Rate (TPR, or Recall) for the positive class
Negative Predictive Value (NPV) and True Negative Rate (TNR) for the negative class
Toloka’s U-MATH dataset is tailored for working with advanced mathematical problems. It provides structured data to simplify problem-solving processes and analyze complex mathematical ideas, supporting a wide range of applications in academic research, education, and AI training with an emphasis on measuring mathematical problem-solving skills for both models and students.
Here are answers to some common questions about the math dataset concept and Toloka's U-math dataset.
What is a mathematics dataset?
A mathematical dataset is a structured collection of problems, solutions, and often additional metadata ranging from elementary exercises to advanced, multi-step challenges. Such datasets support training and evaluation of AI models and can also be used in educational settings to teach students or assess their proficiency. Typically, mathematical datasets are organized by topics — such as algebra, calculus, trigonometry, or geometry — and vary in complexity to match different learning or testing requirements. Advanced mathematical benchmarks, such as U-MATH by Toloka, take things a step further by focusing on the more difficult topics such as university-level content or introduce additional challenges such as inclusion of visual elements, allowing for evaluation of higher-level problem-solving abilities.
What types of problems are included in mathematics datasets?
Mathematics datasets usually cover a wide range of problem types across various topics and difficulty levels. These may include single-step exercises, proofs, and multi-step challenges that require advanced logic.
Some datasets span broad fields like algebra, calculus, and geometry, while others focus narrowly on areas such as number theory, differential equations, combinatorics, or linear algebra.
Additionally, problems in a dataset can be designed to test different cognitive skills, from straightforward calculations to open-ended logical reasoning. This variety makes it possible to teach models multiple mathematical skills while measuring mathematical problem-solving capabilities in depth.
Unlike many datasets that deal with either elementary to intermediate-level math or highly specific contest-style challenges, Toloka’s collection is specifically dedicated to problems aligned with univeristy curricula. It features the most challenging problems drawn from real coursework and academic research, ensuring both complexity and practicality and making them relevant to real-world applications
Our U-MATH dataset covers a broad spectrum of problem types, including six key math subjects:
Precalculus
Algebra
Differential and Integral Calculus
Multivariable Calculus
Sequences & Series
Approximately 20% of the problems include visual elements, demanding a deep comprehension of both mathematical imagery and textual reasoning.
Each problem in Toloka’s math dataset is paired with detailed, academically crafted solutions and explanations verified by academic scholars.
Do mathematics datasets vary by difficulty level?
Yes, math datasets can significantly vary in difficulty, ranging from elementary mathematics exercises to highly complex, multi-step proofs. For example, entry-level datasets often focus on foundational skills like simple algebraic equations or geometry concepts.
Intermediate datasets usually include more complicated topics like calculus problems, trigonometric identities, or statistical analyses. These types of school-level collections form the majority of thematic math datasets available.
Higher-level datasets are designed for complex reasoning tasks in fields like differential equations, abstract algebra, and advanced calculus. Their creation involves processing data derived from academic research and requires rigorous verification and additional quality control, measuring mathematical problem-solving at advanced levels.
Many available datasets focus on school-level math problems, which are sufficient for basic tasks but limit their use for more complex applications. Toloka’s U-MATH dataset was developed to bridge this gap, ensuring that learning models are evaluated for the multi-step reasoning required in higher education or research.
The 1,125 problems in Toloka’s math dataset were selected for their complexity. Each challenge required a deep understanding and sophisticated problem-solving abilities. We excluded simpler problem types, such as short-answer questions and multiple-choice formats, ensuring that the dataset objectively assesses university-level math skills.
Where can I download mathematics datasets for free?
If you're looking for a university-level mathematics benchmark dataset, U-MATH is a must-have. It provides high-quality, complex problems drawn from real academic coursework and research.
However, a general-purpose math dataset can be sufficient for simpler tasks. You can search for them in online repositories that offer free access, such as:
Kaggle: A popular platform with a variety of datasets, including those for machine learning and AI training, some of which cover topics like algebra, calculus, and statistics.
Mathematical Data Sets (MATLAB Central): A collection of data sets provided by users, often containing problems from different areas of mathematics.
UCI Machine Learning Repository: While primarily focusing on machine learning data, it also hosts mathematical and statistical analysis datasets.
OpenML: A platform for sharing datasets and machine learning experiments, which includes math-related data for modeling and analysis.
Platforms like Toloka can provide the most relevant and advanced problems for more specialized and high-level datasets tailored to sophisticated mathematical reasoning tasks.
What LLMs perform best on university-level math problems?
Recent evaluations indicate that several large language models (LLMs) exhibit strong performance on university-level math problems, with some outperforming others depending on model parameter counts and specialization in specific areas based on comprehensive evaluation metrics.
For instance, Gemini has demonstrated superior capabilities in solving complex mathematical problems, particularly excelling in the visual domain, such as interpreting graphs or geometric diagrams. Its advantage over GPT-4 is most noticeable when handling tasks where visual reasoning is critical.
QWEN2.5-Math-72B has also shown impressive results in mathematical reasoning, especially when working with textual tasks. This model outperforms GPT-4 in terms of accuracy on a variety of textual problems, particularly those involving word problems or symbolic reasoning.
However, despite these advancements, even the best-performing large language models still encounter challenges when dealing with tasks that combine images and text, such as problems involving detailed visual data (e.g., graphs or complex diagrams). This suggests that while LLMs have made substantial progress in handling university-level math, there is still considerable room for improvement, particularly in multi-modal tasks where integration of visual and textual information is crucial.
How are problems for U-MATH selected?
U-MATH includes 1,125 university-level problems, including 340 meta-evaluation tasks. Each problem is rigorously selected based on its alignment with real-world academic coursework and advanced research challenges.
The selection process focuses on complexity. Simpler problems, such as those with very short solutions or allowing calculator use, were filtered out and removed from the set. The rest passed through a few small-scale large language models to check performance, with only the most challenging examples reserved for inclusion in the dataset.
Finally, experts from the Stevens Institute of Technology reviewed and evaluated all automatically selected problems.
Can Toloka's math dataset be used for AI model evaluation?
Yes, Toloka's U-MATH dataset is ideal for evaluating AI models on complex mathematical problem-solving. It serves as a high-quality benchmark for assessing reasoning, multi-step solution capabilities, and accuracy in handling university-level math challenges.
We develop larger, customized datasets for clients interested in fine-tuning. Please contact us to discuss a dataset that supports your training goals.
What is the role of visual elements in Toloka's math dataset?
Visual elements are crucial in many mathematical problems, especially in fields like geometry, calculus, and data analysis. About 20% of the problems in Toloka’s U-MATH dataset incorporate visual elements, requiring models to interpret and analyze graphs, charts, and geometric figures.
The visual component adds a layer of complexity, as solving such problems involves both textual reasoning and visual interpretation. These visual elements are essential for testing how a model handles real-world scenarios, where problems often involve both images and text.