U-MATH & μ-MATH
Assessing LLMs on university-level math
Why U-MATH?
U-MATH is the largest university-level math benchmark, designed to address the gap in existing math benchmarks, which are limited in scope and size.
What is µ-MATH?
The Meta-Evaluation Benchmark is a set of 1084 meta-evaluation solutions designed to rigorously assess the quality of LLM judges, based on U-MATH problems.
Together, these benchmarks support comprehensive evaluation of LLM proficiency on university-level math.
LLM leaderboard on U-Math
Learn More
Both U-Math and μ-MATH datasets were collected with the help of Gradarius - a learning platform that helps students master calculus through a step-by-step approach, providing immediate feedback to guide them through problem-solving.
Insights into LLM performance on U-Math
Model
Full
1100
U-MATH
T
900
V
200
THard 600
Algebra
T
150
V
30
Diff. Calc
T
150
V
70
Integr. Calc
T
150
V
58
Multiv. Calc
T
150
V
28
Precalculus
T
150
V*
10
Seq. & Series
T
150
V*
4
Text-only models
Small
Ministral 8B 2410
23.1
26.9
6.0
13.5
60.0
6.7
13.3
8.6
10.0
5.2
12.7
3.6
47.3
0.0
18.0
0.0
Llama-3.1 8B
29.5
33.7
11.0
22.8
60.0
3.3
17.3
10.0
22.7
19.0
23.3
3.6
50.7
20.0
28.0
0.0
Qwen2.5 7B
43.3
50.4
11.0
34.5
86.0
20.0
30.7
4.3
32.0
19.0
36.7
3.6
78.7
10.0
38.7
0.0
Qwen2.5-Math 7B
45.5
53.0
11.5
38.0
84.7
6.7
32.0
8.6
24.0
17.2
44.0
10.7
81.3
0.0
52.0
50.0
Medium
Mistral Small 2501 (24B)
34.8
39.9
12.0
22.0
80.7
13.3
13.3
10.0
13.3
15.5
25.3
14.3
70.7
0.0
36.0
0.0
Qwen2.5 32B
52.4
60.4
16.0
46.3
92.7
13.3
42.7
11.4
34.7
25.9
50.0
17.9
85.3
0.0
58.0
0.0
Large
Llama-3.1 70B
35.2
40.4
11.5
23.8
79.3
3.3
17.3
17.1
16.0
10.3
267
7.1
68.0
0.0
35.3
50.0
Llama-3.1 Nemotron 70B
42.5
47.7
19.5
33.7
84.0
23.3
29.3
21.4
21.3
19.0
40.7
14.3
67.3
20.0
43.3
0.0
Llama 3.3 70B
42.5
47.7
19.5
33.7
84.0
23.3
29.3
21.4
21.3
19.0
40.7
14.3
67.3
20.0
43.3
0.0
Mistral Large 2411 (123B)
42.5
47.7
19.5
33.7
84.0
23.3
29.3
21.4
21.3
19.0
40.7
14.3
67.3
20.0
43.3
0.0
Qwen2.5 72B
51.2
58.9
16.5
44.7
90.7
16.7
36.7
15.7
35.3
17.2
52.0
14.3
84.0
10.0
54.7
50.0
Athene-V2 Chat (72B)
54.9
62.9
19.0
49.8
87.3
10.0
43.3
22.9
36.7
17.2
62.0
21.4
90.7
0.0
57.3
75.0
Qwen2.5-Math 72B
59.5
68.7
18.0
57.0
94.7
6.7
46.0
12.9
44.0
25.9
69.3
21.4
89.3
10.0
68.7
75.0
DeepSeek-V3 (MoE 37/671B)
62.6
69.3
32.5
57.5
96.0
10.0
49.3
30.0
38.7
39.7
69.3
42.9
90.0
40.0
72.7
50.0
Multimodal models
Small
Pixtral 12B
17.5
17.9
16.0
8.8
40.0
23.3
10.7
30.0
4.7
3.4
6.7
71.1
32.0
0.0
13.3
0.0
Llama-3.2 11B Vision
20.4
22.9
9.0
10.3
52.0
3.3
7.3
20.0
1.3
3.4
13.3
0.0
44.0
10.0
19.3
0.0
Qwen2-VL 7B
26.3
27.1
22.5
15.3
58.7
10.0
18.7
37.1
11.3
17.2
14.0
17.9
42.7
0.0
17.3
0.0
Large
Llama-3.2 90B Vision
37.2
41.8
16.5
24.7
82.0
23.3
21.3
27.1
11.3
5.2
30.0
10.7
70.0
0.0
36.0
25.0
Qwen2-VL 72B
41.8
43.9
32.5
29.3
80.0
26.7
29.3
44.3
22.0
27.6
32.0
28.6
66.0
10.0
34.0
25.0
Pixtral Large 2411 (124B)
47.8
51.4
31.5
38.2
82.7
33.3
30.0
32.9
24.7
32.8
46.7
28.6
73.3
30.0
51.3
0.0
Proprietary
Claude Sonnet 3.5 (new)
38.7
40.7
30.0
26.2
75.3
30.0
20.7
41.4
12.0
15.5
33.3
39.3
64.0
20.0
38.7
0.0
GPT-4o-mini
43.4
47.2
26.0
30.0
87.3
13.3
26.0
32.9
16.7
17.2
37.3
39.3
76.0
20.0
40.0
50.0
GPT-4o
50.2
53.9
33.5
38.3
90.0
33.3
30.0
37.1
27.3
27.6
49.3
42.9
80.0
30.0
46.7
0.0
Gemini 1.5 Flash
57.8
61.2
42.5
48.5
90.7
46.7
47.3
47.1
30.7
31.0
55.3
53.6
82.7
30.0
60.7
50.0
Gemini 1.5 Pro
67.2
71.7
47.0
62.0
92.0
60.0
62.0
50.0
47.3
27.6
65.3
60.7
90.0
50.0
73.3
75.0
Reasoning models
Open
QVQ-72B-Preview
65.0
69.7
44.0
57.2
94.0
33.3
54.0
41.4
41.3
55.2
65.3
50.0
95.3
30.0
68.0
0.0
QwQ-32B-Preview
73.1
82.7
30.0
75.8
95.3
3.3
70.0
24.3
67.3
50.0
80.7
32.1
97.3
20.0
85.3
50.0
DeepSeek-R1 (MoE 37/671B)
80.7
91.3
33.0
88.2
96.7
16.7
85.3
22.9
87.3
50.0
86.7
42.9
98.3
10.0
93.3
75.0
Proprietary
o1-mini
76.3
82.9
46.5
75.8
97.3
40.0
75.3
52.9
72.0
46.6
78.7
42.9
96.7
30.0
77.3
50.0
Gemini 2.0 Flash Thinking
83.2
89.2
58.5
86.2
95.3
60.0
80.7
48.6
88.7
65.5
85.3
75.0
95.3
50.0
90.0
25.0
o3-mini
82.2
92.8
34.5
89.5
99.3
10.0
88.0
17.1
90.7
60.3
85.3
50.0
99.3
20.0
94.0
75.0
o1
86.8
93.1
58.5
90.5
97.3
50.0
86.0
57.1
90.7
63.8
92.0
60.7
99.3
50.0
93.3
75.0
Accuracy scores on our U-MATH benchmark and its constituent subject splits. For each category (overall and subject-specific scores) two numbers are provided: separately for text-only (T) and visual (V) problems. Additionaly, an overall score across all the subjects excluding Algebra and Precalculus is shown under U-MATH THard. Asterisks denote small sample sizes (<15). Bold indicates the best result within each group.
All the models are used with their latest versions available as of 2025-03-15. Greedy decoding is employed for all the models (except for OpenAI o-series), with ablations performed to ensure no performance degradation occurs compared to sampling-based inference. Images are not included in the prompt for text-only models, only the problem statements. Free-form solutions are verified against golden labels by an ensemble of reasoning models.
What makes
U-MATH stand out
Challenging problems to test deep understanding and advanced reasoning skills.
Covers 6 subjects: Algebra, Precalculus, Differential Calculus, Integral Calculus, Multivariable Calculus, Sequences & Series
Problems and solutions are sourced from real coursework and checked by experts to meet academic standards.
Math Subject
Textual #
Visual #
Algebra
150
30
Differential Calculus
150
70
Integral Calculus
150
58
Multivariable Calculus
150
28
Precalculus
150
10
Sequences and Series
150
4
All
900
200
Dataset
Has Uni. Level
Uni. Level %
#Test
%Visual
% Free Form Answer
MMLUMath
0
1.3k
0
0
GSM8k
0
1k
0
0
MATH
0
5k
0
100
MiniF2F
0
244
0
100
OCWCourses
100
272
0
100
ProofNet
?
371
0
100
CHAMP
0
270
0
100
MathOdyssey
26
387
0
100
MMMUMath
0
505
100
0
MathVista
0
5k
100
46
MATH-V
0
3k
100
50
We-Math
20
1.7k
100
0
MathVerse
0
4.7k
83.3
45
U-Math (Toloka)
100
1.1k
20
100
Accuracy scores on our U-MATH benchmark and its constituent subject splits. For each category (overall and subject-specific scores) two numbers are provided: separately for text-only (T) and visual (V) problems. Additionaly, an overall score across all the subjects excluding Algebra and Precalculus is shown under U-MATH THard. Asterisks denote small sample sizes (<15). Bold indicates the best result within each group.
Test your LLM's math
capabilities with U-MATH
U-Math Dataset Collection Process
To identify the most challenging problems from tens of thousands of examples available for our benchmark we used a multi-stage selection process:
Filter out unsuitable problems
We exclude easy problems, those requiring extensive calculations, and multiple-choice problems
Test LLMs
Solve the selected problems using a number of
popular SLMs
Expert validation
The final set of problems is verified by domain experts from Stevens Institute of Technology
Analyze results
Choose the most challenging problems in each
subject area
U-Math Data Samples
μ-MATH Meta-Evaluation Benchmark insights
1
Problem-solving performance ≠ judgment performance. A tradeoff between these skills emerges in non-reasoners.
2
The tradeoff yields distinctive judgment styles: proprietary models are more conservative, minimizing false positives, while Qwens tend to be more lenient
3
Reasoning models push through the Pareto frontier, typically inducing substantial imbalances. o1 pushes even further and exhibits a far more weighted performance.
4
Judging is a non-trivial skill. The maximum attainable performance is imperfect, even with reasoners, so we need to account for errors when using auto-evaluation.
5
A balanced mix of training data leads to well-rounded performance, as evidenced by Qwen2.5 and Qwen2.5-Math.
6
Reducing the model size, on the contrary, exacerbates bias. Balanced performance requires a model capable enough to properly generalize over the training mixture.
Which LLMs excel at judging math solutions?
U-MATH shows accuracy in solving math problems. μ-MATH reflects accuracy in judging solutions
U-MATH TextHard Accuracy
μ-MATH F1-score
μ-MATH Dataset Collection Process
Data Sample
Robust test for LLM judges
Judgment errors and biases in evaluations are often overlooked, creating uncertainty and unreliability. Meta-evaluations are essential to identify, quantify, and address these issues, yet they remain scarce, especially in math.
Dataset construction
Testing and metrics
During testing, a model is provided with a problem statement, a reference answer, and a solution to evaluate, and tasked to determine whether the solution is correct or not. We treat this as a binary classification task.
Primary metric:
Macro-averaged F1 score, so as to minimize the effect of class imbalanceing
Fine-grained metrics:
Positive Predictive Value (PPV, or Precision) and True Positive Rate (TPR, or Recall) for the positive class
Negative Predictive Value (NPV) and True Negative Rate (TNR) for the negative class
Toloka’s U-MATH dataset is tailored for working with advanced mathematical problems. It provides structured data to simplify problem-solving processes and analyze complex mathematical ideas, supporting a wide range of applications in academic research, education, and AI training with an emphasis on measuring mathematical problem-solving skills for both models and students.
Here are answers to some common questions about the math dataset concept and Toloka's U-math dataset.







