U-MATH & μ-MATH: Assessing LLMs on university-level math

Why U-MATH?

U-MATH is the largest university-level math benchmark, designed to address the gap in existing math benchmarks, which are limited in scope and size.

— 1,100 problems from math courses across 6 key subject areas

— 20% of tasks include images to challenge LLMs
— Practical applications for industry and education

What is µ-MATH?

The Meta-Evaluation Benchmark is a set of 1084 meta-evaluation solutions designed to rigorously assess the quality of LLM judges, based on U-MATH problems.

Together, these benchmarks support comprehensive evaluation of LLM proficiency on university-level math.

LLM leaderboard on U-Math

Both U-Math and μ-MATH datasets were collected with the help of Gradarius - a learning platform that helps students master calculus through a step-by-step approach, providing immediate feedback to guide them through problem-solving.

Insights into LLM performance on U-Math

Reasoning models boast breakthrough performance, but university-level problems are still a challenge.

Gemini emerges as the overall winner across the board — from smaller models to reasoner systems.

Open-weight models are rapidly closing the gap on text-only tasks, but continue to lag on multi-modal problems.

Specialization trumps size: domain-specific models such as Qwen Math beat models an order of magnitude larger.

Integrating visual reasoning proves tough: U-MATHv scores lag significantly behind text-only performance, and adding visual capabilities to a model typically leads to degradation.

Gemini models are consistently most adept at visual reasoning, exhibiting a large U-MATHv margin within all the model groups.

Model

U-MATH

U-MATH

T
900

V
200

Algebra

T
150

V
30

Diff. Calc

T
150

V
70

Integr. Calc

T
150

V
58

Multiv. Calc

T
150

V
28

Precalculus

T
150

V*
10

Seq.& Series

T
150

V*
4

Text-only models

Small

Ministral 8B 2410

18.3

21.4

4.0

62.0

3.3

3.3

8.6

1.3

0.0

6.0

3.6

46.7

0.0

9.3

0.0

LFM-7B

21.1

24.6

5.5

68.0

3.3

4.0

11.4

0.7

1.7

7.3

3.6

56.0

0.0

11.3

0.0

Llama-3.1 8B

22.3

26.1

5.0

59.3

3.3

6.7

5.7

9.3

3.4

11.3

3.6

54.7

10.0

15.3

25.0

Qwen2.5 7B

33.8

40.0

6.0

86.0

10.0

12.7

1.4

10.0

12.1

26.7

3.6

75.3

0.0

29.3

0.0

Qwen2.5-Math 7B

38.4

45.2

7.5

87.3

6.7

18.7

5.7

8.0

10.3

36.0

10.7

80.7

0.0

40.7

0.0

Medium

Mistral-Small-2501 (24B)

29.5

35.0

4.5

82.0

3.3

6.0

4.3

6.0

5.2

16.0

3.6

69.3

10.0

30.7

0.0

Qwen2.5-32B

43.8

51.4

9.5

92.7

3.3

30.0

7.1

12.0

12.1

45.3

17.9

83.3

10.0

45.3

0.0

Phi-4 (14B)

44.1

51.3

11.5

90.7

0.0

32.7

8.6

9.3

19.0

46.0

10.7

88.7

20.0

40.7

25.0

Large

Llama-3.1 70B

28.5

33.7

5.0

82.0

3.3

10.7

5.7

4.0

5.2

14.0

3.6

64.0

0.0

27.3

25.0

Llama-3.1 Nemotron 70B

31.4

37.4

4.0

84.0

0.0

14.7

2.9

4.0

3.4

25.3

7.1

64.0

20.0

32.7

0.0

Llama 3.3 70B

37.3

43.4

9.5

87.3

3.3

20.0

12.9

11.3

12.1

38.0

7.1

67.3

0.0

36.7

0.0

Mistral Large 2411 (123B)

40.4

48.1

5.5

86.7

6.7

23.3

2.9

15.3

5.2

37.3

10.7

80.7

0.0

45.3

25.0

Qwen2.5 72B 

41.0

48.6

7.0

88.7

6.7

22.7

4.3

12.0

6.9

40.0

17.9

83.3

0.0

44.7

0.0

Athene-V2 72B Chat 

46.2

54.6

8.5

88.7

3.3

34.0

4.3

16.0

6.9

50.7

21.4

88.7

10.0

49.3

50.0

Qwen2.5-Math 72B

50.2

59.0

10.5

92.7

6.7

35.3

7.1

20.7

17.2

58.0

7.1

90.0

0.0

57.3

50.0

DeepSeek-V3 (685B)

51.9

60.4

13.5

98.0

3.3

35.3

5.7

20.7

24.1

57.3

17.9

90.0

10.0

61.3

50.0

Multimodal models

Small

Pixtral 12B

15.5

15.6

15.5

44.7

23.3

1.3

34.3

0.7

0.0

3.3

0.0

32.0

0.0

11.3

0.0

Llama-3.2 11B Vision

17.0

18.6

10.0

54.0

10.0

1.3

20.0

1.3

1.7

4.7

3.6

43.3

10.0

6.7

0.0

Qwen2-VL 7B

20.4

21.4

15.5

62.7

10.0

4.7

32.9

0.7

5.2

6.7

7.1

45.3

0.0

8.7

0.0

Large

Qwen2-VL 72B

31.2

32.2

26.5

80.7

26.7

9.3

40.0

2.0

13.8

14.7

28.6

65.3

10.0

21.3

0.0

Llama-3.2 90B Vision

32.6

36.3

16.0

85.3

26.7

10.7

25.7

2.7

1.7

22.7

7.1

65.3

20.0

31.3

25.0

Pixtral Large 2411 (124B)

39.7

42.9

25.5

86.0

33.3

15.3

31.4

9.3

15.5

32.0

25.0

72.7

20.0

42.0

25.0

Proprietary

Claude Sonnet 3.5

35.1

36.1

30.5

76.0

33.3

12.0

41.4

7.3

17.2

21.3

28.6

65.3

30.0

34.7

25.0

GPT-4o-mini 

37.2

40.3

23.0

88.0

16.7

16.7

31.4

4.0

10.3

24.0

35.7

77.3

20.0

32.0

25.0

GPT-4o

43.5

46.4

30.0

91.3

30.0

18.7

32.9

10.0

20.7

41.3

42.9

79.3

30.0

38.0

25.0

Gemini 1.5 Flash

51.3

53.8

40.0

91.3

50.0

36.0

45.7

14.0

24.1

44.0

50.0

80.7

30.0

56.7

50.0

Gemini 1.5 Pro 

60.1

63.4

45.0

91.3

60.0

50.7

47.1

27.3

24.1

60.7

57.1

87.3

70.0

63.3

50.0

Reasoning models

Large

QVQ-72B-Preview

55.1

59.3

36.0

92.7

33.3

44.7

34.3

19.3

39.7

53.3

42.9

91.3

30.0

54.7

0.0

QwQ-32B-Preview

61.5

71.8

15.0

94.0

3.3

60.7

7.1

39.3

24.1

65.3

25.0

92.7

10.0

78.7

50.0

DeepSeek-R1

68.4

79.0

20.5

96.7

0.0

70.7

10.0

48.0

36.2

76.7

32.1

98.0

10.0

84.0

75.0

Proprietary

o1-mini-2024-09-12

63.5

73.3

19.0

96.0

0.0

66.0

11.4

42.0

27.6

68.7

39.3

93.0

20.0

74.0

25.0

o3-mini-2025-01-31 high

68.9

79.7

20.5

94.0

6.7

76.7

7.1

52.7

39.7

75.3

32.1

96.0

0.0

83.3

50.0

o1-2024-12-17

70.7

77.2

45.2

38.7

20.0

31.3

21.4

27.3

25.9

32.0

32.1

41.3

20.0

39.3

0.0

Gemini-2.0-Flash
thinking-exp-01-21

73.6

77.7

55.0

93.3

56.7

69.3

47.1

54.7

56.9

74.0

75.0

95.3

50.0

80.0

25.0

Comparison of models’ accuracy on our U-MATH benchmark and its subjects. Scores for various mathematical categories, including text and visual analysis, are displayed. For each subject 2 numbers are provided - text-only (T) and visual (V) problems. Asterisk denotes a small number of samples (<15). Free-form solutions judged by gpt-4o-2024-08-06. Images are not included in the prompt for text-only models, only the problem statement. Bold indicates the best result in each group.

What makes
U-MATH stand out

Challenging problems to test deep understanding
and advanced reasoning skills

Covers 6 subjects: Algebra, Precalculus, Differential Calculus, Integral Calculus, Multivariable Calculus, Sequences & Series

Problems and solutions are sourced from real coursework and checked by experts to meet academic standards

Math Subject

Math Subject

#Textual

#Textual

#Visual

#Visual

Algebra

Algebra

150

150

30

30

Differential Calculus

Differential Calculus

150

150

70

70

Integral Calculus

Integral Calculus

150

150

58

58

Multivariable Calculus

Multivariable Calculus

150

150

28

28

Precalculus

Precalculus

150

150

10

10

Sequences and Series

Sequences and Series

150

150

4

4

All

All

900

900

200

200

Dataset

Dataset

Has Uni. Level

Has Uni. Level

% Uni. Level

%
Uni. Level

#Test

#Test

%Visual

%Visual

% Free Form Answer

% Free Form Answer

MMLUMath

MMLUMath

0

0

1.3k

1.3k

0

0

0

0

GSM8k

GSM8k

0

0

1k

1k

0

0

0

0

MATH

MATH

0

0

5k

5k

0

0

100

100

MiniF2F

MiniF2F

0

0

244

244

0

0

100

100

OCWCourses

OCWCourses

100

100

272

272

0

0

100

100

ProofNet

ProofNet

?

?

371

371

0

0

100

100

CHAMP

CHAMP

0

0

270

270

0

0

100

100

MathOdyssey

MathOdyssey

26

26

387

387

0

0

100

100

MMMUMath

MMMUMath

0

0

505

505

100

100

0

0

MathVista

MathVista

0

0

5k

5k

100

100

46

46

MATH-V

MATH-V

0

0

3k

3k

100

100

50

50

We-Math

We-Math

20

20

1.7k

1.7k

100

100

0

0

MathVerse

MathVerse

0

0

4.7k

4.7k

83.3

83.3

45

45

U-Math (Toloka)

U-Math (Toloka)

100

100

1.1k

1.1k

20

20

100

100

Existing auto-evaluation math benchmarks with corresponding test samples published, visual samples percent,and percent of free-form answers. Levels include university-level benchmarks as well as lower educational levels. Some of the problems in U-MATH and mu-MATH are sourced from OpenStax under CC BY 4.0.

Test your LLM's math
capabilities with U-MATH

Test your LLM's math
capabilities with U-MATH

Test your LLM's math
capabilities with U-MATH

Test your LLM's math
capabilities with U-MATH

U-Math Dataset Collection Process 

To identify the most challenging problems from tens of thousands of examples available for our benchmark we used a multi-stage selection process:

Filter out
unsuitable problems

We exclude easy problems, those requiring extensive calculations, and multiple-choice problems.

Test LLMs

Solve the selected problems using popular small LLMs

Analyze results

Choose the most challenging problems in each subject area

Expert validation

The final set of problems
is checked by experts from Stevens Institute of Technology

U-Math Data Samples

μ-MATH Meta-Evaluation Benchmark insights

Problem-solving performance ≠ judgment performance. A tradeoff between these skills emerges in non-reasoners.

The tradeoff yields distinctive judgment styles: proprietary models are more conservative, minimizing false positives, while Qwens tend to be more lenient.

Reasoning models push through the Pareto frontier, typically inducing substantial imbalances. o1 pushes even further and exhibits a far more weighted performance.

Judging is a non-trivial skill. The maximum attainable performance is imperfect, even with reasoners, so we need to account for errors when using auto-evaluation.

A balanced mix of training data leads to well-rounded performance, as evidenced by Qwen2.5 and Qwen2.5-Math.

Reducing the model size, on the contrary, exacerbates bias. Balanced performance requires the model to generalize over the training mixture.

Which LLMs excel at judging math solutions?

U-MATH shows accuracy in solving math problems. μ-MATH reflects accuracy in judging solutions

μ-MATH Dataset Collection Process 

Robust test for LLM judges

Judgment errors and biases in evaluations are often overlooked, creating uncertainty and unreliability. Meta-evaluations are essential to identify, quantify, and address these issues, yet they remain scarce, especially in math.

Dataset construction

— Hand-picked ~25% of U-MATH problems (271 in total), selected for their judgement complexity and representing university-level math

— Generated 4 solutions for each: using Qwen2.5-72B, Llama3.1-70B, GPT4-o and Gemini1.5-Pro, resulting in 1084 problem-solution pairs

— Supplied each pair with a correct judgment verdict via a combination of Toloka's math experts labeling and Gradarius formal autoverification API

— Treat the judgment problem as a binary classification task, compute standard binary metrics, with macro F1 as the main one (so that both positive and negative labels contribute equally)

Testing and metrics

During testing, a model is provided with a problem statement, a reference answer, and a solution to evaluate. We treat this as a binary classification task.

Primary metric:

  • Macro-averaged F1 score to minimize the effect of class imbalance

Fine-grained metrics:

  • Positive Predictive Value (PPV, or Precision) and True Positive Rate (TPR, or Recall) for the positive class

  • Negative Predictive Value (NPV) and True Negative Rate (TNR) for the negative class

Data Sample

An example problem from the µ-MATH meta-evaluation benchmark, illustrating the comparison between the golden (reference) answer and the answer generated by an LLM.

How smart is your LLM?
Test performance on complex math problems and step-by-step reasoning

How smart is your LLM?
Test performance on complex math problems and step-by-step reasoning

How smart is your LLM?
Test performance on complex math problems and step-by-step reasoning

Frequently Asked Questions

Toloka’s U-MATH dataset is tailored for working with advanced mathematical problems. It provides structured data to simplify problem-solving processes and analyze complex mathematical ideas, supporting a wide range of applications in academic research, education, and AI training with an emphasis on measuring mathematical problem-solving skills for both models and students.

Here are answers to some common questions about the math dataset concept and Toloka's U-math dataset.

What is a mathematics dataset?
What is a mathematics dataset?
What is a mathematics dataset?
What is a mathematics dataset?
What types of problems are included in mathematics datasets?
What types of problems are included in mathematics datasets?
What types of problems are included in mathematics datasets?
What types of problems are included in mathematics datasets?
Do mathematics datasets vary by difficulty level?
Do mathematics datasets vary by difficulty level?
Do mathematics datasets vary by difficulty level?
Do mathematics datasets vary by difficulty level?
Where can I download mathematics datasets for free?
Where can I download mathematics datasets for free?
Where can I download mathematics datasets for free?
Where can I download mathematics datasets for free?
What LLMs perform best on university-level math problems?
What LLMs perform best on university-level math problems?
What LLMs perform best on university-level math problems?
What LLMs perform best on university-level math problems?
How are problems for U-MATH selected?
How are problems for U-MATH selected?
How are problems for U-MATH selected?
How are problems for U-MATH selected?
Can Toloka's math dataset be used for AI model evaluation?
Can Toloka's math dataset be used for AI model evaluation?
Can Toloka's math dataset be used for AI model evaluation?
Can Toloka's math dataset be used for AI model evaluation?
What is the role of visual elements in Toloka's math dataset?
What is the role of visual elements in Toloka's math dataset?
What is the role of visual elements in Toloka's math dataset?
What is the role of visual elements in Toloka's math dataset?