Solutions

Datasets

Research

Resources

Company

Talk to us

U-MATH & μ-MATH: Assessing LLMs on university-level math

Why U-MATH?

U-MATH is the largest university-level math benchmark, designed to address the gap in existing math benchmarks, which are limited in scope and size.

— 1,100 problems from math courses across 6 key subject areas

— 20% of tasks include images to challenge LLMs
— Practical applications for industry and education

What is µ-MATH?

The Meta-Evaluation Benchmark is a set of 1084 meta-evaluation solutions designed to rigorously assess the quality of LLM judges, based on U-MATH problems.

Together, these benchmarks support comprehensive evaluation of LLM proficiency on university-level math.

LLM leaderboard on U-Math

Learn more

Download the Benchmark

Download the Paper

Read the Article

Both U-Math and μ-MATH datasets were collected with the help of Gradarius - a learning platform that helps students master calculus through a step-by-step approach, providing immediate feedback to guide them through problem-solving.

Insights into LLM performance on U-Math

University-level problems pose a challenge to standard-inference models, but reasoning systems boast breakthrough performance.

Open-weight models are rapidly closing the gap on text-only tasks, but continue to lag on multi-modal problems.

Integrating vision proves tough: U-MATHv scores lag significantly behind text-only performance, and adding visual capabilities to a model typically leads to degradation.

Gemini models are most adept at visual reasoning, consistently leading in U-MATHv score within all the model groups.

Continuous training pushes the models forward: the Athene fine-tune of Qwen 2.5 and Nemotron fine-tune of Llama 3.1 deliver improvements across the board.

Specialization trumps size: domain-specific models such as Qwen Math beat models an order of magnitude larger.

Model

Full
1100

U-MATH

T
900

V
200

THard

600

Algebra

T
150

V
30

Diff.
Calc

T
150

V
70

Integr.
Calc

T
150

V
58

Multiv.
Calc

T
150

V
28

Precalculus

T
150

V*
10

Seq. & Series

T
150

V*
4

Text-only models

Small

Ministral 8B 2410

23.1

26.9

6.0

13.5

60.0

6.7

13.3

8.6

10.0

5.2

12.7

3.6

47.3

0.0

18.0

0.0

Llama-3.1 8B

29.5

33.7

11.0

22.8

60.0

3.3

17.3

10.0

22.7

19.0

23.3

3.6

50.7

20.0

28.0

0.0

Qwen2.5 7B

43.3

50.4

11.0

34.5

86.0

20.0

30.7

4.3

32.0

19.0

36.7

3.6

78.7

10.0

38.7

0.0

Qwen2.5-Math 7B

45.5

53.0

11.5

38.0

84.7

6.7

32.0

8.6

24.0

17.2

44.0

10.7

81.3

0.0

52.0

50.0

Medium

Mistral Small 2501 (24B)

34.8

39.9

12.0

22.0

80.7

13.3

10.0

13.3

15.5

25.3

14.3

70.7

0.0

36.0

0.0

Qwen2.5 32B

52.4

60.4

16.0

46.3

92.7

13.3

42.7

11.4

34.7

25.9

50.0

17.9

85.3

0.0

58.0

0.0

Large

Llama-3.1 70B

35.2

40.4

11.5

23.8

79.3

3.3

17.3

17.1

16.0

10.3

267

7.1

68.0

0.0

35.3

50.0

Llama-3.1 Nemotron 70B

42.5

47.7

19.5

33.7

84.0

23.3

29.3

21.4

21.3

19.0

40.7

14.3

67.3

20.0

43.3

0.0

Llama 3.3 70B

44.7

51.7

13.5

39.5

83.3

6.7

35.3

11.4

27.3

20.7

48.7

10.7

68.7

10.0

46.7

25.0

Mistral Large 2411 (123B)

47.6

55.6

12.0

42.5

85.3

13.3

32.0

8.6

36.7

15.5

45.3

14.3

78.0

0.0

56.0

25.0

Qwen2.5 72B

51.2

58.9

16.5

44.7

90.7

16.7

36.7

15.7

35.3

17.2

52.0

14.3

84.0

10.0

54.7

50.0

Athene-V2 Chat (72B)

54.9

62.9

19.0

49.8

87.3

10.0

43.3

22.9

36.7

17.2

62.0

21.4

90.7

0.0

57.3

75.0

Qwen2.5-Math 72B

59.5

68.7

18.0

57.0

94.7

6.7

46.0

12.9

44.0

25.9

69.3

21.4

89.3

10.0

68.7

75.0

DeepSeek-V3 (MoE 37/671B)

62.6

69.3

32.5

57.5

96.0

10.0

49.3

30.0

38.7

39.7

69.3

42.9

90.0

40.0

72.7

50.0

Multimodal models

Small

Pixtral 12B

17.5

17.9

16.0

8.8

40.0

23.3

10.7

30.0

4.7

3.4

6.7

71.1

32.0

0.0

13.3

0.0

Llama-3.2 11B Vision

20.4

22.9

9.0

10.3

52.0

3.3

7.3

20.0

1.3

3.4

13.3

0.0

44.0

10.0

19.3

0.0

Qwen2-VL 7B

26.3

27.1

22.5

15.3

58.7

10.0

18.7

37.1

11.3

17.2

14.0

17.9

42.7

0.0

17.3

0.0

Large

Llama-3.2 90B Vision

37.2

41.8

16.5

24.7

82.0

23.3

21.3

27.1

11.3

5.2

30.0

10.7

70.0

0.0

36.0

25.0

Qwen2-VL 72B

41.8

43.9

32.5

29.3

80.0

26.7

29.3

44.3

22.0

27.6

32.0

28.6

66.0

10.0

34.0

25.0

Pixtral Large 2411 (124B)

47.8

51.4

31.5

38.2

82.7

33.3

30.0

32.9

24.7

32.8

46.7

28.6

73.3

30.0

51.3

0.0

Proprietary

Claude Sonnet 3.5 (new)

38.7

40.7

30.0

26.2

75.3

30.0

20.7

41.4

12.0

15.5

33.3

39.3

64.0

20.0

38.7

0.0

GPT-4o-mini

43.4

47.2

26.0

30.0

87.3

13.3

26.0

32.9

16.7

17.2

37.3

39.3

76.0

20.0

40.0

50.0

GPT-4o

50.2

53.9

33.5

38.3

90.0

33.3

30.0

37.1

27.3

27.6

49.3

42.9

80.0

30.0

46.7

0.0

Gemini 1.5 Flash

57.8

61.2

42.5

48.5

90.7

46.7

47.3

47.1

30.7

31.0

55.3

53.6

82.7

30.0

60.7

50.0

Gemini 1.5 Pro

67.2

71.7

47.0

62.0

92.0

60.0

62.0

50.0

47.3

27.6

65.3

60.7

90.0

50.0

73.3

75.0

Reasoning models

Open

QVQ-72B-Preview

65.0

69.7

44.0

57.2

94.0

33.3

54.0

41.4

41.3

55.2

65.3

50.0

95.3

30.0

68.0

0.0

QwQ-32B-Preview

73.1

82.7

30.0

75.8

95.3

3.3

70.0

24.3

67.3

50.0

80.7

32.1

97.3

20.0

85.3

50.0

DeepSeek-R1 (MoE 37/671B)

80.7

91.3

33.0

88.2

96.7

16.7

85.3

22.9

87.3

50.0

86.7

42.9

98.3

10.0

93.3

75.0

Proprietary

o1-mini

76.3

82.9

46.5

75.8

97.3

40.0

75.3

52.9

72.0

46.6

78.7

42.9

96.7

30.0

77.3

50.0

Gemini 2.0 Flash Thinking

83.2

89.2

58.5

86.2

95.3

60.0

80.7

48.6

88.7

65.5

85.3

75.0

95.3

50.0

90.0

25.0

o3-mini

82.2

92.8

34.5

89.5

99.3

10.0

88.0

17.1

90.7

60.3

85.3

50.0

99.3

20.0

94.0

75.0

86.8

93.1

58.5

90.5

97.3

50.0

86.0

57.1

90.7

63.8

92.0

60.7

99.3

50.0

93.3

75.0

Accuracy scores on our U-MATH benchmark and its constituent subject splits. For each category (overall and subject-specific scores) two numbers are provided: separately for text-only (T) and visual (V) problems. Additionaly, an overall score across all the subjects excluding Algebra and Precalculus is shown under U-MATH THard. Asterisks denote small sample sizes (<15). Bold indicates the best result within each group.

All the models are used with their latest versions available as of 2025-03-15. Greedy decoding is employed for all the models (except for OpenAI o-series), with ablations performed to ensure no performance degradation occurs compared to sampling-based inference. Images are not included in the prompt for text-only models, only the problem statements. Free-form solutions are verified against golden labels by an ensemble of reasoning models.

What makes
U-MATH stand out

Challenging problems to test deep understanding
and advanced reasoning skills.

Covers 6 subjects: Algebra, Precalculus, Differential Calculus, Integral Calculus, Multivariable Calculus, Sequences & Series.

Problems and solutions are sourced from real coursework and checked by experts to meet academic standards.

Math Subject

#Textual

#Visual

Algebra

150

Differential Calculus

150

Integral Calculus

150

Multivariable Calculus

150

Precalculus

150

Sequences and Series

150

All

900

200

Dataset

Has Uni. Level

% Uni. Level

%
Uni. Level

#Test

%Visual

% Free Form Answer

MMLUMath

1.3k

GSM8k

MATH

100

MiniF2F

244

100

OCWCourses

100

272

100

ProofNet

371

100

CHAMP

270

100

MathOdyssey

387

100

MMMUMath

505

100

MathVista

100

MATH-V

100

We-Math

1.7k

100

MathVerse

4.7k

83.3

U-Math (Toloka)

100

1.1k

100

Existing auto-evaluated mathematical benchmarks together with their test set sizes, percentages of visual problems, and percentages of free-form answers. Additionaly, percentages of university-curricula specific problems are provided. Some of the problems in U-MATH and μ-MATH are sourced from OpenStax under CC BY 4.0.

Test your LLM's math
capabilities with U-MATH

Download U-MATH

Test your LLM's math
capabilities with U-MATH

Download U-MATH

Test your LLM's math
capabilities with U-MATH

Download U-MATH

Test your LLM's math
capabilities with U-MATH

Download U-MATH

U-Math Dataset Collection Process

To identify the most challenging problems from tens of thousands of examples available for our benchmark we used a multi-stage selection process:

Filter out
unsuitable problems

We exclude easy problems, those requiring extensive calculations, and multiple-choice problems

Test LLMs

Solve the selected problems using a number of popular SLMs

Analyze results

Choose the most challenging problems in each subject area

Expert validation

The final set of problems is verified by domain experts from Stevens Institute of Technology

U-Math Data Samples

μ-MATH Meta-Evaluation Benchmark insights

Problem-solving performance ≠ judgment performance. A tradeoff between these skills emerges in non-reasoners.

The tradeoff yields distinctive judgment styles: proprietary models are more conservative, minimizing false positives, while Qwens tend to be more lenient.

Reasoning models push through the Pareto frontier, typically inducing substantial imbalances. o1 pushes even further and exhibits a far more weighted performance.

Judging is a non-trivial skill. The maximum attainable performance is imperfect, even with reasoners, so we need to account for errors when using auto-evaluation.

A balanced mix of training data leads to well-rounded performance, as evidenced by Qwen2.5 and Qwen2.5-Math.

Reducing the model size, on the contrary, exacerbates bias. Balanced performance requires a model capable enough to properly generalize over the training mixture.

Which LLMs excel at judging math solutions?

U-MATH shows accuracy in solving math problems. μ-MATH reflects accuracy in judging solutions

μ-MATH Dataset Collection Process

Robust test for LLM judges

Judgment errors and biases in evaluations are often overlooked, creating uncertainty and unreliability. Meta-evaluations are essential to identify, quantify, and address these issues, yet they remain scarce, especially in math.

Dataset construction

— Hand-picked ~25% of U-MATH problems (271 in total), selected for their asessment complexity and overall representativeness of university-level math problems

— Generated 4 solutions for each: using Qwen2.5 72B, Llama-3.1 70B, GPT4-o and Gemini 1.5 Pro, resulting in 1084 problem-solution pairs

— Supplied each pair with a correct judgment verdict via a combination of Toloka's math experts labeling and Gradarius formal autoverification API

Testing and metrics

During testing, a model is provided with a problem statement, a reference answer, and a solution to evaluate, and tasked to determine whether the solution is correct or not. We treat this as a binary classification task.

Primary metric:

Macro-averaged F1 score, so as to minimize the effect of class imbalance

Fine-grained metrics:

Positive Predictive Value (PPV, or Precision) and True Positive Rate (TPR, or Recall) for the positive class
Negative Predictive Value (NPV) and True Negative Rate (TNR) for the negative class

Data Sample

A sample problem from the µ-MATH meta-evaluation benchmark, illustrating a comparison between the golden (reference) answer and an answer generated by an LLM.

How smart is your LLM?
Test performance on complex math problems and step-by-step reasoning

Download U-MATH and μ-MATH

How smart is your LLM?
Test performance on complex math problems and step-by-step reasoning

Download U-MATH and μ-MATH

How smart is your LLM?
Test performance on complex math problems and step-by-step reasoning

Download U-MATH and μ-MATH

Frequently Asked Questions

Toloka’s U-MATH dataset is tailored for working with advanced mathematical problems. It provides structured data to simplify problem-solving processes and analyze complex mathematical ideas, supporting a wide range of applications in academic research, education, and AI training with an emphasis on measuring mathematical problem-solving skills for both models and students.

Here are answers to some common questions about the math dataset concept and Toloka's U-math dataset.

U-MATH & μ-MATH: Assessing LLMs on university-level math

LLM leaderboard on U-Math

Insights into LLM performance on U-Math

What makes U-MATH stand out

Test your LLM's math capabilities with U-MATH

Test your LLM's math capabilities with U-MATH

Test your LLM's math capabilities with U-MATH

Test your LLM's math capabilities with U-MATH

U-Math Dataset Collection Process

U-Math Data Samples

μ-MATH Meta-Evaluation Benchmark insights

Which LLMs excel at judging math solutions?

μ-MATH Dataset Collection Process

Robust test for LLM judges

Dataset construction

Testing and metrics

Data Sample

How smart is your LLM? Test performance on complex math problems and step-by-step reasoning

How smart is your LLM? Test performance on complex math problems and step-by-step reasoning

How smart is your LLM? Test performance on complex math problems and step-by-step reasoning

Frequently Asked Questions

What is a mathematics dataset?

What is a mathematics dataset?

What is a mathematics dataset?

What is a mathematics dataset?

What types of problems are included in mathematics datasets?

What types of problems are included in mathematics datasets?

What types of problems are included in mathematics datasets?

What types of problems are included in mathematics datasets?

Do mathematics datasets vary by difficulty level?

Do mathematics datasets vary by difficulty level?

Do mathematics datasets vary by difficulty level?

Do mathematics datasets vary by difficulty level?

Where can I download mathematics datasets for free?

Where can I download mathematics datasets for free?

Where can I download mathematics datasets for free?

Where can I download mathematics datasets for free?

What LLMs perform best on university-level math problems?

What LLMs perform best on university-level math problems?

What LLMs perform best on university-level math problems?

What LLMs perform best on university-level math problems?

How are problems for U-MATH selected?

How are problems for U-MATH selected?

How are problems for U-MATH selected?

How are problems for U-MATH selected?

Can Toloka's math dataset be used for AI model evaluation?

Can Toloka's math dataset be used for AI model evaluation?

Can Toloka's math dataset be used for AI model evaluation?

Can Toloka's math dataset be used for AI model evaluation?

What is the role of visual elements in Toloka's math dataset?

What is the role of visual elements in Toloka's math dataset?

What is the role of visual elements in Toloka's math dataset?

What is the role of visual elements in Toloka's math dataset?

What makes
U-MATH stand out

Test your LLM's math
capabilities with U-MATH

Test your LLM's math
capabilities with U-MATH

Test your LLM's math
capabilities with U-MATH

Test your LLM's math
capabilities with U-MATH

How smart is your LLM?
Test performance on complex math problems and step-by-step reasoning

How smart is your LLM?
Test performance on complex math problems and step-by-step reasoning

How smart is your LLM?
Test performance on complex math problems and step-by-step reasoning