U-MATH & μ-MATH: New university-level math benchmarks challenge LLMs
How well do LLMs tackle university-level math? Our comprehensive benchmark puts them to the test. Read on to find out where top LLMs rank.
Toloka and Gradarius, a calculus learning platform for students, have joined forces to create a pivotal benchmark in a challenging area for LLMs — mathematics. Introducing U-MATH, a robust dataset designed to assess an LLM's capacity for mathematical thinking at the university level. We also present μ-MATH, a meta-benchmark for evaluating the judging capabilities of LLMs on free-form mathematical solutions.
Our approach stands out from other benchmarks in several ways:
Multimodality. The U-MATH benchmark comprises 1,100 problems with 20% of problems requiring visual comprehension, across six math subject areas.
Complexity. To create U-MATH, academic experts designed university-level math problems based on curriculum from top US universities.
Subset to test judging skills. We analyze how models perform in the role of a judge using our meta-benchmark, μ-MATH.
The U-MATH and μ-MATH benchmarks are available on HuggingFace — we encourage you to download and use them to test the performance of your LLMs.
The Toloka research team tested several small and large LLMs on these datasets and compared their performance to reveal unexpected insights (for detailed results, read the paper on ArXiv).
Leaderboard surprises: Are top LLMs taking a hit?
We provide results on the complete U-MATH benchmark, text-only subset (U-MATHText) and text-and-visual subset (U-MATHVisual). The free-form answers in the benchmark are judged by GPT-4o; please refer to the μ-MATH section for judge selection details.
Accuracy comparison on U-Math.
The benchmark proved challenging for current LLMs.
A notable finding: GPT-4o only solved 43% of the problems, but it was matched by the open-weight Qwen2.5 model family with 50%. They were both surpassed by Gemini 1.5 Pro, solving 63% of the text-only tasks and 45% of the tasks with image processing (resulting in 60% on U-MATH).
In an interesting twist, the smaller, specialized text-only models like Qwen2.5-Math-7B managed to keep up with or even outperform 10 times larger models like Llama or proprietary models like GPT-4o. While larger models usually have the upper hand, this comparison shows that a targeted approach can deliver competitive accuracy at a fraction of the size.
There is a sizable gap in performance between text and visual tasks, indicating lots of room for improvement for LLMs in handling visual information, even for top-tier models. It is especially noticeable for open-weight models. For a detailed comparison of model performance across visual and text tasks, visit our U-MATH benchmark page.
How U-MATH was designed to hit the sweet spot in math benchmarking
Complex math problems demand step-by-step logic and memory retention, making math benchmarking an excellent way to test and improve the overall reasoning abilities of LLMs.
Math is particularly tough for LLMs; many current benchmarks can't reliably assess their true proficiency. Existing benchmarks are often limited in size, target school-level math problems, or don't cover enough topics, and they don’t test visual reasoning thoroughly enough. Benchmark saturation is also a concern — GPT-4o achieved 80% and 94% success rates on two of the most popular math benchmarks, MATH and GSM8K, indicating the need for new, improved standards.
U-MATH is designed to fill this gap. The dataset consists of 1,100 open-ended problems with solutions and final answers. As a hybrid benchmark, 20% of the tasks in the dataset have a visual component. The distribution of topics is shown in the table.
The benchmark’s strength lies in the rigorous preparation of math problems from the Gradarius platform. The math problems in the dataset were previously unpublished, written by experts, and designed to assess mathematical reasoning rather than arithmetic skills. Final selections were reviewed by professors of the Stevens Institute of Technology, ensuring each task met university-level standards. For more details about the dataset, read the paper.
Examples of math problems from the U-MATH dataset
What about LLMs judging AI-generated solutions? Enter μ-MATH
Since the U-MATH benchmark is based on free-form answers, a suitable LLM performs the role of judge. However, LLMs tend to introduce biases — a fact that is often overlooked, with judgment biases largely remaining unmeasured. To address the “elephant in the room”, we created μ-MATH as a unique meta-evaluation benchmark to study and quantify the behaviors, biases, and performance of LLMs as judges of math solutions.
The μ-MATH benchmark is created from a subset of U-MATH problems specifically selected for their assessment difficulty. Each problem is supplied with four solutions generated by different LLMs to enhance the meta-evaluation diversity. Based on these 1084 solutions, we assigned each sample a golden label (correct or incorrect).
Using this benchmark, we tested several LLMs on their ability to judge open-ended math problems. Let's illustrate the process with an example.
We show each judge a problem statement, a golden answer, and a solution to grade. The judge produces a binary label, indicating the compatibility between the solutions.
The best judges are not the best problem solvers
Our meta-benchmark reveals an important point: models that excel at solving math problems aren’t always the best at judging solutions. Judgment is a challenging skill in its own right, requiring meta-evaluations for proper study.
Comparison of models' scores on the text subset of U-Math (accuracy) and μ-MATH (F1)
Many models show a mismatch between solving and evaluation abilities. For example, Qwen2.5-Math excels at solving but performs poorly as a judge, frequently solving problems instead of evaluating solutions. In contrast, Claude ranks higher as a judge despite its weaker problem-solving skills.
Gemini 1.5 Pro stands out by excelling in both solving and evaluating math problems, thanks to its strong math skills and effective instruction following.
GPT-4o judgments were overly conservative, often dismissing correct solutions due to the model’s inability to derive complex answers, which landed it back in second place. However, in general, the model performs well as a judge, with fewer hallucinations. Its reliability, ease of use, and open access made it our choice for judging U-MATH.
To learn about the behaviors and properties of different LLMs tested as judges, including their biases and prompt sensitivity, read the full research paper.
The difference our benchmarks can make
U-MATH and μ-MATH introduce new insights into LLM problem-solving and judging abilities when challenged with university-level math. With this comprehensive evaluation framework, we hope to lead the way towards further development of LLM reasoning skills, shaping robust models capable of handling real-world mathematical problems. To learn more, visit our U-MATH page.
Building on these achievements in math, future steps would address the evaluation gap in other specialized domains such as Physics, Law, Natural Sciences, Medicine, IT, Finance, and Engineering. Multifaceted benchmarking is invaluable in assessing LLM performance in these fields, where complex, non-trivial benchmarks are required.
Download U-MATH, μ-MATH and evaluation code to evaluate the math abilities of whatever model you are using or building. Feel free to reach out to us for an in-depth evaluation and help with fine-tuning your model using our complex math reasoning dataset.
Article written by:
Updated:
Dec 5, 2024