Toloka's advanced benchmarking reveals limitations and opportunities for LLMs
As AI models become increasingly sophisticated, it’s getting difficult to identify where they are still failing. AI answers sound convincing to ordinary users because of their fluency, and only an expert will notice when the information is inaccurate or misleading.
But that doesn’t mean we should stop trying.
Following our commitment to responsible AI, the Toloka Research team is putting a spotlight on areas that need improvement, so LLMs can serve the best interests of both GenAI producers and end users.
Focusing on specialized domains
In recent months, Toloka's research team selected several critical domains to help LLM developers assess their models. Our research efforts have resulted in three benchmarks that allow LLM producers to find performance gaps in these areas:
University-level mathematics
Complex questions in natural sciences
Detecting AI-generated text
To collect the data for these benchmarks, we applied our expertise in generating data for various domains, knowledge of customer needs, and extensive network of domain experts. After running experiments and testing the latest LLMs on our benchmarks, we’ve shown that even the most advanced models still require alignment when addressing specialized domains. Read on to learn about each of the benchmarks and our findings.
The approach
To tackle the current challenges most popular models face, we use the AI benchmarking evaluation method, which compares how models do specific tasks using defined datasets. With unique datasets developed by Toloka, LLM producers can see whether they need to fine-tune models using supplementary data.
The biggest university-level math benchmark
LLMs' reasoning capabilities are still limited. As the industry focuses on developing step-by-step reasoning skills, there is a high demand for data in areas like coding, mathematics, and logical reasoning.
There are plenty of existing benchmarks to assess problem-solving abilities in mathematics, but most of them focus on school-level math problems. A few others go to the opposite extreme and assess higher mathematics with theoretical problems that are exceedingly difficult. This leaves a significant gap in understanding how LLMs perform on more practical, university-level problems.
To close this gap, Toloka collaborated with Gradarius, a calculus learning platform for students. Together, we developed a new benchmark consisting of 1,100 real-world math problems sourced from curriculum at top US universities and reviewed by academic experts. It includes complex prompts across 6 math subject areas — algebra, differential calculus, integral calculus, multivariable calculus, precalculus, and sequences and series — and surpasses any other assessment for advanced math capabilities in these subjects.
About 20% of the problems in the dataset incorporate visual elements, requiring models to interpret and analyze graphs, charts, and geometric figures. Visual elements are crucial in many mathematical problems, especially in fields like geometry, calculus, and data analysis.
The U-Math problems have free-form answers, so solutions are judged by an LLM. To find out how accurately LLMs can judge answers when benchmarking, we created μ-MATH, a meta-evaluation dataset. μ-MATH is a subset of U-MATH with 340 problems and LLM-generated solutions to assess an LLM’s ability to judge solutions.
We tested a variety of small, large, and proprietary LLMs on our benchmark, and the experiments showed some. The biggest surprise was that GPT-4o wasn’t the best performer. It only solved 43% of the problems, but it was matched by the open-weight Qwen2.5 model family with 50%. They were both surpassed by Gemini 1.5 Pro, solving 63% of the text-only tasks and 45% of the tasks with image processing (resulting in 60% on U-MATH). Discover more findings in our blog post about U-MATH.
To test your LLM math capabilities on this dataset, download it here.
Benchmark dataset for improving LLMs in the natural sciences
Existing STEM benchmarks don’t accurately assess the depth of subject knowledge or represent real-world problems in research and industry. Toloka’s research team created a specialized benchmark dataset covering several domains to evaluate whether LLMs are good at complex science-related questions.
For this project, we collaborated with a team of domain experts who are active researchers in fields such as high-energy physics, immunology, and cell biology. The result is a dataset of 180 questions spanning ten subdomains.
We tested several popular models on this benchmark. Unsurprisingly, GPT-4 beat the models in all domains except Bioinformatics, where Llama 3 outran all the models. We identified significant issues in the responses from all the LLMs we tested, indicating severe limitations in their reliability as information sources for natural science topics.
If you are interested in conducting similar evaluations of your LLM for a domain of your choice, please contact us to discuss.
Read more about Toloka’ specialized benchmark dataset for natural sciences here.
A boost for AI detection – Beemo
The detection of AI-generated content is a problem relevant to all domains regardless of complexity. Reliable AI-generated text detection is vital for robust model development. It helps overcome ethical and legal concerns about training datasets, maintains their quality and safeguards the performance of the final model.
To contribute to stronger benchmarking for AI detection, we developed Beemo in collaboration with NLP researchers from the University of Oslo and Penn State University. The main difference from existing benchmarks is that we not only compared LLM-generated texts to human-written texts but also included human-edited versions of AI-generated text for greater nuance in AI detection.
There are several ways you can use Beemo for your own experiments:
Benchmark systems for AI detection. If you feed machine-generated and human-edited texts to an AI detector, in many cases, it will recognize human-edited texts as human-written.
Explore the robustness of AI detectors with respect to a diverse set of LLMs and prompt categories.
Train your own nuanced AI detectors that can recognize edits made to machine-generated texts.
Beemo is available on Hugging Face.
What’s next
Even the most advanced LLMs have limitations, and benchmarks offer an effective method for identifying them. Uncovering these challenging areas highlights the need for targeted fine-tuning.
If you want to discover more areas for LLM improvement and spotlight them in the market, our research team is open to collaboration. Currently, we are working on several benchmarking projects, including other exciting domains, such as Arabic languages. Reach out with your research proposals.
Article written by:
Updated:
Dec 6, 2024