LLM Evaluation In Action: Should You Trust Automated Metrics or Human Judgment?
As LLMs (Large Language Models) move toward widespread adoption, we expect they will be integrated into many consumer products in 2025. Yet one question is paramount: How do we measure their true capabilities? Evaluating these models isn’t just about ranking performance—it’s about uncovering their strengths and limitations, comparing models to guide improvements, and ensuring they deliver reliable, trustworthy results in real-world scenarios.
LLM evaluation remains a complex challenge. In this article, we break down current evaluation methods, discuss when and how to apply different approaches, and share practical examples from our evaluation projects at Toloka, where we tackle these challenges daily to help build smarter AI.
Computational evaluation
The first type of evaluation that can be performed is computational evaluation, which originates from traditional machine-learning evaluation methods. It relies on a gold standard sample and a statistical scoring method to compare a model’s output against this reference. The result is a single score that quantifies how closely the model's output aligns with the reference standard. In other words, each task has a correct answer and a score that shows if the model “got it right”.
Commonly used metrics include ROUGE, BLEU, BERTScore, BARTScore, and others. This approach is fast, scalable, and effective for simple tasks with clear reference outputs. However, computational metrics often have a weak correlation with human judgment when applied to complex, open-ended tasks. It’s best to use this type of evaluation for simple tasks, and augment it with human evaluation or other automated rating approaches to provide quick sanity checks when tuning models.
Human evaluation
Human evaluation is the holy grail for evaluating LLMs, especially when the raters are trained specialists. The process works by prompting the model with a question, then taking the answer and asking a human expert to evaluate it based on pre-specified criteria such as correctness, fluency, clarity, mentioning certain concepts, and so on. Evaluation can be done pointwise (assessing a single response) or side-by-side (comparing responses from different models). This is the most reliable type of evaluation, and unlike computational evaluation, which only works for text output, human evaluation is suitable for all output modalities, including video and audio.
The downside is that human evaluation can be expensive and time-consuming. For complex tasks, it can also be challenging to find skilled experts who are qualified to perform high-quality evaluations. At Toloka, we rely on the Mindrift platform to access an extensive network of domain experts and skilled annotators with specialized training and sophisticated quality control processes.
Automated evaluation
Automated evaluation is similar to the human evaluation described above, except that human judgment is replaced with a strong LLM model. The model scores the answer based on several criteria or performs a side-by-side comparison, just like a human expert would do. This method is suitable for complex, open-ended tasks, but it is less time-consuming than human evaluation and works well for iterative measurements once an evaluation dataset has been created. Its main weakness is that it depends on the judge model’s biases, so the LLM used for evaluation must be chosen carefully. Interestingly, research has shown that LLMs that solve problems exceptionally well are not always the best at judging solutions. Judging capabilities should be evaluated separately before relying on an LLM for automated evaluation.
Building datasets for rubric-based auto evaluation
At Toloka, we support LLM producers with several types of evaluations, including automated evaluation, which is best suited for text and reasoning problems. In addition to the standard LLM-based evaluation, we implement a human-in-the-loop approach to ensure that LLM judgments align with human evaluations, providing a final score for prompts across different criteria.

Visual: Prompts in the dataset are judged by the LLM and validated by human experts to get a final score
Building datasets for this approach is challenging, but it is highly effective for complex and niche-domain problems. To create a comprehensive dataset, we involve human domain experts in a four-step process:
Creating prompts – Experts write prompts relevant to specific niche domains.
Defining evaluation rubrics – Experts establish criteria to assess responses. Some rubrics may apply to the entire dataset (e.g., language fluency), while others may be question-specific (e.g., "Does the answer explain a specific concept?").
Generating a gold-standard answer – Experts create a reference answer to use for comparing model outputs.
Validating data points – Experts review the dataset to ensure that prompts, reference answers, and rubrics align. If they encounter issues, they revise and refine the dataset.
We have used this evaluation approach to test models across various domains, such as Law, Acoustics, Engineering, Mathematics, Natural Sciences, and Medicine. To achieve the highest quality of evaluation data, we collaborate with highly qualified experts—most of our experts have Master’s or PhD degrees and industry experience in relevant fields.
Human evaluation for video and audio domains
For creative fields, automatic evaluation is still not effective. We highly recommend human evaluation for video and audio content, and the evaluation process needs to be precisely designed for the specific modality and use case.

Visual: Aggregated results of human evaluation of a video.
To create datasets for creative fields, we developed a specialized toolkit in collaboration with experts from the film industry, including producers and graphic designers. This toolkit enables the rapid creation of complex prompts and their corresponding evaluation criteria. Experts can customize prompts by selecting factors such as prompt complexity, desired features (e.g., the main focus of the video, character traits like age or species, and background details), and camera settings (e.g., location, angle, and movements).

Visual: Dataset Toolkit for evaluating creatives.
Our approach is a scalable and effective way to create evaluation datasets for creative models. When used in conjunction with human evaluation, it gives model producers a detailed picture of the strengths and weaknesses of their models.
Do you need an in-depth evaluation of your model?
Before pushing a model to market, you may want to run multiple types of evaluations to understand how it will perform in real-life scenarios. We can support you with any approach to evaluation, whether it’s evaluating complex text and reasoning capabilities or creative outputs. Custom evaluation datasets can align model performance with your expectations for specific domains, skills, and scenarios.
Connect with our team to get a tailored solution.
Article written by:
Updated:
Mar 3, 2025