An Annotator's Perspective: Building a Dataset to Challenge LLM Evaluation
Toloka is supporting new research at CLEF 2025 through the ELOQUENT lab, which focuses on evaluating the real-world usefulness of generative language models. As part of the lab’s work on modelling human preferences, Zoia Butenko – a researcher at the University of Oslo’s Language Technology Group – led the annotation effort for a new dataset.
Zoia shares how the dataset came together and what the team learned along the way.
Why modelling human preference is harder than it sounds
Most of us naturally feel good when our ideas resonate with others. There's something deeply human about wanting our thoughts to be understood and valued. That’s to say, we appreciate when people connect with our perspectives and when our contributions feel meaningful to a conversation.
This idea of validation, however, was turned completely on its head with a recent infamous ChatGPT update. The update in question made the model so agreeable that users described it as sycophantic. Since the model would agree with virtually anything the user stated, and praise any idea they presented, it was almost completely useless. The experience was so extreme that many found it more amusing than helpful.
That’s just one example of the unpredictable nature of human preference. Sometimes we have no idea how to train AI agents because we don’t know what the user actually wants. Moreover, the users themselves may not know what they like until they get it. This is why the Human Preference Prediction dataset was created, where I lead a team on annotation.
What follows is the process and my own perspective on building such an important resource.
Preparing the methodology
The process began by collecting responses from several LLMs to prompts such as how do I take care of a wooden table? These responses were then paired and sent to experienced Mindrift annotators for evaluation. Although presented in pairs, the setup allowed five LLMs from the Llama, Claude and GPT families to be assessed against one another for broader comparison. Evaluating one pair at a time made the task far more manageable than trying to assess all models simultaneously.

Surprisingly, even deciding between two responses is a complicated process. The responses might have different strengths and problems. How do the annotators even begin to choose which one is better?
We came up with specific criteria to guide the annotators.
The 5 criteria we used to judge responses
Relevance: how well does the response answer the question or request in the prompt?
Naturalness: how human-like is the response? Does it look AI-generated?
Truthfulness: how factually correct is the response?
Safety: is the response safe for users? Does it contain potentially harmful or offensive material?
Overall quality: which response is better overall?
The criteria had to be evaluated completely independently of each other to highlight potential trade-offs between them. For example, a response can be safe but low in relevance if it avoids providing information on potentially disturbing topics. Similarly, a hallucinated (i.e., untruthful) response can still read naturally and appear convincing. In addition, the annotators needed to provide a clarification for the choice under each criterion – usually two to three sentences.
For instance, ‘I prefer B because it answered all parts of the prompt, while A only focused on the first part’.
Some criteria were more complicated than others: ‘Truthfulness’ required fact-checking any claims made by the models and providing sources, but was objective.
The ‘overall quality’ criterion, on the other hand, didn’t require any knowledge or research, but it was almost entirely subjective and left up to annotators to decide what is most important.
Having criteria helps, but I know from experience that the choice can still be difficult. Fortunately, in addition to choosing one of the two responses, there were three other options: ‘both are good’, ‘both are bad’, and ‘I don’t know’.
The early stages of the project brought a steady stream of questions and sparked discussion, both within our team and among annotators. Where is the line between ‘human-like’ and ‘too human-like’? What can be deemed offensive or dangerous?
Some of the discussions even challenged my own understanding and expectations of AI and AI alignment. I found myself pondering more and more what I was actually looking for in a response, and hearing other perspectives from my colleagues seemed to only further confuse me. Whatever the question was, there were simply too many valid points of view and preferences in the team. I began to think that we need to develop LLMs that are more equitable and accountable to different opinions, needs, and values, and that there is no one-size-fits-all AI technology or `objective` guidelines.
The annotation stage
Firstly, the annotators were given the guidelines for annotation
A detailed definition of each criterion
Key indicators to watch for during evaluation
Specific guidance on what to apply or avoid
Clarity was especially important to make sure annotators shared a consistent understanding of the task. The guidelines were also refined in the process. For instance, the ‘Naturalness’ criterion was difficult for some people because LLMs nowadays provide extremely human-like responses. We added instructions to look out for ‘AI-sounding’ phrases, such as ‘Certainly! Here’s…’.
After familiarizing themselves with the guidelines, annotators completed five tasks as their training stage. To make sure the task was clear, we gave detailed feedback on every submission during the training stage.
Once we were confident the annotators were ready, the main annotation phase began. Ongoing checks and occasional corrections helped keep standards high, though annotators were now working independently.
Although the task might seem simple, annotation often took a long time. This was especially true when models generated lengthy responses. Some responses were long enough to take several minutes just to read. Annotators then had to apply all five criteria, often revisiting the text, comparing the responses, making a decision, and writing a concise justification.
On average, one task took about 10 to 15 minutes. Annotators were also instructed to skip tasks that would take over 30 minutes, or that required university-level knowledge in a specific field.
The number of tasks given out to each annotator was restricted to 30 a day to prevent bias, since some people completed much more tasks than others.
What we learned about human preference, and how to model it
Understanding human preferences goes beyond the technical challenge. We worked to strike the right balance between objective structure and the unpredictability of human preference. As a result, we collected preferences for 1,242 response pairs for five different LLMs.
The dataset will support one of the evaluation tasks at ELOQUENT Lab 2025, focused on assessing the quality of generative language models. Specifically, the Preference Prediction task will focus on building systems that can anticipate human choices between different LLM outputs.
Baseline evaluation: How well do LLMs understand human preference?
To test the value of the dataset – and provide a benchmark for future systems – Toloka ran a baseline evaluation using GPT‑4.1‑nano (April 2025 release). The goal was to see how closely an existing LLM could match human judgements across the five preference criteria.
The model was asked to do two things:
Predict which of two LLM-generated responses (A vs B) a human would prefer, across each criterion
Generate a natural-language explanation for its prediction
The baseline serves as a reproducible, minimal performance floor. Its predictions were compared against the “golden” labels collected from human annotators to assess how close the model came to human-level judgement.
The model achieved an average accuracy of 29.7% across all criteria. Performance was highest for overall quality (49.6%), but much lower for truthfulness and safety (both 17.95%), well below human-level reliability.
For the explanation task, the model’s outputs were evaluated using multiple metrics: accuracy, ROUGE‑L, BERTScore, and GPT‑4o as a judge. It scored 29.7% (accuracy), 8.40 (ROUGE‑L), 83.13 (BERTScore), and 24.27 (LLM-as-a-judge). While the language was relatively fluent, the explanations lacked factual accuracy and persuasive clarity.
These results underline how difficult it remains for even strong models to predict human preferences, especially across fine-grained, subjective criteria.
We hope his dataset will help create AI that’s more aligned with human judgement.
Looking to evaluate your LLM with real human insight? Get in touch with Toloka