← Blog

News

Evaluating model reasoning with rubrics: building a domain-specific evaluation dataset

Mariya Shmatova

Elena Trajkova

on May 27, 2025

High-quality human expert data. Now accessible for all on Toloka Platform.

Learn more

High-quality human expert data. Now accessible for all on Toloka Platform.

Learn more

When your AI model aces every standard benchmark but struggles with real-world tasks, you know something's missing in your evaluation approach.

Even strong foundation models need to be pushed to their limits. A top AI producer asked us to take the first step toward evaluating their model’s reasoning capabilities: create a high-quality dataset of complex prompts across sophisticated domains like medicine and linguistics.

Precise, well‑scoped prompts form the foundation of a good benchmark: they determine which skills we measure, which domains we cover, and how fairly we treat different styles of reasoning. Creating these prompts through traditional evaluation methods using generic criteria like "helpfulness" simply fell short. We needed something more nuanced and granular.

Our solution: a multi-step evaluation framework built around domain-specific grading rubrics that dissect model responses with surgical precision. Rather than asking "Is this answer correct?", we ask questions like "Does this answer mention the following key concepts?" along with dozens of other targeted criteria that expose exactly where the model performs well and where it falls short.

In this article, we'll walk you through our rubric-based methodology, how it forms the backbone of our evaluation process, and why it might be the missing piece in your own model assessment toolkit.

The challenge: Building prompts with depth and nuance

The model developer shared specific topics where they had focused efforts on improvements, including medicine and linguistics. Our evaluation dataset targeted these domains while testing the model's reasoning skills.

Since large language models can solve routine problems and perform well on most benchmarks, we needed a way to push their limits with more complex tasks. The goal was to create a set of problematic prompts that break the model at its current state. This is conceptually similar to red-teaming, but instead of crafting adversarial prompts, we used scoring criteria to probe the model's weaknesses. Here's a prompt example:

Standard evaluation methods typically compare a model's outputs to those of a competitor or a human-written reference, using either human judgment or an LLM as the evaluator. However, as the questions become increasingly complex, comparing them directly using these methods is almost impossible. Two proofs for a single theory can vary significantly, making it challenging to find common ground for comparison. Fixed criteria such as truthfulness and helpfulness are insufficient because we may have numerous facts that need to be verified.

For meaningful evaluations, complex topics require more granular criteria that consider the depth, context, and nuanced details of the answers.

Our solution: A comprehensive rubric-based evaluation system

Toloka's rubric-based evaluation system solves the challenges arising from standard evaluation techniques. Our grading rubrics set clear expectations for each answer, providing a straightforward way to judge the quality and accuracy of model output. Notably, this approach minimizes biases by not relying on a single "gold answer."

Another key feature is using iterative refinement of prompts to challenge the model and identify its weaknesses. This process starts with posing a question and generating answers, which an expert then evaluates for accuracy. If the answer is correct, we modify the prompt by adding complexity or requesting more details until we get an inadequate answer. This helps us identify areas where the model struggles. Essentially, we ask a question, receive an answer, and keep refining the prompt until the model fails to answer correctly.

We can use this approach to generate evaluation datasets for any knowledge domains or model skills.

The data generation process

Step 1: Creating the prompts

Experts from selected domain fields create the prompts. To ensure that the questions cover a range of topics, we create a detailed taxonomy of narrower topics within a broad subject and collaborate with experts in these specific fields. For example, within linguistics, we would focus on specific areas like morphology or semantics, and in medicine, we would look at pediatrics or orthopedics.

The prompts should sound natural, include adequate context, and not leave room for ambiguous answers. The dataset can consist of tricky reasoning questions, but not tricky retrieval questions—the answers must be clearly contained within the context.

Step 2: Collecting the answers

In this step, we generate answers using a selected LLM. Since the model can be inconsistent and respond differently to the same prompt, collecting multiple answers helps us confirm that the prompt is consistently challenging and the model is not producing errors by chance.

After that, we evaluate the prompts against formal criteria, such as objectivity, clarity of context, question specificity, reasoning, and difficulty level. We also include automated checks that explain the grading decisions in accordance with these standards.

Step 3: Establishing rubric criteria for evaluation

We are now ready to develop the rubric criteria, which consist of yes/no questions to evaluate if a response meets specific requirements. Correct answers should pass these criteria, while incorrect answers should fail. The primary goal of the rubric is to automatically verify the validity of an answer.

The criteria are currently created by human experts and focus on factual accuracy, content quality, and reasoning. To automate the process, we are introducing an additional step in the pipeline where an LLM generates a preliminary draft of the rubric based on the context of the prompt.

The rubrics we develop at this stage will later be given to an LLM judge during automated evaluations. They will guide the model in assessing answers in a consistent and structured way. So, how do we automatically evaluate these criteria? Each evaluation starts with a prompt, such as: "You are an expert evaluator in linguistic domains. Here's a question, along with the criteria. Please assess the provided answer against each criterion." The judging model then gives a verdict of either pass or fail for each criterion.

The judging model's role is to compare the criteria with the answer and determine if it aligns with the expectations. We do not require the model to create an original response from scratch. The key skill for our model evaluator is the ability to extract and comprehend text.

Step 4: Writing the correct answer

In this step, the experts write the correct answer for each prompt while ensuring it meets all the rubric criteria. While we don't use this answer as a standalone reference, we include it in the rubrics to assist the LLM judge in making more accurate assessments.

Step 5: Rating prompt difficulty

Experts evaluate the difficulty of the prompt on a scale from 1 to 5, where 1 represents the easiest and 5 indicates the most challenging level. This is necessary for understanding the overall set coverage and complexity. We can get a sense of how intricate and challenging the entire test set is by looking at the complexity distribution in this markup. Additionally, the development team can score the model on specific parts of the dataset depending on the complexity they need.

Step 6: Adding contextually irrelevant documents

In the last stage, we include supplementary documents to intentionally distract the model from generating an accurate response. For instance, if a user is analyzing reports on just one company, they might upload monthly reports from five additional companies and distract the model. The goal is to ensure that the model does not get sidetracked by the other reports in context.

The outcome: An efficient prompt pipeline

To gain a better understanding of the entire process, let's explore the data generation pipeline.

In the main generation stage, we create prompts by following the steps outlined above. After passing the initial automated checks, these prompts are forwarded to our editors for review. If the task requires minor changes, editors can make the necessary adjustments and send the prompt to the next step. They can also send it back for revision if they find bigger issues or disagree with the content. Once editing is complete, we conduct a final check to ensure the language quality is suitable; however, this step can be omitted since the data isn't used for training.

Using the outlined approach, we generate data similar to SFT, which includes prompts, answers, and rubrics. We collaborate with domain field experts who write the prompts and use automated tools to scale and expedite the process. Our focus is not on evaluating but on building a comprehensive dataset for future evaluations.

Here’s how the evaluation framework works using the neurolinguistics example shown earlier.

Business impact: A real-world evaluation framework with long-term benefits

We created a custom evaluation dataset focused on real-world, domain-specific tasks to stress-test the model in use cases relevant to our customer’s needs. Automated iterative evaluation is more effective than standard expert assessments because it provides a clearer picture of the model's performance and helps us track progress over time.

Rubric-based evals offer detailed control for tweaking model skills in complex domains. Ultimately, this methodology helped the customer understand the performance of their model and provided a foundation for ongoing improvements.

Want to build more reliable benchmarks for your models?

Reach out to explore how custom datasets and rubric-based scoring can improve your LLM evaluation process.

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.