Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Human-powered evaluation: Actionable feedback for next‑gen video diffusion models

May 20, 2025

May 20, 2025

Insights

Insights

AI judges are the modern heroes of automated evaluation, but can they handle videos?

Large language models (LLMs) are being used to evaluate other AI systems—scoring and critiquing the text their cousins generate. It's a practical solution with clear benefits: automated evaluation is faster, more cost-effective, and easier to scale when compared to human evaluation.

But not when it comes to video.

While LLM judges excel at assessing text, they struggle with moving images. To evaluate an AI-generated video, the model must understand temporal coherence, motion realism, visual quality, contextual consistency, and overall aesthetics. These are areas where human perception continues to outperform current automated methods.

Picture a scene: a dog is chasing a ball across a beach as waves roll in and prints appear in the sand. A human can instantly tell whether the scene looks natural and believable, noticing subtle details like whether the dog's movements match its size and breed, or if the prints align with the dog’s path. In contrast, a language model acting as a judge doesn’t reason about motion, physics, or visual aesthetics in the same integrated way.

Beyond the limitations of LLM judges, there are practical challenges to automated video evaluation. Comprehensive visual benchmarks like MMStar and MMT-Bench are known to include mislabeled data, suggesting the results are unreliable. More importantly, automated metrics offer limited insight into what is wrong with a video or how it can be improved, making them relatively useless for comparing models and guiding development.

In response to these challenges, we developed a human-centered evaluation framework designed specifically for video diffusion models, and used it to evaluate four top video models to reveal their strengths and weaknesses.

Introducing Toloka’s video evaluation toolkit 

Toloka built the Mainstream Movies video eval toolkit to cross the bridge between auto evaluation and human evaluation when assessing video diffusion models. The toolkit offers a set of 500 text‑to‑video prompts of varying complexity that cover about 85 percent of typical movie scenes, along with guidelines and structure for evaluating the videos generated for each prompt.

The toolkit was developed with leading industry experts, production designers, and visual‑effects and computer‑graphics supervisors from major studios such as Lionsgate and the BBC, who worked on blockbusters like Dawn of the Planet of the Apes, Wonder Woman, and Kingsman: The Golden Circle. Their expertise helped us design a robust prompt creation pipeline, pictured below, and ensured the prompts met real-world cinematic standards.

Using the toolkit, a trained team of experts evaluates the videos generated by the model for each prompt. They score each video on complexity, realism, and alignment, broken down into detailed subcategories as shown in the image.

What makes our approach stand out?

The Mainstream Movies toolkit offers several enhancements over current industry standards and best practices.

  • It focuses on value to the end user. Higher scores align with viewers’ perception of video quality. 

  • It provides detailed and actionable insights so that ML engineers know what areas are underperforming and can take action.

  • Videos are evaluated by humans, who excel at understanding temporal coherence, context, motion realism, visual quality, and aesthetics in ways automated methods cannot easily replicate.

How top video models performed on Mainstream Movies evaluation 

We used the toolkit to assess four state-of-the-art video diffusion models: Luma AI, Pika, Runway, and Sora. To evaluate the models, we used videos generated from the same prompt on the same day and scored them against a set of criteria. The image below compares the output of these models for a single prompt, using the Realism category. 

Trained evaluators rated the videos from 1 to 5 on Realism and 1 to 3 on Complexity, and the scores were converted to a percentage in the results table. To augment the numeric scores, evaluators also explained what is wrong with the video using their own words.

After analyzing three main criteria and thirteen subcategories, we found that Luma AI, Pika, and Runway perform similarly. The model that stands out is Sora, which significantly outperforms the others in complexity and realism. On the other hand, Sora falls short in prompt alignment. The results are shown in the tables below.


Complexity

70%

70%

70%

80%

Details

73%

77%

70%

80%

Movement

67%

67%

73%

77%

Realism 

66%

70%

70%

76%

Look

66%

72%

70%

74%

Movement

62%

60%

66%

72%

Consistency

68%

72%

72%

80%

Alignment 

76%

77%

72%

73%






Detailed analysis revealed that each model has areas of strength and weakness. For example, when we zoom in on the Realism category, Pika excelled in furniture and vegetation, performing equal to or even better than Sora. The table below shows more granular results for the Realism category for each model.


Realism

66%

70%

70%

76%

Look

66%

72%

70%

74%

Main object

66%

68%

64%

72%

Person

60%

50%

66%

72%

Building

72%

66%

64%

74%

Vehicle

74%

72%

72%

72%

Animal

58%

52%

42%

66%

Devices

58%

72%

70%

70%

Vegetation

72%

76%

70%

76%

Furniture

74%

88%

68%

82%

Analyzing further, we also noticed distinctions between easier and more challenging categories—models tend to score higher on Furniture and lower on Animals. Overall, Sora leads in many areas but is closely matched or outperformed in specific categories.

In addition to the numeric scores, free‑form reviewer notes give us valuable information and reveal several recurring issues that affect video quality. For example, in the Person subcategory under Realism, the most common problems with the models that experts commented on are morphing body parts, blurring or sudden disappearance, incorrect anatomy, and unnatural object behavior. These insights give model developers actionable feedback for the next iteration of their models.

Ready to evaluate your video generation model?

If you’d like to evaluate your own video generation model and gain insights for performance improvements, connect with our team to use our toolkit or develop a customized version for your needs.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?