Read our papers on AI training, evaluation, and safety

Read our papers on AI training, evaluation, and safety

Read our papers on AI training, evaluation, and safety

VOX-DUB: a new benchmark that puts AI dubbing to the test

September 9, 2025

September 9, 2025

Insights

Insights

How well do today's AI systems handle dubbing? We introduce VOX-DUB, a new human-based benchmark designed to find out. It’s built to evaluate the nuances of AI dubbing, complete with a test dataset, clear annotation guidelines, and an initial validation comparing four commercial solutions.

While text-to-speech (TTS) synthesis has nearly reached human levels of quality, dubbing remains an open challenge. It's not enough for a voice to sound natural—what truly matters is whether the system can preserve the actor’s character, intonation, and timbre when switching languages. This is what shapes how an audience perceives a scene.

Popular leaderboards for TTS already exist, like the Hugging Face TTS Arena and  artificialanalysis.ai Speech Arena, but there's no open benchmark dedicated specifically to dubbing. VOX-DUB aims to fill this gap. Our approach uses pairwise A/B comparisons with access to the original audio, forcing annotators to look beyond "pleasantness of sound" and consider emotion, pauses, and voice similarity. We've validated this benchmark offline with native speakers, and it’s ready to be used to test the performance of your models.

How VOX-DUB Works

Here’s a look at the methodology that makes our benchmark a robust tool for assessing AI dubbing.

The Setup

We compare systems pairwise. Annotators listen to the original clip and both dubbed versions, then mark which is “better,” “worse,” or the “same” on several aspects. The “same” option captures cases where the systems are indistinguishable without penalizing either. To ensure impartiality, all evaluations are blind—provider names and metadata are hidden.

While Mean Opinion Score (MOS) is common in TTS research, it's less practical for dubbing, where subtle differences are harder to capture and calibration across languages is more complex. Our A/B approach is designed to pinpoint these fine-grained distinctions.

The Five Core Aspects

There is no single universal indicator of AI dubbing quality. To make comparisons fair and interpretable, we evaluate five distinct aspects that shape perception:

  • Pronunciation — Correctness in the target language and absence of phonetic errors.

  • Naturalness — Does it sound like a real human, not a synthesizer?

  • Audio quality — Clarity and the absence of artifacts or noise.

  • Emotional accuracy — How well the original emotions are conveyed.

  • Voice similarity — How close the timbre is to the source actor.

Details and examples for edge cases are included in our annotation guidelines available on Hugging Face VOX-DUB.

Data 

We chose English (en-US) and Spanish (es-MX) as target languages. The source speech comes from French, German, Russian, Hindi, Chinese, and Japanese, as well as English and Spanish. The dataset is drawn from Creative Commons licensed  YouTube videos with a focus on acting performances. The final release consists of 168 text–audio pairs, available on Hugging Face VOX-DUB.

Systems

For the experiment, we tested four commercial systems: TTS arena leaders ElevenLabs and Minimax, alongside dubbing-focused solutions Deepdub and Dubformer. All systems were run via their APIs using default settings.

Annotators and Scale

Annotations were conducted by native speakers with strong experience in speech synthesis evaluation. A strict requirement was the use of headphones (no laptop speakers). To ensure reliability, each comparison was evaluated by three independent annotators. In total, this produced 30,240 judgment instances (168 utterances × 6 pairs × 5 aspects × 3 overlaps × 2 languages). We then applied the Bradley–Terry–Davidson (BTD) model to aggregate these judgments into a final ranking that reflects the relative quality of each system.

The Results: A look at the dubbing leaderboard

The benchmark proved challenging for current systems. The figures below show the results for the two target languages, English (en-US) and Spanish (es-MX), with bar height showing system scores and vertical lines marking 95% confidence intervals.

English


Spanish

System quality profiles show only minor variation between the two target languages and generally remain within the confidence intervals.

Annotators selected the “same” option in about one-third of cases, which aligns with expectations for subjective audio evaluations where subtle differences are often hard to distinguish. 

Individual raters matched the aggregated label in 70% to 80% of cases, showing a strong level of agreement overall.

Key Findings: The trade-offs in modern dubbing AI

Our analysis revealed several recurring patterns and challenges for current dubbing systems.

The emotion vs. audio quality trade-off

In highly expressive scenes (like shouting), systems that scored higher on emotion often lost points on audio quality. It seems boosting expressiveness frequently introduces artifacts and noise across all tested systems.

0:00/1:34

Original audio

0:00/1:34

Has better sound quality

0:00/1:34

Emotion is conveyed better

Two dominant types of pronunciation errors

Not all systems remain stable when transferring to other languages. The most common errors are:

Phonetic “hallucinations” — Missing or inserted sounds and distorted pronunciation.

0:00/1:34

Unclear pronunciation

Strong foreign accent — Speech that sounds distinctly “non-native” to listeners.

0:00/1:34

Strong foreign accent

Prosody: The big picture is there, but the details are lost

Most systems successfully captured the overall mood of a scene. However, finer details like pauses, accents, and nuanced prosody were often missing, which can reduce the clarity of the intended meaning.

0:00/1:34

Original audio

0:00/1:34

Synthesized

What's next for dubbing evaluation?

Future versions of VOX-DUB will extend beyond audio into video fragments. Adding visual context will create a closer match to real-world dubbing and open the door to new lines of assessment:

  • Visual alignment: measuring lip-sync, rhythm, and pauses within the shot

  • Voice stability: checking whether the voice remains stable across different lines

  • Acoustic setting: reflecting the scene's acoustic environment, such as the relative position of an actor and reverbs

With video included, VOX-DUB can better reflect the challenges of real productions and drive progress toward models that handle full performances rather than isolated audio clips.

About Toloka 

Toloka develops high-quality, multimodal datasets for demanding domains such as dubbing, video, coding, and more. If your team needs a bespoke benchmark or deeper evaluation support, our experts can build a framework tailored to your model and use case. Contact us today.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?