VOX-DUB: a new benchmark that puts AI dubbing to the test
How well do today's AI systems handle dubbing? We introduce VOX-DUB, a new human-based benchmark designed to find out. It’s built to evaluate the nuances of AI dubbing, complete with a test dataset, clear annotation guidelines, and an initial validation comparing four commercial solutions.
While text-to-speech (TTS) synthesis has nearly reached human levels of quality, dubbing remains an open challenge. It's not enough for a voice to sound natural—what truly matters is whether the system can preserve the actor’s character, intonation, and timbre when switching languages. This is what shapes how an audience perceives a scene.
Popular leaderboards for TTS already exist, like the Hugging Face TTS Arena and artificialanalysis.ai Speech Arena, but there's no open benchmark dedicated specifically to dubbing. VOX-DUB aims to fill this gap. Our approach uses pairwise A/B comparisons with access to the original audio, forcing annotators to look beyond "pleasantness of sound" and consider emotion, pauses, and voice similarity. We've validated this benchmark offline with native speakers, and it’s ready to be used to test the performance of your models.
How VOX-DUB Works
Here’s a look at the methodology that makes our benchmark a robust tool for assessing AI dubbing.
The Setup
We compare systems pairwise. Annotators listen to the original clip and both dubbed versions, then mark which is “better,” “worse,” or the “same” on several aspects. The “same” option captures cases where the systems are indistinguishable without penalizing either. To ensure impartiality, all evaluations are blind—provider names and metadata are hidden.
While Mean Opinion Score (MOS) is common in TTS research, it's less practical for dubbing, where subtle differences are harder to capture and calibration across languages is more complex. Our A/B approach is designed to pinpoint these fine-grained distinctions.
The Five Core Aspects
There is no single universal indicator of AI dubbing quality. To make comparisons fair and interpretable, we evaluate five distinct aspects that shape perception:
Pronunciation — Correctness in the target language and absence of phonetic errors.
Naturalness — Does it sound like a real human, not a synthesizer?
Audio quality — Clarity and the absence of artifacts or noise.
Emotional accuracy — How well the original emotions are conveyed.
Voice similarity — How close the timbre is to the source actor.
Details and examples for edge cases are included in our annotation guidelines available on Hugging Face VOX-DUB.
Data
We chose English (en-US) and Spanish (es-MX) as target languages. The source speech comes from French, German, Russian, Hindi, Chinese, and Japanese, as well as English and Spanish. The dataset is drawn from Creative Commons licensed YouTube videos with a focus on acting performances. The final release consists of 168 text–audio pairs, available on Hugging Face VOX-DUB.
Systems
For the experiment, we tested four commercial systems: TTS arena leaders ElevenLabs and Minimax, alongside dubbing-focused solutions Deepdub and Dubformer. All systems were run via their APIs using default settings.
Annotators and Scale
Annotations were conducted by native speakers with strong experience in speech synthesis evaluation. A strict requirement was the use of headphones (no laptop speakers). To ensure reliability, each comparison was evaluated by three independent annotators. In total, this produced 30,240 judgment instances (168 utterances × 6 pairs × 5 aspects × 3 overlaps × 2 languages). We then applied the Bradley–Terry–Davidson (BTD) model to aggregate these judgments into a final ranking that reflects the relative quality of each system.
The Results: A look at the dubbing leaderboard
The benchmark proved challenging for current systems. The figures below show the results for the two target languages, English (en-US) and Spanish (es-MX), with bar height showing system scores and vertical lines marking 95% confidence intervals.

English

Spanish
System quality profiles show only minor variation between the two target languages and generally remain within the confidence intervals.
Annotators selected the “same” option in about one-third of cases, which aligns with expectations for subjective audio evaluations where subtle differences are often hard to distinguish.
Individual raters matched the aggregated label in 70% to 80% of cases, showing a strong level of agreement overall.

Key Findings: The trade-offs in modern dubbing AI
Our analysis revealed several recurring patterns and challenges for current dubbing systems.
The emotion vs. audio quality trade-off
In highly expressive scenes (like shouting), systems that scored higher on emotion often lost points on audio quality. It seems boosting expressiveness frequently introduces artifacts and noise across all tested systems.
Original audio
Has better sound quality
Emotion is conveyed better
Two dominant types of pronunciation errors
Not all systems remain stable when transferring to other languages. The most common errors are:
Phonetic “hallucinations” — Missing or inserted sounds and distorted pronunciation.
Unclear pronunciation
Strong foreign accent — Speech that sounds distinctly “non-native” to listeners.
Strong foreign accent
Prosody: The big picture is there, but the details are lost
Most systems successfully captured the overall mood of a scene. However, finer details like pauses, accents, and nuanced prosody were often missing, which can reduce the clarity of the intended meaning.
Original audio
Synthesized
What's next for dubbing evaluation?
Future versions of VOX-DUB will extend beyond audio into video fragments. Adding visual context will create a closer match to real-world dubbing and open the door to new lines of assessment:
Visual alignment: measuring lip-sync, rhythm, and pauses within the shot
Voice stability: checking whether the voice remains stable across different lines
Acoustic setting: reflecting the scene's acoustic environment, such as the relative position of an actor and reverbs
With video included, VOX-DUB can better reflect the challenges of real productions and drive progress toward models that handle full performances rather than isolated audio clips.
About Toloka
Toloka develops high-quality, multimodal datasets for demanding domains such as dubbing, video, coding, and more. If your team needs a bespoke benchmark or deeper evaluation support, our experts can build a framework tailored to your model and use case. Contact us today.