Toloka and top universities launch innovative benchmark for detecting AI-generated texts

on September 17, 2024

High-quality human expert data. Now accessible for all on Toloka Platform.

Learn more

We are excited to announce the release of Beemo, a new benchmark for detecting AI-generated texts created through a collaboration between Toloka and NLP researchers from the University of Oslo, Penn State University, and other contributors.

Beemo (benchmark of expert-edited machine-generated texts) can be used to show how subtle human edits in machine-generated texts impact detection accuracy, marking a significant step forward for developing and evaluating AI text detectors and improving AI detection systems.

Collecting user-edited machine-generated texts has attracted contributors who recognize the urgency of AI detection research.

Adaku Uchendu, researcher at MIT Lincoln Labs, explains:
Disinformation poses a serious threat to democracy, and while large language models (LLMs) are impressive, they can accelerate this issue. To preserve the integrity of our information ecosystem, which is already compromised, it's essential to develop models that can distinguish artificial texts from human-written ones.

Beemo is now publicly available on GitHub and Hugging Face.

Why is detecting AI-generated texts so important?

Detecting AI-generated text is critical for mitigating the risks of misusing generative AI technologies for malicious purposes. Our recent post discusses current AI detection methods and their challenges, as existing benchmarks often fall short in practical applications due to their simplicity.

Reliable AI-generated text detection is vital for maintaining the quality of datasets used in training LLMs and addressing ethical and legal concerns.

Without effective detection, AI-generated content can be misidentified as human-produced, leading to an increasing volume of misinformation and fake texts, and potential copyright complications. Additionally, overreliance on AI-generated datasets can degrade the performance of pretrained LLMs.

In the fight against misinformation and fake news, it's crucial to identify not only entirely AI-generated text, but also texts co-authored by humans and LLMs. Such hybrid texts can be particularly deceptive, blending the natural tone of human texts with the persuasive style of AI, said Preslav Nakov, Professor and Department Chair of Natural Language Processing at MBZUAI.
This new benchmark provides a vital tool for researchers and practitioners to better understand and to build automatic systems to detect such sophisticated forms of texts, thus enhancing our ability to uphold information integrity in the age of generative AI.

What sets Toloka’s approach to benchmarking apart

The project is a unique collaboration between industry and academia, according to Vladislav Mikhailov, a PostDoc at the Language Technology Group, University of Oslo.
With Beemo, we make one of the first attempts to address more practical scenarios for detecting content created via interaction of generative language technologies and users. This resource is helpful for many research and development purposes, from benchmarking machine-generated text detectors to improving data annotation quality on crowdsourcing platforms.

Beemo features texts generated by LLMs (e.g.,Llama and Mistral) and edited by expert annotators. With coverage for a variety of use cases, Beemo allows comparing three different responses to a given prompt: human-written, LLM-generated, and expert-edited LLM-generated answers.

To create the benchmark:

We used the No_Robots dataset as the source of prompts and human-written texts and generated responses with the help of open source instruction-finetuned LLMs.
A group of expert annotators refined 20-40% of the generated responses to make them more human-like. They checked the model outputs for factual accuracy, corrected grammar and phrasing, removed hallucinations, and added personal touches to enhance engagement and relevance.
The edited responses were then validated by lead editors with a strong background in editing and annotating generated data.

Each dataset example belongs to one of five categories (open-ended generation, summarization, rewriting, open question answering, and closed question answering) and consists of a prompt, a gold standard response to the prompt, an LLM's response to the prompt, and its corresponding human-edited version.

How to use this benchmark
This diagnostic benchmark can be used to develop and test new artificial text detectors, while offering a test bed for existing AI detection systems. It illustrates how human edits in a generated text can affect an AI detector's prediction, making the AI-generated content appear human-written.

There are several ways to use Beemo for experiments related to AI detection:
1. Benchmark zero-shot and trained systems for AI detection. You can feed machine-generated texts and human-edited texts to an AI detector. In many cases, the AI detector will recognize human-edited texts as human-written.
2. Explore the robustness of AI detectors with respect to a diverse set of LLMs and prompt categories.
3. Train your own AI detectors. Instead of the standard binary classification setup (human-written vs. machine-generated), we encourage building more nuanced AI detectors that can recognize edits made to machine-generated texts.
A different direction for experiments could explore how training on human-edited generated data enhances LLM performance. We anticipate additional potential applications, and we look forward to exploring and collaborating with the AI community.
We encourage you to share how you use our benchmark – tag Toloka on LinkedIn and X, so we can see your findings. Your insights are valuable and can help further improve the benchmark and its applications.

Team
Vladislav Mikhailov, Saranya Venkatraman, Jason Lucas, Jooyoung Lee, Ekaterina Artemova

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.

Toloka and top universities launch innovative benchmark for detecting AI-generated texts

Why is detecting AI-generated texts so important?

What sets Toloka’s approach to benchmarking apart

How to use this benchmark

Team

Subscribe to Toloka news