Products

Resources

Impact on AI

Company

Meet Beemo: A benchmark for AI text detection

Beemo: Benchmark of Expert Edited and Machine-generated Output. Beemo is the first benchmark of its kind, designed for evaluating AI detector performance on texts with mixed authorship — texts generated by an LLM and then edited by a human expert or another model.

Contributors

Ekaterina
Artemova

Toloka AI

Toloka AI

Jason
Lucas

The Pennsylvania
State University

The Pennsylvania
State University

Saranya
Venkatraman

The Pennsylvania
State University

The Pennsylvania
State University

Jooyoung
Lee

The Pennsylvania
State University

The Pennsylvania
State University

Sergei
Tilga

Toloka AI

Toloka AI

Adaku

Uchendu

Independent
researcher

Independent
researcher

Vladislav
Mikhailov

University of Oslo

University of Oslo

Beemo: Data Structure

The benchmark covers use cases ranging from creative writing to summarization. 6,585 texts in multiple edited versions support diverse MGT detection evaluations.
— Human written versions
— AI versions (generated by 10 instruction-finetuned LLMs)
— Edited versions polished by experts
— Plus 13,170 additional machine-generated texts edited by LLMs

Beemo composition:
H=human-written
M=machine-generated
E=expert-edited
LLM=LLM-edited
Llama 3.1=Llama 3.1 is used for editing
GPT4o = GPT4o is used for editing

How BEEMO was created

generating instruction-finetuned LLMs’ responses based on No Robots Dataset

editing the responses by expert annotators

editing the responses by state-of-the-art LLMs

Data Samples

80% of AI Detectors Fooled by Tweaking Just 20% of AI-Generated Text

Testing the best MGT detectors on Beemo reveals a gap in detecting AI text after editing.
Using AUROC as our main performance measure, we examined three task setups:

  • human-written vs machine-generated

  • machine-generated vs expert-edited

  • human-written vs expert-edited

Beemo Benchmark insights

Zero-shot detectors (e.g., Binoculars, DetectGPT, DetectLLM) excel at distinguishing human vs. machine-written texts and adapt well to expert-edited and LLM-edited content. Binoculars leads on Beemo, while DetectGPT is the most robust.

AI detectors struggle to detect machine-generated text refined by experts. Even slight edits are enough to bypass detection.

Expert-edited texts are more likely to be classified as human-written than LLM-edited ones, highlighting that LLM editing doesn’t fully mimic human revisions.

Specialized Qwen2.5-Math takes the lead over Gemini due to stellar performance in the text domain.

Detector performance remains stable regardless of the extent of edits (20%-80% edited text). Significant changes to AI-generated content don’t make it easier for detectors to tell whether a human was involved.

How robust is your AI detector?
Test it with Beemo

How robust is your AI detector?
Test it with Beemo

How robust is your AI detector?
Test it with Beemo