Meet Beemo: A benchmark for AI text detection
Beemo: Benchmark of Expert Edited and Machine-generated Output. Beemo is the first benchmark of its kind, designed for evaluating AI detector performance on texts with mixed authorship — texts generated by an LLM and then edited by a human expert or another model.
Contributors
Ekaterina Artemova
Toloka AI
Jason Lucas
The Pennsylvania
State University
Saranya Venkatraman
The Pennsylvania
State University
Jooyoung Lee
The Pennsylvania
State University
Sergei Tilga
Toloka AI
Adaku Uchendu
Independent researcher
Vladislav Mikhailov
University of Oslo
Beemo: Data Structure
The benchmark covers use cases ranging from creative writing to summarization. 6,585 texts in multiple edited versions support diverse MGT detection evaluations.
Human written versions
AI versions (generated by 10 instruction-finetuned LLMs)
Edited versions polished by experts
Plus 13,170 additional machine-generated texts edited by LLMs
Beemo composition:
H=human-written
M=machine-generated
E=expert-edited
LLM=LLM-edited
Llama 3.1=Llama 3.1 is used for editing
GPT4o = GPT4o is used for editing
Generating instruction-finetuned LLMs’ responses based on No Robots Dataset
Editing the responses by expert annotators
Editing the responses by state-of-the-art LLMs
Data Samples
80% of AI Detectors Fooled by Tweaking Just 20% of AI-Generated Text
Testing the best MGT detectors on Beemo reveals a gap in detecting AI text after editing.
Using AUROC as our main performance measure, we examined three task setups:
Human-written vs machine-generated
Machine-generated vs expert-edited
Human-written vs expert-edited
Comparison of 4 MGT detection methods on Beemo
AI detectors
Binoculars
DetectGPT
DetectLLM
MAGE










