Meet Beemo: A benchmark for AI text detection
Beemo: Benchmark of Expert Edited and Machine-generated Output. Beemo is the first benchmark of its kind, designed for evaluating AI detector performance on texts with mixed authorship — texts generated by an LLM and then edited by a human expert or another model.
Contributors
Ekaterina
Artemova
Jason
Lucas
Saranya
Venkatraman
Jooyoung
Lee
Sergei
Tilga
Adaku
Uchendu
Vladislav
Mikhailov
Beemo: Data Structure
The benchmark covers use cases ranging from creative writing to summarization. 6,585 texts in multiple edited versions support diverse MGT detection evaluations.
— Human written versions
— AI versions (generated by 10 instruction-finetuned LLMs)
— Edited versions polished by experts
— Plus 13,170 additional machine-generated texts edited by LLMs
Beemo composition:
H=human-written
M=machine-generated
E=expert-edited
LLM=LLM-edited
Llama 3.1=Llama 3.1 is used for editing
GPT4o = GPT4o is used for editing
How BEEMO was created
generating instruction-finetuned LLMs’ responses based on No Robots Dataset
editing the responses by expert annotators
editing the responses by state-of-the-art LLMs
Data Samples
80% of AI Detectors Fooled by Tweaking Just 20% of AI-Generated Text
Testing the best MGT detectors on Beemo reveals a gap in detecting AI text after editing.
Using AUROC as our main performance measure, we examined three task setups:
human-written vs machine-generated
machine-generated vs expert-edited
human-written vs expert-edited
Beemo Benchmark insights
Zero-shot detectors (e.g., Binoculars, DetectGPT, DetectLLM) excel at distinguishing human vs. machine-written texts and adapt well to expert-edited and LLM-edited content. Binoculars leads on Beemo, while DetectGPT is the most robust.
AI detectors struggle to detect machine-generated text refined by experts. Even slight edits are enough to bypass detection.
Detector performance remains stable regardless of the extent of edits (20%-80% edited text). Significant changes to AI-generated content don’t make it easier for detectors to tell whether a human was involved.