Meet Beemo: A benchmark for AI text detection

Beemo: Benchmark of Expert Edited and Machine-generated Output. Beemo is the first benchmark of its kind, designed for evaluating AI detector performance on texts with mixed authorship — texts generated by an LLM and then edited by a human expert or another model.

Contributors

Ekaterina Artemova

Toloka AI

Jason Lucas

The Pennsylvania
State University

Saranya Venkatraman

The Pennsylvania
State University

Jooyoung Lee

The Pennsylvania
State University

Sergei Tilga

Toloka AI

Adaku Uchendu

Independent researcher

Vladislav Mikhailov

University of Oslo

Beemo: Data Structure

Beemo composition: H=human-written M=machine-generated E=expert-edited LLM=LLM-edited Llama 3.1=Llama 3.1 is used for editing GPT4o = GPT4o is used for editing
Beemo composition: H=human-written M=machine-generated E=expert-edited LLM=LLM-edited Llama 3.1=Llama 3.1 is used for editing GPT4o = GPT4o is used for editing

The benchmark covers use cases ranging from creative writing to summarization. 6,585 texts in multiple edited versions support diverse MGT detection evaluations.

  • Human written versions

  • AI versions (generated by 10 instruction-finetuned LLMs)

  • Edited versions polished by experts

  • Plus 13,170 additional machine-generated texts edited by LLMs

Beemo composition:
H=human-written
M=machine-generated
E=expert-edited
LLM=LLM-edited
Llama 3.1=Llama 3.1 is used for editing
GPT4o = GPT4o is used for editing

How BEEMO
was created

How BEEMO was created

Generating instruction-finetuned LLMs’ responses

based on No Robots Dataset

Editing the responses by expert annotators

Editing the responses by state-of-the-art LLMs

Data Samples

80% of AI Detectors Fooled by Tweaking Just 20% of AI-Generated Text

Testing the best MGT detectors on Beemo reveals a gap in detecting AI text after editing.
Using AUROC as our main performance measure, we examined three task setups:

  • Human-written vs machine-generated

  • Machine-generated vs expert-edited

  • Human-written vs expert-edited

Comparison of 4 MGT detection methods on Beemo

Comparison of 4 MGT detection methods on Beemo
Comparison of 4 MGT detection methods on Beemo

AI detectors

Binoculars

DetectGPT

DetectLLM

MAGE

Beemo Benchmark insights

Beemo Benchmark insights

Zero-shot detectors (e.g., Binoculars, DetectGPT, DetectLLM) excel at distinguishing human vs. machine-written texts and adapt well to expert-edited and LLM-edited content. Binoculars leads on Beemo, while DetectGPT is the most robust.

Zero-shot detectors (e.g., Binoculars, DetectGPT, DetectLLM) excel at distinguishing human vs. machine-written texts and adapt well to expert-edited and LLM-edited content. Binoculars leads on Beemo, while DetectGPT is the most robust.

AI detectors struggle to detect machine-generated text refined by experts. Even slight edits are enough to bypass detection.

AI detectors struggle to detect machine-generated text refined by experts. Even slight edits are enough to bypass detection.

Expert-edited texts are more likely to be classified as human-written than LLM-edited ones, highlighting that LLM editing doesn’t fully mimic human revisions.

Expert-edited texts are more likely to be classified as human-written than LLM-edited ones, highlighting that LLM editing doesn’t fully mimic human revisions.

Detector performance remains stable regardless of the extent of edits (20%-80% edited text). Significant changes to AI-generated content don’t make it easier for detectors to tell whether a human was involved.

Detector performance remains stable regardless of the extent of edits (20%-80% edited text). Significant changes to AI-generated content don’t make it easier for detectors to tell whether a human was involved.

How robust is your AI detector ?
Test it with Beemo