The short answer
85-95%
Unmodified AI output
Directly submitted GPT-4o or Claude output, no edits
60-80%
Lightly edited AI output
Minor human edits, corrected errors, added specifics
30-60%
Paraphrased or humanized
Run through a humanizer tool or extensively rewritten
These figures reflect ensemble detector performance on current AI model output. Single-metric detectors perform significantly worse, particularly on edited or paraphrased content.
Detection by AI model
Not all AI models are equally detectable. Older models with more predictable token distributions are easier to detect. Newer models trained with RLHF and instruction-following objectives produce text that is harder to distinguish from human writing at the statistical level.
Figures based on ensemble testing. Rates vary by content type and detector version.
Detection by content type
Content type is the second largest variable after whether the text has been paraphrased. Formal writing registers have lower perplexity in both AI and human output, making them harder to distinguish.
Creative fiction
90-97%
EasyAI fiction has characteristic plot structure, generic imagery, and low specificity
Social media posts
85-94%
EasyShort text with AI-pattern phrasing is distinctive; limited human variation possible
News articles
82-91%
EasyAI news has flat inverted pyramid structure, over-attribution, and smooth topic flow
Marketing copy
78-88%
ModerateBenefit-focused structure is standard in both AI and human marketing writing
Technical documentation
70-82%
ModerateConsistent terminology and procedural patterns are expected in human tech writing too
Academic writing (native English)
72-86%
ModerateHedges and formal transitions trigger detectors but semantic models still distinguish
Academic writing (ESL)
60-78%
HardFormal ESL patterns closely match AI output; high false positive risk
Legal documents
58-74%
HardBoilerplate language is inherently low-perplexity in both AI and human legal writing
How modification affects detection
The relationship between editing and detection is not linear. The first round of edits has the largest effect on statistical detectors; semantic detectors are more resistant to surface-level changes.
Raw AI output
Statistical
88%
Semantic
85%
Ensemble
87%
Light edit (fix errors, add one specific detail)
Statistical
74%
Semantic
79%
Ensemble
77%
Moderate edit (vary sentence length, add 3+ specifics)
Statistical
55%
Semantic
70%
Ensemble
63%
Heavy edit (structural changes, substantial new content)
Statistical
38%
Semantic
58%
Ensemble
48%
Full humanizer pass (synonym replacement, burstiness)
Statistical
29%
Semantic
55%
Ensemble
42%
Humanizer + manual review
Statistical
22%
Semantic
41%
Ensemble
31%
Approximate averages across content types. Semantic detector is most resistant to surface-level changes.
What these numbers mean in practice
For Educators
Unmodified student AI submissions are reliably caught at 85-95%. The concern is students who lightly edit or paraphrase, which drops detection to 60-80%. A semantic-model ensemble handles this better than perplexity-only tools, but heavily edited AI work remains harder to detect than unmodified output. Detection scores should be one input in a larger evaluation, not the sole criterion.
For Publishers and editors
Freelance content submitted without editing is caught reliably. The risk is AI-assisted content that was reviewed before submission. An ensemble detector with a 70%+ threshold catches most of this category. Post-publication monitoring catches cases where the detection threshold was tuned too loosely.
For Writers using AI tools
Light editing leaves detectable signals. If the goal is content that reads as human and passes detection, the editing threshold is substantial: structural changes, added specifics, varied rhythm. Surface-level synonym replacement does not meaningfully reduce ensemble detection scores, though it does reduce perplexity-only scores.
For Researchers and policy-makers
Detection accuracy figures from 2022-2024 studies are not current. The gap between unmodified and heavily paraphrased AI output has narrowed as models improved. Any policy relying on a specific detection rate should use current ensemble benchmarks and account for content type variation.
The direction of travel
Detection rates for unmodified AI output have stayed roughly stable (85-95%) despite significant improvements in AI model quality. This is because detector training keeps pace with model training: as GPT-4o output became detectable, detectors were retrained on it.
The harder problem is the edited and paraphrased range. Detection rates in the 30-60% zone mean many cases are genuinely ambiguous. This will likely remain true as both humanizer tools and AI models improve.
The practical implication: use detection as a filter and investigation trigger, not a verdict. A 90% score is strong evidence. A 55% score is a reason to look more carefully. A 25% score is not proof of human authorship. For more context, see AI Detection False Positives and Why Detectors Fail on Paraphrased Text.
See where your text falls
Seven detectors, one score, full breakdown. Free, no account needed.
Try Airno free