The short answer

85-95%

Unmodified AI output

Directly submitted GPT-4o or Claude output, no edits

60-80%

Lightly edited AI output

Minor human edits, corrected errors, added specifics

30-60%

Paraphrased or humanized

Run through a humanizer tool or extensively rewritten

These figures reflect ensemble detector performance on current AI model output. Single-metric detectors perform significantly worse, particularly on edited or paraphrased content.

Detection by AI model

Not all AI models are equally detectable. Older models with more predictable token distributions are easier to detect. Newer models trained with RLHF and instruction-following objectives produce text that is harder to distinguish from human writing at the statistical level.

ModelEnsemble detect rateTrend

GPT-2 / GPT-395-99%Easily detected

GPT-3.5 Turbo88-94%Reliably detected

GPT-4 / GPT-4o80-90%Well detected

Claude 3 Sonnet78-88%Well detected

Claude 3.5 Sonnet72-85%Moderate difficulty

Gemini 1.5 Pro70-83%Moderate difficulty

GPT-4o + humanizer30-60%Hard

Llama 3 / local models65-80%Varies by prompt style

Figures based on ensemble testing. Rates vary by content type and detector version.

Detection by content type

Content type is the second largest variable after whether the text has been paraphrased. Formal writing registers have lower perplexity in both AI and human output, making them harder to distinguish.

Creative fiction

90-97%

Easy

AI fiction has characteristic plot structure, generic imagery, and low specificity

Social media posts

85-94%

Easy

Short text with AI-pattern phrasing is distinctive; limited human variation possible

News articles

82-91%

Easy

AI news has flat inverted pyramid structure, over-attribution, and smooth topic flow

Marketing copy

78-88%

Moderate

Benefit-focused structure is standard in both AI and human marketing writing

Technical documentation

70-82%

Moderate

Consistent terminology and procedural patterns are expected in human tech writing too

Academic writing (native English)

72-86%

Moderate

Hedges and formal transitions trigger detectors but semantic models still distinguish

Academic writing (ESL)

60-78%

Hard

Formal ESL patterns closely match AI output; high false positive risk

Legal documents

58-74%

Hard

Boilerplate language is inherently low-perplexity in both AI and human legal writing

How modification affects detection

The relationship between editing and detection is not linear. The first round of edits has the largest effect on statistical detectors; semantic detectors are more resistant to surface-level changes.

Raw AI output

Statistical

88%

Semantic

85%

Ensemble

87%

Light edit (fix errors, add one specific detail)

Statistical

74%

Semantic

79%

Ensemble

77%

Moderate edit (vary sentence length, add 3+ specifics)

Statistical

55%

Semantic

70%

Ensemble

63%

Heavy edit (structural changes, substantial new content)

Statistical

38%

Semantic

58%

Ensemble

48%

Full humanizer pass (synonym replacement, burstiness)

Statistical

29%

Semantic

55%

Ensemble

42%

Humanizer + manual review

Statistical

22%

Semantic

41%

Ensemble

31%

Approximate averages across content types. Semantic detector is most resistant to surface-level changes.

What these numbers mean in practice

For Educators

Unmodified student AI submissions are reliably caught at 85-95%. The concern is students who lightly edit or paraphrase, which drops detection to 60-80%. A semantic-model ensemble handles this better than perplexity-only tools, but heavily edited AI work remains harder to detect than unmodified output. Detection scores should be one input in a larger evaluation, not the sole criterion.

For Publishers and editors

Freelance content submitted without editing is caught reliably. The risk is AI-assisted content that was reviewed before submission. An ensemble detector with a 70%+ threshold catches most of this category. Post-publication monitoring catches cases where the detection threshold was tuned too loosely.

For Writers using AI tools

Light editing leaves detectable signals. If the goal is content that reads as human and passes detection, the editing threshold is substantial: structural changes, added specifics, varied rhythm. Surface-level synonym replacement does not meaningfully reduce ensemble detection scores, though it does reduce perplexity-only scores.

For Researchers and policy-makers

Detection accuracy figures from 2022-2024 studies are not current. The gap between unmodified and heavily paraphrased AI output has narrowed as models improved. Any policy relying on a specific detection rate should use current ensemble benchmarks and account for content type variation.

The direction of travel

Detection rates for unmodified AI output have stayed roughly stable (85-95%) despite significant improvements in AI model quality. This is because detector training keeps pace with model training: as GPT-4o output became detectable, detectors were retrained on it.

The harder problem is the edited and paraphrased range. Detection rates in the 30-60% zone mean many cases are genuinely ambiguous. This will likely remain true as both humanizer tools and AI models improve.

The practical implication: use detection as a filter and investigation trigger, not a verdict. A 90% score is strong evidence. A 55% score is a reason to look more carefully. A 25% score is not proof of human authorship. For more context, see AI Detection False Positives and Why Detectors Fail on Paraphrased Text.

See where your text falls

Seven detectors, one score, full breakdown. Free, no account needed.

Try Airno free