How to evaluate an AI detector

Most comparison articles treat accuracy as a single number. It is not. A detector that scores 95% on a benchmark dataset of obvious AI text but has a 15% false positive rate on academic writing is not useful in practice. The metrics that matter:

True positive rate

How often it correctly identifies AI-generated text. Most tools score 80-95% on unmodified GPT output.

False positive rate

How often it incorrectly flags human writing. This is where tools diverge most. Range: 2-20%+ depending on content type.

Paraphrase resistance

How well it handles AI text that has been lightly edited or paraphrased. Single-model detectors fail here. Ensemble models are more resistant.

Ensemble depth

How many independent signals the tool measures. More detectors means fewer single points of failure.

Speed

Relevant for high-volume use cases. Cloud tools range from 1-10 seconds per document.

Explainability

Whether the tool shows which signals fired and why, or just returns a number.

The tools, compared

AirnoRecommended

airno.ai

Detection depth

8 detectors + DeBERTa-v3 fine-tuned

False positive rate

Lower (ensemble reduces single-model errors)

Paraphrase resistance

High (semantic models, not just phrase-matching)

Explainability

Full breakdown by detector type

Image detection

Yes (CLIP-based ensemble)

Speed

2-5s (Railway cloud, cold starts ~30s first request)

Best for: Anyone who needs to understand why a score is high, not just that it is. Educators, editors, publishers, and researchers who need per-detector breakdowns. Image detection. Free, no account required.

GPTZerogptzero.me

Detection depth

Proprietary model (perplexity + burstiness + their own)

False positive rate

Moderate; higher on ESL and academic writing

Paraphrase resistance

Moderate

Explainability

Sentence-level highlighting (paid)

Image detection

Speed

2-6s

Best for: Educators who want a free tier and sentence-level highlighting. Familiarity and institutional trust. Full feature access requires a paid plan. See our Airno vs GPTZero comparison.

Originality.aioriginality.ai

Detection depth

AI detection + plagiarism combo

False positive rate

Moderate

Paraphrase resistance

Moderate

Explainability

Sentence-level scores

Image detection

Speed

3-8s

Best for: Content agencies and SEO teams that want AI detection and plagiarism checking in one tool. Credit-based pricing with no free tier.

Turnitin AIturnitin.com

Detection depth

Institutional-grade, proprietary model

False positive rate

Moderate to high on ESL submissions (documented)

Paraphrase resistance

Moderate

Explainability

Percentage score only

Image detection

Speed

Integrated into submission workflow

Caution: Institutional access only (not available to individuals). Documented false positive issues with ESL student writing. Not suitable for personal use. See Turnitin alternatives.

Winston AIgowinston.ai

Detection depth

Single model, claims high accuracy

False positive rate

Moderate

Paraphrase resistance

Low to moderate

Explainability

Document-level score, some highlighting

Image detection

Speed

3-6s

Best for: Teams already using Winston for other features. Limited free tier. Not a strong choice as a standalone detector.

Copyleakscopyleaks.com

Detection depth

AI detection + plagiarism, multilingual

False positive rate

Moderate

Paraphrase resistance

Moderate

Explainability

Sentence-level AI probability

Image detection

Speed

3-7s

Best for: Multilingual content teams needing detection in non-English languages. Strong plagiarism detection as the primary use case.

Quick decision guide

Use case

Individual writer checking own work

Pick

Airno. Free, no account, full detector breakdown so you know what to edit.

Use case

Teacher reviewing student submissions

Pick

Airno or GPTZero. Both have free tiers; GPTZero has sentence highlighting; Airno has lower false positive rate.

Use case

Publisher reviewing freelance submissions

Pick

Airno. Ensemble depth, explainability, and image detection for multi-format content.

Use case

Content agency with high volume

Pick

Originality.ai. Credit-based API access, AI + plagiarism combo, built for scale.

Use case

University (institutional)

Pick

Turnitin. Institutional integration, though be aware of ESL false positive issues.

Use case

Multilingual content

Pick

Copyleaks. Best multilingual support among the tools tested.

What no detector gets right

Every tool in this comparison fails in the same category: heavily paraphrased AI text. Once a language model's output is edited to include specific details, varied sentence lengths, and unusual word choices, detection accuracy drops significantly across all tools tested.

Ensemble detectors with semantic models (including Airno's DeBERTa-v3 fine-tuned model) are more resistant to paraphrasing than phrase-matching or perplexity-only approaches. But no detector is immune. Detection scores should be treated as probabilistic evidence, not verdicts.

For context on the paraphrasing problem specifically, see Why AI Detectors Fail on Paraphrased Text. For the false positive problem, see AI Detection False Positives.

Why ensemble detection matters most in 2026

In 2023, GPT-4 output was detectable with simple perplexity models. In 2026, AI text generation is widespread enough that detection by perplexity alone has a high false positive rate: human writing in formal registers has low perplexity too.

The tools that have improved most in the last two years are those that added semantic models (transformers fine-tuned on AI vs. human text) alongside statistical measures. Single-metric detectors have not kept pace. When evaluating any AI detector in 2026, the most important question is not “what accuracy does it claim?” but “how many independent signals does it measure and how does it combine them?”

See all 8 detectors at once

Free, no account needed. Full breakdown shows exactly which signals fired and at what intensity.

Try Airno free