How to evaluate an AI detector
Most comparison articles treat accuracy as a single number. It is not. A detector that scores 95% on a benchmark dataset of obvious AI text but has a 15% false positive rate on academic writing is not useful in practice. The metrics that matter:
True positive rateHow often it correctly identifies AI-generated text. Most tools score 80-95% on unmodified GPT output.
False positive rateHow often it incorrectly flags human writing. This is where tools diverge most. Range: 2-20%+ depending on content type.
Paraphrase resistanceHow well it handles AI text that has been lightly edited or paraphrased. Single-model detectors fail here. Ensemble models are more resistant.
Ensemble depthHow many independent signals the tool measures. More detectors means fewer single points of failure.
SpeedRelevant for high-volume use cases. Cloud tools range from 1-10 seconds per document.
ExplainabilityWhether the tool shows which signals fired and why, or just returns a number.
The tools, compared
Detection depth
8 detectors + DeBERTa-v3 fine-tuned
False positive rate
Lower (ensemble reduces single-model errors)
Paraphrase resistance
High (semantic models, not just phrase-matching)
Explainability
Full breakdown by detector type
Image detection
Yes (CLIP-based ensemble)
Speed
2-5s (Railway cloud, cold starts ~30s first request)
Best for: Anyone who needs to understand why a score is high, not just that it is. Educators, editors, publishers, and researchers who need per-detector breakdowns. Image detection. Free, no account required.
Detection depth
Proprietary model (perplexity + burstiness + their own)
False positive rate
Moderate; higher on ESL and academic writing
Paraphrase resistance
Moderate
Explainability
Sentence-level highlighting (paid)
Image detection
No
Speed
2-6s
Best for: Educators who want a free tier and sentence-level highlighting. Familiarity and institutional trust. Full feature access requires a paid plan. See our Airno vs GPTZero comparison.
Detection depth
AI detection + plagiarism combo
False positive rate
Moderate
Paraphrase resistance
Moderate
Explainability
Sentence-level scores
Image detection
No
Speed
3-8s
Best for: Content agencies and SEO teams that want AI detection and plagiarism checking in one tool. Credit-based pricing with no free tier.
Detection depth
Institutional-grade, proprietary model
False positive rate
Moderate to high on ESL submissions (documented)
Paraphrase resistance
Moderate
Explainability
Percentage score only
Image detection
No
Speed
Integrated into submission workflow
Caution: Institutional access only (not available to individuals). Documented false positive issues with ESL student writing. Not suitable for personal use. See Turnitin alternatives.
Detection depth
Single model, claims high accuracy
False positive rate
Moderate
Paraphrase resistance
Low to moderate
Explainability
Document-level score, some highlighting
Image detection
No
Speed
3-6s
Best for: Teams already using Winston for other features. Limited free tier. Not a strong choice as a standalone detector.
Detection depth
AI detection + plagiarism, multilingual
False positive rate
Moderate
Paraphrase resistance
Moderate
Explainability
Sentence-level AI probability
Image detection
No
Speed
3-7s
Best for: Multilingual content teams needing detection in non-English languages. Strong plagiarism detection as the primary use case.
Quick decision guide
Individual writer checking own work
Airno. Free, no account, full detector breakdown so you know what to edit.
Teacher reviewing student submissions
Airno or GPTZero. Both have free tiers; GPTZero has sentence highlighting; Airno has lower false positive rate.
Publisher reviewing freelance submissions
Airno. Ensemble depth, explainability, and image detection for multi-format content.
Content agency with high volume
Originality.ai. Credit-based API access, AI + plagiarism combo, built for scale.
University (institutional)
Turnitin. Institutional integration, though be aware of ESL false positive issues.
Multilingual content
Copyleaks. Best multilingual support among the tools tested.
What no detector gets right
Every tool in this comparison fails in the same category: heavily paraphrased AI text. Once a language model's output is edited to include specific details, varied sentence lengths, and unusual word choices, detection accuracy drops significantly across all tools tested.
Ensemble detectors with semantic models (including Airno's DeBERTa-v3 fine-tuned model) are more resistant to paraphrasing than phrase-matching or perplexity-only approaches. But no detector is immune. Detection scores should be treated as probabilistic evidence, not verdicts.
For context on the paraphrasing problem specifically, see Why AI Detectors Fail on Paraphrased Text. For the false positive problem, see AI Detection False Positives.
Why ensemble detection matters most in 2026
In 2023, GPT-4 output was detectable with simple perplexity models. In 2026, AI text generation is widespread enough that detection by perplexity alone has a high false positive rate: human writing in formal registers has low perplexity too.
The tools that have improved most in the last two years are those that added semantic models (transformers fine-tuned on AI vs. human text) alongside statistical measures. Single-metric detectors have not kept pace. When evaluating any AI detector in 2026, the most important question is not “what accuracy does it claim?” but “how many independent signals does it measure and how does it combine them?”
See all 8 detectors at once
Free, no account needed. Full breakdown shows exactly which signals fired and at what intensity.
Try Airno free