The core idea: how surprising is each word?
A language model assigns a probability to every possible next word given the preceding context. Perplexity measures how surprised the model is, on average, by the words that actually appear.
Low perplexity means the text was very predictable to the model. Each word was close to what the model would have chosen. High perplexity means the text was unpredictable; many words were surprising choices the model would not have selected.
Low perplexity text
“In today's rapidly evolving technological landscape, artificial intelligence has emerged as a transformative force that is reshaping industries and redefining human potential.”
Every word here is highly predictable given the preceding context. A language model assigns high probability to each token. This is characteristic of AI output.
High perplexity text
“The raccoon knocked over three bins last Tuesday. My neighbor blamed me, even though I explicitly told her last April that raccoons prefer unsecured lids.”
Specific details, unexpected references, and idiosyncratic connections are harder to predict. Human writing tends toward higher perplexity because real experiences are varied.
Why AI output has low perplexity
Large language models generate text by selecting the most probable continuation at each step (or sampling from high-probability options). This process is self-reinforcing: the model chooses words that are grammatically smooth, semantically appropriate, and stylistically conventional, because these are the words that appeared most often in its training data in similar contexts.
The result is text that is very predictable to other language models. When a detector model scores the text, it finds that each word was exactly what it would have expected. This low-perplexity signature is one of the most reliable statistical indicators of AI generation.
Typical unedited GPT-4 output
Model selects highest-probability tokens; smooth, conventional phrasing throughout
Lightly edited AI output
Edits change a few tokens but the underlying statistical texture is preserved
Heavily rewritten AI content
Structure and phrasing changes shift some perplexity, but full human-level variance is rarely achieved
Formal academic human writing
Academic convention constrains vocabulary and structure, producing AI-like smoothness without AI authorship
Casual personal writing
Personal style, idiosyncratic vocabulary, and real-life specifics produce unexpected word choices
Poetry and experimental writing
Deliberate rule-breaking produces very high perplexity regardless of authorship
Burstiness: the companion metric
Perplexity alone is not sufficient. Formal human writing and technical documentation can have low perplexity without being AI-generated. A companion metric called burstiness helps address this limitation.
Burstiness measures the variance in perplexity across sentences. Human writing tends to mix high-complexity and low-complexity sentences within the same document. A writer might use a long, structurally complex sentence followed by a short, punchy one. This natural variation produces high burstiness.
AI output tends toward uniform sentence complexity. Each sentence is approximately as complex as the others. This produces low burstiness: smooth, consistent phrasing throughout the document, with few of the sharp shifts in sentence structure that characterize natural writing.
AI signature
Low perplexity + low burstiness: text that is both smooth and uniformly smooth. Each sentence about as predictable as the last.
Human signature
Variable perplexity + high burstiness: some sentences are predictable, others are not. The variance itself is a human signal.
Why perplexity alone is not enough for detection
Several real-world writing patterns produce low perplexity for reasons that have nothing to do with AI generation:
- •Formal academic writing follows strict conventions that constrain vocabulary and structure, producing low perplexity without AI authorship. This is the source of the ESL false positive problem: non-native speakers who write highly formal, textbook-correct English are penalized for writing well.
- •Legal and regulatory text uses standardized clause structures, precise boilerplate, and required phrasing that are inherently low-perplexity.
- •Technical documentation, API references, and instructional content follow templated structures that produce smooth, predictable text.
- •News writing follows inverted-pyramid structure and journalistic conventions that narrow the space of appropriate word choices.
A detector that relies exclusively on perplexity will produce high false positive rates for all of these content types. This is why Airno uses eight independent detectors: perplexity and burstiness are two of the signals, but they are weighted alongside pattern analysis, semantic deep learning (DeBERTa v3), frequency analysis, artifact detection, and more. When multiple independent signals converge on a high score, the result is much more reliable than any single metric.
How to interpret perplexity-related signals in detection results
When you run text through Airno and see the per-detector breakdown, the statistical detector is the one most closely correlated with perplexity and burstiness analysis. Here is how to interpret what you see:
Statistical detector high, others low
Probably humanLikely a false positive. The text is formally written (academic, legal, technical) but may be genuinely human-authored. Formal writing genres produce low perplexity without AI generation.
Statistical + pattern detectors high, DeBERTa low
UncertainModerate signal. Statistical smoothness and phrase patterns are present but the semantic deep learning model does not flag it. Could be AI-assisted editing rather than full AI generation.
Statistical + DeBERTa high, others variable
Likely AIStronger signal. The combination of surface statistical patterns and deep semantic features is harder to explain as a false positive. DeBERTa v3 is specifically trained to resist surface-level evasion.
All detectors elevated
Almost certainly AIStrongest signal. All eight independent detection methods agree. This combination is very unlikely to be a false positive for content that is not AI-generated or AI-generated with minimal editing.
Further reading
For a practical overview of how all eight of Airno's detectors work together, see How AI Detection Works. For the false positive problem in formal writing contexts (directly related to the perplexity limitation explained here), see AI Detection False Positives. For detection accuracy across content types, see What Percentage of AI Content Is Detectable?
See all eight detectors on your text
Statistical, pattern, DeBERTa v3, frequency, CNN, artifact, and more. Free, no account needed.
Try Airno free