The core idea: how surprising is each word?

A language model assigns a probability to every possible next word given the preceding context. Perplexity measures how surprised the model is, on average, by the words that actually appear.

Low perplexity means the text was very predictable to the model. Each word was close to what the model would have chosen. High perplexity means the text was unpredictable; many words were surprising choices the model would not have selected.

Low perplexity text

“In today's rapidly evolving technological landscape, artificial intelligence has emerged as a transformative force that is reshaping industries and redefining human potential.”

Every word here is highly predictable given the preceding context. A language model assigns high probability to each token. This is characteristic of AI output.

High perplexity text

“The raccoon knocked over three bins last Tuesday. My neighbor blamed me, even though I explicitly told her last April that raccoons prefer unsecured lids.”

Specific details, unexpected references, and idiosyncratic connections are harder to predict. Human writing tends toward higher perplexity because real experiences are varied.

Why AI output has low perplexity

Large language models generate text by selecting the most probable continuation at each step (or sampling from high-probability options). This process is self-reinforcing: the model chooses words that are grammatically smooth, semantically appropriate, and stylistically conventional, because these are the words that appeared most often in its training data in similar contexts.

The result is text that is very predictable to other language models. When a detector model scores the text, it finds that each word was exactly what it would have expected. This low-perplexity signature is one of the most reliable statistical indicators of AI generation.

Typical unedited GPT-4 output

Model selects highest-probability tokens; smooth, conventional phrasing throughout

Very low

Lightly edited AI output

Edits change a few tokens but the underlying statistical texture is preserved

Low

Heavily rewritten AI content

Structure and phrasing changes shift some perplexity, but full human-level variance is rarely achieved

Low-medium

Formal academic human writing

Academic convention constrains vocabulary and structure, producing AI-like smoothness without AI authorship

Low-medium

Casual personal writing

Personal style, idiosyncratic vocabulary, and real-life specifics produce unexpected word choices

Medium-high

Poetry and experimental writing

Deliberate rule-breaking produces very high perplexity regardless of authorship

High

Burstiness: the companion metric

Perplexity alone is not sufficient. Formal human writing and technical documentation can have low perplexity without being AI-generated. A companion metric called burstiness helps address this limitation.

Burstiness measures the variance in perplexity across sentences. Human writing tends to mix high-complexity and low-complexity sentences within the same document. A writer might use a long, structurally complex sentence followed by a short, punchy one. This natural variation produces high burstiness.

AI output tends toward uniform sentence complexity. Each sentence is approximately as complex as the others. This produces low burstiness: smooth, consistent phrasing throughout the document, with few of the sharp shifts in sentence structure that characterize natural writing.

AI signature

Low perplexity + low burstiness: text that is both smooth and uniformly smooth. Each sentence about as predictable as the last.

Human signature

Variable perplexity + high burstiness: some sentences are predictable, others are not. The variance itself is a human signal.

Why perplexity alone is not enough for detection

Several real-world writing patterns produce low perplexity for reasons that have nothing to do with AI generation:

•Formal academic writing follows strict conventions that constrain vocabulary and structure, producing low perplexity without AI authorship. This is the source of the ESL false positive problem: non-native speakers who write highly formal, textbook-correct English are penalized for writing well.
•Legal and regulatory text uses standardized clause structures, precise boilerplate, and required phrasing that are inherently low-perplexity.
•Technical documentation, API references, and instructional content follow templated structures that produce smooth, predictable text.
•News writing follows inverted-pyramid structure and journalistic conventions that narrow the space of appropriate word choices.

A detector that relies exclusively on perplexity will produce high false positive rates for all of these content types. This is why Airno uses eight independent detectors: perplexity and burstiness are two of the signals, but they are weighted alongside pattern analysis, semantic deep learning (DeBERTa v3), frequency analysis, artifact detection, and more. When multiple independent signals converge on a high score, the result is much more reliable than any single metric.

How to interpret perplexity-related signals in detection results

When you run text through Airno and see the per-detector breakdown, the statistical detector is the one most closely correlated with perplexity and burstiness analysis. Here is how to interpret what you see:

Statistical detector high, others low

Probably human

Likely a false positive. The text is formally written (academic, legal, technical) but may be genuinely human-authored. Formal writing genres produce low perplexity without AI generation.

Statistical + pattern detectors high, DeBERTa low

Uncertain

Moderate signal. Statistical smoothness and phrase patterns are present but the semantic deep learning model does not flag it. Could be AI-assisted editing rather than full AI generation.

Statistical + DeBERTa high, others variable

Likely AI

Stronger signal. The combination of surface statistical patterns and deep semantic features is harder to explain as a false positive. DeBERTa v3 is specifically trained to resist surface-level evasion.

All detectors elevated

Almost certainly AI

Strongest signal. All eight independent detection methods agree. This combination is very unlikely to be a false positive for content that is not AI-generated or AI-generated with minimal editing.

What Is Perplexity in AI Detection?

The core idea: how surprising is each word?

Why AI output has low perplexity

Burstiness: the companion metric

Why perplexity alone is not enough for detection

How to interpret perplexity-related signals in detection results

Further reading