Can AI-generated content be reliably distinguished from human content?

Why can't humans reliably spot AI-generated text?

People — even experts — are surprisingly bad at telling AI text from human text, often performing no better than a coin flip. In a large study with 1,276 participants, the average person correctly identified AI-generated images, audio, and video only about 50% of the time — essentially guessing [9]. For academic writing specifically, 63 university lecturers correctly identified AI-generated excerpts just 57% of the time, and professional-level AI text fooled over 80% of them [8]. Even stroke experts reviewing scientific essays misclassified nearly one-third of human-written essays as AI-generated [1]. Medical school application readers fared only slightly better, with 56% accuracy in spotting AI-written personal statements [4]. The pattern is consistent: human judgment is unreliable, especially when the AI text is well-written.

Are AI detection tools any better than humans?

AI detection tools can outperform humans in some controlled tests, but they are inconsistent, easily fooled, and prone to false accusations. For example, GPTZero correctly identified 100% of AI-generated scientific essays and 95.5% of human-written ones in one study [1], and ZeroGPT achieved 91% accuracy on medical school essays [4]. However, other studies paint a messier picture: when three different detectors (ZeroGPT, PhraslyAI, Grammarly) were tested on the same texts, their scores varied so much that the correlation between them was poor [5]. Another test found that a paid detector assigned a 0% AI likelihood to some human-written personal statements but an 84% likelihood to others — a huge, unreliable swing [2]. And when text is a mix of human and AI writing (e.g., a student editing an AI draft), detectors become essentially useless, unable to distinguish mixed content from fully AI content [2][5]. False positives are a serious problem: at optimal cutoff settings, detectors misclassified 25-50% of human-written passages as AI [6].

What makes AI text so hard to detect — and what might work better?

AI text is hard to detect because modern language models are trained to mimic human writing patterns so closely that the statistical differences are subtle and easily erased. AI-generated text tends to have lower "perplexity" (meaning it's more predictable word-by-word), simpler vocabulary, and more repetitive sentence structures [1][3]. But these differences shrink when the AI is prompted to write at a professional level or when a human lightly edits the output [5][8]. Some advanced methods show promise: a stylometric model using 31 writing-style features achieved 81-98% accuracy on different datasets [12], and a dual-channel approach combining probability analysis with dynamic watermarking reached 95.4% accuracy with minimal quality loss [10]. Another technique called LLI (Linear Leaky Input) improved detection F-scores by 55% by focusing on context relevance [11]. However, these are research tools, not widely available commercial detectors, and they still struggle with edited or mixed content. The most reliable approach may be embedding invisible watermarks during AI generation itself [10], but that requires cooperation from AI developers.

Sources used in this answer

Scientific Writing in the Era of Large Language Models: A Computational Analysis of AI- Versus Human-Created Content

Human experts misclassified 31.8% of human-written essays as AI; GPTZero correctly identified 100% of AI essays and 95.5% of human essays, but relied on only a few key sentences.

2025 · Rohan Khera, Aline F Pedroso, Vipina K Keloth, Hua Xu, Gisele S Silva, Lee H Schwamm · Stroke

Original

Accuracy of Artificial Intelligence Detection Software for Residency Personal Statements

Four AI detection tools assigned highly variable AI-likelihood scores to human-written personal statements (0-84%), and none could reliably distinguish mixed human-AI content from fully AI content.

2025 · Thomas Pak, Enrique Chiu Han, Cesar Eber Montelongo Hernandez, Kristina Collins, Alana Carrasco, Arya Nekovei, Darlene King, Diana M Robinson, Adam Brenner · Journal of Graduate Medical Education

Original

Detecting Artificial Intelligence-Generated Personal Statements in Professional Physical Therapist Education Program Applications: A Lexical Analysis

Recurrence quantification analysis (RQA) differentiated ChatGPT from human personal statements with 70% sensitivity and 91.4% specificity using a 13% determinism threshold.

2024 · John H Hollman, Beth A Cloud-Biebl, David A Krause, Darren Q Calley · Physical therapy

Original

Death of the Personal Statement: A Qualitative Comparison Between Human-Authored and Artificial Intelligence-Generated Medical School Admissions Essays.

Medical school application readers correctly identified AI authorship only 56% of the time; AI-generated essays scored higher on quality than human essays (5.02 vs 4.67 on a 7-point scale).

2025 · Matthew J Vaccaro, Ishpriya Sharma, Andrea P Espina Rey, Nicole Lyman, C. Palacios, Yu Zhang, Aashna Mehta, Angelo A. Leto Barone, Brian Kellogg · Journal of the American College of Surgeons

Original

Ability of AI detection tools and humans to accurately identify different forms of AI-generated written content.

Three AI detection tools could statistically distinguish five levels of AI use, but their absolute scores varied significantly (ICC 0.57-0.95), and human raters achieved only 19% accuracy.

2025 · Adam Cheng, Yiqun Lin, Gabriel Reedy, Christine Joseph, Samantha Wirkowski, Viviane Mallette, Vikhashni Nagesh, David Krieser, Aaron Calhoun · Advances in simulation (London, England)

Original

Evaluating the Detection Accuracy of AI-Generated Content in Plastic Surgery: A Comparative Study of Medical Professionals and AI Tools.

Medical professionals correctly identified passage origin only 26.5% of the time; AI detection tools showed strong discriminatory power (AUC=0.962) but had false-positive rates of 25-50% at optimal cutoffs.

2025 · Keenan S. Fine, Emily E. Zona, Aidan W. O’Shea, Ellen C. Shaffrey, Pradeep K. Attaluri, Peter J. Wirth, A. Dingle, S. Poore · Plastic and Reconstructive Surgery

Original

Precision of academic plagiarism detection: A descriptive analysis of Artificial Intelligence verifiers

Four plagiarism detection tools varied widely in F-scores for AI content: Copy Leaks 99%, Content at Scale 79%, ZeroGPT 69%, Scribber 25%.

2026 · Luis Ebano Amor Oliva, Erika Guadalupe May Guillermo · Emerging Trends in Education

Original

Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses

University lecturers identified AI-generated texts only 57% of the time; professional-level AI text fooled over 80% of respondents, and human and machine performance did not differ significantly.

2025 · Alexandra Fiedler, Jörg Döpke · International Review of Economics Education

Original

As Good as a Coin Toss: Human Detection of AI-Generated Content

Across 1,276 participants, average detection of synthetic media was near chance (50%); accuracy dropped further with foreign languages, single-modality media, and human faces in images.

2024 · Di Cooke, Abigail Edwards, S. Barkoff, Kathryn Kelly · Communications of the ACM

Original

CurveMark: Detecting AI-Generated Text via Probabilistic Curvature and Dynamic Semantic Watermarking.

A dual-channel detection framework (CurveMark) combining probability curvature and dynamic watermarking achieved 95.4% accuracy with minimal quality degradation (perplexity increase <1.3).

2025 · Yuhan Zhang, Xingxiang Jiang, Hua Sun, Yao Zhang, Deyu Tong · Entropy (Basel, Switzerland)

Original

Research on weight leakage input algorithm and artificial intelligence generated text detection

The LLI (Linear Leaky Input) algorithm improved AI text detection F-scores by 55.07% over the best existing model (chatgpt-detector-roberta) by enhancing context-relevance learning.

2025 · Hao Fang, Weijian Chen · Proceedings of the 2025 3rd International Conference on Artificial Intelligence, Systems and Network Security

Original

StyloAI: Distinguishing AI-Generated Content with Stylometric Analysis

StyloAI, using 31 stylometric features and a Random Forest classifier, achieved 81% accuracy on a multi-domain dataset and 98% on an education-specific dataset.

2024 · Dr Chidimma Opara · AIED Companion

Original