Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

Source: arXiv:2605.21713 · Published 2026-05-20 · By André V. Duarte, Brian Tufts, Aditya Oke, Fei Fang, Arlindo L. Oliveira, Lei Li

TL;DR

This paper addresses the challenge of detecting whether peer reviews are human-written, fully AI-generated, or human-written but refined using large language models (LLMs). Unlike prior work focused on superficial textual features, Sem-Detect introduces a novel semantic-level approach that analyzes the claims and ideas expressed in reviews. The key insight leveraged is that different AI models tend to produce reviews with overlapping, convergent claims, whereas humans contribute more diverse and unique judgments. By comparing a target review against multiple AI-generated reviews of the same paper at the claim level, Sem-Detect effectively distinguishes between the three authorship categories, including LLM-polished human reviews—a subtler challenge that existing detectors struggle with.

Evaluated on a large dataset of over 20,000 reviews from leading ML conferences (ICLR and NeurIPS), Sem-Detect improves true positive rates by 25.5% over previous best domain-specific baselines at a stringent 0.1% false positive rate in binary detection. In the three-class setting, it achieves over 91% accuracy on fully AI and LLM-refined classes, with under 3.5% of LLM-refined reviews misclassified as fully AI-generated. The method also demonstrates robustness under out-of-distribution conditions and cross-domain transfer to medical imaging reviews without retraining, highlighting its practical applicability. Sem-Detect thus offers a more nuanced and semantically grounded detection framework tailored to the peer review context, preserving intellectual integrity while allowing legitimate LLM usage for polishing.

Key findings

Sem-Detect achieves an AUC of 0.999 and [email protected]% FPR of 0.760 on binary human vs fully AI-generated peer review detection, improving 25.5% over the strongest baseline EditLens (TPR 0.606).
In three-class classification (human, LLM-refined human, fully AI), Sem-Detect correctly identifies 91.18% of AI-generated and 91.61% of LLM-refined reviews, with only 0.66% of human reviews misclassified as AI-generated.
Less than 3.5% of LLM-refined human reviews are misclassified as fully AI-generated, demonstrating that semantic signals of human judgment persist despite LLM polishing.
Semantic features based on mean best-match claim similarity robustly separate AI-generated reviews (median similarity ≈0.73) from human and LLM-refined ones (median ≈0.64), validating the key hypothesis about claim-level convergence.
Sem-Detect’s confidence scores are well-calibrated, enabling trade-offs between automatic classification coverage and accuracy; at a 0.80 confidence threshold, 79% of reviews remain auto-classified with 94.7% accuracy.
Under out-of-distribution shifts with unseen LLMs and prompting templates, three-class Macro-F1 drops from 0.84 (in-distribution) to 0.68–0.71, but AI precision increases to 0.96–0.97, indicating conservative, reliable AI predictions.
Cross-domain evaluation on medical imaging peer reviews (MIDL 2022) shows comparable or higher F1 scores across classes without retraining, demonstrating robustness beyond ML conferences.
On recent ICLR 2026 reviews, Sem-Detect detects AI-generated content consistent with rising trends reported in literature, validating its contemporary relevance.

Threat model

The adversary consists of peer-review authors or agents who use LLMs either to generate entire reviews or to refine human-written reviews. They aim to produce convincing reviews that may mislead detection systems. They do not have control over all reference AI-generated reviews used for comparison, nor perfect knowledge or capability to avoid semantic claim-level detection. They cannot generate perfectly unique claim sets indistinguishable from human reviewers across multiple AI models.

Methodology — deep read

Threat Model & Assumptions: The adversary includes authors or reviewers who may generate peer reviews fully by AI models or produce human-written reviews refined or polished with LLMs. The detector must distinguish between fully AI-generated text and human-originated ideas—even if the human text has been edited by an LLM. It assumes access to the paper under review and multiple AI-generated reviews of the same paper. The adversary cannot perfectly mimic human claim diversity or evade claim-level semantic comparisons.
Data: The dataset consists of 20,165 peer reviews from 800 papers submitted to ICLR and NeurIPS conferences (2021-2022). It includes three classes: 3,065 authentic human-written reviews, 6,768 fully AI-generated reviews created by prompting four different LLMs to produce reviews simulating reviewer scores, and 12,332 human reviews refined using LLMs instructed not to add new content but to improve clarity. Reviews were cleaned to remove artifacts revealing generation. Additional evaluation data includes cross-domain reviews from a medical imaging conference (MIDL 2022) and recent submissions to ICLR 2026.
Architecture / Algorithm: Sem-Detect processes target reviews by pairing each with k=3 AI-generated reference reviews for the same paper and same review score to control for evaluation biases. Each review is segmented into atomic claims via an LLM, categorizing claims into types (e.g., evaluation, constructive input, clarification). It extracts textual features (perplexity, entropy, token likelihood, Fast-DetectGPT score) and semantic features computed from claim embeddings (cosine similarities between claims in the target review and reference AI reviews). Key semantic features include proportion of claims exceeding a similarity threshold, mean similarity scores, intra-review semantic diversity, and log of claim count.

Feature vectors consisting of 5 semantic and 4 textual features feed into a LightGBM classifier trained for three-class classification (human, LLM-refined, fully AI). Hyperparameters are tuned via randomized search optimizing macro-F1 with stratified five-fold cross-validation by paper.

Training Regime: The models are trained on 80% of the data and tested on 20%, split at the paper level to avoid leakage. Training uses boosted decision trees on the combined feature vectors. Claim embeddings are obtained from a dedicated LLM embedding model. Random seeds and hardware details are not explicitly listed, but standard practices apply.
Evaluation Protocol: Two main settings are evaluated: binary classification (human vs fully AI, excluding polished reviews) and three-class classification including LLM-refined class. Metrics include ROC-AUC, True Positive Rate at fixed low False Positive Rates (0.1%, 1%), and macro-F1 for multiclass. Baselines include both general-purpose detectors (Fast-DetectGPT, RADAR, Binoculars) and domain-specific methods (TF Model, Anchor, EditLens). Additional experiments test robustness to OOD shifts with unseen LLM models and prompt templates, cross-domain generalization to medical imaging peer review, and analysis on current conference reviewing data. Confidence calibration and thresholding are explored to trade coverage and accuracy.
Reproducibility: The dataset of 20k+ reviews and code for data processing, claim extraction, feature computation, and modeling are constructed and planned for release, facilitating reproducibility. Some LLMs used for generation and embeddings are proprietary or large-scale but clearly documented. Exact code release details are not specified but the paper implies openness.

Example End-to-End: For a given target peer review of a paper, Sem-Detect obtains three AI-generated reference reviews sharing the same evaluation score. It segments the target review into claims capturing individual judgments. For each claim, it computes embedding-based cosine similarities to claims in each reference review, aggregates these to produce semantic features reflecting claim overlap with synthetic AI reviews. It computes textual statistics on the raw review. The combined 9-D feature vector is fed into the LightGBM classifier, which outputs the predicted authorship class with calibrated confidence. Application of confidence thresholding enables uncertain predictions to be flagged for manual inspection.

Technical innovations

Combining claim-level semantic analysis of peer review content with textual statistical features for authorship detection, rather than relying solely on surface text signals.
Leveraging the observation that AI-generated reviews from different LLMs converge on similar claims, while humans produce more diverse and unique claims, enabling discriminative semantic similarity features.
Operationalizing semantic segmentation into atomic, claim-level units (across evaluation, constructive input, and clarification dialogue categories) to enhance interpretability and precision in semantic comparison.
Introducing a three-class classification framework that distinguishes fully AI-generated reviews, human-written reviews, and human reviews refined by LLMs, addressing the nuanced use of LLM assistance.
Using multiple AI-generated reference reviews per paper and matching by evaluation score to control for semantic confounders when comparing target reviews.

Datasets

ICLR and NeurIPS peer reviews — 20,165 reviews across 800 papers; publicly crawled from OpenReview with synthetic labels for AI/LLM-refined classes
MIDL 2022 medical imaging peer reviews — ~100 papers with human reviews and synthetic AI/LLM-refined reviews for cross-domain testing; publicly available on OpenReview
ICLR 2026 peer reviews — used for analysis but ground truth labels not specified.

Baselines vs proposed

LogRank: AUC = 0.576 vs Sem-Detect: 0.999
MAGE: AUC = 0.699 vs Sem-Detect: 0.999
Fast-DetectGPT: [email protected]% FPR = 0.021 vs Sem-Detect: 0.760
RADAR: AUC = 0.965 vs Sem-Detect: 0.999
Anchor: [email protected]% FPR = 0.541 vs Sem-Detect: 0.760
EditLens: [email protected]% FPR = 0.606 vs Sem-Detect: 0.760
TF Model (domain-specific): AUC = 0.926 vs Sem-Detect: 0.999

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.21713.

Fig 1

Fig 1: Classical AI-text detectors rely on textual features to

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 2

Fig 2: Sem-Detect pipeline. We construct our dataset by prompting LLMs to generate fully AI reviews from conference papers and

Fig 3

Fig 3: ROC curves on the binary-setting. LLM-

Fig 4

Fig 4: ROC curves for the collapsed binary task. Human and

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

Performance degrades under out-of-distribution conditions (unseen LLM models and prompts), with macro-F1 dropping from 0.84 to ~0.68, indicating sensitivity to generation shifts.
LLM-refined human reviews remain the hardest class to distinguish, resulting in some misclassification, especially human reviews being classified as LLM-refined (35.38%), due in part to class imbalance.
Evaluation is primarily limited to ML conference peer reviews and one medical imaging venue; broader generalization to non-ML fields with different review styles remains untested.
Some reliance on proprietary or large-scale LLMs for generation and embedding, which may pose reproducibility or deployment challenges.
The method presumes availability of multiple AI-generated reference reviews targeting the same paper and score, which might not be feasible in all settings.
Potential risks remain if adversaries develop methods to diversify AI-generated claims or mimic human claim patterns closely.

Open questions / follow-ons

How to extend semantic claim-level detection techniques to other domains and languages where review styles or norms differ significantly?
Can adversarial training or augmentation with more diverse AI-generated claims improve robustness against emerging generative models designed to evade detection?
How to automate and scale claim extraction and semantic segmentation for diverse and complex review texts without supervision?
What additional metadata or contextual signals (e.g., reviewer identity patterns, submission timestamps) can complement semantic analysis for improved authorship attribution?

Why it matters for bot defense

For bot-defense practitioners, Sem-Detect exemplifies how contextual, semantic-level understanding combined with reference data comparison improves attribution beyond surface text features. This principle can be applied in CAPTCHA and bot-detection scenarios where distinguishing automated from human-generated content involves examining unique intent, reasoning, or claim diversity rather than lexical cues alone. Additionally, the approach's robustness strategies, such as confidence thresholding and out-of-distribution testing, provide practical defense architectures against adaptive adversaries. While developed for peer review, the concept of leveraging multiple reference outputs to detect collusion or duplication could inspire more resilient bot-detection frameworks that incorporate semantic fingerprints rather than just syntactic or probabilistic text measures.

Cite

bibtex

@article{arxiv2605_21713,
  title={ Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews },
  author={ André V. Duarte and Brian Tufts and Aditya Oke and Fei Fang and Arlindo L. Oliveira and Lei Li },
  journal={arXiv preprint arXiv:2605.21713},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.21713}
}

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​