On the Footprints of Reviewer Bots Feedback on Agentic Pull Requests in OSS GitHub Repositories

Source: arXiv:2604.24450 · Published 2026-04-27 · By Syeda Kaneez Fatima, Yousuf Abrar, Abdul Rehman Tahir, Amelia Nawaz, Shamsa Abid, Abdul Ali Bangash

TL;DR

This paper studies reviewer bots in “agentic” pull requests, i.e., PRs created by autonomous coding agents, and asks whether the bots’ feedback characteristics relate to PR acceptance and resolution time. The authors analyze reviewer-bot comments from the AI_Dev dataset and focus on two dimensions: feedback quality (relevance, clarity, conciseness) and activity volume (comment count). The core idea is not to benchmark a new model, but to characterize bot review behavior at scale and connect that behavior to workflow outcomes using nonparametric statistics.

The main result is a fairly blunt one: reviewer bots are usually polite and concise, but only moderately relevant to the code changes they comment on, and more bot comments on a PR are associated with longer time to resolution. The paper finds that feedback volume matters more than feedback quality for workflow delay, while quality metrics have weak or inconsistent associations with acceptance. In practice, the authors argue this points toward a dilution problem: once bots generate many comments on the same PR, the average relevance and clarity of those comments drops, suggesting that bot reviewers should be more selective and context-aware rather than noisier.

Key findings

The study analyzes 7,416 reviewer-bot comments across 4,532 agentic PRs from the AI_Dev dataset; after filtering, the bot comments come from 29 distinct bots.
Reviewer-bot comments are overwhelmingly civil: 99.8% civil vs 0.2% uncivil, with uncivil language reportedly limited to rare cases like using “useless” in a review comment.
The most common comment types are Bugfix (14.0%), Testing (11.8%), and Documentation (11.6%); Refactoring (5.5%) and Logging (1.7%) are rare, while “Other” covers 55.5%.
Average feedback scores are 6.91 for relevance, 6.96 for clarity, and 8.69 for conciseness across all bot comments.
Bot comment count correlates positively with PR resolution time: Spearman ρ = 0.19 with Holm-Bonferroni adjusted p = 7.312e-13.
Bot comment count correlates negatively with mean relevance (ρ = -0.20, adjusted p = 8.648e-20) and mean clarity (ρ = -0.19, adjusted p = 1.39e-15), supporting a dilution effect as volume rises.
Mean relevance differs significantly between accepted and unaccepted PRs after correction (adjusted p = 0.002), but the paper reports no meaningful association between the feedback-quality metrics and workflow outcomes overall.
Mean relevance shows a small negative correlation with PR acceptance (ρ = -0.09, adjusted p = 0.0006), while mean conciseness shows a small positive correlation (ρ = 0.06, adjusted p = 0.01).

Threat model

The relevant adversary is not a malicious actor but the emergent interaction between autonomous code generators and autonomous reviewer bots in OSS GitHub workflows. The paper assumes reviewer bots can see the PR diff and produce comments, but they cannot access hidden maintainer deliberations, bot internals, or post hoc human review reasoning. It also assumes the observed PR metadata and comment text are sufficient to infer associations with merge outcomes and resolution time, while acknowledging that unobserved confounders like repository policy, PR size, or task complexity may influence both bot behavior and workflow outcomes.

Methodology — deep read

The paper’s threat model is implicit rather than adversarial in the security sense: the authors study the observable behavior of reviewer bots acting on agentic PRs in OSS GitHub repositories, not an active attacker trying to evade detection. The “adversary,” if any, is the workflow itself: autonomous code generation plus autonomous review may create noisy, low-signal feedback loops that slow PR resolution. The paper assumes reviewer-bot comments and associated diff hunks are sufficient proxies for review behavior; it does not inspect bot internal logic, hidden prompts, or maintainer-side human discussions. It also assumes PR outcomes (merged vs closed, and time-to-merge for merged PRs) are reasonable endpoints for workflow efficiency.

For data, the source is the AI_Dev dataset by Li et al. (2025), described as a large-scale empirical corpus of AI-based PRs, review comments, and metadata from GitHub OSS projects. The paper starts from reviewer comments paired with diff_hunk text, then filters out non-bot comments. The final comment-level corpus is 7,416 reviewer-bot comments from 29 bots across 4,532 agentic PRs. For the outcome analysis, they further define a cohort of 4,532 PRs with bot reviews: 3,054 accepted and 1,478 unaccepted. To reduce survival bias, they define a cutoff timestamp t_last as the creation time of the most recently accepted PR and exclude PRs created after that point, so recently opened PRs are not misclassified as unresolved simply because they had less time to close. The paper does not report a train/test split because this is not a prediction task; it is an observational study. Preprocessing details beyond bot-comment filtering and the survival-bias cutoff are not expanded in the text provided.

Methodologically, the authors adopt the review-categorization framework from Sghaier et al. (2025). Each comment is classified along three categorical axes: Type, Nature, and Civility. Type covers the review focus, such as Bugfix, Testing, Documentation, Refactoring, Logging, or Other. Nature captures communicative style, such as Prescriptive, Clarification, Descriptive, or Other. Civility is binary: Civil or Uncivil. They also score each comment on three ordinal dimensions from 1 to 10: Relevance, Clarity, and Conciseness. Relevance is defined as pertinence to the code change, clarity as how well the comment conveys its message, and conciseness as brevity/efficiency. The authors use GPT-5.1 as an automated annotator, feeding it the comment text and the associated diff_hunk plus a prompt containing the full rubric and an example. A random subset of 200 comments is manually validated in parallel before scaling the annotation to the rest of the corpus. One concrete example of the pipeline: a bot comment attached to a diff_hunk is passed to GPT-5.1; the model assigns a Type (say, Testing), Nature (say, Prescriptive), Civility (Civil), then gives relevance/clarity/conciseness scores; these scores are later aggregated per PR by averaging across that PR’s comments.

The manual validation is a key methodological step. Two authors independently annotate 200 random comments, and a third resolves disagreements. Agreement between human annotators and GPT-5.1 is reported as Cohen’s kappa for categorical labels and Krippendorff’s alpha for ordinal scores, with 95% confidence intervals. The reported values are high: Nature κ = 0.89, Type κ = 0.86, Civility κ = 1.0; relevance α = 0.94, conciseness α = 0.95, clarity α = 0.87. This is used to justify applying the GPT-5.1 annotator to the full set of 7,132 remaining comments after the 200-comment validation subset. The paper says the automated annotations include a rationale for each classification/scoring decision, and that the prompt template is in the replication kit, but it does not report the exact prompt text in the excerpt.

For evaluation, the authors use a two-step nonparametric analysis of PR outcomes. First, for resolution time, they compute Spearman rank correlations between each PR-level independent variable and PR resolution time, where resolution time is measured from PR creation to acceptance for merged PRs. Second, for acceptance, they use the Mann–Whitney U test to compare accepted versus unaccepted PRs on the distributions of mean relevance, mean clarity, mean conciseness, and bot comment count. They apply Holm-Bonferroni correction to control family-wise error rate across multiple tests. The reported figures show the direction and magnitude of correlations: Fig. 1 plots mean relevance and mean conciseness against acceptance rate; Fig. 2 plots all three quality metrics plus comment count against resolution time; Fig. 3 plots comment count against mean relevance/clarity to illustrate the dilution effect. The paper does not report regression models, causal identification, or matched controls, so the findings are correlational rather than causal. Reproducibility is moderately good: the authors state that a replication kit is available via Zenodo, including the prompt template, but the dataset itself is AI_Dev rather than a new released corpus, and the full annotation outputs are only described, not exhaustively reproduced in the excerpt.

Technical innovations

Introduces an empirical link between reviewer-bot feedback characteristics and agentic PR outcomes, rather than treating bot review as a black box.
Operationalizes reviewer-bot feedback quality with a three-part rubric adapted from Sghaier et al. (2025): relevance, clarity, and conciseness.
Uses GPT-5.1 as a high-throughput annotator for bot-review comment taxonomy, validated against 200 manually labeled comments with high agreement.
Defines a PR-level aggregation scheme that compares feedback quality and activity volume against merge outcome and resolution time with nonparametric tests and Holm-Bonferroni correction.

Datasets

AI_Dev — 7,416 reviewer-bot comments on 4,532 agentic PRs from GitHub OSS projects — public dataset by Li et al. (2025)
Manual validation subset — 200 randomly sampled comments — authors’ annotation sample
Outcome cohort — 4,532 PRs (3,054 accepted, 1,478 unaccepted) — derived from AI_Dev after cutoff filtering

Baselines vs proposed

Accepted vs unaccepted PRs on mean relevance: Mann–Whitney U significant after Holm-Bonferroni, adjusted p = 0.002; proposed analysis finds higher relevance associated with merged PRs
Bot comment count vs PR resolution time: Spearman ρ = 0.19, adjusted p = 7.312e-13; proposed analysis finds more bot comments associated with slower resolution
Bot comment count vs mean relevance: Spearman ρ = -0.20, adjusted p = 8.648e-20; proposed analysis finds higher volume associated with lower relevance
Bot comment count vs mean clarity: Spearman ρ = -0.19, adjusted p = 1.39e-15; proposed analysis finds higher volume associated with lower clarity
Mean relevance vs PR acceptance: Spearman ρ = -0.09, adjusted p = 0.0006; proposed analysis finds a small negative correlation
Mean conciseness vs PR acceptance: Spearman ρ = 0.06, adjusted p = 0.01; proposed analysis finds a small positive correlation

Limitations

The framework collapses 55.5% of comments into an “Other” category, which weakens interpretability and may hide important bot-review subtypes.
The study is correlational; it does not establish that bot comment volume causes longer resolution times or that relevance causes acceptance.
Only the AI_Dev dataset is analyzed, so generalization to newer agent ecosystems, other repositories, or post-2025 toolchains is uncertain.
The paper observes comment text and diff hunks but cannot inspect the bot’s internal reasoning, prompts, or whether comments were acted on by humans before merge.
Acceptance analysis uses merged vs closed-without-merge outcomes, but the excerpt does not show controls for project size, PR complexity, or repository-specific norms.
The automated annotator is validated on only 200 comments, which is respectable for agreement checks but still a small slice of the 7,416-comment corpus.

Open questions / follow-ons

Can the observed dilution effect be reproduced under controlled experiments where comment volume is held constant and relevance is manipulated directly?
What semantic or task-specific metrics would better predict whether a bot comment is actionable for agentic PRs than the 1–10 relevance/clarity/conciseness rubric?
Where is the optimal operating point between bot comment volume and signal quality, and does it vary by repository or PR type?
Do human maintainers respond differently to bot comments than to human comments on the same agentic PRs, and does that mediate resolution time?

Why it matters for bot defense

For bot defense practitioners, this paper is a reminder that automation quality is not just about raw throughput; volume can actively degrade usefulness. In a CAPTCHA or bot-defense setting, a high rate of machine-generated signals can overwhelm reviewers or downstream moderation workflows if the signals are only loosely aligned with the underlying artifact. The paper’s main practical lesson is to favor targeted, high-relevance interventions and to monitor whether increased automation is creating noise rather than signal.

A second takeaway is methodological: if you are building detection, review, or triage systems, you should measure not only how many events the system produces but also whether those events remain semantically tied to the item under review. The authors’ “dilution” finding suggests that activity counts can be misleading on their own. For CAPTCHA practitioners, that maps cleanly to bot-detection pipelines: more classifier alerts or more automated comments are not inherently better if precision and contextual relevance fall as volume rises.

Cite

bibtex

@article{arxiv2604_24450,
  title={ On the Footprints of Reviewer Bots Feedback on Agentic Pull Requests in OSS GitHub Repositories },
  author={ Syeda Kaneez Fatima and Yousuf Abrar and Abdul Rehman Tahir and Amelia Nawaz and Shamsa Abid and Abdul Ali Bangash },
  journal={arXiv preprint arXiv:2604.24450},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.24450}
}

On the Footprints of Reviewer Bots Feedback on Agentic Pull Requests in OSS GitHub Repositories ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​