Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

Source: arXiv:2605.13801 · Published 2026-05-13 · By Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan

TL;DR

This paper addresses a critical problem in the evaluation of AI and machine learning models: the reproducibility crisis driven by unreliable and non-repeatable human-annotated evaluations. Current standard practice relies on a small number of annotations per item without modeling the variance introduced by individual annotators who label multiple items, causing underestimated p-values and overoptimistic claims of statistical significance. To overcome this, the authors propose a multi-level bootstrapping framework that realistically models annotator behavior across items using persistent rater identifiers. They leverage three distinct multi-level resampling strategies (sampling independent items and raters, sampling raters within items, and sampling stratified batches) to simulate human-annotated datasets more accurately.

Using three large annotated datasets (DICES, Toxicity, and D3code), the study empirically analyzes how accounting for annotator dependencies affects the tradeoffs between the number of items (N) and number of responses per item (K) needed to achieve reliable null hypothesis significance tests (NHSTs). The results show that failing to model annotators across items significantly underestimates p-values and that realistic modeling requires substantially larger annotation budgets to reach statistical significance. Moreover, distribution-sensitive metrics like MAE and Jensen-Shannon distance benefit from higher K, providing a more efficient evaluation budget allocation. The paper offers key quantitative insights for building more reproducible evaluation protocols that can better capture human disagreement and annotator variance in generative AI assessments.

Key findings

Accounting for rater behavior across items results in higher p-value estimates compared to ignoring it, implying prior evaluations may underestimate uncertainty.
On the DICES dataset, the p-value for Accuracy drops from 0.045 under independent sampling (S1) to 0.019 when accounting for rater-item dependencies (S2) at NK=5000 (Table 1).
Distribution-sensitive metrics such as MAE, Wins, and Jensen-Shannon Distance (JSD) achieve statistical significance at a lower total budget (NK=1000) but require higher responses per item (K=40 to 100).
Using stratified batch sampling (S3) to model batches of annotators working on isolated item sets increases required annotation budget up to 5× compared to S2 (e.g., Toxicity dataset accuracy significance at NK=500 under S2 vs NK=2500 under S3).
The optimal number of raters per item (K) increases when modeling rater dependencies, e.g., K changes from 40 under S2 to 100 under S1 for MAE on DICES dataset.
Downsampling the number of raters leads to higher p-values and reduced statistical power, highlighting the importance of large annotator pools.
Models that ignore annotator overlap across items overestimate statistical power and can produce misleadingly low p-values.
Higher annotation budgets and more responses per item are necessary to reliably distinguish model performance under realistic rater behavior assumptions.

Threat model

n/a — The paper does not focus on adversarial threats but on statistical modeling of annotator variance to improve evaluation robustness and reproducibility.

Methodology — deep read

The paper's methodology centers around modeling human annotator behavior across multiple items to improve the reproducibility and statistical reliability of ML model evaluations.

Threat model & assumptions: The adversary is not explicitly defined since this is evaluation methodology research, but the problem is that annotator disagreement and bias introduce variance that is unaccounted for. Annotators label multiple items, introducing dependencies that make independent sampling assumptions invalid. The authors assume annotators do not collude but do have persistent subjective biases.
Data: They use three annotated datasets with known multiple human ratings per item and with persistent rater identifiers: (a) DICES (350 safety-labeled dialogs, 123 raters, fully crossed design with all raters annotating all items), (b) Toxicity (107,620 social media comments annotated by 17,280 raters in batches of 20 per rater), (c) D3code (4554 items on offensiveness labeled by 4309 raters from 21 countries). These datasets provide realistic rater-item interaction graphs.
Architecture/algorithm: The core technical contribution is a multi-level non-parametric bootstrap sampling framework with three variants:
- S1: Sample items and raters independently with replacement, assuming all raters rate all items (common but unrealistic assumption).
- S2: Sample items with replacement; for each sampled item independently sample raters who actually rated that item (models rater-item dependencies).
- S3: Sample stratified batches of items and their associated raters, modeling the batch-wise nature of some datasets (e.g., Toxicity).

These bootstrapping methods generate new simulated datasets reflecting different assumptions about annotator behavior. The simulation framework builds upon prior Variance Estimation Toolkit (VET) work but replaces parametric models with non-parametric bootstrapping to avoid distributional assumptions.

Training regime: The method does not train a model but simulates sampling to estimate p-values for null hypothesis significance tests (NHST). For each setting, multiple bootstrap samples (1000 repeats) are drawn to build distributions of metrics under null and alternative hypotheses. Different annotation budgets (N×K) are explored, ranging from 100 to 50,000, and number of responses per item K is varied from 1 to 100.
Evaluation protocol: Metrics include accuracy, mean absolute error (MAE), wins (which model has lower MAE), precision, recall, F1-score, KL-divergence, and Jensen-Shannon distance. For given N and K, simulated samples generate distributions of metrics for models A (ideal) and B (perturbed by small effect ϵ). P-values are computed as the fraction of null samples exceeding alternative sample metrics. Comparisons are made to evaluate the inflation or deflation of p-values under different bootstrapping assumptions and annotation budgets.
Reproducibility: Code is based on the open VET toolkit extended with these novel bootstrap methods. The used datasets are mostly public or referenced. Persistent rater IDs are rare, limiting dataset choices and motivating calls for richer metadata releases. Specific hardware used includes 16-core CPU machines with 40-64GB RAM; experiments take 12–18 hours per dataset.

Concrete example: On the DICES dataset with fully crossed raters, S2 bootstrapping samples items and then raters within items, keeping the dependencies of rater behavior across items intact. For an annotation budget of NK=5000 and K=40, p-values for accuracy fall to 0.019 compared to 0.045 under S1 independent sampling, showing higher rigor and indicating prior evaluations ignoring such dependencies may overstate confidence.

Technical innovations

Introduction of multi-level non-parametric bootstrapping sampling that accounts for annotators’ cross-item dependencies, improving p-value estimations in evaluations.
Comparison of three distinct bootstrapping methods (S1: independent item and rater sampling, S2: rater sampling within items, and S3: stratified batch sampling) tailored to real dataset structures, unlike prior independent or parametric approaches.
Empirical quantification of the tradeoffs between numbers of raters per item (K) and items (N) for reliable NHST under realistic annotator behavior.
Demonstration that distribution-sensitive metrics (e.g., MAE, JSD) require fewer total annotations but higher K, providing a strategic framework for annotation budget optimization.

Datasets

DICES — 350 items, 123 raters, fully crossed, multi-annotation safety evaluation dataset
Toxicity — 107,620 social media comments, 17,280 raters, batch annotated dataset from Stanford Toxicity Dataset
D3code — 4554 items, 4309 raters, cross-cultural offensiveness dataset with demographic metadata

Baselines vs proposed

S1 (independent items and raters): Accuracy p-value = 0.045 vs S2 (raters within items) = 0.019 at NK=5000 on DICES
S1: MAE p-value = 0.032 vs S2: 0.020 at NK=1000 on DICES
S2: Accuracy p-value = 0.035 at NK=500 vs S3 (stratified batches): 0.009 at NK=2500 on Toxicity
S1: requires K=100 for significance vs S2: K=40 for distribution-sensitive metrics on DICES
Precision metric fails to achieve significance under S1 but succeeds under S2 at very high budget NK=50000 on DICES

Limitations

Requires datasets with persistent rater identifiers and high rater-to-item ratios, rare in public benchmarks.
Current bootstrapping methods tailored around three datasets; may need adaptation for sparser or different annotation topologies.
Oversampling from finite rater pools is a suboptimal proxy for very large-scale annotation, may still underestimate real-world variance.
Does not address demographic subgroup variability, fairness, or intersectional annotation biases—future work suggested.
Focuses on NHST p-values but does not explore alternative statistical frameworks or adversarial annotation behaviors.

Open questions / follow-ons

How does annotator demographic and cultural heterogeneity affect N vs K tradeoffs and evaluation reproducibility?
Can this multi-level bootstrapping framework be extended to continuous or multi-dimensional annotation tasks?
What are the impacts of adversarial or low-quality annotators on the bootstrapping p-value estimations?
How do alternative statistical testing frameworks compare to NHST under modeled annotator dependencies?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights important challenges in how human annotation variance impacts the reliability of performance evaluations. In contexts where CAPTCHAs or bot detection systems are judged using human-labeled ground truth, ignoring annotator dependencies and biases can lead to overconfident conclusions about improvements or model superiority.

This study’s bootstrapping framework and empirical insights recommend increasing the number of annotators per challenge instance (K) and carefully designing annotation collection to capture annotator overlap. Practitioners should consider distribution-sensitive metrics and larger rater pools to ensure significance tests truly reflect robustness. These principles help avoid false-positive claims about attack resistance or bot detection accuracy and promote more reproducible, trustworthy evaluations in security-critical ML settings.

Cite

bibtex

@article{arxiv2605_13801,
  title={ Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling },
  author={ Deepak Pandita and Flip Korn and Chris Welty and Christopher M. Homan },
  journal={arXiv preprint arXiv:2605.13801},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.13801}
}

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​