Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization
Source: arXiv:2605.28802 · Published 2026-05-27 · By Beiduo Chen, Pingjun Hong, Ziyun Zhang, Benjamin Roth, Anna Korhonen, Barbara Plank
TL;DR
This paper investigates whether large language models (LLMs) can learn to simulate annotator-specific label-explanation behavior, extending beyond modeling only label disagreement to capturing how annotators justify and reason about their decisions through free-text explanations. Using two sentence-pair classification tasks—natural language inference (NLI) and paraphrase judgment—with four annotators each, the authors first analyze the stability and detectability of annotator differences in explanations. They find that single explanations are dominated by input content and thus weakly reveal annotator-specific signature, but that stable annotator-specific patterns emerge when averaging explanation representations across multiple instances per annotator. To model this behavior, they evaluate prompting methods, supervised fine-tuning (SFT), and propose a novel cross-annotator preference optimization (CAPO) method that uses contrastive supervision between valid alternative annotations to sharpen the model's fit to a target annotator's style. Experiments on two datasets with Qwen3 and Llama3.2 models show prompting is limited, SFT captures annotator-specific traits better, and CAPO further improves aggregation-level imitation and human-validated attribution of individual annotator reasoning style, while maintaining task decision accuracy. The work suggests that human label variation (HLV) can be viewed and exploited as stable annotator-specific explanation behavior, which can enable scalable, interpretable modeling of annotation perspectives grounded in annotator histories rather than just labels.
Key findings
- Annotator label agreement (pairwise Cohen’s Kappa) ranges between 0.31 and 0.41, indicating moderate but consistent label-level differences across annotators (Fig 1).
- Single explanations carry weak annotator-specific signal due to input content overlap; embedding-based single-instance classification accuracy is only 41.9%.
- After input-content removal and aggregation of explanation embeddings over groups of m=50 annotations, annotator classification accuracy rises sharply to 96.4%, revealing stable annotator-specific patterns (Fig 5).
- Supervised fine-tuning (SFT) surpasses prompting baselines in capturing annotator-specific behavior, improving label accuracy and explanation similarity metrics.
- Cross-annotator preference optimization (CAPO) further improves aggregation-aware imitation metrics such as group classifier confidence and normalized imitation score relative to SFT, without significant accuracy loss (Table 4).
- CAPO reduces distribution shift on multiple explanation-style features (length, modality, lexical diversity) as measured by feature KL divergence (Table 6).
- Human validation shows CAPO explanations are judged more coherent, better grounded, and more reflective of target annotator style than SFT or prompting, with 82.8% inter-annotator agreement.
- LLM-as-judge accuracy for correctly attributing outputs to the intended annotator is higher for CAPO (around 0.328 for VariErr) than for SFT or prompting (Table 4).
Threat model
n/a — This is not a security paper focused on adversarial threats but rather a study on modeling human annotator variation and behavior.
Methodology — deep read
Threat model and assumptions: The adversary is not explicitly a threat actor but conceptualized as the task of modeling and reproducing annotator-specific human label variation (HLV) including their free-text explanations for decision-making in sentence-pair tasks. The attacker here is metaphorical, representing challenges in learning from annotator histories. The adversary cannot access demographic or persona metadata but only repeated label-explanation pairs per annotator. The paper assumes that variation among annotators is a stable, meaningful signal rather than noise.
Data: Two public or semi-public sentence-pair datasets are used: VariErr NLI (entailment, neutral, contradiction labels) and R2 paraphrase judgment (integer relatedness scores from -5 to 5). Each dataset involves 4 annotators per item, with 300/100/100 splits for train/dev/test items, ensuring annotator label-explanation pairs are per-item. Explanations are free-text justifications provided by annotators alongside labels/scores.
Architecture/algorithm: Modeling uses large language models (LLMs) Qwen3-4B and Llama 3.2-3B instruction-tuned variants. Three main simulation approaches are compared: (a) prompting baselines including base (ID only), in-context learning (ICL) with few-shot examples, value profile (VP) prompting summarizing annotator style, and VP-ICL combo; (b) supervised fine-tuning (SFT) trains independent LoRA adapters per annotator on their training data, optimizing conditional language modeling loss to predict label-explanation pairs given input and annotator ID; (c) proposed Cross-Annotator Preference Optimization (CAPO), which fine-tunes SFT adapters further using direct preference optimization loss contrasting a target annotator’s output with other annotators’ outputs for the same input. CAPO treats alternative annotations as valid but less target-specific rather than mistakes.
Training regime: Models are trained/fine-tuned on the training set annotations per annotator separately (SFT) and then further refined with CAPO. CAPO constructs contrastive pairs conservatively, matching label-equality on NLI and near-equality on paraphrase scores. Hyperparameters, epochs, batch sizes, hardware details, and random seed strategies are documented in the appendix but not fully detailed for all runs. Checkpoint selection for CAPO is guided by aggregation-aware metrics computed on the development set.
Evaluation protocol: Multiple complementary views: (a) decision matching via label accuracy and mean absolute error for paraphrase scoring, (b) reference explanation similarity using ROUGE-L, BERTScore, embedding cosine similarity, (c) aggregation-aware imitation using KL divergence on handcrafted explanation-style feature distributions, group classifier confidence measuring whether generated outputs are classified correctly as belonging to the target annotator in aggregate, normalized imitation score comparing model outputs versus human references, and (d) human validation by expert annotators rating explanation coherence, fidelity, and style. An LLM-based judge attribution task tests single-output annotator recognition. Bootstrapped groups of size ~50 used for stable aggregation.
Reproducibility: Code and adapter checkpoints are mentioned but not clearly released; datasets are based on VariErr NLI and R2 paraphrase judgment with links provided. The paper provides appendix details on prompt templates, pair policies, feature extraction, and hyperparameters. Overall, full exact replication may require contacting authors or additional implementation effort, but methodology is described clearly enough to attempt reproduction with available datasets.
Technical innovations
- Cross-Annotator Preference Optimization (CAPO): a novel contrastive training approach that uses valid annotations from other annotators as negative but plausible samples to sharpen LLM imitation of target annotator-specific label-explanation behavior.
- Use of aggregation-level embedding averaging and content-effect removal (residual embeddings E3, E4) to reveal stable annotator-specific explanation patterns not visible at single-instance level.
- Development of aggregation-aware evaluation metrics for annotator imitation, including group classifier confidence, normalized imitation scores, and feature distribution KL divergence.
- Insight that parametrically fine-tuned LoRA adapters per annotator encode annotator-specific style better than prompting on symbolic or profile inputs.
Datasets
- VariErr NLI — ~500 items (300/100/100 split) — publicly described, includes 4 annotators with label + explanation pairs
- R2 Paraphrase Judgment — ~500 items (300/100/100 split) — publicly described, includes 4 annotators with score + explanation pairs
Baselines vs proposed
- Qwen3 Base prompting: Label Accuracy 47.2% vs CAPO 62.7%
- Qwen3 SFT: Label Accuracy 63.8% vs CAPO 62.7% (competitive)
- Qwen3 SFT Feature KL 0.084 vs CAPO 0.081 (lower is better)
- Qwen3 SFT Group Classifier Confidence 0.845 vs CAPO 0.867
- Qwen3 SFT Normalized Imitation Score 0.859 vs CAPO 0.888
- Llama3.2 Base prompting: Label Accuracy 31.5% vs CAPO 51.2%
- Llama3.2 SFT: Label Accuracy 51.2% vs CAPO 51.2% (equal)
- Llama3.2 SFT Feature KL 0.113 vs CAPO 0.121 (slightly worse)
- Llama3.2 SFT Group Classifier Confidence 0.909 vs CAPO 0.924
- Llama3.2 SFT Normalized Imitation Score 0.946 vs CAPO 0.964
Limitations
- Annotator-specific behavior is weak and heavily confounded by input content at the single-instance level, requiring aggregation of many examples for stable modeling and evaluation.
- Datasets contain only 4 annotators each, which limits generalizability of findings across diverse annotator populations or domains.
- Evaluation relies heavily on indirect metrics like group classifier confidence and LLM-as-judge attribution which while informative have limitations in interpretability and potential bias.
- CAPO requires repeated annotations per annotator on same inputs, which may not be practical for many annotation projects.
- Human validation was limited to 50 samples with two annotators, restricting thorough assessment of model-generated explanation quality and style.
- Study focuses on sentence-pair tasks; applicability of CAPO and annotator explanation modeling to other annotation formats and modalities remains untested.
Open questions / follow-ons
- Can CAPO be extended to settings with more annotators and fewer repeated annotations per annotator?
- How well do annotator-specific explanation models generalize across domains and annotation tasks beyond sentence-pair classification?
- Can improved disentanglement techniques better separate input content influence from annotator explanation style at the single-instance level?
- What is the effect of modeling annotator behavior on downstream tasks like active learning, annotation quality control, or explanation-based debiasing?
Why it matters for bot defense
For bot-defense and CAPTCHA practitioners, this work highlights the importance of capturing user-specific behavioral signatures not only in overt labels or answers (e.g., pass/fail) but also in explanation-style or interaction patterns that go beyond single responses. Techniques like CAPO suggest that models can learn stable, individual-level reasoning or justification styles from repeated behavior that may help differentiate human users or attacker bots emulating generic behaviors. This informs design of bot-detection methods that leverage consistency and stylistic nuance in human sessions rather than relying solely on aggregate correctness or single-instance signals. Furthermore, the emphasis on aggregation-aware evaluation cautions against trusting instance-level similarity metrics for user attribution, which is relevant for CAPTCHA analysis where user decisions are often sparse and noisy. Overall, the paper’s approach encourages casting human variation as a rich, learnable signal in interpretable behavior modeling, which may inspire new defenses focusing on personalized interaction fingerprints in CAPTCHA or bot-detection scenarios.
Cite
@article{arxiv2605_28802,
title={ Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization },
author={ Beiduo Chen and Pingjun Hong and Ziyun Zhang and Benjamin Roth and Anna Korhonen and Barbara Plank },
journal={arXiv preprint arXiv:2605.28802},
year={ 2026 },
url={https://arxiv.org/abs/2605.28802}
}