Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Source: arXiv:2501.07493 · Published 2025-01-13 · By Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo et al.

TL;DR

This paper asks a practical security question that voting-based LLM leaderboards have mostly glossed over: if model identities are hidden during pairwise voting, can an adversary still learn which model produced which answer and then systematically bias the ranking? The authors focus on Chatbot Arena, whose leaderboard is updated from anonymous head-to-head votes, and argue that in the absence of strong bot protection and account controls, this anonymity can be broken cheaply enough to make leaderboard manipulation feasible.

The core result is a two-stage reranking attack. First, the attacker de-anonymizes responses with either prompt-based identity probing or a lightweight supervised classifier built on response text features; they report >95% target-model detection accuracy on many settings. Second, once the attacker knows which anonymous answer came from the target model, they selectively upvote or downvote it. In offline simulations of Chatbot Arena, they estimate that on the order of a few hundred to a few thousand malicious votes can move a model several leaderboard positions, with one-position shifts often requiring fewer than 1,000 votes. The paper also goes beyond attack demonstration by working with the Chatbot Arena team to analyze and deploy mitigations intended to raise the cost of such manipulation.

Key findings

Identity-probing prompts such as “Who are you?” exceeded 90% detection accuracy for all evaluated models in Table 2, with several models at 97–100%.
The training-based detector using simple bag-of-words features achieved >95% test accuracy for many target models; e.g., gpt-4o-mini-2024-07-18 reached 95.8% in Table 3.
TF-IDF also worked well but was slightly weaker than BoW in Table 3; for claude-3-5-sonnet-20240620, TF-IDF reached 92.6% vs BoW 93.7%.
Length alone was nontrivial but much weaker: word-length accuracy was only 67.2–77.7% across the models in Table 3, showing the attack is not just exploiting trivial verbosity differences.
Specialized or multilingual prompts produced the strongest de-anonymization: Figure 3 shows prompt categories like Math, Coding, and Chinese yielding the highest accuracies, with the text noting >95% across all evaluated models using optimal prompts.
In the simulated leaderboard, moving top-ranked models by one position required fewer than 1,000 malicious votes in many cases; e.g., gemini-1.5-pro-exp-0801 needs 696 votes to reach rank 1 from rank 2 in Table 4(a).
Some high-rank manipulations were still feasible at modest cost: gpt-4o-2024-05-13 required 1,668 votes to move from rank 3 to rank 1, and 3,756 votes to move from rank 3 to rank 5 in Table 4(a).
The authors estimate the one-time detector-building cost at $440 in their setup, and argue that without mitigation the total attack cost is dominated by this offline training cost rather than by online voting.
Mitigations materially increase attack cost by raising per-account or per-action expense: the paper reports that authentication, rate limiting, malicious-user identification, login, and reCAPTCHA are the main levers discussed with Chatbot Arena.
The paper states that all experiments were simulated/offline and had no impact on the live Chatbot Arena leaderboard; Chatbot Arena was not aware of any real adversarial manipulation attempts at publication time.

Threat model

The adversary is a public Chatbot Arena user who can submit arbitrary prompts, observe two anonymous model responses, and (in the stronger version of the attack) separately query candidate models directly via APIs or open weights to collect training data. They know the public model roster and can choose prompts to maximize response separability. They cannot modify the platform code directly, and the paper assumes they do not already have privileged access to internal ranking data beyond what is public or inferable. The goal is to selectively upvote or downvote a chosen model enough to alter the arena leaderboard.

Methodology — deep read

Threat model and attack formulation: the attacker is a public user who can submit arbitrary prompts to Chatbot Arena and can also directly query candidate models outside the arena when the model is API-accessible or open-weight. The attacker knows the set of available leaderboard models and aims either to promote a target model or demote rivals. The key security assumption the paper attacks is anonymous pairwise voting: the user sees two responses without model identities, then votes, and that vote updates the Elo/Bradley-Terry-style leaderboard. The authors model the de-anonymization problem as binary classification: given a prompt-response pair (P, R), decide whether R came from a specific target model M or from any other model in the current model pool. They explicitly assume different models have different response distributions for a given prompt, which is a mild and natural assumption for this setting.

Data and experimental setup for de-anonymization: the paper evaluates 22 representative Chatbot Arena models overall, but the reported tables focus on a subset of models shown in Tables 2 and 3. For the training-based detector, they construct prompts from eight categories: normal chat in high-resource languages (English, Chinese, Spanish), normal chat in low-resource languages (Indonesian, Persian), and specialty prompts for Coding, Math, and Safety-violating instructions (Table 1). They sample 200 prompts per category and collect 50 responses per model for each prompt. For each prompt-model pair, they build a balanced binary dataset containing 50 responses from the target model and 50 uniformly sampled responses from other models. They split each dataset 80/20 into train/test, and report average test accuracy across prompts. The text does not specify decoding parameters in the excerpt beyond noting they are given in Appendix A.1.

Architecture / algorithm for de-anonymization: they test two detector families. The first is an identity-probing detector that asks direct identity questions such as “Who are you?”, “Which model are you?”, and related variants; the classifier is a simple string match that predicts positive if the response contains the model name or organization name anywhere in the text. This is easy to deploy but is often partially defeated by simple post-processing, and Chatbot Arena already filters votes mentioning model names. The second is a training-based detector built on extremely simple text features: response length (in words or characters), TF-IDF, or bag-of-words. These features feed a logistic regression classifier implemented with scikit-learn defaults and random_state=42. The main novelty is not the classifier itself, but showing that standard text features are strong enough to recover model identity from responses with very high accuracy on the arena’s own model pool. Figure 2 uses PCA projections of BoW features to show visually separable model-specific clusters for three prompts, which supports the quantitative results in Table 3.

Concrete end-to-end example of the attack: suppose an adversary wants to upvote a target model such as gpt-4o-2024-05-13. The attacker first selects a prompt type that maximizes separability, for example a math question or a non-English prompt, and trains a logistic regression detector on BoW features using 50 target-model responses and 50 negative responses per prompt. On a new Chatbot Arena interaction, the attacker submits a prompt, receives two anonymous responses, and runs the detector on each response. If the detector identifies one of them as the target model, the attacker votes for that response; otherwise, they abstain. This selective-voting policy converts otherwise honest, random exposure to target-model responses into a biased stream of votes. The paper’s simulation then propagates those votes through the arena’s Bradley-Terry ranking update process and checks after every 1,000 interactions whether the model has moved up or down the desired number of ranks.

Training regime and evaluation protocol: the de-anonymization classifier is logistic regression with default hyperparameters, train/test 80/20, and random state 42. The paper does not describe iterative hyperparameter tuning, multiple seeds, or confidence intervals in the excerpt. Evaluation is by test accuracy for the detector, averaged across prompts; the identity-probing detector is averaged over 1,000 queries per prompt. For ranking manipulation, the authors simulate attacker behavior on historical voting data and compute Bradley-Terry coefficients and rankings after each 1,000 interactions. They define two quantities: votes, meaning malicious votes cast only when the target model is detected, and interactions, meaning total user queries including abstentions. The default simulation assumes 95% de-anonymization accuracy with symmetric 5% false positive/false negative rates, because the best prompt set in Section 2.4 reaches about that level. They also assume the attacker remains passive when detection fails. Appendix B.2 is said to contain ablations on detection accuracy and non-detection behavior, but those details are not in the excerpt.

Reproducibility and mitigation analysis: the paper says the attack was responsible-disclosed to Chatbot Arena in August 2024 and that the work was done entirely in simulated environments with no effect on the live leaderboard. The authors collaborate with the arena developers to discuss and implement mitigations; some existed already, including Cloudflare bot protection, malicious-user detection, and rate limiting, and others such as reCAPTCHA and login were being integrated. On the defense side, they propose a cost model with three components: one-time detector-building cost, per-account maintenance cost, and per-action cost. They then argue that authentication, rate limiting, and anomaly-based malicious-user detection raise attack cost by increasing account cost or reducing the number of actions per account. The excerpt includes an explicit likelihood-test formulation for detecting malicious behavior from vote sequences under a benign-distribution assumption, plus a Neyman-Pearson-likelihood-ratio framing when both benign and adversarial distributions are modeled, but the mathematical section is truncated before the full derivation and any concrete detection performance numbers for these defenses are shown.

Technical innovations

A two-stage reranking attack that combines response de-anonymization with selective voting to bias anonymous leaderboard updates.
A demonstration that very simple features, especially bag-of-words over model responses, are enough to identify target models with >95% accuracy in the arena setting.
A simulation-based estimate of leaderboard manipulation cost under Bradley-Terry ranking updates, reporting the number of malicious votes/interactions needed to move specific models up or down.
A defense cost model that decomposes attack expense into detector-building, per-account, and per-action terms, making mitigation tradeoffs explicit.

Datasets

LMSYS-Chat-1M — 1M+ chat prompts/responses (used as prompt source for normal chat categories) — public
Alpaca Code — size not specified in excerpt (used as prompt source) — public
MATH — size not specified in excerpt (used as prompt source) — public
AdvBench — size not specified in excerpt (used as prompt source) — public
Historical Chatbot Arena voting data — size not specified in excerpt — Chatbot Arena internal/public logs referenced in Appendix A.4

Baselines vs proposed

Identity-probing vs. no probing baseline: accuracy >90% for all evaluated models with the prompt “Who are you?” (Table 2) vs. chance-level anonymity without probing (implicit).
BoW logistic regression vs. length-only features: gpt-4o-mini-2024-07-18 accuracy = 95.8% vs proposed (BoW) = 95.8% and word-length = 68.5% (Table 3).
BoW logistic regression vs. TF-IDF: claude-3-5-sonnet-20240620 accuracy = 93.7% vs TF-IDF = 92.6% (Table 3).
For chatgpt-4o-latest, required votes to move from rank 1 to rank 4 = 1315 (Table 4a) and required interactions = 82000 (Table 4b).
For llama-13b, required votes to move from rank 129 to rank 125 = 381 (Table 5a) and required interactions = 24000 (Table 5b).

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2501.07493.

Fig 1

Fig 1: Chatbot Arena compiles a model leaderboard using crowdsourced user votes and

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

The attack and leaderboard manipulation estimates are based on offline simulation, not live exploitation of Chatbot Arena.
The de-anonymization results rely on access to many model responses and publicly available models; the threat may be weaker for closed models with restricted APIs or different output policies.
The strongest detector results come from simple settings with balanced datasets and specific prompt families; generalization to unseen prompt distributions is not fully established in the excerpt.
The paper assumes 95% de-anonymization accuracy in the attack simulation, but the real-world rate could be lower or higher depending on model mix and platform defenses.
The excerpt does not provide full details on statistical uncertainty, multiple seeds, or confidence intervals for the detector and ranking simulations.
Some mitigation analysis is conceptual and cost-model-based; the excerpt does not include full empirical evaluation of the defenses’ false positive rates or user impact.

Open questions / follow-ons

How robust are the de-anonymization detectors under stronger platform-side response normalization, paraphrasing, or post-processing that removes stylistic signatures?
How much does the attack cost increase if the defender combines authentication, rate limits, per-user query quotas, and anomaly detection simultaneously rather than in isolation?
Can one build model-agnostic defenses that preserve anonymity while still allowing honest preference voting, without leaking enough signal for re-identification?
Do the same vulnerabilities apply to other public voting systems with different aggregation rules than Bradley-Terry/Elo, and how does the attack transfer across ranking algorithms?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the paper is a concrete example of why “anonymous” crowdsourced evaluation is only as strong as the surrounding abuse-prevention stack. The attack does not require sophisticated adversarial ML; simple text features plus direct identity prompts can often recover which model generated a response, after which ordinary account-farming and selective voting can bias rankings. That means a platform protecting voting integrity needs more than content filtering: it needs account authentication, rate limiting, anomaly detection, and likely device- or risk-based controls tied to action budgets.

For a CAPTCHA or fraud engineer, the main takeaway is that the economic goal is to raise the marginal cost of each malicious vote above the expected ranking influence per vote. The paper’s cost model is useful because it separates the offline cost of building a detector from the online cost of sustaining many accounts and actions. If you operate a voting-based benchmark, you should think like an abuse-prevention system designer: model attacker throughput, not just classifier accuracy, and assume the attacker can adapt prompts to maximize separability. The paper also suggests that post-processing alone is brittle; it can be bypassed by indirect identity elicitation or by stronger classifiers trained on response style.

Cite

bibtex

@article{arxiv2501_07493,
  title={ Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards },
  author={ Yangsibo Huang and Milad Nasr and Anastasios Angelopoulos and Nicholas Carlini and Wei-Lin Chiang and Christopher A. Choquette-Choo and Daphne Ippolito and Matthew Jagielski and Katherine Lee and Ken Ziyu Liu and Ion Stoica and Florian Tramer and Chiyuan Zhang },
  journal={arXiv preprint arXiv:2501.07493},
  year={ 2025 },
  url={https://arxiv.org/abs/2501.07493}
}

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​