Skip to content

SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack

Source: arXiv:2605.24958 · Published 2026-05-24 · By Han Liu, Zhi Xu, Xiaotong Zhang, Feng Zhang, Xiaoming Xu, Wei Wang et al.

TL;DR

This paper addresses the under-explored problem of transfer-based adversarial attacks in the textual domain, where attackers craft adversarial examples using surrogate ensemble models without querying or accessing victim models. Existing methods often treat submodels equally in the ensemble and rely on inaccurate word importance estimation, limiting transferability. To tackle this, SEP-Attack introduces a novel paradigm leveraging Determinantal Point Process (DPP) to generate diverse ensemble weights, which better represent submodel transferability. Using these weights, a new confidence scoring method is proposed to more accurately compute word importance, guiding synonym substitutions to create adversarial examples. SEP-Attack selects final adversarial samples by quantifying transferability via neighborhood averaging to pick examples with robust transfer potential. Extensive experiments on four popular NLP benchmark datasets, two real-world APIs (Alibaba Cloud and Google Cloud), and large language models show SEP-Attack substantially outperforms state-of-the-art ensemble-based and query-based baselines. Notably, it achieves up to 99% attack success rate on IMDB with only 10 queries, and can bypass defense methods such as HotFlip and SHIELD effectively. These results demonstrate SEP-Attack as a simple yet highly effective framework advancing textual transfer attacks under strict black-box constraints.

Key findings

  • SEP-Attack achieves attack success rates (ASR) of 70.2%, 78.5%, 75.9% against BERT, LSTM, and TextCNN on MR dataset with only 10 queries, surpassing the next-best baseline by at least 12.5%
  • On the IMDB dataset, SEP-Attack achieves near-perfect ASR close to 99% against all three victim models, improving over baselines by more than 30% (e.g., 53.7% improvement vs second-ranked method attacking BERT)
  • SEP-Attack outperforms query-based methods (TextFooler, DeepWordBug, BBA, etc.) despite using only 10 queries versus 100 queries in baselines
  • The Determinantal Point Process technique effectively generates diverse surrogate ensemble weights from the weight space W, improving attack diversity and success
  • The confidence score metric based on weighted ensemble outputs enables more accurate word importance estimation compared to removing words as in TextFooler, reducing semantic distortion
  • SEP-Attack successfully bypasses adversarially trained models: achieves 33.3% vs 30.0% ASR on HotFlip-defended AG dataset and 97.9% vs 37.8% on SHIELD-defended IMDB dataset (Table 3)
  • Attacks on real-world APIs demonstrate SEP-Attack achieves 64.1% ASR on Alibaba Cloud API and 75.3% on Google Cloud API, exceeding baselines by more than 10%
  • The transferability quantification via averaging confidence scores in synonym neighborhoods (Eq. 7) allows effective pruning of candidates for efficient black-box attacks

Threat model

The adversary is a black-box attacker who has no access to the internal architecture, parameters, or output logit distributions of the victim model. They cannot query the victim model during adversarial example generation, relying solely on surrogate ensemble models to craft transferable examples. The attacker assumes the victim model's task is known (e.g., text classification) but has zero query or information access beyond input-output labels during final attack testing. This models realistic scenarios where victim model queries are restricted or costly.

Methodology — deep read

The authors consider a realistic black-box threat model where the adversary has no access to the victim model's architecture, parameters, or queries. Instead, attacks are generated solely based on surrogate ensemble models composed of different classifiers (BERT, ALBERT, RoBERTa, LSTM, TextCNN). The adversary's goal is to find adversarial examples that fool the unknown victim model, optimizing transferability.

The experiments use four public benchmark datasets: MR, AG for short texts, and Yelp, IMDB for longer texts. From each test set, 500 examples are sampled, and 4 surrogate models form the ensemble by excluding the victim model architecture during attack generation. Surrogate and victim model accuracies were reported (Table 1). Synonym sets for word substitutions are constructed based on part-of-speech tags (nouns, verbs, adjectives, adverbs).

SEP-Attack centers around three main components:

  1. Diversity-Based Surrogate Ensemble Weights Generation: Instead of equal weighting of submodels, the authors generate a large candidate weight space W (random real vectors in [0,1]^N) and use Determinantal Point Process (DPP) to sample diverse subsets aiming to maximize representativeness and diversity. The kernel matrix K = W * W^T is formed, and the subset with highest determinant is selected as the weight ensemble WE, containing E diverse weight vectors.

  2. Adversarial Example Generation: For each diversity-generated weight vector w_e in WE, SEP-Attack iteratively updates the original text in T=10 iterations. Each iteration involves: (a) Calculating a weighted confidence score δ(X,y,w_e) measuring model confidence on label y using the surrogate ensemble with weights w_e; (b) Computing word importance scores I_{x_i} = -δ(X\x_i,y,w_e) reflecting the confidence drop if word x_i is removed; (c) Sampling a replacement order O based on softmax on importance scores with scaling to emphasize high differences; (d) For each word in O, substituting with the synonym that minimizes confidence score δ, allowing perturbations to expand beyond the final limit ε to a larger η; (e) Afterwards, a pruning step removes unimportant substitutions by reverting words when it improves confidence and reduces perturbation distance to within ε.

  3. Transferable Example Selection: Given a large candidate set X formed by combining T iterations over E weights, the method estimates transferability score τ_t for each candidate X_t by computing the average confidence score δ over its synonym-based neighborhood (adjacency region) B_m(X_t). This neighborhood averaging measures robustness in the text space since flatter loss indicates better transferability. The top-K candidates by τ_t (descending) are selected as adversarial examples.

The surrogate ensemble models rely on standard pretrained model fine-tuning for classification. The confidence score δ combines softmax probabilities weighted by the sampled ensemble weights w_e. Word substitutions use synonym sets filtered by POS tagging and semantic closeness.

Experiments run on NVIDIA A800 GPUs, with query budgets set to 10 for ensemble-based attacks and 100 for query-based baselines. Attack success rate (ASR) is the main metric, computed as the percent of originally correctly classified texts misclassified after attack perturbation. Various baselines include query-based attacks (TextFooler, DeepWordBug, BBA) and ensemble-based methods (HAWR, Ensemble).

Overall, the method moves beyond equal ensemble weighting and simplistic importance estimation by leveraging DPP for weight diversity and an iterative update + pruning strategy. Transferability scoring through neighborhood averaging enables efficient candidate selection without victim queries. SEP-Attack is thus effective at generating high-quality adversarial text examples under black-box constraints.

A concrete example end-to-end: starting from an input sentence X, SEP-Attack first generates diverse ensemble weights WE via DPP. For one sampled w_e, it computes confidence scores and word importances, then in each iteration replaces the most important words with synonyms that minimize weighted confidence score δ. After T iterations, it prunes substitutions to stay within perturbation distance ε. This yields candidate X'. Repeating over all WE yields a candidate set X. Transferability scores τ are computed for each X' by averaging δ over synonym neighbors. Finally, top-K highest τ candidates are selected as transferable adversarial examples for attack.

Technical innovations

  • Use of Determinantal Point Process (DPP) to generate diverse surrogate ensemble weights representing varying transferability levels of submodels, as opposed to uniform weighting.
  • Introduction of a weighted confidence score metric δ combining ensemble predictions with DPP-sampled weights to more accurately estimate word importance during attack.
  • Iterative synonym substitution with an update-and-prune schedule expanding perturbation budget initially (η > ε) then removing less important replacements to mitigate semantic distortion and optimize attack quality.
  • Novel transferability score τ computed by averaging confidence scores over a sampled synonym neighborhood around each adversarial candidate, enabling robust selection of transferable attacks without victim queries.

Datasets

  • MR — 500 test examples used — public benchmark for short text sentiment classification
  • AG — 500 test examples used — public benchmark for short text topic classification
  • Yelp — 500 test examples used — public long text sentiment classification dataset
  • IMDB — 500 test examples used — public long text movie review sentiment classification dataset

Baselines vs proposed

  • TextFooler: ASR = 47.1% on MR / 18.8% on AG vs SEP-Attack: 70.2% / 58.8%
  • DeepWordBug: ASR = 37.8% on MR / 23.5% on AG vs SEP-Attack: 70.2% / 58.8%
  • BBA: ASR = 61.7% on MR / 40.5% on AG vs SEP-Attack: 70.2% / 58.8%
  • Ensemble: ASR = 19.1% on MR / 12.7% on AG (10 query budget) vs SEP-Attack: 70.2% / 58.8%
  • HAWR: ASR = 38.0% on MR / 22.4% on AG vs SEP-Attack: 70.2% / 58.8%
  • Against defense HotFlip on AG: TextFooler 16.0%, DeepWordBug 27.0%, BBA 30.0%, HAWR 11.6%, SEP-Attack 33.3%
  • Against defense SHIELD on IMDB: TextFooler 4.1%, DeepWordBug 4.75%, BBA 37.8%, HAWR 23.9%, SEP-Attack 97.9%
  • Real-World API: Alibaba Cloud ASR: BBA 58.7% vs SEP-Attack 64.1%; Google Cloud ASR: LeapAttack 29.9% vs SEP-Attack 75.3%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.24958.

Fig 1

Fig 1: The overview of our proposed SEP-Attack framework, which consists of three main components. (1) Generating

Limitations

  • The attack focuses on classification tasks; generalizability to other NLP tasks like question answering or generation is untested.
  • Synonym substitution relies on precomputed synonym sets which may limit perturbation diversity or semantic coherence.
  • No adversarial evaluation under adaptive defenses specifically designed against transfer-based attacks.
  • The technique depends on the quality and diversity of surrogate models; unknown if it scales to very large-scale ensembles or newer model architectures.
  • Semantic similarity or fluency metrics are not extensively analyzed, so linguistic naturalness of adversarial examples is unclear.
  • The discrete nature of text limits gradient-based optimizations, requiring heuristic perturbation schedules that may not find global optima.

Open questions / follow-ons

  • How effective is SEP-Attack against future or adaptive adversarial defenses designed specifically for transfer-based text attacks?
  • Can the DPP-based weight diversity generation scale or be optimized further when ensembles involve hundreds of large language models?
  • What is the semantic and fluency quality of the generated adversarial examples, and how can the method balance attack success rate and text naturalness better?
  • Can the method extend to other NLP tasks beyond classification, such as named entity recognition or question answering?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, SEP-Attack highlights key vulnerabilities in text classification models used for moderation or bot detection that rely on static surrogate ensembles for robustness evaluation. Its novel diversity-weighted ensemble approach shows that treating submodels with equal weights can underestimate transferable attack risks, suggesting defense evaluation should consider diverse surrogate weightings. The confidence scoring and transferability metrics provide insight into more robust criteria for detecting adversarial inputs. Furthermore, the demonstrated success against black-box real-world APIs with low query budgets warns of practical risks where attackers can generate transferable yet undetectable adversarial text payloads. Defensive strategies may need to incorporate detection of synonym substitution patterns, dynamic model ensembles, or robust semantic similarity checks to mitigate such attacks. The work encourages rigorous adversarial testing of language-based CAPTCHA or bot detection systems using surrogate ensembles with diverse weighting and transferability-aware adversarial candidate selection.

Cite

bibtex
@article{arxiv2605_24958,
  title={ SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack },
  author={ Han Liu and Zhi Xu and Xiaotong Zhang and Feng Zhang and Xiaoming Xu and Wei Wang and Fenglong Ma and Hong Yu },
  journal={arXiv preprint arXiv:2605.24958},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.24958}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution