BAMI: Training-Free Bias Mitigation in GUI Grounding

Source: arXiv:2605.06664 · Published 2026-05-07 · By Borui Zhang, Bo Zhang, Bo Wang, Wenzhao Zheng, Yuhao Cheng, Liang Tang et al.

TL;DR

This paper attacks a specific failure mode in GUI grounding—the task of localizing a UI element given a screenshot and a natural-language instruction—by diagnosing why state-of-the-art multimodal models fail on complex benchmarks rather than simply throwing more training data at the problem. The authors introduce a lightweight attribution tool called Masked Prediction Distribution (MPD), which replaces computationally prohibitive Shapley-value attribution (estimated at ~10 hours per sample on an RTX 4090) with repeated random masking of screenshot regions (300 perturbations per sample, ~20 min), then aggregates prediction hotspots to reveal where a model's 'attention' is concentrated. Applying MPD to 50 error samples from TianXi-Action-7B on ScreenSpot-Pro, they find that 74% of failures are inductive biases—not knowledge gaps—decomposing into precision bias (20%) caused by digit-level tokenization of coordinates, and ambiguity bias (54%) caused by the mismatch between edit-distance optimization (cross-entropy on token sequences) and Euclidean-distance accuracy required at inference time.

Armed with this diagnosis, the authors propose BAMI (Bias-Aware Manipulation Inference), a training-free inference wrapper with two modules: (1) coarse-to-fine focus, which iteratively crops the screenshot around the model's prediction at a configurable ratio λ to progressively narrow the coordinate search space; and (2) candidate selection, which generates multiple mutually exclusive candidate boxes via masked re-prediction and then uses an external correction model (online API or a locally fine-tuned Qwen3-VL-8B) guided by GUI-specific prompt rules to pick the geometrically correct candidate, explicitly injecting Euclidean-space priors that the grounding model's token-level training objective lacks.

Evaluated on ScreenSpot-Pro and ScreenSpot-V2 across OS-Atlas-7B, UI-TARS-1.5-7B, and TianXi-Action-7B backbones, BAMI consistently improves all baselines without any weight updates. The headline result is TianXi-Action-7B rising from 51.9% to 57.8% on ScreenSpot-Pro, surpassing all other 7B-scale test-time methods including DiMo-GUI (49.7%) and GUI-RC (41.2%). A locally trained correction model variant achieves 56.2%, demonstrating that the approach does not require large proprietary API calls.

Key findings

MPD attribution on 50 TianXi-Action-7B error samples on ScreenSpot-Pro reveals: 14% knowledge gap, 20% precision bias, 54% ambiguity bias, 12% other—meaning 74% of failures are inductive bias and are theoretically addressable at inference time without retraining.
BAMI applied to TianXi-Action-7B raises ScreenSpot-Pro average accuracy from 51.9% to 57.8% (+5.9 pp), the best reported result among 7B-scale models at time of submission.
BAMI applied to OS-Atlas-7B improves ScreenSpot-Pro average from 18.9% to 41.6% (+22.7 pp), the largest absolute gain across tested backbones (Table 3).
BAMI applied to UI-TARS-1.5-7B improves from 40.8% to 51.9% (+11.1 pp) on ScreenSpot-Pro (Table 3).
Ablation (Table 4 left): coarse-to-fine focus alone yields 55.2% (+3.3 pp over 51.9% baseline); candidate selection alone yields 54.3% (+2.4 pp); combined BAMI yields 57.8% (+5.9 pp), indicating both components contribute independently and synergistically.
Ablation (Table 4 right): vanilla candidate selection prompt reaches 55.7%; adding chain-of-thought (CoT) style prompting reaches 57.0%; adding CoT plus key principles (Euclidean-space priors) reaches 57.8%—confirming that injecting GUI geometric priors into the selection prompt is necessary.
Crop ratio ablation (Fig 6a): ratios above 40% provide meaningful precision-bias reduction; optimal range is [50%, 70%]; 2 iterations is sufficient, with diminishing or negative returns beyond that.
Local correction model (Qwen3-VL-8B fine-tuned via LoRA on 128K dual-box samples) achieves 56.2% vs GPT-5's 57.8%, demonstrating near-parity without external API dependency (Table 5).

Threat model

n/a — this is a GUI grounding accuracy paper, not a security or adversarial robustness paper. The 'adversary' is implicitly the difficulty of the ScreenSpot-Pro benchmark (high-resolution professional software screenshots with small, densely packed UI elements), not a malicious actor. No adversarial manipulation of inputs, model extraction, or evasion attacks are considered.

Methodology — deep read

Threat model and assumptions: This is not a security paper in the adversarial-attack sense, but the 'adversary' is implicitly the class of complex professional GUI screenshots (ScreenSpot-Pro) that cause state-of-the-art grounding models to fail. The authors assume a fixed, frozen grounding model whose weights cannot be modified at deployment time, and that inference-time compute budget can be modestly expanded (2 crop iterations × 2–3 candidate predictions per iteration plus one correction-model call). They assume access to an external vision-language model (either API or locally fine-tuned) for the correction step.

Data: Two evaluation benchmarks are used. ScreenSpot-Pro is the primary benchmark, covering professional software across six domains (Development, Creative, CAD, Scientific, Office, OS), with high-resolution screenshots and small targets—most baseline models score below 50% here. ScreenSpot-V2 covers simpler mobile/web/desktop scenarios. The MPD pilot study uses 50 manually analyzed error samples from TianXi-Action-7B on ScreenSpot-Pro; these are not a held-out split but a diagnostic sample. The correction model training uses 128K dual-box samples (two candidate bounding boxes per sample, one correct, one incorrect), sourced from public data according to the authors—exact dataset provenance is relegated to a supplementary appendix not included in the provided text. No dataset-level statistics (class balance, resolution distribution) are reported in the main paper.

Architecture and novel components: BAMI is not a neural module but an inference-time algorithm (Algorithm 1) wrapping any existing grounding model f and a correction model m. The outer loop runs N iterations (N=2 in practice). In each iteration, M candidate bounding boxes are generated (M=2–3): the grounding model predicts a box on the current (possibly cropped) image; that box's pixels are masked to black; the model predicts again on the masked image, ensuring candidate diversity through exclusion. This produces a candidate set Φ_t. The correction model m then selects the preferred box from Φ_t given the query and the unmasked image, guided by a structured prompt encoding four GUI-specific principles: functional preference, memory/pattern comparison, interactive component prioritization, and a fourth unlisted principle (redacted or omitted in the paper text). The selected box is used to crop the image at ratio λ for the next iteration. The final output is the center point of the last selected box. The novelty of the MPD attribution tool lies in replacing Shapley values (exponential complexity) with a Monte Carlo masking scheme: 300 random rectangular masks per sample, predictions aggregated into a spatial heatmap, reducing per-sample attribution from ~10 hours to ~20 minutes on an RTX 4090.

Training regime: BAMI itself requires no training. The local correction model variant fine-tunes Qwen3-VL-8B using LoRA on 128K dual-box samples. Training details (learning rate, LoRA rank, epochs, batch size, hardware) are stated to be in the supplementary material, which is not included in the provided text. This is a reproducibility gap in the main paper. The grounding backbone models (OS-Atlas-7B, UI-TARS-1.5-7B, TianXi-Action-7B) are used frozen. All inference experiments run on a single RTX 4090 GPU.

Evaluation protocol: The correctness criterion is: a prediction is correct if the predicted bounding box center (x_c, y_c) lies within the ground-truth bounding box. This is the standard ScreenSpot-Pro metric. The primary metric is accuracy (fraction of correct predictions). Baselines span proprietary models (GPT-4o, Claude Computer Use), general open-source VLMs (Qwen2.5-VL-3B/7B), GUI-specific SFT models, RL-trained models, and test-time inference methods. Ablations cover: (1) crop ratio λ ∈ {30%,...,90%} with fixed 2 iterations; (2) iteration count 1–5 with fixed λ=50%; (3) target type (text vs. icon elements, shown via per-sample Euclidean distance plots in Fig 6b); (4) component ablation (C2F focus only, candidate selection only, both); (5) prompt design variants (vanilla, +CoT, +CoT+KP); (6) correction model choice across five online APIs and one local model. No statistical significance tests are reported. No cross-validation is used—evaluation is on the full ScreenSpot-Pro benchmark.

Concrete end-to-end example: Given the instruction 'launch next step in design run in vivado' and a high-resolution Vivado IDE screenshot: (1) TianXi-Action-7B predicts a coarse bounding box, which may be off by 100+ pixels due to discretization. (2) BAMI crops the image to 50–70% of the original around that coarse box, feeding the crop back to the same model for a refined prediction. (3) On the cropped image, the model predicts candidate box 1. Its pixels are masked; the model predicts candidate box 2. (4) The correction model (e.g., GPT-5 or local Qwen3-VL-8B) receives the original cropped image, both candidate boxes, the query, and the structured prompt rules. It selects box 1 or 2 based on Euclidean-space priors. (5) The selected box is used for the next crop iteration; after N=2 iterations, the center of the final selected box is returned as the click coordinate.

Reproducibility: Code is publicly released at https://github.com/Neur-IO/BAMI. Grounding model weights are available through their respective public repositories. The local correction model's LoRA training details are in the supplementary material. The 128K dual-box training dataset provenance is not fully described in the main text. The 50-sample MPD analysis dataset is not released as far as can be determined from the provided text.

Technical innovations

Masked Prediction Distribution (MPD): a Monte Carlo attribution method that replaces exponential-cost Shapley value estimation with 300-sample random masking to produce spatial prediction heatmaps in ~20 minutes vs ~10 hours per sample on the same RTX 4090 hardware.
Formalization of precision bias as a consequence of digit-level tokenization (e.g., '789' → <7><8><9>), where cross-entropy training minimizes edit distance over token sequences rather than Euclidean distance over coordinate values, creating a systematic localization offset distinct from prior work's general 'hallucination' framing.
Formalization of ambiguity bias as a metric inconsistency: a candidate at edit distance 1 from ground truth (e.g., 189 vs. 789) can be Euclidean-distance 600 away, while a candidate at edit distance 3 (e.g., 801) is Euclidean-distance 12 away—prior GUI inference methods (DiMo-GUI, GUI-RC) address multi-modal ambiguity but not this specific token-vs-coordinate metric conflict.
Candidate diversity through masked exclusion: rather than sampling multiple predictions independently (which may collapse to the same mode), BAMI masks the pixel region of each accepted candidate before soliciting the next prediction, forcing mutual exclusivity in the candidate set—a simple but previously undemonstrated mechanism in GUI grounding inference pipelines.
Training-free correction via GUI-prior-injected prompts: the correction model is shown to require explicit Euclidean-space priority principles in its prompt to outperform naive selection (55.7% vanilla vs. 57.8% with CoT+KP), suggesting that VLMs used as selectors default to edit-distance-like preferences unless explicitly overridden.

Datasets

ScreenSpot-Pro — not explicitly sized in main text, covers 6 professional software domains (Development, Creative, CAD, Scientific, Office, OS) with text and icon target types — public benchmark, introduced in [13]
ScreenSpot-V2 — not explicitly sized in main text, covers mobile/web/desktop simple scenarios — public benchmark, introduced in [32]
Dual-box correction training set — 128K samples (paired candidate bounding boxes, one correct/one incorrect) — source described as public data, exact provenance in supplementary material not provided in main text

Baselines vs proposed

TianXi-Action-7B (backbone): ScreenSpot-Pro avg accuracy = 51.9% vs BAMI-7B (GPT-5 correction): 57.8%
TianXi-Action-7B (backbone): ScreenSpot-Pro avg accuracy = 51.9% vs BAMI-7B (local Qwen3-VL-8B correction): 56.2%
OS-Atlas-7B: ScreenSpot-Pro avg = 18.9% vs +BAMI: 41.6%
UI-TARS-1.5-7B: ScreenSpot-Pro avg = 40.8% vs +BAMI: 51.9%
UGround-7B: ScreenSpot-Pro avg = 16.5% vs +BAMI: 30.0%
DiMo-GUI-7B (test-time method): ScreenSpot-Pro avg = 49.7% vs BAMI-7B: 57.8%
GUI-RC (test-time method): ScreenSpot-Pro avg = 41.2% vs BAMI-7B: 57.8%
C2F Focus only (ablation): ScreenSpot-Pro = 55.2% vs full BAMI: 57.8%
Candidate Selection only (ablation): ScreenSpot-Pro = 54.3% vs full BAMI: 57.8%
Vanilla prompt (no CoT, no KP): 55.7% vs CoT+KP prompt: 57.8%
Gemini-2.5-Pro correction model: 57.2% vs GPT-5 correction model: 57.8%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.06664.

Fig 1

Fig 1: Compared with conventional grounding models, BAMI

Fig 2

Fig 2: Bias Mitigation Strategy. To address accuracy bias and

Fig 4

Fig 4: Error Attribution Analysis. (a) Proportions of attribution types. (b) Attribution analysis of model predictions. The deep red

Fig 3

Fig 3: Accuracy comparison on ScreenSpot-Pro. BAMI

Fig 5

Fig 5: Illustration of BAMI. Step 1: Based on the initial prediction results of the grounding model, BAMI performs cropping around

Fig 6

Fig 6 (page 4).

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

MPD pilot analysis is based on only 50 error samples from one model (TianXi-Action-7B) on one benchmark (ScreenSpot-Pro); the 74%/14%/12% breakdown may not generalize to other model families, architectures, or benchmark distributions.
Inference cost is substantially increased: each grounding query now requires 2 iterations × 2–3 model forward passes plus one correction-model call, meaning latency scales by roughly 5–7× over single-pass inference—no wall-clock latency numbers are reported, making it impossible to assess production feasibility.
The correction model training details (LoRA rank, learning rate, epochs, batch size, train/val split of the 128K dataset) are deferred to supplementary material not available in the main text, impeding reproducibility of the local variant.
No adversarial or distribution-shift evaluation: BAMI is tested only on two standard benchmarks; robustness to out-of-distribution GUIs (novel software, non-English interfaces, unusual aspect ratios) is untested.
The fourth candidate selection rule mentioned in Figure 5's prompt template is explicitly blanked out ('_______________') in the paper, suggesting incomplete disclosure of the prompt engineering that drives the method's performance.
BAMI's improvement on ScreenSpot-CAD-Icon (29.7% vs baseline 18.8% for TianXi-Action-7B) and some other subcategories is meaningful, but the Scientific-Text subcategory drops from 80.6% to 77.8% (Table 3), indicating the method can hurt performance on already-well-handled subcategories—this regression is not discussed by the authors.
No statistical significance testing is performed; given that ScreenSpot-Pro subcategory sample counts appear small (implied by Table 1's 50-sample pilot), apparent per-category gains may be within noise.

Open questions / follow-ons

Can the ambiguity bias correction be internalized into the grounding model itself—e.g., via a modified reward function during RL fine-tuning that penalizes edit-distance-correct but Euclidean-distance-wrong predictions—rather than requiring an external correction model at inference time?
What is the relationship between the number of MPD masking samples (300 in this work) and attribution quality as a function of screenshot resolution and element density? Is there a principled stopping criterion, or is 300 an arbitrary choice?
BAMI's coarse-to-fine cropping discards global context at fine stages; for tasks requiring understanding of relative position (e.g., 'the second button from the left'), does contextual information loss systematically degrade performance, and can a dual-stream (global + local) architecture preserve it?
The local correction model is trained on 128K dual-box samples—how sensitive is its performance to training set size, diversity, and the strategy used to generate 'hard negative' candidate boxes, and does it transfer across GUI domains not seen during fine-tuning?

Why it matters for bot defense

For bot-defense and CAPTCHA engineers, BAMI is directly relevant as a capabilities-assessment signal: it demonstrates that training-free inference wrappers can push GUI grounding models from ~52% to ~58% on professionally complex interfaces, closing the gap with human performance on tasks that CAPTCHA systems often assume are hard for automated agents (small targets, ambiguous icons, context-dependent element identification). The coarse-to-fine mechanism in particular is a practical recipe for any agent attempting to solve image-based challenges requiring precise pixel-level localization—it requires no new training data and is applicable to any frozen VLM backbone.

Practically, a bot-defense team should update their adversary capability model to account for test-time inference enhancements of this type. CAPTCHA designs that rely on small or visually ambiguous targets (e.g., 'click the smallest arrow' or 'select the icon that is not interactive') are specifically the failure modes that BAMI is engineered to overcome. The ambiguity bias correction via structured prompting is also noteworthy: it shows that a VLM acting as a 'meta-selector' over multiple candidate predictions can compensate for the systematic tokenization biases that previously provided some implicit resistance to automated solving. The remaining 14% knowledge-gap failures and the observed regression on some subcategories suggest that interfaces using truly novel or domain-specific UI paradigms still provide some natural resistance, but this gap is narrowing with inference-time methods alone.

Cite

bibtex

@article{arxiv2605_06664,
  title={ BAMI: Training-Free Bias Mitigation in GUI Grounding },
  author={ Borui Zhang and Bo Zhang and Bo Wang and Wenzhao Zheng and Yuhao Cheng and Liang Tang and Yiqiang Yan and Jie Zhou and Jiwen Lu },
  journal={arXiv preprint arXiv:2605.06664},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06664}
}

BAMI: Training-Free Bias Mitigation in GUI Grounding ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​