Skip to content

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Source: arXiv:2605.28775 · Published 2026-05-27 · By Suji Kim, Kangsan Kim, Sung Ju Hwang

TL;DR

This paper addresses the challenge of efficiently specializing small computer-use agents (CUAs) to diverse software domains. Large specialized experts are costly and impractical for many real-world uses, so smaller models that are specialized are more feasible but suffer from uneven performance across domains. The key insight is that naive data augmentation or large-scale training on target domains yields marginal gains. Instead, the authors propose LEARNWEAK, a fully automated, annotation-free framework that leverages a stronger teacher agent to identify the student's specific domain weaknesses, generate targeted new training tasks to expose these weaknesses, and apply an error-aware preference optimization that disentangles planning and execution errors in agent behavior. Evaluated on the OSWorld benchmark over eight software domains, LEARNWEAK achieves average performance gains of 11.6 and 11.1 absolute percentage points over the EvoCUA-8B and OpenCUA-7B baselines respectively. The approach both surpasses existing autonomous dataset generation baselines and enables the small agents to in some cases outperform the larger teacher. This work demonstrates the critical importance of student-aware data generation and fine-grained error-targeted training for effective domain specialization of small CUAs, paving a practical path toward efficient adaptation of lightweight agents in diverse computer-use environments.

Key findings

  • LEARNWEAK specialization improves EvoCUA-8B from 50.69% to 62.24% average task success on OSWorld domains, +11.6 points.
  • LEARNWEAK specialization improves OpenCUA-7B from 37.65% to 48.72%, +11.1 points average across 8 domains.
  • Specialized EvoCUA-8B surpasses its 32B teacher in 3 domains (Gimp, Thunderbird, VSCode).
  • LEARNWEAK outperforms existing data synthesis baselines like AgentNet, AgentSynth, OS-Genesis, ZeroGUI, and WebSTAR by 5.58 points on average (Table 2).
  • Weakness-aware dataset generation yields best performance when weakness reports come from the target student model itself, rather than other models (Table 3).
  • Iterative generation combined with weakness-report conditioning is necessary; one-shot or exploration-only generation performs worse (Table 6).
  • Choice of teacher model matters until teacher is strong enough; stronger teachers produce more reliable failure signals enabling better student gains (Table 4).
  • LEARNWEAK's error-aware preference optimization outperforms standard SFT and DPO by 9.62 points on average (Table 5).

Threat model

n/a – This is an algorithmic framework for domain-specific model specialization in computer-use agents rather than a security or adversarial threat model paper. The assumptions include having access to a stronger teacher agent and an executable environment with verification but do not involve active adversaries.

Methodology — deep read

  1. Threat Model & Assumptions: The adversary is not explicitly modeled here since this is a domain specialization approach for CUAs, but the process assumes access to a stronger reference (teacher) agent and seed tasks. The student policy is pretrained on diverse GUI tasks but weak in the target domain. No human annotation is used beyond a small set of seed queries. The environment is fully executable with a verification function to assess success.

  2. Data: The data consists of trajectories of teacher and student agent executions on the target domain environment across multiple iterative cycles. Initially, only a small number (K) of seed queries/tasks are defined per domain. Then iterative teacher-student paired executions identify failures where teacher succeeds but student fails, generating failure sets used to build weakness reports describing typical error types per domain. New synthetic queries are generated conditioned on these weakness reports and representative screenshots to target the student’s weaknesses. This loop runs for N iterations, aggregating failures into a final training set Dd for specialization. OSWorld benchmark with 8 domains is used for evaluation.

  3. Algorithm/Architecture: LEARNWEAK consists of two main components:

  • LEARNWEAK-GEN: An automated dataset generation pipeline that iteratively improves student data. It compares teacher and student trajectories on tasks, extracts failure modes, summarizes weakness reports capturing planning vs execution error types, selects representative screenshots via clustering and VLM reranking, and uses a task-query generator to synthesize new queries targeting weaknesses and domain exploration.
  • LEARNWEAK-DPO: A training method based on Direct Preference Optimization (DPO) that applies selective preference learning losses. The training processes trajectories at step level, filtering to steps where teacher action differs from replayed student action. Each step is categorized into planning or execution error types and masked losses are computed to update only behavioral relevant parts (e.g., execution parameters for execution failures). Low-rank adapters (LoRA) are trained for modular domain specialization on top of frozen base student policy.
  1. Training Regime: For each domain, the student model is specialized by training on the generated datasets. Models experimented with include EvoCUA-8B and OpenCUA-7B students, with EvoCUA-32B teacher. The training employs step-level masked DPO objective, increasing the relative likelihood of teacher actions over student mistakes. Hyperparameters, epochs, and batch sizes are specified in their appendix but exact numeric details in main text are limited. The process is repeated independently per domain.

  2. Evaluation Protocol: Evaluation is done on the OSWorld benchmark across 8 domains. The metric is average task success rate (%) over 3 inference trials per task with a max step budget of 50. Comparisons are made against zero-shot students, multiple strong baselines (generalist LLMs, existing domain specialized models), and data generation baselines under matched budgets. Multiple ablation studies are included for dataset generation components (iteration, weakness conditioning), teacher choice, and training objectives. Statistical significance tests are not explicitly mentioned.

  3. Reproducibility: Code, prompts, and hyperparameters are mentioned to be in the supplemental material and project page but dataset and models are not fully public. The dependency on a strong teacher model means full replication requires access to that model. The problem setting assumes an executable environment with a verifier, which may limit reproducibility to supported benchmarks.

Technical innovations

  • Annotation-free iterative dataset generation pipeline that identifies student weaknesses by paired teacher-student rollouts and synthesizes targeted queries conditioned on weakness reports and representative screenshots.
  • Error-aware preference optimization that disentangles planning (wrong action types) and execution (parameter-level) failures and applies selective masked training updates for more behaviorally precise specialization.
  • Modular domain specialization via LoRA adapters on top of a frozen base CUA policy, enabling scalable multi-domain adaptation with localized updates.
  • Demonstration that student-aware data generation outperforms static or weakness-agnostic autonomous trajectory generation methods under matched data budgets.

Datasets

  • OSWorld — multi-domain computer-use benchmark — public via authors’ project

Baselines vs proposed

  • Zero-shot EvoCUA-8B: average success 50.69% vs LEARNWEAK specialized: 62.24%
  • Zero-shot OpenCUA-7B: 37.65% vs LEARNWEAK specialized: 48.72%
  • WebSTAR (data filtering baseline): 49.62% vs LEARNWEAK: 55.20% (on 4 domains)
  • AgentNet full trajectories: 47.91% vs LEARNWEAK: 55.20% (on 4 domains)
  • LEARNWEAK-DPO training: 55.20% vs standard DPO: 45.58% average success

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.28775.

Fig 1

Fig 1: Conceptual illustration of LEARNWEAK and performance gains after domain specialization,

Fig 2

Fig 2: Overview of LEARNWEAK framework. LEARNWEAK-GEN iteratively constructs domain

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 3).

Fig 5

Fig 5 (page 3).

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

  • Relies on availability of a strong reference teacher agent to identify student weaknesses.
  • Evaluation limited to 8 OSWorld domains; generalization to other software or real user workflows is untested.
  • No explicit adversarial robustness or distribution shift tests beyond domain specialization.
  • The process requires an executable environment with success verification; applicability to opaque or proprietary software may be constrained.
  • Performance gains depend on student and teacher model architectures; effectiveness on much smaller or different CUA backbones is unverified.
  • Statistical significance of improvements not reported; some ablations show variability across domains.

Open questions / follow-ons

  • How does LEARNWEAK performance scale with weaker or more limited teacher agents?
  • Can the framework be extended to multi-domain or continual adaptation rather than per-domain specialization?
  • How do real human usage patterns and user corrections affect robustness and specialization efficacy?
  • What are the limits of error-aware preference optimization for more fine-grained failure taxonomies beyond planning vs execution?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the LEARNWEAK framework illustrates an effective automated approach to gap-targeted specialization of small interactive agents using a stronger reference agent for weakness discovery. Although focused on computer-use environments rather than adversarial bots, the core principles of student-aware data synthesis and error-sensitive training could inspire adaptive strategies in bot detection or challenge generation when agent failures and peculiar behaviors need to be identified and corrected iteratively. The methodological emphasis on disentangling error types and selective preference learning could translate into more targeted retraining of detection classifiers or behavior simulators. However, direct application would require adapting from GUI actions to bot interaction traces and considering adversarial threat models.

Cite

bibtex
@article{arxiv2605_28775,
  title={ Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents },
  author={ Suji Kim and Kangsan Kim and Sung Ju Hwang },
  journal={arXiv preprint arXiv:2605.28775},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.28775}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution