PRO-CUA: Process-Reward Optimization for Computer Use Agents

Source: arXiv:2605.29119 · Published 2026-05-27 · By Yifei He, Rui Yang, Hao Bai, Tong Zhang, Han Zhao

TL;DR

This paper addresses key challenges in training Computer Use Agents (CUAs), which automate complex long-horizon GUI workflows across web and desktop environments. Existing training methods primarily rely on filtered behavior cloning (FBC) from expert trajectories, which suffer from distribution shift, lack negative signals, and overfit easy tasks. Trajectory-level reinforcement learning struggles with sparse rewards and credit assignment, while step-level RL approaches often depend on brittle, rule-based rewards tied to exact expert actions and off-policy data, limiting generalization and data utilization.

The authors propose PRO-CUA, a novel on-policy iterative training framework that decouples environment interaction from policy optimization. PRO-CUA first collects on-policy states via live rollouts at elevated temperature, then generates diverse candidate thought-action pairs per state. A Process Reward Model (PRM), a multimodal reasoning module, assigns binary step-level rewards by evaluating whether candidate actions functionally advance task progress based on visual and textual context. The policy is optimized using Group-Relative Proximal Optimization (GRPO), which leverages the relative advantages within candidate groups. This approach enables dense credit assignment, flexible supervision beyond exact imitation, and learns from both successful and failed trajectories, mitigating distribution shift and improving data efficiency.

Empirical results on three live web benchmarks (WebVoyager, Mind2Web-Live, Online Mind2Web) show PRO-CUA significantly outperforms both filtered behavior cloning and step-level RL with rule-based rewards. PRO-CUA improves success rate by up to 12.7% absolute over FBC and 7.7% over rule-based RL on WebVoyager (4B model). Ablations highlight that the key benefit arises from the PRM-based reward signal. Overall, PRO-CUA offers a scalable, more robust training paradigm for CUAs by integrating on-policy state exploration with learned, functionally grounded step-level rewards.

Key findings

PRO-CUA improves success rate by 12.7% over filtered behavior cloning (29.7% to 42.4%) on WebVoyager with 4B model.
PRO-CUA improves over rule-based step-level RL by 7.7% on WebVoyager (34.7% to 42.4%) and 6.9% on Mind2Web-Live (27.8% to 34.7%) with 4B model.
PRM-based step-level rewards outperform rule-based rewards (36.6% vs 34.7% success on shared successful-trajectory subset), demonstrating effectiveness of learned functional reward.
PRO-CUA consistently yields substantially more usable training data by leveraging both successful and failed trajectories, unlike baselines that only learn from successful rollouts (Fig 4).
The Qwen3-VL-4B PRM, a lightweight model, achieves comparable training effectiveness to a much larger GPT5-mini PRM despite different reward calibration (Fig 3).
PRO-CUA’s group-relative advantage (GRPO) training objective acts as a dynamic curriculum, focusing learning on difficult states with sparse successful candidate actions.
On larger 8B models, PRO-CUA similarly outperforms baselines, improving success rate on WebVoyager by 9.4% over rule-based RL (33.8% to 43.2%).
PRO-CUA improves over expert-trained or closed-data models despite using no external expert demonstrations.

Methodology — deep read

Threat Model & Assumptions: The adversary in this setting is implicit; the focus is on training CUAs that robustly interact with GUI environments despite challenges from distribution shifts, sparse rewards, and limited expert data. The framework assumes no access to perfect expert demonstrations at training time and evaluates proficiency on dynamic live web tasks.
Data: They use three live web benchmarks: WebVoyager, Mind2Web-Live, and Online Mind2Web derived from real-world web navigation tasks. Training data consists of on-policy rollout trajectories collected live by the agent at elevated sampling temperature (1.0) to encourage exploration. Each trajectory comprises task instructions, action histories, and the current screenshot (w=1, only last screenshot kept for memory efficiency). Failed as well as successful trajectories are collected to form the state dataset Dstate.
Architecture / Algorithm:

Policy model: Qwen3-VL large language models fine-tuned for thought-action generation in GUI environments.
PRM: a multimodal step-level reward model receiving task instruction, action history, candidate action, and annotated screenshot cropped to the action target element. It outputs a binary reward indicating if the action functionally advances the task, rather than exact imitation.
GRPO: Group-relative Proximal Optimization optimizes the policy by sampling G candidate thought-action pairs per state, scoring with PRM, computing mean-centered relative advantages per group, and applying clipped policy gradient updates with KL regularization.
Decoupling live environment interaction (rollouts) from policy updates reduces hardware bottlenecks and latency.

Training Regime:

10 training iterations.
At each iteration, rollout 256 tasks with max 20-step trajectories.
Collect on-policy states at temperature 1.0.
Generate G candidate thought-action pairs per collected state (G unspecified explicitly).
Optimize policy using GRPO with a constant learning rate (5e-6 for RL, 1e-5 for FBC) over 1 epoch per iteration.
Training performed on NVIDIA A6000 GPUs.

Evaluation Protocol:

Task success evaluated by GPT-5 automatic judge based on full trajectory and screenshots.
Baselines: Filtered Behavior Cloning (FBC) trained on successful trajectories only; step-level RL with rule-based rewards applied similarly.
Comparisons run on identical Qwen3-VL backbones.
Evaluation across 3 live benchmarks, excluding domains with anti-bot mechanisms.
Ablations isolate reward sources and data utilization.

Reproducibility:

Code released (URL provided in paper).
Uses public Qwen3-VL base models.
Live benchmarks public but subject to dynamic changes in real web environment.

Example: In practice, at iteration N, the current policy performs rollouts of 256 tasks on live web, generating trajectories of at most 20 steps. The collected set of states Dstate includes both successes and failures. For each state, the policy samples a group of candidate actions, which the PRM grades with binary rewards based on multimodal reasoning about functional progress. Using GRPO, the policy is updated to increase the probability of superior candidates relative to the group, focusing learning on challenging states. The improved policy is rolled out in iteration N+1, iterating this loop. This process leads to significant increases in success rates over baselines and better utilization of interaction data.

Technical innovations

Decoupling live environment rollouts from policy optimization by alternating between on-policy state collection and offline policy updates.
Using a multimodal Process Reward Model (PRM) to assign functionally grounded, binary step-level rewards independent of expert action matching.
Applying Group-Relative Proximal Optimization (GRPO) that leverages relative advantages within sampled thought-action groups, enabling dynamic curriculum focusing training on difficult states.
Training on the agent’s own execution states including failure states to mitigate distribution shift and improve robustness and data utilization.

Datasets

WebVoyager — size unspecified — public web navigation benchmark
Mind2Web-Live — size unspecified — public web navigation benchmark
Online Mind2Web — size unspecified — public web navigation benchmark

Baselines vs proposed

Filtered Behavior Cloning (FBC): WebVoyager success = 29.7% vs PRO-CUA = 42.4% (4B model)
Rule-based Step-level RL: WebVoyager success = 34.7% vs PRO-CUA = 42.4% (4B model)
Filtered Behavior Cloning (FBC): Mind2Web-Live success = 26.4% vs PRO-CUA = 34.7% (4B model)
Rule-based Step-level RL: Mind2Web-Live success = 27.8% vs PRO-CUA = 34.7% (4B model)
Qwen3-VL-4B PRM step-level rewards: 36.6% success vs rule-based step rewards 34.7% on shared successful-trajectory subset
PRO-CUA (8B model) outperforms rule-based Step-RL on WebVoyager (43.2% vs 33.8%) and Mind2Web-Live (30.6% vs 25.0%)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.29119.

Fig 1

Fig 1: Overview of the PRO-CUA pipeline. PRO-CUA alternates between two stages across multiple training

Fig 2

Fig 2: Process Reward Model (PRM) grading

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Focus on web-based computer use tasks; effectiveness on desktop or mobile app GUIs not evaluated and may differ.
Policy architecture limited to recent snapshot (w=1) visual context, lacking explicit long-term memory or retrieval components.
PRM rewards are binary and not perfectly calibrated; noisy signal tolerated but quality and calibration could impact learning.
Evaluation depends on GPT-5 automatic judge which itself could have evaluation bias or errors.
Exclusion of web domains with anti-bot mechanisms limits test diversity and generalization claims.
Number of candidate actions per state and group size G for GRPO sampling is not explicitly detailed.

Open questions / follow-ons

How would PRO-CUA perform on desktop or mobile app GUIs with different interface conventions and interaction patterns?
Can integrating long-term memory, retrieval, or enhanced multimodal context improve agent performance and reward model accuracy?
How sensitive is PRO-CUA to the quality and calibration of the Process Reward Model, and how can PRMs be further improved for CUA tasks?
What are the limits of scalability of PRO-CUA in terms of longer horizon tasks or more complex workflows?

Why it matters for bot defense

Bot-defense and CAPTCHA engineers developing automated interaction detectors can view PRO-CUA as a case study in addressing distribution shift and sparse reward challenges in training agents that interact with real-world interactive web environments. The paper’s approach to on-policy state collection and learned step-level rewards illustrates how to construct robust training pipelines that capture agent failure modes and improve error recovery without requiring perfect expert demonstrations, which could inform anomaly detection and bot behavioral modeling strategies.

Although PRO-CUA targets training agents rather than bot detection, its insights into decoupling environment interaction from optimization and using flexible multimodal signals for credit assignment can inspire defensive designs to distinguish between human and automated agents. Moreover, the process reward model concept emphasizes the utility of multimodal contextual evaluation to determine action rationality and progression, an angle potentially exploitable to refine CAPTCHA challenges or adaptive bot verification mechanisms.

Cite

bibtex

@article{arxiv2605_29119,
  title={ PRO-CUA: Process-Reward Optimization for Computer Use Agents },
  author={ Yifei He and Rui Yang and Hao Bai and Tong Zhang and Han Zhao },
  journal={arXiv preprint arXiv:2605.29119},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.29119}
}

PRO-CUA: Process-Reward Optimization for Computer Use Agents ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​