Skip to content

APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

Source: arXiv:2606.12366 · Published 2026-06-10 · By Kechun Xu, Zhenjie Zhu, Anzhe Chen, Rong Xiong, Yue Wang

TL;DR

This paper addresses the problem of poor out-of-distribution (OOD) language instruction generalization in Vision-Language-Action (VLA) models, particularly those using continuous action experts trained from scratch alongside pretrained Vision-Language Models (VLMs). The authors identify a fundamental cause: structural imbalance in VLA data where visual and action diversity far exceed language diversity, causing policies trained end-to-end to learn visual shortcuts and corrupt the underlying VLM language representations. The key novelty is a Bayesian policy factorization into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, enabling a two-stage training method called Action expert PreTraining (APT). Stage 1 pretrains the action expert on balanced vision-action pairs alone with a frozen VLM, isolating it from brittle language signals. Stage 2 injects language conditioning via a gated fusion mechanism that preserves the pretrained visuomotor priors while aligning actions to instructions. Applied to mainstream continuous-action VLA architectures, APT significantly improves OOD instruction following and compositional task generalization in both simulation and real robot experiments. Notably, APT enables joint finetuning of VLM and action expert without harming language understanding, outperforming prior gradient-stopping or co-training approaches.

Key findings

  • APT raises success rates on the LIBERO-PRO benchmark from near 0% (OpenVLA, π0) to up to 62% on position perturbation and 48% on task perturbation, outperforming baselines like π0.5 and LangForce (Table 1).
  • On rigid object pick-place tasks, APT variants achieve up to 98% success on seen objects and 84% on unseen objects, compared to 84%/70% for π0.5 and 42%/30% for π0 (Table 2).
  • APT outperforms π0.5 by a wide margin in real-world clutter pick-place OOD settings, e.g., 25/30 vs 18/30 in SO and 7/10 vs 4/10 in UO settings (Table 3).
  • APT maintains stable language following on compositional task chaining (multi-task concatenated prompts), where π0.5 collapses to executing only the first task (Fig. 6, Fig. 7c).
  • Ablation shows that two-stage training and gated fusion language injection outperform single-stage or token insertion methods, with largest gains on unseen objects and environments (Fig. 4, 5).
  • Jointly finetuning VLM with a pretrained action expert yields better language generalization than stopping gradients (contrary to prior work) (Table 2).
  • APT’s action expert pretraining improves generalization across different architectures (π-style and GR00T-style), showing broad applicability (Fig. 4).
  • Training the VA prior alone on vision-action pairs reduces noisy gradients and prevents visual shortcut learning, confirmed experimentally and supported by Bayesian factorization analysis.

Threat model

The paper does not explicitly define an adversarial threat model; rather, it addresses challenges arising from model optimization under structurally imbalanced training data causing shortcut learning and failure to ground language instructions. The 'adversary' can be considered the inherent dataset imbalance and gradient noise that corrupt pretrained language understanding, but active or malicious attackers are not considered.

Methodology — deep read

  1. Threat Model & Assumptions: The adversary is implicit—issues addressed are not caused by active attacks but by model shortcut learning under data imbalance. No explicit adversarial threat or capability is modeled. The focus is on improving generalization to OOD language instructions that differ lexically, compositionally, or by unseen objects/environmental layouts.

  2. Data: Training uses existing VLA datasets containing triplets of (vision, language, continuous actions), e.g., LIBERO, LIBERO-PRO, and rigid pick-place benchmarks, plus real robot demonstrations. The key insight is that language annotations are sparse relative to vision-action pairs (multiple vision-action frames share one instruction), causing structural imbalance. Dataset splits separate seen/unseen object and environment sets for OOD evaluation. Additional large-scale pretraining datasets amplify this imbalance but provide diverse visuomotor priors.

  3. Architecture/Algorithm: The VLA policy π(a|v, ℓ) is factored as a product of a language-agnostic VA prior π_p(a|v) and a language-conditioned likelihood L(ℓ|v,a) based on a Bayesian formulation. The action expert is a Transformer-based diffusion model. Visual and language tokens are encoded by a frozen Qwen3-VL VLM backbone. The action expert processes action history, proprioceptive state, and noisy future action tokens using multimodal causal self-attention. A novel layer-wise gated fusion mechanism injects VLM intermediate features into each action expert attention layer via learnable scalar gates controlling vision-language feature influence.

The two-stage training exploits this model: Stage 1 trains the VA prior on vision-action pairs only with language tokens masked, activating half the attention layers to learn visuomotor priors without language bias. Stage 2 adds language tokens and interleaved attention layers, jointly finetuning all layers to align actions with instructions, preserving the prior. This avoids shortcut gradients harming the VLM and improves language grounding.

  1. Training Regime: Stage 1 pretraining trains the action expert with frozen VLM; Stage 2 fine-tuning enables gradient flow to VLM optionally, allowing joint optimization. Training uses standard diffusion loss with denoising steps conditioned on visual and language input. Experiments do not specify exact epochs or batch sizes in main text but evaluate with large-scale datasets pretrained then fine-tuned on benchmarks and real-world data. Seeds and hardware details are not specified but typically consistent with SOTA robot learning setups.

  2. Evaluation Protocol: Evaluations occur on LIBERO-PRO (with position and task perturbations), a rigid object pick-place benchmark (seen/unseen objects/environment splits), and multiple real robot tabletop manipulation tasks, including cluttered scenes and compositional instructions (task coaching and chaining). Metrics are success rates on held-out OOD instructions and compositional tasks. Ablations examine single vs two-stage training, language injection methods, gradient stopping (knowledge insulation), and architecture generality. Results consistently show substantial generalization improvements. No formal statistical tests reported.

  3. Reproducibility: The paper links to a public project page containing model and code artifacts. Some datasets are standard in the field (e.g., LIBERO), but full datasets and pretrained weights may not be fully public. The method builds on publicly available VLM Qwen3-VL and common diffusion architectures, enabling partial reproducibility. However, complete replication requires robot hardware and large-scale VLA training data.

Concrete example end-to-end: Given a robot manipulation trajectory with multiple vision-action frames paired to a single instruction, Stage 1 trains the action expert conditioned on visual observations alone using frozen VLM features to learn plausible visuomotor mappings. After convergence, Stage 2 integrates language tokens via gated fusion and expands the action expert network, jointly optimizing to condition the previously learned prior on the instruction, thereby aligning continuous action outputs to the intended task while preserving robust visuomotor control.

Technical innovations

  • Bayesian policy factorization separating VLA policy into a language-agnostic vision-action prior and a language-conditioned likelihood, facilitating robust action expert pretraining.
  • Two-stage training procedure with Stage 1 pretraining of the action expert on vision-action pairs alone (with frozen VLM), followed by Stage 2 language-conditioned finetuning integrating VLM features via gated fusion.
  • Novel layer-wise gated fusion mechanism injecting intermediate VLM features into the action expert's self-attention layers with learnable gates, preserving visuomotor priors during language conditioning.
  • Demonstration that joint finetuning of VLM and action expert after proper action expert pretraining outperforms prior gradient-stopping approaches to preserve language understanding.

Datasets

  • LIBERO-PRO — size unspecified — public research benchmark with perturbations for measuring instruction generalization
  • Rigid Object Pick-Place Benchmark — size unspecified — standard simulation dataset with object/environment splits
  • Real-world tabletop manipulation demonstrations — approx. 30 demos per task — collected by authors

Baselines vs proposed

  • OpenVLA: success on LIBERO-PRO perturbations = 0% vs APT: up to 62% (Table 1)
  • π0: success rate = 42% on Pick-Place SO vs APT: 88% without VLM finetuning, 96% with (Table 2)
  • π0.5 (Knowledge Insulation applied): 84% SO success vs APT (KI+2-Stage+Ft VLM): 98% (Table 2)
  • LangForce: up to 48% success on LIBERO-PRO Task perturbation vs APT: 48% Pos and 62% Task (Table 1)
  • CaP-X: below 27% average vs APT(Ft VLM): 40% average on compositional tasks (Table 1)
  • On Real-World Clutter Pick-Place: π0.5 = 18/30 SO successes vs APT = 25/30 (Table 3)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.12366.

Fig 1

Fig 1: Action expert pretraining (APT) enables

Fig 2

Fig 2: Overview of APT. In Stage 1, the action expert is pretrained as a VA prior conditioned

Fig 6

Fig 6: Results on

Fig 7

Fig 7: Real-world cases. (a) pick-place task, (b) clutter pick-place

Fig 4

Fig 4: Action expert pretraining applies to diverse architectures.

Fig 5

Fig 5: Ablation on large-scale pretraining and

Fig 7

Fig 7 (page 8).

Fig 8

Fig 8 (page 8).

Limitations

  • Does not explicitly model long-horizon memory or multi-step task progress tracking, limiting performance on very long compositional instruction sequences.
  • Evaluations focus mainly on tabletop manipulation; extension to mobile manipulation or locomotion domains is not explored.
  • Exact training hyperparameters, seeds, and hardware details are not fully disclosed, potentially impeding exact reproducibility.
  • Limited evaluation of robustness under strong distribution shifts beyond object/environment variations, e.g., extreme visual domain changes.
  • Potential failure modes include premature task switching or confusion among visually similar distractors in cluttered scenes, as observed in real robot tests.

Open questions / follow-ons

  • How can long-term memory or recurrence mechanisms be integrated with APT to further enhance compositional multi-step instruction following?
  • Can APT be extended or adapted to mobile manipulation or locomotion tasks involving more complex state/action spaces?
  • What is the impact of drastically different visual domains (e.g., sim-to-real gaps or novel lighting conditions) on the preservation of VA priors?
  • How does the choice of VLM backbone (beyond Qwen3-VL) affect the effectiveness of gated fusion and instruction generalization?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, APT's insights into multimodal policy factorization and pretraining to mitigate shortcut learning can be instructive. The core challenge—imbalance between language and visual/action data causing degraded language grounding—parallels bot detection strategies where attackers may exploit modal shortcuts or misleading cues. The Bayesian factorization and staged pretraining approach could inspire methods to better fuse and protect language-based signals in interactive bot challenges, preventing adversarial bypass through purely visual or action shortcuts. The gated fusion mechanism ensuring preservation of pretrained semantic representations while enabling conditioning may inform improved CAPTCHA designs that robustly integrate text and visual features without corrupting their joint interpretation under adversarial conditions.

Cite

bibtex
@article{arxiv2606_12366,
  title={ APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies },
  author={ Kechun Xu and Zhenjie Zhu and Anzhe Chen and Rong Xiong and Yue Wang },
  journal={arXiv preprint arXiv:2606.12366},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12366}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution