Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

Source: arXiv:2606.02274 · Published 2026-06-01 · By Huayi Zhou, Wei Gao, Dekun Lu, Ruiji Liu, Zhanqi Zhang, Ziyang Zhang et al.

TL;DR

Dexterity-BEV addresses critical limitations in current end-to-end robotic manipulation policies that derive from 2D pretrained vision-language foundation models. It focuses on overcoming the inherent lack of explicit 3D spatial understanding in existing systems and spatial-temporal misalignments across diverse robotic embodiments, camera setups, and datasets. By introducing aligned vertex maps and vertex spectrums, the authors lift 2D visual inputs into a shared 3D coordinate system, leveraging camera calibration and optional depth images. This approach is further enhanced by designating a canonical Bird's-Eye-View (BEV) reference frame, synthesizing BEV images from multi-view RGB-D data, which promotes viewpoint invariance and robustness to camera pose variations.

The authors also develop a comprehensive data processing pipeline that spatially and temporally aligns trajectories and observations from heterogeneous data sources, including different robot types and tele-operators. Experimentally, Dex-BEV demonstrates improved generalization across multiple simulation benchmarks (LIBERO and RoboTwin 2.0) and physical robotic platforms with diverse manipulation tasks. It significantly outperforms strong baselines like π0 and X-VLA in cross-embodiment success rates and shows robustness to large variations in camera viewpoints and workspace layouts, confirming the effectiveness of explicit 3D alignment and BEV representation for scalable dexterous manipulation.

Key findings

On LIBERO (official) benchmark, Dex-BEV achieves an average success rate of 97.8%, comparable to X-VLA (98.1%) and better than π0 (94.2%) and 2D ablation (92.8%).
On RoboTwin 2.0 (dual-arm) benchmark, Dex-BEV obtains 76.0% average success, surpassing π0 (46.4%), X-VLA (70.0%), and 2D ablation (64.8%), demonstrating cross-embodiment robustness.
In modified LIBERO with randomized camera viewpoints and scene layouts, Dex-BEV maintains ~90% success, whereas X-VLA and 2D ablations fail (<10%).
Real-world robot experiments over five complex long-horizon tasks show Dex-BEV's success rates between 76.7% and 96.7%, outperforming π0 and X-VLA baselines by margins of 16-36 percentage points.
The temporal alignment method normalizes trajectory execution speed across robots and tele-operators, improving policy consistency for quasi-static manipulation tasks.
3D spatial alignment integrates camera intrinsics/extrinsics and enforces a unified TCP frame convention, enabling robot-agnostic SE(3) action representations.
Ablation removing the 3D input and alignments causes significant performance degradation across both simulation and real-robot tasks.
Synthesized BEV images from multi-view point clouds substantially reduce input viewpoint variance as reflected in consistent pixel-wise object positions.

Threat model

The adversary model is not adversarial security-focused but considers natural environmental and system variations that challenge manipulation policies: changes in camera viewpoints, robot base and scene poses, differing teleoperator behaviors, and heterogeneous robot embodiments. The adversary cannot arbitrarily alter physical robot hardware or miscalibrate sensors maliciously. Instead, challenges arise from distribution shifts and alignment mismatches that the policy must robustly generalize over.

Methodology — deep read

The authors first define a threat and goal scenario focusing on robust end-to-end visuomotor manipulation policies generalized across robot embodiments, camera views, and diverse datasets. Adversaries or perturbations include camera viewpoint changes, robot base pose shifts, and operator variations, but not malicious model attacks.

Data include multiple large-scale datasets (LIBERO, RoboTwin 2.0, RoboMind, Agibot, others) with RGB videos, depth when available, and action trajectories from various robotic platforms and teleoperation systems. Data is unified spatially and temporally through a custom pipeline involving manual GUI-aided calibration, ICP registration, and vision model depth synthesis where missing depth channels exist. Robot URDF models are registered to enforce unified TCP frames for consistent forward kinematics across embodiments.

The core algorithmic novelty is elevating 2D visual inputs into a shared 3D reference frame by introducing aligned vertex maps and vertex spectrums. Pixel-wise depth values are back-projected with intrinsic and extrinsic camera matrices to 3D vertex maps. These are transformed into a canonical Bird's-Eye-View (BEV) frame, providing a geometric structure invariant to camera pose variation. For RGB-only views, the vertex spectrum encodes depth hypotheses per pixel to generate volumetric positional embeddings.

The architecture uses a pretrained 2D Vision-Language Model (VLM) backbone enhanced by adding 3D positional features derived from aligned vertex maps/spectrums and synthetic BEV images constructed by orthographic projection of fused multi-camera colored point clouds. Visual and language tokens are fused to form context embeddings. An action decoder trained by flow matching predicts SE(3) pose chunks expressed in the aligned BEV coordinate frame.

Training runs for multiple epochs on simulated datasets using batch processing, with careful hyperparameter tuning (details limited). No mention of random seed control or hardware specifics. Evaluation protocols use held-out trajectories and compare to strong baselines (π0, X-VLA) across multiple tasks and metrics like success rate. Ablations remove 3D inputs and alignment to quantify impact. Distribution shifts include randomized camera and scene poses. Real-world experiments test policies on distinct dual-arm robots over long-horizon tasks.

The authors release code, pretrained checkpoints, and data processing tools to facilitate reproducibility. Some datasets used are public benchmarks, others are internal or proprietary. Overall, the methodology is a systematic integration of calibrated 3D spatial alignment, BEV representation, temporal trajectory normalization, and flow matching policy learning for improved generalization across complex manipulation domains.

A concrete example is the LIBERO task where multi-view RGB-D frames from a single-arm Franka robot are transformed into BEV vertex maps; policies trained on this input achieve a 97.8% success rate and generalize well to perturbed camera views, whereas a 2D baseline fails below 10%. The output SE(3) poses are consistently interpreted in the canonical BEV frame, facilitating cross-embodiment transfer.

Technical innovations

Introduction of pixel-wise aligned vertex maps and vertex spectrums to lift 2D visual inputs into a unified 3D BEV-aligned coordinate frame leveraging camera calibration and optional depth data.
Synthesis of BEV images by aggregating multi-camera RGB-D data into a top-down orthographic projection that is invariant to camera pose variations.
Unification of observation-action representations by expressing robot proprioceptive measurements and output actions as SE(3) poses in the BEV frame across diverse robot embodiments.
A comprehensive data processing pipeline combining manual 3D GUI alignment, rule-based algorithms, and vision foundation models to spatially and temporally align heterogeneous datasets and trajectories.
Temporal alignment of manipulation trajectories via normalized end-effector speed to reduce variations due to human teleoperation and robot platform differences.

Datasets

LIBERO — thousands of manipulation demonstrations — public benchmark
RoboTwin 2.0 — large-scale dual-arm manipulation dataset — public/internal hybrid
Agibot-Alpha/Beta — internal dataset of bimanual manipulation — private
RoboMind 2.0 — internal real robot dataset — private
Droid — real-world manipulation dataset with depth synthesized via FoundationStereo — private/internal
DexForce W1, W1*, A1 — proprietary robot datasets with teleoperated trajectories — internal

Baselines vs proposed

π0 [14]: LIBERO average success = 94.2% vs Dex-BEV = 97.8%
X-VLA [39]: LIBERO average success = 98.1% vs Dex-BEV = 97.8%
2D Ablation (no 3D inputs/alignment): LIBERO average success = 92.8% vs Dex-BEV = 97.8%
π0 [14]: RoboTwin average success = 46.4% vs Dex-BEV = 76.0%
X-VLA [39]: RoboTwin average success = 70.0% vs Dex-BEV = 76.0%
2D Ablation: RoboTwin average success = 64.8% vs Dex-BEV = 76.0%
X-VLA (official ckpt): Modified LIBERO with camera and scene perturbation <10% success vs Dex-BEV ~90%
Real-world Task: Fold Mailer Box: π0 43.3%, X-VLA 56.7%, Dex-BEV 76.7%
Real-world Task: Fold Cloth (Agilex): π0 66.7%, X-VLA 80.0%, Dex-BEV 93.3%
Real-world Task: Scoop Popcorn: π0 60.0%, X-VLA 70.0%, Dex-BEV 86.7%
Real-world Task: Handover Book: π0 40.0%, X-VLA 70.0%, Dex-BEV 93.3%
Real-world Task: Fold Cloth (A1): π0 63.3%, X-VLA 76.7%, Dex-BEV 96.7%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.02274.

Fig 1

Fig 1: We introduce Dexterity-BEV (Dex-BEV), a series of technical and systematic contribu-

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Heavy reliance on accurate camera calibration (intrinsic and extrinsic parameters) restricts deployment in unstructured or dynamic environments where calibrations are unavailable or imprecise.
Depth images are optional but missing depth must be synthesized or simulated, which may reduce input fidelity and policy accuracy in RGB-only setups.
Temporal alignment assumes quasi-static manipulation tasks; tasks requiring highly dynamic timing or physics (e.g., throwing) may break this assumption and degrade performance.
Current work only applies the BEV aligned framework to Vision-Language-Action (VLA) models; extension to World-Action Models (WAMs) with explicit future 3D state prediction is deferred.
Evaluation primarily focuses on representative simulation benchmarks and select real robots; wider generalization to more diverse robot morphologies and unstructured real-world domains remains to be demonstrated.
While a comprehensive pipeline is proposed for spatial and temporal alignment, full automation and universal reliability of these procedures may be challenging, limiting ease of adoption.

Open questions / follow-ons

Can calibration-free or self-supervised approaches leverage foundation models to enable BEV lifting without explicit camera parameters for unstructured environments?
How can temporal alignment generalize to highly dynamic or contact-rich manipulation tasks beyond quasi-static assumptions?
Can this unified 3D alignment framework be extended to incorporate explicit future state predictions in World-Action Models (WAMs)?
What are the scalability limits of BEV representations and alignment pipelines when integrating more heterogeneous robot morphologies and large-scale multi-modal datasets?

Why it matters for bot defense

For bot-defense engineers working on CAPTCHA or bot-detection systems that involve perception and action, Dexterity-BEV’s systematic approach to spatial-temporal alignment and 3D input representation offers an instructive case study. It shows how grounding noisy, viewpoint-varying multi-camera visual inputs into a unified 3D coordinate system can greatly improve model robustness and generalization. Similarly, CAPTCHAs that seek to robustly verify human interactions or distinguish bots might benefit from integrating 3D awareness or geometric consistency checks rather than relying purely on raw 2D pixel data.

Furthermore, the pipeline’s temporal alignment techniques could inspire analogous normalization in behavioural inputs or event trajectories in user interaction patterns. While robotics manipulation differs fundamentally from CAPTCHA bot defense, the core insights about reducing input-output modality misalignments, encoding viewpoint-invariant spatial information, and leveraging canonical reference frames can parallel strategies for ensuring robust, generalizable challenge-response protocols in automated adversarial environments.

Cite

bibtex

@article{arxiv2606_02274,
  title={ Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning },
  author={ Huayi Zhou and Wei Gao and Dekun Lu and Ruiji Liu and Zhanqi Zhang and Ziyang Zhang and Jian Chen and Wenlve Zhou and Sheng Xu and Shumin Li and Kangyi Guo and Shichen Xu and Zixin Huang and Yongyi Su and Kui Jia },
  journal={arXiv preprint arXiv:2606.02274},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.02274}
}

Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​