VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

Source: arXiv:2606.12396 · Published 2026-06-10 · By Jin Yao, Dhruva Dixith Kurra, Tom Lampo, Zezhou Cheng, Danhua Guo, Burhan Yaman

TL;DR

This paper addresses the challenge of grounding vision-language-action (VLA) models for autonomous driving in dense 3D geometry. VLA models integrate scene understanding and language reasoning to generate driving trajectories, but existing methods fall short of capturing continuous, dense 3D spatial information critical for safe motion planning. Prior approaches rely either on sparse 3D perception outputs (boxes, lanes) that lack dense spatial precision, or inject dense features from frozen 3D foundation models without objectives ensuring geometric grounding, or drop language reasoning in favor of geometry-only methods. The authors propose VLGA, a novel four-expert Mixture-of-Transformers architecture that adds a dedicated geometry expert supervised by a dense per-pixel pointmap reconstruction loss against LiDAR scans. This geometry expert operates alongside the vision-language, perception, and action experts, providing a parameter-isolated dense spatial modality stream that explicitly reconstructs the 3D world during training. Extensive evaluations on nuScenes (open-loop) and Bench2Drive (closed-loop simulator) show VLGA achieving state-of-the-art results on safety-critical metrics: lowest L2 displacement and 3-second collision rate among VLAs without ego-state input, and best closed-loop driving score with improved comfort and success rate. Ablations confirm the importance of the dense geometry supervision to keep the geometric stream relevant for action planning. The paper thus establishes explicit dense 3D geometry supervision within a VLA framework as a key step toward operationally grounded, interpretable autonomous driving policies combining language, vision, geometry, and action.

Key findings

VLGA-Large achieves the lowest average L2 displacement on nuScenes without ego status: 0.50 m, outperforming all VLA baselines on 15/16 planning metrics (Table 1).
VLGA-Large has the lowest 3-second collision rate on nuScenes without ego status at 0.18%, reducing collisions by ~2.5–3x compared to other VLA models with similar L2 error.
On Bench2Drive closed-loop driving, VLGA attains state-of-the-art Driving Score of 79.08, +0.71 over prior best UniDriveVLA, with better success rate (52.73%) and comfort at comparable efficiency (Table 2).
Ablation shows adding the geometry expert alone reduces collision rate from 0.169% to 0.149%; adding dense pointmap supervision further reduces it to 0.136%, an 8.7% relative decrease (Table 4).
VLGA improves success rates notably on spatially demanding skills such as Emergency Brake (55.00% vs 50.00%) and Give Way (40.00% vs 30.00%), while matching prior art on Merging and performing comparably on Overtaking and Traffic Sign recognition tasks (Table 3).
VLGA retains frozen pretrained weights for vision-language backbone and perception expert but trains new geometry expert and action expert with a two-stage schedule, showing integration of dense geometry without disrupting upstream models.
Dense pointmap supervision is computed from multi-frame LiDAR projections but the geometry decoder predicting reconstructions is discarded at inference, so the learned geometry stream benefits downstream planning without runtime LiDAR.
VLGA uses a four-expert Mixture-of-Transformers with masked joint attention integrating vision-language, perception, geometry, and action modalities, allowing the action expert to attend to dense geometry tokens independently of language parameters.

Threat model

The adversary is the autonomous driving environment presenting complex, safety-critical 3D spatial scenarios where the agent must plan collision-free trajectories from visual and language inputs without access to ego motion ground truth or LiDAR at test time. The model is assumed not to receive privileged sensor data or oracle object detection. It must ground high-level navigation instructions and dense scene geometry robustly for safe operation in crowded, dynamic urban scenarios.

Methodology — deep read

The authors approach the problem of dense 3D geometric grounding in vision-language-action models under a threat model where the autonomous driving agent must infer precise 3D structure from multi-view cameras and language instructions, without any privileged sensor input at test time (e.g., no ego status or LiDAR). The adversary is indirectly modeled as the environment complexity, requiring long-horizon spatial precision for collision-free planning.

Data comes from two established autonomous driving benchmarks: nuScenes for open-loop evaluation (28,130 training frames, 6-camera rig, ego pose removed for the hardest test), and Bench2Drive simulation routes in CARLA for closed-loop testing. Ground truth consists of LiDAR sweeps and trajectory annotations, with LiDAR pointmap projections obtained by accumulating multiple point clouds and projecting into camera pixel patches to create dense per-pixel 3D supervision.

VLGA's architecture is based on a Mixture-of-Transformers (MoT) framework with four modality-specialized experts: Understanding (vision-language semantic), Perception (sparse object-level queries e.g., agent boxes and lanes), Geometry (new expert for dense 3D spatial tokens), and Action (trajectory decoder). The geometry expert operates on tokens emitted from a pretrained geometry backbone (DVGT-2) producing a 60x34 token grid per camera at 960x544 resolution, totaling ~12,240 tokens for 6 cameras. A learnable per-patch projector maps geometry features into the MoT token space.

The geometry stream is supervised with a dense per-pixel pointmap reconstruction objective: a 5-layer transformer decoder consumes geometry tokens and predicts 3D points in ego LiDAR frame coordinates per patch with uncertainty logits, optimizing a confidence-weighted regression loss akin to Pi3. This direct geometric supervision aligns the geometry tokens explicitly with continuous 3D structure rather than relying on indirect action losses.

Training proceeds in two stages: (1) geometry stage where geometry expert, projector, and decoder are trained with pointmap loss only, freezing action and other experts, warming up the geometry stream; (2) joint stage where action expert is unfrozen and trained with combined action and pointmap losses (pointmap weighted at 0.1), enabling co-adaptation without diluting geometric fidelity. Optimizer is AdamW at 5e-5 with batch size 128, momentum EMA, over 10+30 epochs for nuScenes and 3+7 epochs for Bench2Drive, run on 8 H100 GPUs.

Evaluation uses multiple open-loop planning metrics on nuScenes, including average L2 trajectory error over 1-3 sec horizons and collision rate, under both ST-P3 and UniAD protocols. Closed-loop Bench2Drive uses Driving Score combining success, efficiency, and comfort, plus per-skill success breakdown (Merging, Overtaking, Emergency Brake, Give Way, Traffic Sign). Ablations isolate effects of geometry expert addition and dense supervision.

Reproducibility is partially ensured by use of publicly known datasets (nuScenes, Bench2Drive), pretrained backbone checkpoints from prior works (DVGT-2, UniDriveVLA), and detailed training schedules. However, no explicit code or model weights release is mentioned in the paper. The dense pointmap decoder is discarded at inference to avoid runtime overhead.

Example end-to-end: input 6-camera frames and navigation command are encoded by frozen vision-language and perception backbones; pretrained geometry backbone processes multi-view images into dense patch features; these features are projected and integrated into the MoT alongside language and perception tokens; the geometry expert attends jointly; finally, the action expert attends to all streams and predicts a future trajectory. During training, the geometry tokens also pass through a lightweight decoder to reconstruct dense 3D LiDAR point clouds, with the regression loss guiding parameter updates. Subsequently, action loss finetunes weights to align final trajectories with ground truth, all while maintaining dense geometric grounding.

This approach systematically delivers dense pixel-level 3D awareness to end-to-end vision-language policies, improving long-horizon planning safety without sacrificing scene understanding or language reasoning.

Technical innovations

Introducing a dedicated, parameter-isolated geometry expert modality stream within a vision-language-action Mixture-of-Transformers, supervised explicitly with dense per-pixel LiDAR pointmap reconstruction.
Applying a dense per-pixel confidence-weighted regression loss on 3D points in ego LiDAR frame space to directly supervise the geometry stream rather than relying solely on action or sparse box supervision.
Two-stage training schedule that first warms up geometry components using pointmap loss before joint action+geometry optimization to prevent interference and maintain geometric fidelity.
Integration of dense geometry tokens alongside sparse perception and language streams using masked joint attention, allowing the action expert to leverage all modalities for trajectory prediction without conflating language and geometry parameters.

Datasets

nuScenes — 28,130 training keyframes for open-loop planning, 6-camera images with LiDAR ground truth — public
Bench2Drive — 220 closed-loop driving routes in CARLA simulator across 12 towns — public

Baselines vs proposed

UniDriveVLA-Large (without ego status): 3s collision rate = 0.27% vs VLGA-Large: 0.18%
UniDriveVLA-Large (without ego status): L2 average = 0.51 m vs VLGA-Large: 0.50 m
UniDriveVLA (Bench2Drive): Driving Score = 78.37 vs VLGA: 79.08
DVGT-2 (with ego status): 3s collision rate = 0.47% vs VLGA-Large (without ego status): 0.18%
VGGDrive (with ego status): L2 avg = 0.31 m vs VLGA-Large (without ego status): 0.41 m but collision rate 0.18%
VLGA Base (without ego status): collision rate 0.35% vs UniDriveVLA Base: 0.41% (same scale comparison)
Adding geometry expert reduces collisions from 0.169% to 0.149%, adding pointmap supervision further to 0.136%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.12396.

Fig 1

Fig 1: Paradigms for grounding driving policies in 3D geometry. Existing approaches expose 3D structure

Fig 2

Fig 2: VLGA architecture. A four-expert Mixture-of-Transformers coupled by masked joint attention: an

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 3).

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

High inference cost due to large vision-language backbone constrains deployment on edge devices; distillation and quantization are future work.
Dense pointmap supervision is per-frame only; temporal consistency across multi-frame input not yet addressed.
No adversarial or out-of-distribution testing reported; robustness under anomaly or adversarial conditions unknown.
Frozen vision-language and perception backbones may limit adaptability; only geometry and action components are trained.
Dense geometry decoder is discarded at inference; effectiveness relies on learned representations but decoder's runtime cost is not analyzed.
Bench2Drive evaluation is in simulation rather than real-world closed-loop deployment.

Open questions / follow-ons

How can temporal consistency be leveraged to enforce smooth dense geometry understanding across video frames to improve long-horizon planning?
What are the robustness properties of the geometry stream under sensor noise, occlusions, and adversarial perturbations?
Can model distillation or quantization methods substantially reduce inference costs while preserving dense geometric grounding?
How does the geometry expert contribute to interpretability or failure diagnosis in safety-critical driving scenarios?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, VLGA exemplifies how grounding high-level language and vision understanding in a dense spatial modality can improve actionable decision-making reliability and safety. Analogously, defensive systems might benefit from explicit spatial grounding features rather than relying only on high-level semantic reasoning, especially where precise spatial reasoning about a user's environment or behavior matters. The architecture’s parameter-isolated expert design preserving modality-specific streams in a joint transformer may inspire multi-modal systems that need modular, interpretable components, for example to isolate anomalous input modalities in bot detection.

While the domain is autonomous driving, VLGA's approach to supervised dense 3D geometry reconstruction could translate to multi-modal bot-detection tasks requiring fusion of vision, language, and spatial context. The dense geometric supervision ensures the system cannot ignore fine-grained spatial details, reducing brittleness where attackers try to exploit lack of grounding. It also shows value in combining sparse structured queries with dense spatial signals, a concept applicable for multi-modal CAPTCHA challenges involving spatial puzzles or behavior analysis. However, the computational cost and system complexity caution about deploying such large-scale multi-expert models outside specialized hardware environments.

Cite

bibtex

@article{arxiv2606_12396,
  title={ VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving },
  author={ Jin Yao and Dhruva Dixith Kurra and Tom Lampo and Zezhou Cheng and Danhua Guo and Burhan Yaman },
  journal={arXiv preprint arXiv:2606.12396},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12396}
}

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​