Skip to content

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Source: arXiv:2605.22809 · Published 2026-05-21 · By Jiahao Wang, Bo Sun, Yijing Bai, Vincent Casser, Songyou Peng, Zehao Zhu et al.

TL;DR

Sensor2Sensor addresses a critical data scarcity challenge in autonomous driving system (ADS) validation by enabling the conversion of unstructured, in-the-wild monocular dashcam videos into high-fidelity, multi-modal sensor data streams matching the target autonomous vehicle (AV) embodiment. Unlike existing proprietary AV logs that have limited diversity and scale, Sensor2Sensor unlocks vast external data sources such as internet driving videos and dashcams, which inherently capture rare, diverse, and safety-critical scenarios. The core novelty lies in bridging a massive embodiment gap—monocular, single-view videos versus synchronized, multi-view cameras and LiDAR point clouds—through a generative, diffusion-based sensor conversion approach.

Key findings

  • Sensor2Sensor achieves a Fréchet Inception Distance (FID) of 6.47 in multi-view image generation on a curated paired dataset, outperforming baselines VGGT (FID 250.93), π3 (FID 246.27), and X-Drive (FID 8.30).
  • On multi-view video generation, Sensor2Sensor attains a Fréchet Video Distance (FVD) of 278.12, substantially better than reconstruction baselines π3 (2007.35) and VGGT (2373.15) and the ablation without view-concatenation (293.73), indicating superior temporal coherence.
  • Chamfer Distance for LiDAR generation improves by 13.37% versus X-Drive baseline (8.68 vs 10.02), showcasing more accurate 3D geometry reconstruction.
  • Human evaluation across 26 participants shows Sensor2Sensor's output is top-ranked over 83% of the time for images and 68% for LiDAR on dashcam data, and similarly strong on internet footage (85%+ preference), signifying strong alignment with real data.
  • Ablation confirms that conditioning input via view concatenation (VC) outperforms channel concatenation (CC) with better FID (6.47 vs 6.88) and LPIPS (0.316 vs 0.346).
  • DAgger finetuning improves temporal video generation stability and realism, reducing FVD from 288.90 to 278.12 and FID from 24.65 to 21.54.
  • Perception models trained on real data achieve comparable LiDAR detection and image segmentation performance on Sensor2Sensor-generated data, confirming strong downstream utility.

Threat model

n/a — This work focuses on data synthesis and sensor conversion for autonomous driving system validation and does not explicitly model an adversary or security threats.

Methodology — deep read

  1. Threat Model & Assumptions: The adversary is not explicitly defined as this is focused on data generation for autonomous driving. The key assumption is that large-scale paired data of dashcam videos and matching AV sensor logs do not exist, requiring synthetic pairing and generative modeling. The model targets safety-critical ADS validation scenarios by generating sensor data realistically across modalities and views.

  2. Data: The authors utilize proprietary AV log datasets comprising approximately 100,000 driving scenes (~10 seconds each) with synchronized 8-view cameras (360 degrees) and LiDAR scans. They reconstruct 4D scenes from this rich data using dynamic 3D Gaussian Splatting (4DGS). From these reconstructions, synthetic dashcam-style monocular video viewpoints are rendered by sampling intrinsic/extrinsic parameters that mimic real-world dashcam optics and placements. This yields paired (synthetic dashcam, AV log) training data with perfect spatio-temporal alignment. Evaluation employs a curated paired "Fixed-Camera-to-AV" dataset of 1,000 sequences (3s each) and an in-the-wild dataset collected from internet videos, OEM dashcams, phone recordings, and other ADAS sources.

  3. Architecture/Algorithm: Sensor2Sensor uses a conditional diffusion model with a multi-branch U-Net architecture enabling simultaneous multi-view image and LiDAR point cloud generation. Each modality employs its own VAE encoding/decoding: multi-view images are generated conditioned on input dashcam frames plus camera poses encoded as raymaps to enforce geometric consistency; LiDAR is represented as range-view spin images with 4 channels (range, intensity, elongation, validity) and encoded by a LiDAR VAE. A novel cross-sensor attention module allows joint feature fusion between image and LiDAR branches to ensure cross-modal consistency. For conditioning, the input monocular dashcam video is treated as a 9th view with a binary mask to distinguish noise-free conditions from generated views. During video generation, a DAgger-based auto-regressive training approach is used to reduce error accumulation in multi-step rollout.

  4. Training: The diffusion model is trained on synthetic paired data from 4DGS rendering. The loss for LiDAR VAE includes multiple L1 and LPIPS terms across channels plus KL divergence. The model is trained with attention to multi-view geometry and sensor fusion. Hyperparameters such as epoch count, batch size, optimizer details, and hardware specifics are not detailed in the source but training includes iterative DAgger finetuning for temporal stability.

  5. Evaluation Protocol: Metrics include FID and LPIPS for image quality, FVD for video realism, PSNR and SSIM for quantitative similarity, Chamfer Distance for LiDAR accuracy, and human perceptual evaluation for realism and alignment. Baselines include VGGT and π3 (reconstruction-based), X-Drive and a CAT3D adaptation (generative baselines) with various conditioning ablations. Evaluations are done on held-out paired fixed-camera AV data and on in-the-wild videos to test generalization.

  6. Reproducibility: The paper does not mention a public code or dataset release. The paired dataset is proprietary. Detailed architectural diagrams and loss formulations are provided enabling replication in principle. The work relies heavily on proprietary AV logs and 4DGS reconstructions, containing sensitive data.

Example end-to-end: Starting from a 4DGS reconstruction of an AV log scene, the system renders a synthetic dashcam monocular video from a sampled virtual camera pose reflecting real dashcam optics. This synthetic monocular view is paired with the original multi-view camera images and LiDAR from the AV log to form training data. Sensor2Sensor is trained to condition on the monocular view and output the full multi-view and LiDAR sensor suite, learning geometric and modality consistency through the novel cross-sensor attention diffusion architecture. At inference, Sensor2Sensor takes real-world unseen monocular dashcam footage as input and generates temporally consistent, multi-view camera streams plus LiDAR point clouds aligned to the target AV embodiment for downstream ADS use.

Technical innovations

  • A synthetic data pairing pipeline using dynamic 4D Gaussian Splatting to reconstruct scenes from AV logs and render realistic, diverse monocular dashcam views enabling supervised training without real paired data.
  • A conditional multi-modal diffusion model architecture that jointly generates multi-view images and LiDAR point clouds from a single monocular input via dedicated VAEs and U-Net branches with a novel cross-sensor attention module for inter-modality feature fusion.
  • Encoding camera poses as raymaps and incorporating a binary conditioning mask for the input monocular view to enforce precise geometric control and improved generative fidelity across multiple views.
  • Use of a DAgger-based auto-regressive training strategy to significantly improve temporal coherence and reduce error accumulation in long video rollouts of multi-sensor outputs.

Datasets

  • Internal AV logs — ~100,000 10s scenes — proprietary fleet-collected AV data with synchronized 8-view cameras and LiDAR
  • Fixed-Camera-to-AV paired dataset — 1,000 sequences of 3 seconds — curated from internal AV sensor configurations
  • In-the-wild driving dataset — diverse dashcam, internet, phone, ADAS videos — manually collected, non-public

Baselines vs proposed

  • VGGT [44]: Multi-view image FID = 250.93 vs Sensor2Sensor: 6.47
  • π3 [48]: Multi-view image FID = 246.27 vs Sensor2Sensor: 6.47
  • X-Drive [51]: Multi-view image FID = 8.30 vs Sensor2Sensor: 6.47
  • VGGT [44]: Video FVD = 2373.15 vs Sensor2Sensor: 278.12
  • π3 [48]: Video FVD = 2007.35 vs Sensor2Sensor: 278.12
  • Ours without VC (view concatenation): Video FVD = 293.73 vs Sensor2Sensor with VC: 278.12
  • X-Drive [51]: Chamfer Distance = 10.02 vs Sensor2Sensor: 8.68
  • Human Evaluation on dashcam images: X-Drive preference 3.08% vs Sensor2Sensor 83.46%
  • Human Evaluation on dashcam LiDAR: X-Drive 8.08% vs Sensor2Sensor 68.08%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.22809.

Fig 1

Fig 1: Sensor2Sensor is a novel generative paradigm for translating in-the-wild monocular videos from varied sources such as dash-

Fig 2

Fig 2: Synthetic paired-data curation pipeline. We recon-

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

  • Relies on proprietary datasets and 4DGS reconstructions, limiting reproducibility and benchmarking on public datasets.
  • Evaluation mainly focuses on visual/sensor fidelity and does not include adversarial robustness or explicit safety-critical scenario coverage analysis.
  • Generalization tested qualitatively on in-the-wild data but lacks quantitative metrics on truly out-of-distribution conditions or extreme weather.
  • Model complexity and computational cost for diffusion and 4DGS reconstruction may challenge real-time or large-scale deployment.
  • LiDAR is modeled via range-view images rather than raw point clouds, which might limit some types of point cloud details or downstream compatibility.
  • While temporal stability is improved with DAgger, long-term rollout drift and error accumulation remain possible and not fully quantified across extended sequences.

Open questions / follow-ons

  • How well does Sensor2Sensor perform under extreme weather, lighting, or rare event conditions not represented in training data?
  • Can the approach be extended to other sensor modalities beyond cameras and LiDAR, such as radar or event cameras?
  • What are the limits of temporal consistency and error accumulation over longer video rollouts beyond 3-second clips?
  • How might model performance vary across diverse AV embodiments with differing sensor configurations or capture setups?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, Sensor2Sensor's core contribution lies in bridging large domain gaps via conditional diffusion to generate multi-modal, multi-view sensor data from monocular inputs. This cross-embodiment generative modeling approach highlights the utility of reconstructive synthetic paired data and cross-modal attention to maintain geometric and temporal consistency at scale. While autonomous driving and CAPTCHA tasks differ, similar challenges of translating sparse or weakly-aligned input streams into richer, structured multi-modal outputs arise in user interaction validation or fake user detection for bot defense. Techniques like conditional diffusion with modality-attentive architectures and synthetic data pairing could inspire novel CAPTCHA tasks that require sensor or modality translation consistency checks to detect synthetic or bot-generated content. However, direct application would require adaptation for modalities relevant to CAPTCHA (e.g., behavioral biometrics, motion, touch) and verification under adversarial conditions, which remain open challenges.

Cite

bibtex
@article{arxiv2605_22809,
  title={ Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving },
  author={ Jiahao Wang and Bo Sun and Yijing Bai and Vincent Casser and Songyou Peng and Zehao Zhu and Meng-Li Shih and Xander Masotto and Shih-Yang Su and Kanaad V Parvate and Tiancheng Ge and Linn Bieske and Dragomir Anguelov and Mingxing Tan and Chiyu Max Jiang },
  journal={arXiv preprint arXiv:2605.22809},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.22809}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution