IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

Source: arXiv:2605.16258 · Published 2026-05-15 · By Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou et al.

TL;DR

IVGT addresses the challenging problem of reconstructing coherent and continuous 3D geometry and appearance from multiple unposed RGB images without known camera poses. Unlike existing visual geometry foundation models that explicitly regress discrete pixel-aligned 3D points, IVGT learns an implicit neural scene representation that supports continuous spatial queries within a canonical coordinate system. This representation models signed distance functions (SDF) and view-dependent colors, enabling direct extraction of smooth mesh surfaces and rendering of RGB, depth, and normal maps from arbitrary novel viewpoints. IVGT is trained end-to-end with 2D image supervision and geometric regularization on a diverse multi-dataset collection, and it generalizes across scenes without per-scene optimization or pose priors.

Empirically, IVGT demonstrates strong performance on numerous vision tasks including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation. On ScanNet mesh reconstruction, IVGT matches or outperforms most per-scene optimized baselines while running as a feed-forward, generalizable model. Its mesh outputs are visually smoother and more complete than explicit pointmap reconstructions. IVGT also achieves competitive novel view synthesis quality on RealEstate10K and DL3DV and competitive camera pose estimation errors on ScanNet, Sintel, and TUM-dynamics datasets. These results suggest IVGT's implicit continuous geometry modeling better captures underlying 3D structure from unposed views than prior explicit visual geometry transformers.

Key findings

IVGT achieves an F-score of 0.647 on ScanNet mesh reconstruction, surpassing most per-scene optimization baselines except MonoSDF (which requires hours of optimization).
Mean absolute translation error (ATE) on ScanNet camera pose estimation is 0.032m, comparable or better than feed-forward baselines like VGGT (0.035) and WorldMirror (0.037).
On pointmap reconstruction benchmarks (7-Scenes, NRGBD, DTU), IVGT obtains lower mean accuracy and completeness errors than prior methods like Fast3R and Point3R, e.g., 7-Scenes mean accuracy of 0.016m vs 0.020m for VGGT.
IVGT directly supports continuous 3D queries returning SDF values and colors for any spatial position, avoiding the redundancy and discontinuities in explicit pixel-aligned representations.
Use of ray-depth encoding for spatial position disambiguation improves implicit geometry learning when absolute positional encoding is ambiguous.
Two-stage training with 2D RGB, depth, normal supervision plus geometric regularization (Eikonal loss, smoothness) stabilizes the implicit representation and improves surface quality.
IVGT uses differentiable volume rendering from predicted SDF to jointly supervise geometry and view-consistent appearance without known camera poses.
IVGT renders RGB, depth, and normal maps from novel viewpoints using the same implicit SDF representation, demonstrating unified geometry and appearance modeling.

Threat model

n/a – The paper is focused on neural scene representation for 3D reconstruction and rendering from multi-view images, not on adversarial threats or security models.

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly modeled here, as the paper focuses on multi-view reconstruction from unposed RGB images without known camera parameters. The problem assumes only multi-view RGB images of a static scene are available, and no camera poses or scene-specific optimization is allowed at test time.
Data: The model is trained jointly on multiple diverse datasets including ARKitScenes, CO3Dv2, HyperSim, MegaDepth, OmniObject3D, ScanNet, ScanNet++, Unreal4K, and WildRGBD. These datasets include real and synthetic scenes and objects, with RGB-D data and camera poses for supervision during training. Surface normals are derived from RGB using a separate monocular normal prediction method (DSine). The dataset mixture covers both object-level and scene-level data.
Architecture / Algorithm:

The backbone is a transformer-based multi-view feature extractor initially initialized from VGGT weights. Each input image is tokenized (using a DINO tokenizer). The transformer alternates intra-view and cross-view self-attention layers to aggregate features across views into a global neural scene representation F within a canonical coordinate system.
For an arbitrary 3D query point x in the canonical frame, the model projects x onto each input view using predicted camera parameters to retrieve pixel-aligned image features.
To avoid ambiguity of absolute 3D coordinates, the model encodes the relative ray depth of x with respect to each valid view using a learned MLP, then aggregates ray-depth embeddings and image features.
The concatenated local feature z is passed to cascaded MLP decoders: an 8-layer MLP predicts the signed distance function (SDF) value and an intermediate appearance feature; a 2-layer MLP then predicts the view-dependent RGB color conditioned on the SDF gradient (surface normal) and viewing direction.
SDF values are converted to densities with a learnable parameter to enable differentiable volume rendering. The model can render RGB, depth, and surface normal maps of novel views.
Surface extraction is done by querying the learned SDF on a high resolution grid and running Marching Cubes to extract continuous meshes.

Training Regime:

Two-stage training: Stage 1 optimizes RGB reconstruction loss, rendered depth loss, surface normal loss, and camera pose estimation loss using a Huber loss on predicted camera parameters.
Stage 2 adds Eikonal and smoothness regularizations on the SDF field to improve geometry stability, plus a direct depth supervision loss with predicted uncertainty.
Hyperparameters: AdamW optimizer, learning rate warming up to 2e-4 with cosine decay.
Mini-batch samples 8 viewpoints (4 context + 4 novel) per iteration, 1024 rays per view, points sampled along rays with error-bounded sampling.
Trained on 4 NVIDIA A800 GPUs for 4 days.

Evaluation Protocol:

Metrics include mesh reconstruction accuracy, completeness, chamfer distance, precision, recall, and F-score on ScanNet.
Pointmap reconstruction accuracy and completeness on 7-Scenes, NRGBD, and DTU.
Camera pose errors (ATE, relative translation, rotation) after Sim(3) Umeyama alignment on ScanNet, Sintel, and TUM-dynamics.
Novel view synthesis metrics on RealEstate10K and DL3DV (PSNR, SSIM, LPIPS).
Ablations on ray-depth encoding, loss components, and mesh vs pointmap representations.
Comparisons are against state-of-the-art explicit visual geometry foundation models (VGGT, MONOSDF) and per-scene optimization baselines (NeuS, VolSDF).

Reproducibility:

Code is publicly released at https://github.com/wzzheng/IVGT/
Uses public and widely used datasets with standard splits.
Model weights or detailed hyperparameter tuning beyond described are not explicitly stated.

Example End-to-End Inference: Given N unposed multi-view images of a scene, IVGT extracts tokens and runs the transformer to get a global scene feature representation. For a 3D query point, it projects it onto valid images, retrieves pixel features and ray-depth embeddings, then decodes SDF and color values via cascaded MLPs. Volume rendering integrates these along camera rays to produce novel view images and depth. Marching Cubes extracts the continuous surface mesh from the zero-level set of the predicted SDF volume. All happens in a single forward pass without pose inputs or scene-specific optimization.

Technical innovations

Introduction of an implicit visual geometry transformer that learns a continuous signed distance function (SDF) representation from pose-free multi-view RGB images, enabling continuous surface extraction and rendering.
Design of a ray-depth positional encoding to disambiguate 3D points projected to the same pixel across views, addressing the limitations of absolute coordinate positional encodings under varying canonical frames.
A two-stage training regime combining 2D supervision from RGB, depth, and surface normal images with geometric regularizations (Eikonal and smoothness loss) to stabilize implicit geometry modeling without requiring known camera poses.
Use of differentiable volume rendering on predicted SDF and view-dependent colors to jointly supervise geometry and appearance under a unified implicit neural scene representation.

Datasets

ScanNet — 1513 scenes (train/test split per Guo et al. 2022) — public
ARKitScenes — unspecified size — public
CO3Dv2 — unspecified size — public
HyperSim — unspecified size — public
MegaDepth — unspecified size — public
OmniObject3D — unspecified size — public
ScanNet++ — unspecified size — public
Unreal4K — synthetic dataset — public
WildRGBD — unspecified size — public
7-Scenes — unspecified size — public
NRGBD — unspecified size — public
DTU — unspecified size — public
RealEstate10K — 200 test scenes — public
DL3DV — 112 unseen scenes — public

Baselines vs proposed

COLMAP (per-scene): ScanNet F-score = 0.537 vs IVGT (generalizable) F-score = 0.647
MonoSDF (per-scene optimization): ScanNet F-score = 0.733 vs IVGT F-score = 0.647
VGGT (feed-forward baseline): 7-Scenes accuracy mean = 0.020m vs IVGT = 0.016m
WorldMirror (feed-forward): ScanNet pose ATE = 0.037m vs IVGT = 0.032m
FLARE: RealEstate10K PSNR = 16.33 (sparse views) vs IVGT = 18.97
AnySplat: RealEstate10K PSNR = 17.62 (sparse views) vs IVGT = 18.97

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.16258.

Fig 1

Fig 1: IVGT implicitly models coherent 3D geometry and appearance from pose-free multi-view

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

IVGT performance on mesh reconstruction is slightly below the best per-scene optimized methods (e.g., MonoSDF), which require long training times per scene.
The paper does not evaluate robustness under severe occlusions, dynamic scenes, or non-Lambertian materials, which could impact implicit geometry learning.
Camera pose supervision is required during training, limiting fully unsupervised applicability at scale.
Surface extraction relies on uniform grid sampling and Marching Cubes which could be computationally expensive for large scenes.
Novel view synthesis results are slightly lower in PSNR and LPIPS compared to some specialized rendering methods, indicating a trade-off with generality and continuous geometry modeling.

Open questions / follow-ons

How well does IVGT perform under challenging real-world conditions such as strong lighting changes, motion blur, or occlusions?
Can the implicit visual geometry transformer be extended to handle dynamic scenes or non-rigid objects?
How to further improve novel view synthesis fidelity without sacrificing mesh reconstruction quality using the implicit SDF-based approach?
What are the limits of generalization across drastically different scene types or sensor modalities beyond the combined datasets studied?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, IVGT presents a novel way to create high-fidelity, continuous 3D representations from unposed multi-view images without needing camera parameters or per-instance optimization. This might inspire new CAPTCHAs or bot-detection schemes leveraging implicit 3D geometry understanding, for example by requiring interaction with or recognition of continuous 3D scenes rendered from multi-view inputs. IVGT’s ability to reconstruct coherent surface geometry and appearance in a single forward pass could enable CAPTCHA challenges based on spatial reasoning or verifying user knowledge of complex 3D shapes, potentially increasing resistance to automated attacks that rely on 2D image-based recognition. The paper also highlights that continuous implicit representations outperform discrete pixel-aligned pointmaps for geometry consistency, which may guide design choices in leveraging 3D scene priors for bot interaction security. However, deploying such 3D implicit models for real-time CAPTCHA tasks would require addressing computational efficiency and robustness concerns discussed in the paper’s limitations.

Cite

bibtex

@article{arxiv2605_16258,
  title={ IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation },
  author={ Yuqi Wu and Tianyu Hu and Wenzhao Zheng and Yuanhui Huang and Haowen Sun and Jie Zhou and Jiwen Lu },
  journal={arXiv preprint arXiv:2605.16258},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.16258}
}

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​