Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
Source: arXiv:2604.28193 · Published 2026-04-30 · By Vinayak Gupta, Chih-Hao Lin, Shenlong Wang, Anand Bhattad, Jia-Bin Huang
TL;DR
GenWildSplat targets a very specific failure mode in sparse-view 3D reconstruction: real-world photo collections are usually unposed, have only a handful of useful views, and mix illumination changes with transient distractors like people or cars. Prior in-the-wild Gaussian-splatting systems typically solve this by optimizing per scene with learned appearance embeddings or visibility masks, which is slow, brittle under sparse coverage, and hard to generalize. The paper’s core claim is that a feed-forward model can handle all three problems at once if it learns a better geometric prior, an explicit lighting-conditioned appearance transform, and external occlusion supervision.
The main novelty is GenWildSplat, which builds on AnySplat/VGGT-style sparse multi-view geometry prediction and adds two key components: an appearance adapter that modulates canonical 3D Gaussian colors toward a target lighting code, and a segmentation-based masking pipeline that suppresses transient objects during supervision. Training is staged with a curriculum over synthetic lighting variation, multiple scenes, and then synthetic occlusions, because direct end-to-end training on real sparse imagery was unstable. On PhotoTourism and MegaScenes, the method reports better reconstruction quality than optimization-based in-the-wild baselines while running in about 3 seconds per scene with no test-time optimization.
Key findings
- On MegaScenes (3-view), GenWildSplat improves PSNR from 13.17 for NexusSplats to 14.43 and SSIM from 0.335 to 0.402, while lowering LPIPS from 0.552 to 0.496.
- On MegaScenes (6-view), GenWildSplat improves PSNR from 13.92 for NexusSplats to 15.84 and SSIM from 0.397 to 0.440, while lowering LPIPS from 0.518 to 0.407.
- Against feed-forward baselines on MegaScenes, the full model reaches PSNR 15.84 / SSIM 0.440 / LPIPS 0.407, compared with Vanilla AnySplat at 12.65 / 0.311 / 0.412.
- The paper reports runtime of about 3 seconds for GenWildSplat inference, versus 2.4 hours for NexusSplats and 5–8 hours for other optimization-based in-the-wild baselines.
- Ablation on MegaScenes shows removing the appearance adapter drops PSNR from 15.84 to 13.76 and SSIM from 0.440 to 0.391.
- Ablation shows removing occlusion handling hurts LPIPS materially, from 0.407 to 0.513, indicating transient masking is important for perceptual quality.
- Ablation shows removing curriculum learning is the most damaging single change: PSNR falls to 11.72 and SSIM to 0.318, with the authors describing Gaussian color collapse.
- For PhotoTourism, the paper claims state-of-the-art feed-forward rendering quality over optimization-based methods, but the excerpt does not provide the numeric table values for those scenes in the text we received.
Threat model
The relevant adversary is the uncontrolled data distribution: sparse, unposed, real-world photo collections containing unknown camera intrinsics/extrinsics, heterogeneous lighting, and transient foreground objects that should not become part of the reconstructed static scene. The method assumes the attacker cannot alter the training pipeline beyond ordinary dataset contamination and cannot provide dense, calibrated multi-view supervision at test time. It also assumes access to a pretrained segmentation prior for transient object detection; if that prior fails, the reconstruction supervision degrades. There is no explicit malicious attacker model beyond robustness to nuisance variation.
Methodology — deep read
Threat model and assumptions: this is not a security paper, but the practical “adversary” is the data itself — sparse, unposed internet photo collections with heavy nuisance variation. The model must handle unknown camera poses, strong appearance shifts across time and weather, and transient foreground objects. It assumes only a small number of input views are available at test time (the evaluation uses 3 and 6 views on MegaScenes, and 6 views on PhotoTourism) and that no per-scene optimization is allowed. It also assumes access to a pre-trained semantic segmentation model at inference/training time for transient masking, which is an external prior rather than something learned end-to-end.
Data: training uses 700+ outdoor scenes from DL3DV, augmented to simulate in-the-wild conditions. Appearance diversity is synthesized with DiffusionRenderer via offline unconditioned relighting, costing about 30 minutes per scene. Transient occluders are generated by compositing COCO segments (e.g., people, cars) at random locations, with exact ground-truth masks available for supervision. Evaluation is on PhotoTourism (6 input views across 3 scenes) and a curated subset of 20 challenging MegaScenes scenes, selected to have fewer than 20 registered images so the benchmark remains genuinely sparse rather than artificially subsampled. The paper does not describe train/val/test scene splits in detail in the excerpt, beyond stating that scenes are unseen during evaluation and that the MegaScenes subset was curated for sparsity.
Architecture / algorithm: GenWildSplat extends AnySplat. A 24-layer VGGT transformer backbone with alternating frame and global attention extracts multi-view features from sparse unposed images. Three heads then predict per-view depth, camera intrinsics/extrinsics, and per-pixel Gaussian attributes. These are unprojected into canonical 3D Gaussians with appearance-independent geometry and canonical SH colors. A separate light encoder produces a 16-dimensional lighting vector per input image; an MLP expands that code and an appearance adapter uses it to modulate the canonical Gaussian colors into lighting-specific colors. The transformed Gaussians are then rasterized with a differentiable splatting renderer. For occlusions, a YOLOv8 segmentation network classifies transient categories (the paper names person, car, dog, etc.), merges them into a binary transient mask, and uses that mask to zero out transient pixels in both input and rendered images before reconstruction losses are computed. The key novelty is that appearance is handled in 3D, not via per-image 2D style transfer or a scene-specific latent optimized at test time.
Training regime: the paper uses curriculum learning because direct joint training on geometry, lighting, and occlusions was unstable. Stage 1 trains on a single synthetic scene with illumination variation but no transients, so the model can learn to separate appearance from geometry. Stage 2 adds more synthetic scenes to build broader geometry/appearance priors. Stage 3 introduces synthetic occlusions with ground-truth masks so the network learns to ignore transient regions. The model is initialized from AnySplat pre-trained weights. Training runs for 40K iterations total: 10K in stage 1, 10K in stage 2, and 20K in stage 3, using a perceptual-loss weight of λ = 0.05. The paper says this takes about 2 days on a single RTX A6000. A concrete end-to-end example: given a sparse set of unposed photos of, say, a landmark with tourists and changing daylight, VGGT features feed the depth/camera/Gaussian heads; the geometry is unprojected to canonical Gaussians; the light encoder extracts a code from each input photo; the appearance adapter transforms the Gaussian colors to match that photo’s lighting; the segmentation network masks tourists; and the masked rendered image is compared against the masked input with MSE + perceptual loss.
Evaluation protocol and reproducibility: the primary metrics are PSNR, SSIM, and LPIPS. Baselines include optimization-based in-the-wild methods (GS-W, WildGaussians, NexusSplats, Gaussian-in-the-Wild) and feed-forward variants built from AnySplat, plus AnySplat combined with style transfer or DiffusionRenderer. The paper notes that COLMAP poses hurt baseline performance under sparsity, so it uses VGGT poses for fairer comparison across methods — an important detail because it strengthens the baselines rather than weakening them. The evaluation also includes ablations for appearance adapter, occlusion handling, and curriculum learning. Reproducibility is partial in the excerpt: the architecture and training schedule are described, but the paper does not state whether code, weights, or curated sparse MegaScenes subsets are publicly released in the excerpt provided. The project website is mentioned, but release status is unclear from the text we have.
Technical innovations
- A 3D appearance adapter that conditions canonical Gaussian colors on a learned lighting code, instead of requiring per-scene latent optimization as in WildGaussians or NexusSplats.
- A segmentation-driven transient masking strategy that uses an external pretrained model to supervise only static regions, avoiding collapse of learned visibility/uncertainty masks.
- A curriculum that separates appearance learning, multi-scene geometric generalization, and occlusion handling, which the authors report is necessary for stable convergence.
- A feed-forward sparse-view reconstruction pipeline that predicts depth, camera parameters, and Gaussians from unposed images in a single pass, extending AnySplat to unconstrained outdoor imagery.
Datasets
- DL3DV — 700+ outdoor scenes — public dataset; augmented with synthetic relighting and composited COCO occlusions
- COCO segments — not stated; used for synthetic transient compositing — public source
- PhotoTourism — 3 scenes, 6 input views used in evaluation — public benchmark
- MegaScenes — 20 curated challenging scenes with fewer than 20 registered images each for evaluation — public benchmark subset curated by authors
Baselines vs proposed
- NexusSplats: MegaScenes (3-view) PSNR = 13.17 vs proposed: 14.43
- NexusSplats: MegaScenes (3-view) SSIM = 0.335 vs proposed: 0.402
- NexusSplats: MegaScenes (3-view) LPIPS = 0.552 vs proposed: 0.496
- NexusSplats: MegaScenes (6-view) PSNR = 13.92 vs proposed: 15.84
- NexusSplats: MegaScenes (6-view) SSIM = 0.397 vs proposed: 0.440
- NexusSplats: MegaScenes (6-view) LPIPS = 0.518 vs proposed: 0.407
- Vanilla AnySplat: MegaScenes PSNR = 12.65 vs proposed: 15.84
- Vanilla AnySplat: MegaScenes SSIM = 0.311 vs proposed: 0.440
- Vanilla AnySplat: MegaScenes LPIPS = 0.412 vs proposed: 0.407
- DiffusionRenderer + AnySplat: MegaScenes PSNR = 13.59 vs proposed: 15.84
- DiffusionRenderer + AnySplat: MegaScenes SSIM = 0.309 vs proposed: 0.440
- DiffusionRenderer + AnySplat: MegaScenes LPIPS = 0.444 vs proposed: 0.407
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.28193.

Fig 1: GenWildSplat reconstructs 3D scenes from sparse, unposed images with varying illumination and transient objects in a single

Fig 2: Limitations of Prior Work. Prior methods [16, 36]

Fig 3: Overview of GenWildSplat. Given sparse, unposed images {Ii}V

Fig 4 (page 1).

Fig 5 (page 1).

Fig 6 (page 1).

Fig 7 (page 1).

Fig 8 (page 1).
Limitations
- Sparse viewpoints still leave unseen regions, so geometry is incomplete where the input images provide no coverage.
- The model can produce artifacts or double geometry when test viewpoints are far outside the training distribution.
- Performance degrades on indoor scenes, especially when the external occlusion mask is imperfect or depth discontinuities confuse segmentation.
- The method does not model cast shadows or physically correct relighting, so it is not a full illumination-consistent renderer.
- The excerpt does not report standard deviations, confidence intervals, or statistical tests for the quantitative comparisons.
- Reproducibility is limited by unclear release status for code, checkpoints, and the curated MegaScenes evaluation subset in the provided text.
Open questions / follow-ons
- Can the appearance adapter be extended from lighting transfer to physically grounded shadow and indirect-illumination modeling without sacrificing feed-forward speed?
- How much of the gain comes from the curriculum versus the specific architecture, and would the same training schedule stabilize other feed-forward Gaussian reconstructor backbones?
- Can the method be made robust to indoor scenes and to semantic-mask failures, perhaps via uncertainty-aware masking or self-correcting occlusion estimation?
- Would a larger and more diverse real-world training corpus reduce the double-geometry failures on viewpoints far outside the training distribution?
Why it matters for bot defense
For bot-defense and CAPTCHA practitioners, the paper is mostly relevant as a signal that sparse, messy, real-world image understanding is becoming faster and more generalizable. Systems that rely on appearance variation, transient-object suppression, or cross-view consistency may become easier to automate at scale if a model can reconstruct 3D structure from only a few unconstrained photos. That matters for abuse workflows that involve listing moderation, marketplace scraping, photo verification, or scene-based integrity checks, because a bot could potentially normalize lighting and viewpoint before downstream analysis.
At the same time, the method’s weaknesses are useful too. It still struggles with unseen geometry, extreme viewpoint shifts, indoor scenes, and inaccurate masks. A defender designing visual challenges or forensic checks could exploit exactly those gaps by using configurations with hard occlusions, unusual camera geometry, or shadow-heavy indoor setups. More broadly, the paper suggests that relying on simple multi-view inconsistency or lighting variation as a fraud signal will get less reliable over time, so CAPTCHA and bot-defense systems should combine geometry-aware checks with behavioral and device-side signals rather than depending on image appearance alone.
Cite
@article{arxiv2604_28193,
title={ Generalizable Sparse-View 3D Reconstruction from Unconstrained Images },
author={ Vinayak Gupta and Chih-Hao Lin and Shenlong Wang and Anand Bhattad and Jia-Bin Huang },
journal={arXiv preprint arXiv:2604.28193},
year={ 2026 },
url={https://arxiv.org/abs/2604.28193}
}