RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Source: arXiv:2605.31535 · Published 2026-05-29 · By Ulrich Prestel, Stefan Andreas Baumann, Nick Stracke, Björn Ommer

TL;DR

Self-supervised novel view synthesis (NVS) aims to predict novel images of static scenes from input views without relying on ground-truth camera poses. Despite abundant internet video data, scaling self-supervised NVS to large datasets remains challenging due to brittle optimization in multi-network systems and instability when training on real-world videos containing dynamic content. This paper presents RayDer, a unified transformer backbone that consolidates camera pose estimation, scene reconstruction, and rendering into a single feed-forward model. RayDer introduces a minimal per-view dynamic state token treated as a nuisance factor during training to absorb dynamic content, enabling stable training on unconstrained dynamic video while still targeting static-scene NVS. Through architectural unification, dynamic state modeling, and improved pose autoregression, RayDer enables well-behaved scaling of NVS with respect to both data and compute. Empirically, RayDer exhibits clean power-law scaling over three orders of magnitude in dataset size (from tens of thousands to millions of videos) and model sizes (59M to 743M parameters). It outperforms prior self-supervised methods trained on smaller static-scene datasets, achieving zero-shot novel view synthesis and camera pose estimation competitive with state-of-the-art supervised approaches. The work demonstrates that the main bottleneck to scaling self-supervised NVS is not data scarcity but the design of the learning system itself, which RayDer addresses by unified modeling and handling dynamics as nuisance variables during training.

Key findings

Introducing a per-view dynamic state token stabilizes training on videos with dynamic content, eliminating divergence observed in previous RayZer baseline (Tab. 1: Config A baseline diverges; Configs B and C stable).
Consolidating camera estimation, scene reconstruction, and rendering into a single unified transformer backbone improves zero-shot NVS PSNR on RE10K from 23.76 (Config C) to 25.33 (Config D) and enhances camera pose transfer metrics (Table 1).
RayDer exhibits clean power-law scaling with respect to training compute and data size, modeled with R2 > 0.99 by equations such as MSE(C,D) ≈ 0.0033 + 200C^-0.40 + 2.6D^-0.60 (Fig. 11).
Training on large-scale unconstrained video with dynamics (~2.7M videos) substantially outperforms training on a combined static-scene mix of ~247k videos, e.g. NVS PSNR 29.38 vs 28.68 on RE10K (Table 3).
Random-order autoregressive pose prediction over multiple views improves both pose accuracy and novel view quality, e.g. PSNR increase from 25.12 to 26.28 and pose accuracy recall@10° from 70.1 to 84.4 on RE10K (Table 1, Config E to G).
Local high-resolution intra-frame layers further improve synthesis quality to 26.87 PSNR on RE10K (Config H), highlighting patch size impact on detail (Table 1).
Larger models gain more from increased data scale; small models saturate early or overfit on small data [Fig. 10].
Zero-shot evaluations on unseen datasets show RayDer’s NVS and camera pose performance generalizes well beyond training domains (Fig. 9).

Threat model

The adversary is implicit: the training system has no pose supervision and must learn camera poses and static scene representation from video that includes dynamic, time-varying elements. The adversary can be seen as scene content dynamics, which act as a nuisance factor corrupting pose representations and destabilizing training. There is no explicit adversary that injects adversarial inputs or attempts to fool the system; rather, the method focuses on robustness to uncontrolled real-world videos with unknown dynamics. The system assumes dynamic content cannot be reconstructed and is treated as noise or nuisance during training.

Methodology — deep read

The paper tackles scaling self-supervised NVS from unconstrained real-world video by addressing instability from dynamic scene content and architectural complexity from multi-network pipelines.

Threat model and assumptions: The goal is static-scene novel view synthesis given unposed multi-view inputs coming from monocular videos, some containing dynamic content. The adversary is not directly modeled, but the challenge is learning without supervision of camera pose and scene dynamics; dynamic scene elements are treated as nuisance factors rather than reconstructed.

Data: The main training data is SpatialVid, a large-scale internet video corpus (~2.7M videos) with dynamic content; two complementary datasets are used for initial experiments — Segment Anything-Video (SA-V) and SpatialVid-HQ (SV-HQ). Static-scene NVS test benchmarks include RealEstate-10K (RE10K), DL3DV-10k, CO3D, WildRGBD, among others. Training data contains 8 views per clip with about 0.5s spacing, preprocessed to 256x256 resolution. Labels are purely self-supervised image reconstruction losses without pose or geometry ground truth.

Architecture: RayDer is a single feed-forward transformer backbone unifying three tasks: camera pose and per-view dynamic state estimation from input images, scene representation encoding, and novel view rendering conditioned on target poses and states. The model uses ViT-style layers with a shared backbone, adaptive normalization to indicate token roles, and incorporates a predicted nuisance dynamic state vector per input view (dimensionality ~d_state) to absorb time-varying content. The rendering decoder consumes the latent scene representation z plus target camera pose and dynamic state to yield the target image.

Training regime: AdamW optimizer is used with batch size 256, 256x256 resolution, and training over up to 500k steps on various fractions of SpatialVid (1%, 10%, 100%). Four model sizes are studied from 59M to 743M parameters by varying depth, width, and number of heads. Curriculum training employs random-order autoregressive conditioning over input views, including dynamic state dropout to ensure robustness to unknown dynamic states at inference. Losses are primarily photometric reconstruction between predicted and ground-truth novel views.

Evaluation protocol: Metrics include PSNR, LPIPS, SSIM for NVS quality and transfer-based camera pose accuracy metrics (recall within 10 degrees translation threshold 0.1m). Evaluations are zero-shot on unseen datasets to avoid train-test leakage. Ablations compare configurations: baseline (multi-network RayZer variant), dynamic state modeling, unified single network, parallel-target attention for computational efficiency, autoregressive pose conditioning, and high-resolution local layers. Statistical tests are not detailed but multiple datasets and metrics confirm trends.

Reproducibility: Code and models are released publicly via GitHub (https://github.com/compvis/rayder) although some training data (SpatialVid) may be proprietary. Details on hyperparameters, seeds, and batch compositions are in the supplementary.

Concrete example: Starting from a RayZer baseline, training directly on dynamic SpatialVid videos causes divergence due to pose representation corruption by dynamics. Introducing per-view dynamic state embeddings predicted alongside pose tokens stabilizes training fully (Config B,C). Next, consolidating three subtasks into one transformer backbone (Config D) simultaneously improves pose and novel view quality. To speed train/test compute, parallel-target attention masks enable caching. Further, training autoregressively over input views with randomized input orders forces the model to learn true geometric pose representations instead of temporal shortcuts, significantly improving pose accuracy and NVS quality (Config G). Adding local high-resolution attention layers recovers fine spatial details without large compute increase (Config H). This final model, RayDer-L (743M params), trained on full SpatialVid scale, achieves state-of-the-art zero-shot NVS performance on multiple standard benchmarks.

Technical innovations

Unified single transformer backbone integrating camera pose estimation, dynamic state prediction, scene reconstruction, and rendering to simplify scaling and improve performance over multi-network pipelines.
Introduction of a minimal, per-view dynamic state embedding treated purely as a nuisance factor to absorb time-varying scene content during training on dynamic videos, enabling stable training without modeling 4D scene dynamics at inference.
Random-order autoregressive pose prediction conditioning scheme that prevents models from exploiting frame-order shortcuts and improves geometric pose accuracy and novel view synthesis quality.
Parallel-target attention masking that enables efficient key-value caching and parallel novel view rendering, reducing per-target compute by a factor of ~7 with minimal quality loss.

Datasets

SpatialVid — ~2.7M videos — large-scale real-world internet video with dynamic content (proprietary)
Segment Anything-Video (SA-V) — unspecified size — diverse open-world video dataset
SpatialVid-HQ (SV-HQ) — unspecified size — curated, partially dynamic-scene video dataset
RealEstate-10K (RE10K) — 10,000 sequences — annotated static-scene dataset for zero-shot testing
DL3DV-10k — 10,000 sequences — camera pose estimation transfer benchmark
CO3D — unspecified size — multi-view object-centric dataset
Static Mix — ~247k videos — combined public static-scene NVS datasets (RE10K, DL3DV-10K, uCO3D)

Baselines vs proposed

RayZer baseline (multi-network): NVS PSNR = 22.53 dB vs RayDer single-network: 25.33 dB (Config C to D on SA-V, Table 1)
Dynamic state modeling: training instability eliminated vs RayZer baseline divergence (Fig. 4, Table 1 Config A vs B,C)
Static dataset mix only: RE10K PSNR = 28.68 vs general SpatialVid training: 29.38 (Table 3)
RayDer Config E (parallel-target attention): NVS PSNR = 25.12 vs Config D (no parallel attention): 25.33 (minor quality loss for 7x speedup)
Autoregressive random order (Config G): NVS PSNR = 26.28, pose R@10° = 84.4 vs autoregressive ordered (Config F): 24.49 PSNR, 73.6 pose recall (Table 1)
Local high-res layers (Config H): NVS PSNR = 26.87 vs no local layers (G): 26.28 (Table 1)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.31535.

Fig 9

Fig 9: Zero-shot qualitative samples of RayDer compared with E-RayZer [89] in (a) typical (non-dense view) NVS settings, (b) an

Fig 10

Fig 10: Scaling Across Data and Model Size. We evaluate models trained on SpatialVid (2.7M total samples) at different model

Fig 11

Fig 11: Compute-Optimal Scaling Analysis. RayDer’s compute-optimal performance (i.e., the compute-quality Pareto frontier)

Fig 12

Fig 12: Qualitative Scaling. RayDer’s qualitative behavior follows the trends seen in quantitative evals (Fig. 10): more data &

Fig 8

Fig 8: Final Architecture Overview. RayDer unifies camera estimation (a) and novel view synthesis (b) in a single transformer

Fig 6

Fig 6 (page 7).

Fig 7

Fig 7 (page 7).

Fig 8

Fig 8 (page 7).

Limitations

Dynamic state embedding captures moving content as nuisance factor but does not reconstruct or explicitly model dynamic objects, limiting video reconstruction fidelity.
Training data (SpatialVid) is proprietary and dynamic content distribution may bias learned priors; full generalization across all video domains untested.
Computational cost of largest models and training on millions of videos may be prohibitive for many practitioners.
Autoregressive training scheme introduces complexity and potential training-test domain gaps, partially mitigated by random order autoregression.
Evaluation focuses primarily on static-scene novel view synthesis; applicability to fully dynamic 4D scene reconstruction is out of scope.
No detailed adversarial robustness tests against deliberate camera or scene perturbations.

Open questions / follow-ons

How can the nuisance dynamic state embeddings be extended or disentangled to enable explicit dynamic scene reconstruction or 4D NVS?
What are optimal architectures or training strategies to reduce computational cost while maintaining scaling benefits on extremely large video corpora?
How does RayDer perform on videos with extreme scene dynamics, highly non-rigid motion, or severe occlusions beyond those seen in training?
Can the approach generalize to or incorporate other geometric supervision signals (e.g., sparse SfM, IMU) to further improve pose and rendering accuracy?

Why it matters for bot defense

From a bot-defense or CAPTCHA perspective, RayDer’s approach to jointly estimating camera pose and scene representation from unconstrained video containing dynamic objects is instructive for defenses relying on learned 3D scene understanding at scale. The method demonstrates how to stabilize training with nuisance dynamic embeddings, which could inspire defenses that leverage multi-view consistency and dynamic scene disentanglement to detect automation or spoofing attempts in visual tasks. Moreover, the unified architecture simplifying scaling could influence design of lightweight, scalable 3D vision models for real-time bot detection. However, applying such a large-scale synthesis model directly to CAPTCHA would require careful adaptation as RayDer targets static-scene novel view synthesis rather than interactive challenge generation or adversarial resistance specifically.

Cite

bibtex

@article{arxiv2605_31535,
  title={ RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video },
  author={ Ulrich Prestel and Stefan Andreas Baumann and Nick Stracke and Björn Ommer },
  journal={arXiv preprint arXiv:2605.31535},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.31535}
}

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​