Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

Source: arXiv:2605.31595 · Published 2026-05-29 · By Mungyeom Kim, Minkyeong Jeon, Honggyu An, Jaewoo Jung, Hyuna Ko, Jisang Han et al.

TL;DR

This paper addresses the challenges in dynamic 4D scene reconstruction from monocular video, specifically critiquing existing feed-forward methods that predict per-pixel 3D Gaussians for each frame. Such per-pixel approaches suffer from redundant Gaussian duplication and strong view-dependent biases that prevent effective modeling of global scene motion. The authors propose C4G, a novel feed-forward framework leveraging a compact set of learnable Gaussian query tokens conditioned on timestamps. These query tokens aggregate multi-frame features via a transformer decoder and decode a unified Gaussian representation whose position depends on the target timestamp, enabling globally coherent motion without per-scene optimization. A video diffusion model-based refinement further enhances rendering quality by recovering fine details.

Experiments across multiple dynamic scene datasets demonstrate C4G achieves competitive or superior novel-view synthesis despite using 0.007x the number of Gaussians compared to prior feed-forward methods, while not requiring camera pose input. The compact query-based design also yields stronger motion modeling and robustness to large temporal gaps. Analysis shows the emergent spatiotemporal attention learned by queries enables consistent 4D feature lifting, supporting downstream tasks like point tracking and dynamic scene understanding. Overall, C4G effectively overcomes key limitations of prior pixel-wise 4DGS approaches by learning a globally consistent, compact representation of dynamic scenes from monocular video in a fully feed-forward manner.

Key findings

C4G uses only ~2,000 Gaussians per scene, which is 0.007x or fewer than prior pixel-wise methods (e.g., 802K Gaussians in NeoVerse).
On multiple datasets (DyCheck, ADT, TUM-Dynamics, NVIDIA), C4G achieves PSNR gains of 1.5-4 dB over NeoVerse and pose-required methods, and higher SSIM and lower LPIPS scores (e.g., on TUM-Dynamics PSNR=19.52 vs 15.26 for NeoVerse).
C4G maintains novel-view synthesis quality robustly as input frame temporal gap ∆t increases from 2 to 8, with PSNR degrading only ~1.3 dB vs 4DGT degrading 3 dB at ∆t=8.
Learnable query tokens in the transformer decoder attend spatially to consistent scene regions across all frames and temporally to frames near the target timestamp.
Rendering enhancement using a video diffusion model reduces ghost artifacts and occlusion holes, resulting in cleaner, sharper novel view outputs.
4D feature lifting yields temporally and spatially consistent feature fields, improving point tracking accuracy by >30% over vanilla foundation model features (e.g., DINOv3).
Augmenting Gaussians with lifted foundation model semantics enables open-vocabulary dynamic scene understanding, outperforming static scene baselines on DAVIS with mIoU of 0.63 vs 0.55 for LSeg.
Sinusoidal time embeddings (dim 256) outperform rotary embeddings for timestamp conditioning, improving PSNR by ~3 dB on DyCheck.

Threat model

Not explicitly a security paper; the adversary is not modeled. The paper assumes the input is monocular video without external adversarial manipulation. The focus is on generalizing feed-forward dynamic scene reconstruction without per-scene optimization.

Methodology — deep read

Threat model & assumptions: The model is designed for feed-forward 4D reconstruction of dynamic scenes from monocular video, assuming access to video frames with timestamps but no camera pose information. The adversary model is not explicitly defined; the focus is on robustly modeling scene motion despite missing multi-view geometry and large temporal gaps without per-scene optimization.
Data: Training uses multiple dynamic video datasets - Spring, Kubric, and RealEstate10K. Evaluation employs DyCheck, ADT, TUM-Dynamics, and NVIDIA datasets, covering different dynamic scene scenarios. Input frames are resized to 224x224; T timestamps per video are used in sequences. Labels come from RGB videos and associated timestamps. No camera poses or ground-truth 3D geometry are used.
Architecture / algorithm: C4G consists of a visual feature extractor, a learnable query-based 4D Gaussian decoder, and a video diffusion model (VDM) refinement module.

Visual feature extractor: a VGGT encoder extracts geometry-aware multi-frame features Ft from each frame It, also injected with learnable per-frame timestamp embeddings.
Gaussian decoder DG: inputs concatenated learnable queries Q (N=2048 tokens) with all feature maps, processes through L=2 transformer layers with self-attention, conditioned on a target timestamp tb via learned timestamp embeddings. The decoder outputs temporally modulated 3D Gaussian parameters (position mean µi, opacity σi, covariance Σi, RGB color with spherical harmonics degree 0) per token.
VDM refinement: a pretrained Wan2.1-VACE-1.3B video diffusion model fine-tuned with context and rendered frames as inputs, uses ControlNet-style conditioning to enhance rendered outputs post-hoc via iterative denoising.
Feature lifting decoder DF: reuses Gaussian decoder attention patterns, taking arbitrary foundation model features as inputs and producing aligned per-Gaussian 4D feature attributes for downstream tasks.

Training regime:

Batch size of 1 per GPU on 4 NVIDIA H100 GPUs.
AdamW optimizer with learning rates 1e-5 for transformer decoder, 1e-7 for backbone, cosine annealing schedule.
Loss: photometric loss combining MSE and LPIPS; auxiliary depth and normal losses from MoGe-2; tracking loss from CowTracker.
Initialized from pretrained C3G static scene weights to stabilize dynamic modeling.

Evaluation protocol:

Novel-view synthesis assessed by PSNR, SSIM, LPIPS on holdout datasets.
Comparisons to per-scene optimized and feed-forward baselines, with and without pose data.
Temporal robustness tested by varying strides ∆t between input frames.
Qualitative visual comparisons and ablation on loss terms and embedding choices.
4D feature lifting validated via temporal-invariance metrics and dynamic scene semantic segmentation mIoU.

Reproducibility: No explicit mention of public code or weights; datasets are publicly known; pretrained backbones (VGGT) and foundation models (VDM) used. Some pretrained components (e.g., MoGe-2 for depth) are referenced. Overall, replication should be feasible given detailed architectural and training hyperparameters.

End-to-end example: Given a monocular video sequence, C4G first extracts geometry-aware features with timestamp embeddings. The transformer decoder takes 2048 learnable queries and aggregates features globally across all frames, outputting a set of 3D Gaussians temporally modulated by a target timestamp tb. This avoids per-pixel duplication and yields a compact global representation encoding dynamic motion. The Gaussians are rendered into a novel view image, which is post-processed by the video diffusion model to recover fine details. Finally, foundation model features are lifted into the same Gaussian tokens for constructing 4D feature fields used in tracking and semantics.

Technical innovations

Introduction of learnable timestamp-conditioned Gaussian query tokens enabling compact global aggregation of multi-frame features for feed-forward 4D dynamic reconstruction.
Time embedding conditioning on both input features and query tokens to decode temporally coherent Gaussians that represent the scene at any target timestamp.
Rendering refinement via a video diffusion model conditioned on context frames and rendered frames to enhance high-frequency details without per-scene optimization.
First feed-forward 4D feature lifting decoder reusing Gaussian decoder attention patterns to produce consistent 4D feature fields augmenting each Gaussian with semantic attributes.

Datasets

DyCheck — size not explicitly stated — public dynamic video benchmark with multiple synchronized cameras.
ADT — size not specified — dynamic monocular video dataset.
TUM-Dynamics — size not specified — dynamic scene video dataset.
NVIDIA Dynamic Dataset — size not specified — dynamic video sequences.
Spring — training — publicly available complex scenes dataset.
Kubric — training — procedurally generated synthetic dataset.
RealEstate10K — training — large monocular video dataset of real scenes.

Baselines vs proposed

Shape of Motion (per-scene, pose) PSNR=14.13 on DyCheck vs Ours PSNR=15.64
MoSca (per-scene, pose) PSNR=11.93 on DyCheck vs Ours PSNR=15.64
4DGT (feed-forward, pose) PSNR=12.15 on DyCheck vs Ours PSNR=15.64
MoVieS (feed-forward, pose) PSNR=11.99 on DyCheck vs Ours PSNR=15.64
NeoVerse (feed-forward, no pose) PSNR=11.90 on DyCheck vs Ours PSNR=15.64
TUM-Dynamics novel view synthesis at ∆t=6: 4DGT PSNR=17.27 vs Ours PSNR=19.52
TUM-Dynamics varying ∆t =8: 4DGT PSNR=16.27 vs Ours PSNR=19.23
Feature tracking accuracy with DINOv3-L avg temporal consistency improves from 23.2 to 39.3 with C4G feature lifting
Semantic segmentation (mIoU) on DAVIS: LSeg 0.55 vs Ours 0.63

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.31595.

Fig 1: Failures of pixel-wise feed-forward 4D reconstruction [102, 59, 104]. (a) Duplicated

Fig 2: Pixel-wise 4DGS vs.

Fig 3 (page 2).

Fig 4 (page 2).

Fig 5 (page 2).

Fig 6 (page 2).

Fig 7 (page 2).

Fig 8 (page 2).

Limitations

The method requires pretrained backbones and foundation models which may not be available or optimal in all contexts.
No explicit adversarial robustness or attack evaluation is presented, leaving open security concerns of model manipulation.
The approach currently fixes spherical harmonic degree to 0, which might limit view-dependent color modeling and could affect fidelity in some scenarios.
Training and evaluation use only certain dynamic scene benchmarks; generalization to arbitrary or highly complex scenes is not fully studied.
Dependency on quality of foundation models for depth, normals, and tracking supervision may limit applicability where these models underperform.
Lack of publicly released code or pretrained weights could hinder reproducibility for external researchers.

Open questions / follow-ons

How does C4G perform under strong occlusions or extreme view changes beyond tested datasets?
Can the query-based approach be extended to handle multi-view or multi-camera input for improved geometry?
What are the trade-offs in increasing query token number or transformer depth for scaling to very large scenes?
How robust is the method to noisy or imperfect monocular input videos, such as motion blur or lighting changes?

Why it matters for bot defense

For bot-defense or CAPTCHA practitioners, C4G's approach exemplifies advantages of using compact, temporally coherent global representations to encode dynamic visual information efficiently, potentially informing defenses against automated attacks that rely on simulating or reconstructing 3D scenes dynamically. Its feed-forward inference and pose-free design highlight deployment benefits where scene parameters are unavailable or costly to estimate. The feature lifting module producing consistent 4D embeddings over time could inspire new ways to analyze temporal consistency or detect synthetic manipulations in video CAPTCHA challenges. However, no direct security or adversarial evaluation is provided, so adaptation to bot-detection contexts would require additional robustness analysis. Overall, C4G’s architecture suggests a promising direction for dynamic scene representation that balances efficiency, temporal coherence, and spatial fidelity without per-scene optimization.

Cite

bibtex

@article{arxiv2605_31595,
  title={ Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction },
  author={ Mungyeom Kim and Minkyeong Jeon and Honggyu An and Jaewoo Jung and Hyuna Ko and Jisang Han and Hyeonseo Yu and Donghwan Shin and Sunghwan Hong and Takuya Narihira and Kazumi Fukuda and Yuki Mitsufuji and Seungryong Kim },
  journal={arXiv preprint arXiv:2605.31595},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.31595}
}

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​