Skip to content

GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

Source: arXiv:2606.03921 · Published 2026-06-02 · By Jiahao Sun, Dingkun Wei, Zehong Shen, Hongyu Zhou, Yujun Shen, Liang Li

TL;DR

GARDEN addresses the enduring challenge of converting multi-view RGB images into 3D scene reconstructions that are both visually faithful and physically structured for downstream simulation and interaction. Existing pipelines tend to produce monolithic, gravity-ambiguous scene representations where foreground objects are entangled with background geometry, inhibiting stable physics simulation. Current solutions often replace reconstructed objects with retrieved CAD models, sacrificing instance-specific fidelity and incurring costly retrieval and alignment steps. GARDEN introduces a physically-grounded factorization approach that explicitly leverages gravity as a universal physical prior to align reconstructions into a gravity-view coordinate frame. This disentangles rigid foreground objects from the background, outputting a structured hybrid scene representation comprising standalone object meshes with accurate 6-DoF poses and a clean high-fidelity background.

Key findings

  • GARDEN reduces object-centric RMSE from 0.2163 (LiteReality) to 0.1880 on 3DGS background after simulation (Table 1).
  • Post-simulation LPIPS metric improves from 0.5854 (LiteReality) to 0.4444 (GARDEN point cloud) for object fidelity.
  • Holistic scene RMSE improves from 0.2664 to 0.1616, and LPIPS from 0.6522 to 0.4081, after 10-second physics simulation (Table 2).
  • GARDEN’s gravity direction prediction achieves mean angular error of 1.40° on Hypersim and 1.56° on TartanAir, outperforming all baselines with 0% failure >10° (Table 3).
  • Removing gravity-view alignment degrades 6-DoF pose accuracy and simulation stability (Fig. 4a, ablation Table 4).
  • GARDEN accelerates processing time from 4330s to 560s compared to LiteReality, improving computational efficiency by nearly 8x.
  • Conditional 3D point classification effectively removes duplicated object geometry from background, avoiding scene entanglement.
  • Physics simulation with MuJoCo confirms stable 10-second rollouts with gravity aligned to +y-axis, validating physical correctness of reconstruction.

Threat model

The adversary modeled is the ambiguity and noise in pure RGB-based multi-view reconstruction, which causes global rotation ambiguity and object-background entanglement that preclude stable physical simulation. The system assumes no external sensor data or adversarial input. The threat is uncertainty in coordinate system alignment and geometry factorization leading to physically inconsistent reconstructions that hinder downstream robot interaction or simulation.

Methodology — deep read

The authors address the problem of reconstructing multi-view RGB scenes into 3D environments that support physical simulation by resolving three key challenges: arbitrary global rotation ambiguity, entanglement of rigid foreground objects with background geometry, and lack of physical grounding for stable object placement.

Threat Model and Assumptions: The adversary in this context is the ambiguity and noise inherent in multi-view RGB reconstruction, including errors in pose estimation and depth inference. The system assumes RGB-only input without external sensors like IMUs or depth cameras. No malicious adversarial attacks are modeled.

Data: Training uses large simulation datasets with gravity-aligned coordinate ground truths, including Hypersim, TartanAir, vKitti, and InternScenes. InternScenes with paired object and scene meshes enables supervised learning for 3D conditional point classification. The system is also evaluated on real-world multi-view RGB scenes.

Architecture: The pipeline includes several core components. Gravity-View Alignment regresses a 6D rotation matrix to align reconstructed scenes into a gravity-consistent frame using global camera tokens derived from multi-view foundation model DepthAnything-3. Object Generation uses user-supplied or vision-language generated 2D bounding boxes fed into SAM-3/3D for amodal mesh reconstruction, followed by FoundationPose for 6-DoF pose refinement constrained by the gravity frame. A conditional 3D point classification Transformer network inputs cropped scene points and mesh-sampled points to classify and remove duplicate object geometry from the background.

Training Regime: Gravity-View Alignment module is trained for 10k steps with AdamW optimizer on a cluster of 32 NVIDIA H20 GPUs at a learning rate of 5e-6. The conditional point classifier is trained for 2k steps with a 5e-4 learning rate. Training includes augmentations emulating real-world reconstruction artifacts such as ghosting, deformation, scanning jitter, and pose/scale disturbances.

Evaluation Protocol: Metrics include object-level RMSE, SSIM, LPIPS, holistic scene-level metrics, gravity estimation angular error, and computational latency. Object and scene renderings are assessed both statically and post a 10-second physics simulation rollout in MuJoCo, penalizing unstable or physically inconsistent pose placements. Gravity estimation is compared against classical RANSAC and Manhattan-frame methods on held-out scenes. Ablations systematically remove gravity alignment and factorization components.

Reproducibility: The paper does not explicitly mention public code or pre-trained weights release. Dataset InternScenes is referenced as non-public large-scale simulation data. Training recipes, hyperparameters, and hardware details are documented.

Concrete Example: For a given multi-view RGB indoor scene, the system extracts per-view global tokens from DepthAnything-3, uses them in a Transformer decoder to estimate a rotation matrix aligning the reconstruction to gravity. Using a 2D bounding box from a user or vision-language model triggers SAM-3/3D to generate an initial object mesh. FoundationPose refines this mesh’s 6-DoF pose constrained to the gravity frame. A point classification Transformer receives cropped global points and conditioned object mesh points, predicting membership probabilities to separate object from background points, pruning duplicates. The final scene with independent object meshes and cleaned background is exported to MuJoCo for physics simulation. RGB renderings from the simulator confirm visually faithful, physically stable results.

Technical innovations

  • Use of gravity direction as a physical prior to align multi-view RGB reconstructions into a unified Gravity-View coordinate frame, resolving global rotation ambiguity.
  • A CAD-free pipeline producing explicit object meshes with accurate 6-DoF poses for independent rigid-body manipulation within the gravity-aligned space.
  • Transformer-based conditional 3D point classification module that removes duplicated object geometry from background point clouds using mesh-conditioned features.
  • Dual-representation hybrid scene combining explicit rigid-body meshes for physics and cleaned point cloud or 3D Gaussian Splatting for high-fidelity rendering.

Datasets

  • Hypersim — large-scale synthetic indoor scenes used for gravity alignment training and evaluation — non-public
  • TartanAir — synthetic multi-view indoor/outdoor scenes for gravity estimation evaluation — non-public
  • vKitti — synthetic outdoor scenes with ground-truth pose used for Gravity-View Align training — non-public
  • InternScenes — large-scale simulation scenes with paired object and environment meshes for supervised point-level classification — non-public

Baselines vs proposed

  • LiteReality: object-centric RMSE = 0.2163 vs GARDEN (3DGS, post-sim) RMSE = 0.1880
  • LiteReality: object-centric LPIPS = 0.5854 vs GARDEN (point cloud, post-sim) LPIPS = 0.4444
  • LiteReality: holistic scene RMSE = 0.2664 vs GARDEN (point cloud, post-sim) RMSE = 0.1616
  • Plane-RANSAC gravity estimation mean angular error = 24.87° (Hypersim) vs GARDEN GV predictor = 1.40°
  • COLMAP’s Manhattan alignment Fail@10° = 0% (Hypersim) vs GARDEN GV predictor Fail@10° = 0%
  • Removing Gravity-View alignment degrades 6-DoF pose accuracy (NVS ablation: RMSE worsens from 0.2751 to 0.2961)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.03921.

Fig 1

Fig 1: GARDEN factorizes unstructured RGB images into a structured, gravity-

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

  • The approach depends on the availability of approximate gravity direction during training via synthetic datasets; transfer to arbitrary real-world orientations may degrade.
  • Conditional 3D point classification may struggle in cluttered scenes with highly similar or repeated object instances causing segmentation errors.
  • The method currently requires user or vision-language input to specify target objects, limiting full autonomous scene factorization.
  • No evaluation or robustness testing against adversarial or out-of-distribution multi-view images was reported.
  • Real-world datasets used for evaluation are limited and non-public, affecting reproducibility and generalization estimates.
  • The system focuses on rigid objects and static backgrounds, not addressing dynamic scenes or deformable objects.

Open questions / follow-ons

  • How well does gravity alignment generalize to outdoor or non-upright scenes where gravity direction may be ambiguous or dynamically changing?
  • Can the conditional point classification module scale to highly cluttered indoor scenes with repeated object instances?
  • How to enable fully autonomous target object selection without user or vision-language guidance for large scenes?
  • What is the impact of noisy camera pose estimates or calibration errors on the robustness of gravity-view alignment and reconstruction fidelity?

Why it matters for bot defense

From a bot-defense perspective, GARDEN’s method of physically-grounded 3D scene factorization highlights the importance of embedding real-world physical constraints—like gravity alignment and object disentanglement—into visual data representations. This physical interpretability could inspire more robust CAPTCHA or bot-detection mechanisms that rely on 3D spatial reasoning or physics simulation consistency tests rather than pure 2D image features. Additionally, GARDEN’s elimination of CAD retrieval steps in favor of direct, geometry-grounded reconstruction offers a pathway for more efficient, attack-resistant scene understanding pipelines. However, the system’s dependency on multi-view inputs and 3D reconstruction limits its immediate applicability in typical web-based bot challenges, which tend to be single-image and lightweight. Integrating physical priors such as gravity alignment could nonetheless inspire next-generation CAPTCHAs requiring deeper interaction realism verification.

Cite

bibtex
@article{arxiv2606_03921,
  title={ GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images },
  author={ Jiahao Sun and Dingkun Wei and Zehong Shen and Hongyu Zhou and Yujun Shen and Liang Li },
  journal={arXiv preprint arXiv:2606.03921},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.03921}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution