City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

Source: arXiv:2605.30310 · Published 2026-05-28 · By Sayan Paul, Sourav Ghosh, Siddharth Katageri, Soumyadip Maity, Sanjana Sinha, Brojeshwar Bhowmick

TL;DR

City-Mesh3R addresses the challenge of reconstructing high-fidelity, watertight 3D meshes at city scale from large, unordered collections of multi-view images. Existing methods like NeRF and Gaussian Splatting often produce incomplete, noisy, or irregular surfaces unsuitable for downstream simulations due to missing geometry and computational inefficiency. Small-scale reconstruction techniques do not scale well to city-scale scenes involving thousands to millions of images. This paper proposes a novel, scalable distributed pipeline that decomposes the problem into hierarchical sparse and dense reconstruction stages, enabling end-to-end image-to-mesh reconstruction with minimal exhaustive matching and efficient merging of partial results.

The key innovations include topological image clustering for overlapping partitioning of images, independent cluster-wise sparse SfM with hierarchical merging, geometry-aware spatial partitioning for dense reconstruction, and curvature-adaptive remeshing to allocate mesh complexity where needed. Empirically, City-Mesh3R outperforms state-of-the-art city-scale methods CityGS-v2 and CityGS-X in several urban scenes on GauU-Scene and UrbanScene3D datasets, achieving higher precision, recall, and F1 scores for surface reconstruction while reducing runtime significantly. Qualitative results demonstrate improved mesh regularity, watertightness, and fine structural details, confirming the method's suitability for simulation and urban planning applications.

Key findings

City-Mesh3R achieves higher Precision, Recall, and F1 scores than CityGS-v2 and CityGS-X on CUHK-LOWER (Precision=0.142, Recall=0.091, F1=0.111) and SZIIT (Precision=0.1925, Recall=0.0647, F1=0.0968) scenes from GauU-Scene.
Runtime for reconstruction on evaluated scenes is reduced to 83-113 minutes compared to CityGS-v2's 296-371 minutes and CityGS-X's 74-94 minutes.
Sparse SfM step using clustering reduces 2D reprojection error to 1.38 pixels with only 4.373% image rejection, outperforming N-cut (1.441px, 9.78% rejection) and agglomerative clustering (1.563px, 10.27% rejection).
Proposed mesh refinement stage (Poisson + mesh optimization/remeshing) improves F1 from 0.029 to 0.0968 on SZIIT and from 0.071 to 0.111 on CUHK-LOWER compared to Poisson reconstruction only.
Adaptive curvature-guided remeshing yields better mesh quality and optimization stability compared to uniform Continuous Remeshing, improving F1 score on CUHK-LOWER from 0.0861 to 0.111 and on SZIIT from 0.0542 to 0.0968 under a fixed vertex budget.
Distributed divide-and-conquer approach avoids exhaustive pairwise matching, enabling scalability to arbitrarily large city scenes.
Overlapping image clustering with SLPA enables robust sparse SfM map merging, improving global map coherence and reducing computational overhead.
Mesh stitching through Delaunay triangulation and boundary-aware refinement produces watertight global meshes with smooth transitions across partition boundaries.

Threat model

n/a; the paper is a large-scale 3D reconstruction system paper focused on accuracy and scalability rather than security or adversarial threat modeling.

Methodology — deep read

The methodology consists of several tightly integrated steps designed for scalability and high-quality reconstruction.

Threat Model & Assumptions: The primary challenge is large-scale city reconstruction from unordered multi-view images capturing complex urban geometry. The framework assumes access only to these images and no prior GPS or external localization. Adversaries and security threats are not explicitly modeled; focus is on reconstruction fidelity and computational scalability.
Data: Experiments use large-scale urban datasets: GauU-Scene (4 scenes: CUHK-LOWER, CUHK-UPPER, LFLS, SZIIT) and UrbanScene3D (Residence scene). Each scene contains thousands of images captured covering city-scale extents with diverse building types and scales. Ground truth 3D meshes or reconstructions from competitive methods serve as benchmarks. Images are unordered and unlabeled. Quantitative evaluation uses reprojection errors and 3D surface quality metrics.
Architecture / Algorithm:
- Global Feature Extraction: Images are encoded using DINOv2, a vision transformer, to produce global descriptors for efficient similarity computation.
- Image Similarity Graph & Clustering: Cosine similarity thresholding forms edges in a graph of images. The Speaker-Listener Label Propagation Algorithm (SLPA) produces overlapping clusters of images to enable distributed processing with cluster overlap for robust merging.
- Cluster-wise Sparse SfM: Each image cluster undergoes sparse Structure-from-Motion reconstruction independently using local features restricted to cluster edges and COLMAP as the backend mapper. Reconstructions are aligned and merged using Sim(3) transformations estimated by RANSAC on shared camera poses and track merging with outlier filtering.
- Sparse Map Spatial Partitioning: The merged global sparse point cloud is projected onto a dominant support plane and partitioned into a regular grid with overlap. This spatial partitioning is distinct from topological clustering and aids dense reconstruction.
- Geometry-aware Camera Selection: For each spatial partition, cameras observing points are scored using triangulation quality and viewpoint complementarity, selecting a compact, informative subset rather than all posing cameras.
- Dense Reconstruction and Surface Initialization: Using fixed SFMs of selected cameras and cached MASt3R depth predictions, dense depth maps are aligned with global scales per view and refined with multi-view reprojection consistency. These are fused volumetrically with TSDF integration, then converted to oriented points.
- Mesh Extraction: Screened Poisson Surface Reconstruction converts the oriented points into a watertight initial mesh.
- Mesh Refinement: The initial mesh is optimized via differentiable rendering, minimizing silhouette and normal losses against monocular normal estimates using Isotropic Adam optimizer. Curvature-adaptive remeshing guides edge lengths spatially, focusing mesh density in high-curvature areas with a speed-aware slack schedule controlling refinement over iterations, maintaining memory efficiency and stability.
- Mesh Stitching: Overlapping partition meshes are clipped to non-overlapping exterior regions. The intersection overlap is retessellated by Delaunay triangulation on shared vertices. Boundary-aware topological refinement stitches partitions to a global watertight mesh.
Training Regime: The method does not involve learned training beyond pretrained models (e.g., DINOv2 features, monocular normal estimators). Optimizations of meshes employ iterative differentiable rendering with vertex position updates using Adam over multiple iterations until convergence, guided by stabilization metrics.
Evaluation Protocol: Reconstruction quality is quantified by Precision, Recall, and F1 metrics computed on reconstructed mesh surfaces against references. Runtime is measured for the entire pipeline. Ablations compare clustering methods (SLPA, Ncut, Agglomerative), mesh refinement versus baseline Poisson, and adaptive remeshing versus uniform remeshing under vertex budget constraints. Qualitative comparisons visualize mesh quality and normal smoothness.
Reproducibility: Implementation details include using COLMAP, MASt3R for feature matching, DINOv2 for feature extraction, and existing depth prediction networks. Hardware is a dual NVIDIA A6000 Ada GPUs with 48GB memory each. Code release and pretrained weights are not explicitly mentioned, datasets are public or benchmarked.

Example end-to-end process:

Given thousands of unordered city images, extract DINOv2 features and build similarity graph.
Cluster images using SLPA into overlapping groups.
Run sparse SfM on each cluster independently to create partial sparse point clouds.
Align and merge partial maps into a global sparse city map.
Project sparse points to ground plane and spatially partition into grid cells.
For each cell, select a subset of cameras with high-quality geometry.
Use cached depth predictions and perform per-view scaling and alignment.
Fuse depth maps volumetrically and extract a watertight mesh with Poisson reconstruction.
Optimize mesh with differentiable rendering losses and curvature-guided adaptive remeshing.
Stitch overlapping partitions meshes with Delaunay triangulation and topological refinement for a global mesh.
Evaluate reconstruction precision and runtime, observing improvements over prior city-scale methods.

Technical innovations

A two-stage hierarchical divide-and-conquer partitioning that separately clusters images topologically for sparse SfM, then spatially partitions the sparse map for dense reconstruction to reduce complexity.
Use of overlapping image clusters via SLPA for distributed SfM that enables robust map merging without exhaustive global feature matching.
Curvature-aware adaptive vertex density remeshing that concentrates mesh resolution in regions of high geometric detail while coarsening flat areas, improving detail capture and memory efficiency.
A differentiable mesh refinement pipeline combining silhouette and normal losses optimized via Isotropic Adam with an adaptive remeshing speed schedule tailored for large-scale urban scenes.
A mesh stitching approach based on decomposition into non-overlapping exterior volumes and Delaunay triangulation of overlap boundaries for seamless watertight global mesh integration.

Datasets

GauU-Scene — multiple urban scenes including CUHK-LOWER, CUHK-UPPER, LFLS, SZIIT — public urban multi-view benchmarks
UrbanScene3D — Residence scene — public urban scene reconstruction benchmark

Baselines vs proposed

CityGS-v2: F1 = 0.1009 (CUHK-LOWER) vs City-Mesh3R: F1 = 0.1110
CityGS-X: F1 = 0.0965 (CUHK-LOWER) vs City-Mesh3R: F1 = 0.1110
CityGS-v2: Runtime = 340.9 min (CUHK-LOWER) vs City-Mesh3R: 95.0 min
CityGS-X: Runtime = 75.0 min (CUHK-LOWER) vs City-Mesh3R: 95.0 min
MASt3R-COLMAP SfM: 2D reprojection error = 1.17 px, time = 53.88 hrs vs City-Mesh3R clustering SfM: 1.38 px, time = 2.74 hrs
Poisson only mesh: F1 = 0.029 (SZIIT) vs Poisson + Opt/Remesh: 0.0968
Continuous Remeshing: F1 = 0.0861 (CUHK-LOWER) vs Curvature-guided remeshing: 0.111

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.30310.

Fig 1

Fig 1: Our method reconstructs simulation-ready watertight 3D meshes with high geometric fidelity, smooth surface normals, from

Fig 2

Fig 2: Our proposed pipeline: Starting from an unordered input image set, we first build a sparse SfM representation of the entire scene

Fig 3

Fig 3: Qualitative Comparison with recent city-scale surface reconstruction methods (CityGS-v2[27] and CityGS-X[12]) on GauU-

Fig 4

Fig 4: (A) Ablation Study of ”Poisson Only” vs ”Poisson + MeshOpt/Remesh” across different scenes of GauU-Scene dataset

Fig 5

Fig 5: Area Partitioning Visualizer: The selected partition’s

Fig 6

Fig 6: Extra Qualitative Comparison Results with recent city-scale surface reconstruction methods (CityGS-v2 and CityGS-X) on

Fig 7

Fig 7: Mesh Stitching Result: (A) Two adjacent partition

Limitations

No adversarial robustness or attack scenarios evaluated; focus is on reconstruction quality under nominal conditions only.
Dependency on pretrained monocular normal estimators for mesh refinement, which could introduce unknown biases or errors.
Evaluation primarily on urban scenes with assumed dominant support plane; less suited for highly irregular terrains or non-Manhattan-world cities.
No explicit end-to-end learned model training or adaptation described, limiting generalization beyond camera setups matching pretrained components.
Mesh stitching method relies on overlap consistency; robustness under extreme partition boundary noise or missing data is unclear.
Computational cost still fairly high (hours per scene), though reduced compared to baselines; real-time or near-real-time reconstruction not addressed.
Details on code and pretrained model release are not provided, impacting reproducibility.

Open questions / follow-ons

How robust is the pipeline to highly sparse or inconsistent image coverage, such as missing viewpoints or occlusions?
Can the method adapt to multi-level non-planar urban scenes with complex terrains instead of a single dominant support plane assumption?
What is the sensitivity of the final mesh quality to errors in sparse SfM or cluster merging? Could further global optimization improve results?
Can learned depth or normal estimators be integrated end-to-end to further improve refinement stages or speed?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners working with 3D scene understanding or synthetic urban environment generation, City-Mesh3R presents a scalable approach to generate high-fidelity, watertight city-scale meshes from large sets of unordered images. Its distributed sparse-to-dense pipeline allows handling very large datasets common in urban scenarios, improving the geometric quality and regularity of meshes compared to prior NeRF or Gaussian splatting approaches.

This method could aid in generating robust 3D synthetic environments for training or testing visual-bot detection systems that require accurate physical or urban layouts. The curvature-adaptive remeshing and mesh refinement also ensure stability and quality of the meshes used for downstream simulations or agent interactions, which is important for realistic rendering and bot behavior modeling. However, the complexity and runtime remain considerable compared to smaller scale reconstructions, so practical integration would require parallel or cloud-based infrastructure.

Cite

bibtex

@article{arxiv2605_30310,
  title={ City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images },
  author={ Sayan Paul and Sourav Ghosh and Siddharth Katageri and Soumyadip Maity and Sanjana Sinha and Brojeshwar Bhowmick },
  journal={arXiv preprint arXiv:2605.30310},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30310}
}

City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​