Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering

Source: arXiv:2604.27590 · Published 2026-04-30 · By Davide Di Nucci, Riccardo Catalini, Guido Borghi, Roberto Vezzani

TL;DR

Fake3DGS tackles a gap the 2D deepfake literature mostly ignores: once scene edits happen in a 3D representation, the resulting renders can stay geometrically and photometrically consistent across views, so pixel-artifact detectors lose much of their signal. The paper formalizes “3D fake detection” as deciding whether a 3D scene has been altered in geometry, layout, or appearance even if the rendered images look realistic. To make the task concrete, the authors build Fake3DGS, a benchmark of edited 3D Gaussian Splatting scenes derived from UCO3D reconstructions and manipulated with two editing pipelines, GaussCtrl and Instruct-GS2GS.

The main empirical takeaway is that current 2D detectors do reasonably well when train and test edits come from the same distribution, but they drop sharply under cross-editor generalization, often getting real scenes right while missing fakes. The proposed 3D-aware detector, Fake3DD, operates directly on Gaussian attributes with a modified PointTransformerV3 backbone and substantially improves detection, especially on unseen editing methods. The strongest reported gain is in the cross-edit setting, where Fake3DD reaches 98.7% overall accuracy on GaussCtrl→Instruct-GS2GS and improves the best 2D baseline by 14.9 percentage points overall, driven mainly by a 24.3-point jump on the fake class.

Key findings

Fake3DGS contains more than 41k reconstructed scenes, balanced between real and manipulated samples, and the authors say the compressed benchmark footprint drops from ~7 TB to ~200 GB after 8-bit scalar quantization plus PNG encoding.
In the mixed split, the best 2D baseline reported is CoDE at 92.2% overall accuracy, while Fake3DD reaches 98.9% overall accuracy with 98.1% fake accuracy and 99.5% real accuracy.
In the GaussCtrl→Instruct-GS2GS cross-edit setting, the strongest 2D baseline is DM at 83.8% overall accuracy, but it only gets 74.8% fake accuracy; Fake3DD reaches 98.7% overall with 99.1% fake and 98.4% real accuracy.
In the opposite cross-edit direction, Instruct-GS2GS→GaussCtrl, Fake3DD reports 98.6% overall accuracy, compared with the strongest 2D baseline DM at 82.4% overall accuracy.
Several 2D baselines show a consistent pattern: real accuracy remains high (often above 93–98%) while fake accuracy collapses under cross-edit transfer, e.g. CLIP ViT-B fake accuracy drops to 41.5% in one cross-edit protocol.
The ablation study in Fig. 2 shows opacity is the most important Gaussian feature group; removing opacity causes the largest performance drop, down to 92.5% accuracy from the full-feature model.
The authors report that higher-order spherical harmonics contribute minimally, while the zeroth-order SH term s0 and scale provide moderate gains and quaternion orientation has limited effect.

Threat model

The adversary can modify a 3D Gaussian Splatting scene directly, altering geometry, appearance, or spatial layout, and then render arbitrarily many realistic views that remain multi-view consistent. The defender sees either rendered images or the scene representation but does not know whether the scene has been edited. The paper assumes the attacker may use unseen editing methods at test time and that 2D artifact cues may be weak or absent; it does not assume the adversary can tamper with the detector or its training data.

Methodology — deep read

The threat model is passive 3D manipulation detection. The adversary can directly edit a 3D Gaussian Splatting scene representation to change geometry, object appearance, or spatial layout, then re-render arbitrarily many photorealistic views. The detector does not get oracle access to the original scene and must decide whether a rendered view or the underlying scene is authentic. The paper assumes the attacker’s edits are realistic enough that simple 2D artifact cues may be absent, and in cross-edit evaluation the manipulation method at test time is unseen during training.

For data, the benchmark starts from UCO3D, which already provides multi-view imagery, camera poses, and pretrained 3DGS reconstructions. The authors subsample categories to reduce category imbalance by taking the minimum instance count across categories and sampling that many objects per category. They then export reconstructions as nerfstudio checkpoints and compress the Gaussian parameters using a Self-Organizing Gaussian Grids-style pipeline implemented with gsplat in the nerfstudio ecosystem. Each Gaussian parameter tensor is uniformly scalar-quantized to 8 bits per parameter, sorted into locally coherent 2D grids, and losslessly PNG-compressed. This is important because the benchmark is large: the paper states over 41k scenes in total, evenly balanced between real and manipulated samples, and that the storage footprint shrinks from roughly 7 TB to about 200 GB after compression.

Fake samples are generated with two editing pipelines. GaussCtrl edits rendered images from 3DGS and reconstructs an edited scene using depth-conditioned ControlNet guidance plus attention-based latent alignment to preserve geometry consistency. Instruct-GS2GS iteratively edits input images with InstructPix2Pix and then optimizes the underlying scene to match the edited views. To produce captions for the edits, the authors use Meta-Llama-3-8B-Instruct locally, with three fixed prompt templates that request one of three instruction families: object material/type change, background/surface change, or object color change. For each original caption they generate three edited captions, then randomly select one edited caption per scene so the dataset has a balanced edit-type distribution. Fig. 1 and Fig. 3 show example edits such as grass→snow, sheep→giraffe, or color changes like silver→gold.

The proposed detector, Fake3DD, works directly on the 3DGS representation rather than on rendered RGB images. The backbone is PointTransformerV3, chosen because a Gaussian scene can be treated as an unordered 3D set with local spatial context. The authors modify the input feature representation so each Gaussian is encoded by its 3D mean coordinates plus attributes: opacity, scale, quaternion orientation, and spherical harmonics features, including the zeroth-order term s0 as the view-independent color component. The encoder and decoder stages of PointTransformerV3 produce per-Gaussian contextual features, and these are aggregated into a scene-level embedding by global mean pooling over Gaussians in the same scene. A classification head then outputs a binary real/fake prediction. The novelty is not a new loss or exotic training scheme; it is the choice to classify directly in Gaussian-attribute space so the model can exploit scene-level inconsistencies that are invisible in many rendered frames.

Evaluation uses three split regimes. The mixed split is an 80/20 train-test split over edited scenes, with approximately 33k scenes for training and 8k for testing, and both editors present in train and test. The cross-edit splits train on one editing method and test on the other, so the model must generalize to an unseen generator. All 2D baselines are fine-tuned from released weights on rendered images from the same split protocol; the baselines include ResNet-50-based methods from Wang et al., reconstruction-based DM from Wang et al., CoDE, CLIP linear probes with ViT-B/16 and ViT-L/14, UFD (Ojha et al.), and DINOv2 fine-tuning. Table 1 reports overall accuracy plus class-wise fake and real accuracy. One concrete example: in the GaussCtrl→Instruct-GS2GS setting, DM reaches 83.8% overall accuracy but only 74.8% on fake samples, suggesting it is conservative under distribution shift; Fake3DD reaches 98.7% overall, with 99.1% fake accuracy and 98.4% real accuracy. The paper also performs an ablation in Fig. 2 by dropping one Gaussian attribute group at a time and measuring the accuracy change relative to the full model.

Reproducibility looks relatively strong. The paper says code and data are publicly released, uses a standard nerfstudio checkpoint format, relies on open-source components such as gsplat and Meta-Llama-3-8B-Instruct, and reports the prompts used to generate captions. What is less clear from the text is the exact optimizer, learning rate, epoch count, batch size, random seed strategy, and whether significance testing or confidence intervals were computed; those details are not provided in the excerpt. It is also not explicit whether the train/test split is scene-disjoint or only edit-level disjoint, and the paper notes that splits are performed at the edit level rather than enforcing a strict scene-level separation.

Technical innovations

Formalizes 3D fake detection as a security task for manipulated 3D scene representations, extending beyond image-only deepfake detection.
Introduces Fake3DGS, a large-scale benchmark of real and edited 3D Gaussian Splatting scenes with controlled manipulations of geometry, appearance, and layout.
Proposes Fake3DD, a 3D-aware detector that classifies scenes directly from Gaussian-level attributes using a modified PointTransformerV3 backbone.
Uses multi-view coherence and Gaussian attribute aggregation instead of relying on 2D rendering artifacts, which is the main departure from prior 2D deepfake detectors.
Compresses the dataset representation with 8-bit per-parameter quantization and PNG encoding to make large-scale benchmarking practical.

Datasets

Fake3DGS — >41k scenes; balanced real vs manipulated; 3DGS reconstructions compressed from ~7 TB to ~200 GB — built from UCO3D and edited with GaussCtrl / Instruct-GS2GS
UCO3D — large-scale source dataset with >1000 categories (exact selected subset size not stated) — public source dataset

Baselines vs proposed

CoDE: overall accuracy = 92.2% vs proposed: 98.9% (mixed split)
DM (GaussCtrl→Instruct-GS2GS): overall accuracy = 83.8% vs proposed: 98.7%; fake accuracy = 74.8% vs 99.1%; real accuracy = 95.8% vs 98.4%
DM (Instruct-GS2GS→GaussCtrl): overall accuracy = 82.4% vs proposed: 98.6%; fake accuracy = 80.3% vs 98.6%; real accuracy = 96.9% vs 98.3%
CLIP ViT-B (cross-edit): overall accuracy = 74.3% vs proposed: 98.7% in the corresponding G→I protocol; fake accuracy = 41.5% vs 99.1%
CLIP ViT-L (cross-edit): overall accuracy = 75.4% vs proposed: 98.6% in the corresponding G→I protocol; fake accuracy = 44.5% vs 98.6%
DINOv2-B (mixed split): overall accuracy = 87.8% vs proposed: 98.9%; fake accuracy = 89.7% vs 98.1%; real accuracy = 85.8% vs 99.5%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.27590.

Fig 1

Fig 1: Sample renderings from the Fake3DGS Dataset. Below each view of the scene,

Fig 2

Fig 2 (page 5).

Fig 3

Fig 3 (page 5).

Fig 4

Fig 4 (page 5).

Fig 5

Fig 5 (page 5).

Fig 6

Fig 6 (page 5).

Fig 7

Fig 7 (page 5).

Fig 8

Fig 8 (page 5).

Limitations

The benchmark currently uses only two editing pipelines (GaussCtrl and Instruct-GS2GS), so generalization to other 3D editors is not established.
The paper explicitly says splits are performed at the edit level rather than strict scene-level separation, so some versions of the same underlying scene can appear across partitions.
No optimizer, learning-rate schedule, epoch count, batch size, or seed protocol is given in the excerpt, which makes exact reproduction of training harder from the paper text alone.
Evaluation is limited to classification; there is no edit localization, attribution, or region-level detection benchmark yet.
The detector is trained and tested on the same underlying dataset family (UCO3D-derived scenes), so robustness to other 3D asset types, real-world capture noise, or different reconstruction pipelines is not shown.
The compression pipeline preserves the quantized representation, but the effect of quantization error on downstream detection is not separately isolated.

Open questions / follow-ons

How well does a Gaussian-attribute detector transfer to other 3D scene representations such as NeRFs, meshes, or hybrid radiance-field formats?
Can the benchmark be extended to edit localization, so the model identifies which region or which Gaussian clusters were manipulated?
What happens under stronger distribution shift, such as new object categories, outdoor scenes, or reconstruction quality variation from different source pipelines?
Can passive detection be combined with watermarking or provenance signals to distinguish benign editing from malicious tampering more reliably?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the important lesson is that authenticity checks cannot rely only on 2D render artifacts when the underlying content may be generated or modified in a 3D representation. If a platform accepts user-uploaded 3D assets, AR/VR scenes, or synthetic media generated through 3D pipelines, a 2D-only detector will likely overfit to a specific editor and miss cross-editor manipulations. This paper suggests a more robust direction: inspect the latent 3D structure itself, or at least use signals derived from it, because view-consistent fakes can look clean in every 2D projection.

Practically, a bot-defense team could use this result as a warning about generalization. A model that performs well on known manipulation tools may still fail on a new editing pipeline, and the failure mode will likely be asymmetric: many fakes slip through while real content stays easy to recognize. The most actionable implication is to build evaluation suites that include held-out generators and held-out editing methods, not just random train/test splits, and to treat 3D-aware evidence as a separate modality rather than assuming image-level fingerprints are enough.

Cite

bibtex

@article{arxiv2604_27590,
  title={ Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering },
  author={ Davide Di Nucci and Riccardo Catalini and Guido Borghi and Roberto Vezzani },
  journal={arXiv preprint arXiv:2604.27590},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.27590}
}

Fake3DGS: A Benchmark for 3D Manipulation Detection in Neural Rendering ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​