PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment

Source: arXiv:2605.28241 · Published 2026-05-27 · By Duanchu Wang, Cheng Li, Junjie Yang, Jing Huang, Zihang Cheng, Zhi Gao et al.

TL;DR

PointQ-Bench addresses critical gaps in point cloud quality assessment (PCQA), moving beyond traditional scalar quality prediction to a comprehensive understanding that includes diagnostic defect identification, usability grading, and interpretable quality reasoning. Existing PCQA benchmarks focus mainly on mean opinion score (MOS) regression and do not capture real-world inspection needs such as defect taxonomy, evidence-grounded descriptions, and handling ambiguous cases near usability boundaries. PointQ-Bench fills this gap with 3,083 samples spanning real scans, synthetic distortions, and AI-generated content, annotated with MOS, multi-label defect tags, usability levels, expert textual comments, and over 12,000 question-answer pairs.

Key findings

PointQ-Bench contains 3,083 point clouds from three source families: 1,168 authentic, 1,315 synthetic distorted, and 600 AI-generated samples, covering 8 major defect types.
Quality annotations per sample include MOS scores (1–5), 3-level usability labels (good, usable, bad), multi-label defect tags, and expert evidence-grounded comments.
Evaluated on three perception tasks (anomaly sensing, defect diagnosis, usability grading) and an open-ended cognition task (quality reasoning with SSFRQ-5D evaluation).
Strong proprietary 2D multimodal large language models (MLLMs) like GPT-5 achieve up to 88.68% yes/no accuracy, 42.32% defect diagnosis F1, and 50.13% usability grading macro-F1, outperforming junior-level human baselines.
Native 3D vision-language models (ShapeLLM, PointLLM) lag behind 2D MLLMs on fine-grained tasks, indicating difficulty translating 3D input into quality reasoning.
Multi-view input improves usability grading by up to +5.5 points but has inconsistent impact on anomaly sensing and defect diagnosis, with variance explained mainly by model and data source interactions rather than view count alone.
Open-ended quality reasoning performance shows high structural completeness and reasoning coherence (near ceiling on SSFRQ-5D dimensions S1, S2, R) but low faithfulness (F ~1.07 for GPT-5) and final quality accuracy (Q ~1.39 for GPT-5), revealing a limitation in evidence grounding and calibrated decision-making.
Preserving ambiguous boundary samples improves evaluation of calibration under uncertainty instead of discarding disagreement-prone cases.

Threat model

The adversary is not explicitly defined as this is a perception-quality benchmark rather than a security-focused paper. The implicit threat scenario is no-reference quality assessment where no pristine underlying point cloud is available. Models must generalize across heterogeneous real, synthetic, and AI-generated data under noisy, ambiguous, and defect-ridden conditions. The adversary cannot access references but may attempt to evade defect detection or calibration by exploiting ambiguity in input distributions.

Methodology — deep read

The authors propose PointQ-Bench as a multidimensional benchmark to extend PCQA beyond scalar MOS predictions toward richer quality understanding. The threat model assumes a no-reference scenario where no pristine point cloud reference is available, reflecting real use cases such as inspection and generative content evaluation.

Data is aggregated from seven public datasets to cover three complementary source families: authentic real scans (LiDAR-Net, ScanObjectNN, CO3D), synthetic distorted point clouds (ModelNet40-C, SJTU-PCQA, WPC), and AI-generated point clouds (T23D-CompBench). Each sample is annotated independently by eight trained experts with continuous quality scores, three-level usability labels, multi-label defect tags from a fixed vocabulary of eight defect types, and evidence-grounded free-form descriptions. Annotators used both interactive 3D views and multiple 2D projections during labeling. Scores were normalized to mitigate rater bias and majority voting determined final usability labels. Ambiguous boundary samples were retained as a separate evaluation split to preserve uncertainty rather than discard.

The benchmark integrates multiple tasks: yes/no anomaly sensing (noticeable defect presence), defect diagnosis (multi-label classification of defect types), usability grading (3-class label), and open-ended explanatory quality reasoning. Perception tasks are evaluated by standard classification metrics: accuracy for yes/no, sample-level F1 for defect tags, and macro-F1 for usability. For free-form text reports, the authors define SSFRQ-5D, a 5-dimensional rubric scoring structural sufficiency, specificity, faithfulness, reasoning coherence, and quality accuracy from 0 to 2 per dimension. Automatic judges parse model outputs with large language models to normalize responses for metric calculations.

Evaluation covers 14 state-of-the-art models from three families: proprietary 2D multimodal LLMs (e.g., GPT-5), open-source 2D MLLMs (e.g., Qwen-3.5), and native 3D vision-language models (e.g., PointLLM). Models receive either single-view or multi-view 2D projections or point cloud inputs where applicable. Inference uses zero-shot prompting without fine-tuning. Model outputs are free-form text parsed for label extraction.

Ablations test sampling density and strategies for 3D inputs and the effect of multi-view counts for 2D VLMs. Human-AI agreement on SSFRQ-5D dimensions is assessed via double-blind expert scoring on 200 samples and quadratic weighted Cohen’s kappa. Additional decision splits maintain boundary-ambiguous samples to evaluate uncertainty calibration. The pipeline is fully automated with reproducibility details and extensive annotation logs described in appendices, though code and data release status is unclear.

To illustrate, on an example point cloud sample, eight experts score MOS and annotate defect types by inspecting multiple 2D projections and 3D views. These labels provide ground truth for yes/no defect detection, multi-label identification, usability grading, and open-ended descriptive quality reasoning. Model outputs for these tasks undergo natural language parsing to yield comparable outputs for metric evaluation. This comprehensive framework enables rich diagnostic evaluation rather than simple scalar regression.

Technical innovations

PointQ-Bench is the first large-scale benchmark explicitly targeting multidimensional point cloud quality understanding beyond scalar score prediction, integrating real, synthetic, and AI-generated data with rich annotations including defect taxonomy and evidence-grounded descriptions.
SSFRQ-5D is a novel, lightweight 5-dimensional evaluation protocol designed for reliable assessment of open-ended quality reports by measuring structural sufficiency, specificity, faithfulness, reasoning coherence, and quality accuracy, validated through human-AI agreement.
The benchmark preserves and separately evaluates ambiguity in human annotations near quality boundaries, enabling study of calibration and uncertainty rather than discarding ambiguous samples.
A unified multi-task evaluation framework framing point cloud quality assessment as progressive perception subtasks (anomaly sensing, defect diagnosis, usability grading) combined with cognition-oriented open-ended quality reasoning, enabling holistic model evaluation.

Datasets

LiDAR-Net — 369 samples — public authentic scan dataset
ScanObjectNN — 300 samples — public authentic object scans
CO3D — 499 samples — public multi-view real video reconstructions
ModelNet40-C — 710 samples — public synthetic distortions from CAD models
SJTU-PCQA — 344 samples — public synthetic distortions with subjective annotations
WPC — 261 samples — public distorted synthetic samples
T23D-CompBench — 600 samples (subset) — AI-generated textured meshes from 8 generators

Baselines vs proposed

Random guess: Yes/No accuracy = 50.67%, Defect diagnosis F1 = 14.70%, Usability macro-F1 = 34.26% vs GPT-5 (mv6): Yes/No = 83.59%, What = 42.32%, How = 50.13%
Junior-level human baseline: Yes/No = 86.25%, What = 36.67%, How = 48.35% vs GPT-5: What +5.65% points, How +1.78% points, Yes/No +2.43%
ShapeLLM (13B): Yes/No = 89%, What = 32.42%, How = 25.01% vs GPT-5 (mv6): What +9.9%, How +25.12%, indicating gap between native 3D VLM and 2D MLLMs
Multi-view input increases usability grading (How) by up to +5.5 points on AI-generated samples but often degrades Yes/No accuracy by up to −7.5 points on real and synthetic data
Faithfulness (F) dimension in SSFRQ-5D for GPT-5 is 1.07/2, and Quality Accuracy (Q) is 1.39/2, showing gap between structured reporting and grounded quality judgment
PointLLM (7B) yes/no accuracy around 15-16% across sampling densities, far below strong 2D MLLMs and human baselines

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.28241.

Fig 1

Fig 1: PointQ-Bench: An overview of a unified benchmark for point cloud quality assessment. The benchmark

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 2

Fig 2: Comprehensive analysis of multi-view effects. (a) Average gains across source families; (b) variance

Fig 6

Fig 6 (page 8).

Fig 7

Fig 7 (page 8).

Fig 8

Fig 8 (page 8).

Limitations

Despite large-scale annotation, the dataset size (3,083) remains moderate compared to some large-scale PCQA datasets, potentially limiting deep learning scaling.
Model evaluations are zero-shot with no fine-tuning or supervised adaptation on PointQ-Bench, possibly underestimating achievable performance.
Native 3D vision-language models struggle, but the paper does not deeply investigate core architectural bottlenecks limiting fine-grained 3D quality reasoning.
Ambiguity handling is novel but boundaries of unresolvable disagreement are not fully characterized, limiting interpretation of uncertainty evaluation.
The SSFRQ-5D evaluation requires manual expert references for each sample, constraining scalability and reproducibility without public code or weight releases.
The impact of temporal or dynamic point cloud sequences is not explored; only static point clouds are benchmarked.

Open questions / follow-ons

How can native 3D vision-language architectures be improved to match or exceed the fine-grained diagnostic quality understanding of 2D multimodal LLMs?
Can the SSFRQ-5D evaluation be extended or automated further to scale open-ended quality reasoning assessment without expert references?
What model adaptations or training regimes enable effective uncertainty calibration and ambiguity-aware reasoning on boundary samples?
How might temporal or dynamic point-cloud sequences and multimodal temporal inputs influence diagnostic quality assessment and usability grading?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, PointQ-Bench exemplifies a comprehensive framework for addressing quality assessment challenges beyond simple scalar scoring, analogous to moving from simple pass/fail CAPTCHA detection toward richer human behavior diagnostics and interpretability. The complex task decomposition (anomaly detection, defect diagnosis, usability grading, evidence-grounded explanation) parallels the need in bot defense for interpretability and actionable reasoning rather than opaque binary signals.

This work highlights that multimodal language models—particularly strong 2D MLLMs—can perceive coarse defects reliably but struggle with grounded reasoning and calibrated decisions. Similarly, bot defenses relying on score-based heuristics may miss nuanced or adversarial behaviors if interpretability and uncertainty calibration are lacking. The methodology of preserving ambiguity and evaluating open-ended explanations with multi-dimensional criteria (structural sufficiency, faithfulness, coherence) provides a template for developing diagnostics and interpretable reasoning in CAPTCHA and bot detection systems. Additionally, the finding that more views or inputs do not uniformly improve performance cautions against blindly increasing input complexity without targeted task-aware design, a principle applicable in designing multi-aspect human interaction analyses for bot detection.

Cite

bibtex

@article{arxiv2605_28241,
  title={ PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment },
  author={ Duanchu Wang and Cheng Li and Junjie Yang and Jing Huang and Zihang Cheng and Zhi Gao and ZhuBohong and Di Wang },
  journal={arXiv preprint arXiv:2605.28241},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.28241}
}

PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​