DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion
Source: arXiv:2502.16419 · Published 2025-02-23 · By Jianbin Jiao, Xina Cheng, Kailun Yang, Xiangrong Zhang, Licheng Jiao
TL;DR
This paper addresses the critical challenge of 3D human pose estimation under real-world deficiency-aware conditions such as occlusions, noise interference, and missing viewpoints. Traditional multi-view methods often rely on complex multi-stage pipelines, which can lead to cumulative errors, lengthy training, and insufficient robustness to real-world deficiencies. To tackle these issues, the authors propose DeProPose, a novel end-to-end multi-view 3D pose estimation model that simplifies network architecture by directly extracting 3D poses from images. The key innovation is an adaptive multi-view feature fusion mechanism that dynamically weights each view based on combined projection and absolute error measurements, enabling the model to focus on reliable views and downweight noisy or occluded inputs. This design reduces information loss, lowers complexity, and boosts robustness and accuracy in challenging scenes.
To rigorously evaluate their method, the authors create a new Deficiency-Aware 3D Pose Estimation (DA-3DPE) dataset that simulates noise, occlusion, and missing data scenarios across multiple views. Experimental results on the DA-3DPE dataset and the standard Human3.6M benchmark demonstrate that DeProPose significantly improves accuracy under deficiency-aware conditions compared to existing state-of-the-art multi-view fusion and pose estimation methods. It also provides strong performance in conventional, non-deficiency scenarios with a simpler, unified model that adapts dynamically to multi-view data quality variations.
Key findings
- DeProPose achieves superior robustness to occlusion, noise, and missing viewpoints compared to state-of-the-art baselines on the DA-3DPE deficiency-aware dataset.
- The adaptive fusion weights based on combined projection error and absolute error enable effective downweighting of noisy or occluded views, improving final 3D pose accuracy.
- Direct end-to-end 3D feature extraction from multi-view images avoids error compounding and information loss found in two-stage pipelines.
- The DA-3DPE dataset includes controlled simulations of Gaussian noise, salt-and-pepper noise, speckle noise, missing data blocks, and occlusion from Pascal VOC objects, enabling comprehensive deficiency evaluation.
- Spatial-temporal modeling with Swin Transformer backbone and positional encoding of camera ray information captures dynamics of poses and 3D spatial relations across views.
- DeProPose reduces training complexity and hyperparameter tuning effort relative to multi-stage modular methods combining CNN, LSTM, GCN.
- Multi-view fusion improves accuracy, e.g. weighted feature fusion outperforms simple stacking or averaging (as shown in Fig 2 and corresponding analysis).
- The projection error and absolute error formulation for adaptive weights yields a principled fusion evaluation metric rather than heuristic weighting.
Threat model
The adversary represents real-world environmental deficiencies including occlusions, sensor noise, and missing camera views, rather than intentional tampering. They do not manipulate ground truth labels or camera calibration but cause data corruption that degrades input quality. The model assumes multi-view synchronized video inputs with known camera poses and relies on ground truth annotations for supervised training. It cannot recover from scenarios with completely missing critical view information but adapts to partial deficiencies through weighted fusion.
Methodology — deep read
The paper begins by defining a deficiency-aware 3D human pose estimation task where input data from synchronized multi-view video streams can suffer from occlusion, noise, and missing frames. The threat model assumes an imperfect multi-camera setup where adversaries do not intentionally corrupt data but real-world conditions cause deficiencies. Ground truth 2D and 3D annotations are available during training to compute fusion weights.
Data input consists of multi-view frames organized as a tensor X ∈ R^{V×T×H×W×C}, where V is view count, T frames per video, H/W frame dimensions, and C color channels. Preprocessing converts images into spatio-temporal feature tensors.
DeProPose architecture includes: (1) a Deficiency-Aware Image Encoder based on a Swin Transformer backbone extracting high-dimensional spatial-temporal features F ∈ R^{V×T×D} per view; (2) a Temporal Encoder module that processes F to capture motion dynamics across time for each view; (3) a Positional Encoder incorporates camera ray angles for each view to embed 3D spatial position information; (4) Spatial-Temporal Fusion merges temporal and positional features into a robust representation Ftp.
The key novel component is an Adaptive Multi-View Feature Fusion Adapter that combines features from all views weighted by coefficients ω_v determined by fused projection and absolute error terms. For each view v, the predicted 3D pose is projected to corresponding 2D pose space to compute a projection error e_proj relative to ground truth 2D keypoints. Absolute error e_abs measures feature difference from annotated 3D pose features. Weights are computed as ω_v = 1/(e_proj + e_abs + ε), normalizing view contributions by reliability.
Intermediate features from Transformer blocks are further checked with error Emid to compute a total loss combining projection, absolute, and intermediate errors to guide training end-to-end. The model thus directly predicts 3D poses from fused multi-view features while dynamically downweighting uncertain or erroneous views.
The training regime includes optimization with standard backpropagation, minimizing multi-term losses (L1 to L4), using established training splits on Human3.6M and their new DA-3DPE dataset. The DA-3DPE dataset was synthetically generated by adding three noise types (Gaussian, salt-and-pepper, speckle), missing blocks simulating occlusions, and object overlays from Pascal VOC for occlusion scenarios.
Evaluation protocol compares DeProPose to several baselines including two-stage methods and recent fusion transformers using standard mean per joint position error (MPJPE) metrics under multiple deficiency configurations. Ablations isolate benefit of adaptive fusion weights and spatial-temporal encoding. Human3.6M evaluation benchmarks conventional scenarios while DA-3DPE targets deficiency-aware robustness.
The paper provides a Github link for code release promoting reproducibility. However, details of hyperparameters, seeds, and hardware setups are not exhaustively specified. End-to-end example walkthroughs are conceptually described but lack full implementation step-by-step.
Overall, the methodology demonstrates a comprehensive design integrating spatio-temporal transformer-based image encoding, camera ray positional embedding, and a principled error-driven adaptive fusion to effectively mitigate deficiency-aware estimation challenges in multi-view 3D pose.
Technical innovations
- Introduction of an adaptive multi-view feature fusion mechanism computing weights based on combined projection and absolute errors, allowing dynamic downweighting of noisy or occluded views.
- End-to-end single-stage 3D human pose estimation from multi-view images using a Swin Transformer backbone with temporal and positional encoding, bypassing error-prone two-stage 2D-to-3D lifting pipelines.
- Creation of the DA-3DPE dataset with controlled deficiency-aware perturbations (noise, missing data, occlusion) to benchmark robustness in multi-view 3D human pose estimation.
- Integration of camera ray angle positional encoding to enhance spatial understanding and mitigate viewpoint variation effects in multi-view fusion.
Datasets
- Human3.6M — 3.6 million images with 3D pose annotations — Public
- DA-3DPE (Deficiency-Aware 3D Pose Estimation) — Size not explicitly reported — Generated by introducing noise, missing data, and occlusion to Human3.6M or similar baseline
Baselines vs proposed
- Two-stage methods (e.g., traditional 2D pose estimation + 3D lifting): MPJPE under deficiency scenarios higher by significant margins vs DeProPose (exact numbers not specified).
- MTF-Transformer and FusionFormer: DeProPose exceeds accuracy and robustness on DA-3DPE, especially with adaptive weights fusion.
- Simple stacking or averaging fusion: baseline methods show degraded performance under occlusion/noise vs DeProPose’s adaptive weighted fusion.
- Effect of removing adaptive fusion weights: model accuracy drops noticeably (quantitative ablation details not fully disclosed).
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2502.16419.

Fig 1: Illustration of the proposed framework for multi-view 3D human

Fig 2: Comparison of different fusion methods. (a) Mathematical fusion [15],

Fig 3 (page 1).

Fig 4 (page 1).

Fig 5 (page 1).

Fig 6 (page 1).

Fig 7 (page 1).

Fig 3: This figure illustrates the architecture of the proposed multi-view temporal 3D human pose estimation model. The pipeline begins with the Multi-View
Limitations
- DA-3DPE dataset size and diversity details are limited; synthetic noise and occlusion may not fully capture real-world deficiency complexity.
- Evaluation focuses on multi-view setups with 4 calibrated cameras; generalization to fewer views or uncalibrated systems unclear.
- Method requires ground truth 2D and 3D annotations during training to compute projection and absolute errors; limits applicability in unlabelled scenarios.
- Details on computational cost, inference time, and model size are not reported, which are critical for real-time applications.
- No explicit adversarial robustness evaluation; only natural deficiency disturbances considered.
- Limited ablation on temporal vs spatial encoding contributions to performance; interaction effects remain partially unexplored.
Open questions / follow-ons
- How effective is the adaptive fusion mechanism in scenarios with uncalibrated or fewer camera views?
- Can the method be extended to handle real-time streaming data with variable camera availability and asynchronous frames?
- What are the trade-offs in computational complexity and latency versus accuracy gains in practical deployment?
- How robust is the method to adversarial perturbations or deliberate occlusions beyond naturally occurring deficiencies?
Why it matters for bot defense
For bot-defense and CAPTCHA practitioners, DeProPose’s methodology highlights the importance of adaptive multi-view fusion to improve robustness when input data is noisy, incomplete, or occluded—analogous to bot traffic where partial or corrupted signals must be integrated. The projection-error weighted fusion mechanism exemplifies a principled way to dynamically estimate view reliability, which could inspire multi-modal bot detection systems combining various telemetry sources by weighting based on confidence metrics. Additionally, the use of spatial-temporal transformers and positional encoding to model dependencies over multiple views may parallel multi-factor behavioral analysis for bot detection under uncertainty.
The DA-3DPE dataset’s focus on deficiency-aware conditions aligns with challenges in bot detection where data deprivation or noise can obscure signals. Developing synthetic deficiency datasets and adaptive fusion could aid CAPTCHA systems in leveraging partial user interaction data without compromising accuracy. However, practical implementation must consider computation constraints and absence of ground truth labels typical in live bot-defense. Overall, the work advances understanding of robustness under deficient input quality, a critical requirement in real-world bot-defense environments.
Cite
@article{arxiv2502_16419,
title={ DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion },
author={ Jianbin Jiao and Xina Cheng and Kailun Yang and Xiangrong Zhang and Licheng Jiao },
journal={arXiv preprint arXiv:2502.16419},
year={ 2025 },
url={https://arxiv.org/abs/2502.16419}
}