SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation

Source: arXiv:2605.31551 · Published 2026-05-29 · By Parthsarthi Rawat

TL;DR

This paper addresses the challenge of estimating accurate 3D world-space poses of multiple soccer players from broadcast video, a difficult task due to wide camera angles, partial occlusions, and zooming effects typical of broadcast footage. The authors propose SMART, a method that fine-tunes the SMPLest-X body mesh recovery model on a stratified clip split of the WorldPose dataset with multi-task supervision that incorporates depth and 2D keypoint losses, as well as broadcast-specific augmentations. This adapted model recovers camera-relative skeletons and pelvis depth per player. To place poses in a global frame, they combine this with a robust RAFT optical flow based camera pose tracker that improves over prior Lucas-Kanade methods for this setting. Additional foot-plane anchoring and two-pass temporal smoothing refine global pose placement. On the FIFA Skeletal Tracking Challenge 2026, SMART improves the competition score by 38.6% over the baseline on validation (1.053 to 0.647) and generalizes well to the held-out test set (0.593). The results show substantially reduced global and local MPJPE errors, with global errors dominating as expected due to camera pose sensitivity.

Key findings

SMART reduces competition score on validation set from 1.053 (baseline) to 0.647, a 38.6% improvement.
On test set, SMART achieves score 0.593 with Global MPJPE 0.324 m and Local MPJPE 0.054 m.
Foot-plane anchoring reduces global MPJPE by 44 mm and local MPJPE by 12 mm compared to naive fine-tuning.
Multi-task depth supervision dominates improvements, lowering local MPJPE from 0.067 m to 0.055 m.
RAFT-small with MAD outlier filtering reduces camera rotation error to 0.041° per frame vs 0.043° for Lucas-Kanade baseline.
Naive fine-tuning improves global MPJPE from 0.522 m to 0.425 m but worsens local MPJPE from 0.065 m to 0.079 m due to viewpoint overfitting.
Temporal refinement network reduces local MPJPE by 22% on validation but degrades competition score due to oversmoothing.
Global MPJPE accounts for ~55% of total score; camera tracking is the main bottleneck.

Threat model

The system assumes a naturalistic adversary arising from challenges in broadcast soccer video: varied camera viewpoints, zoom levels, occlusions, and typical broadcast noise and artifacting. The adversary cannot manipulate metadata such as camera intrinsics, bounding boxes or initial camera pose, but introduces realistic appearance and motion variations that complicate accurate pose estimation. The goal is robustness against these environmental visual challenges rather than malicious attacks or spoofing.

Methodology — deep read

Threat Model & Assumptions: The adversary here can be understood as the noise and variability in broadcast soccer video sources, including wide camera angles, zoom, partial occlusions of players, varying lighting conditions, and broadcast distortions. The method assumes access to per-frame player bounding boxes; camera intrinsics and distortion parameters; and an initial camera pose. The adversary cannot alter these inputs but introduces natural visual complexity. The goal is to produce accurate 3D joint poses in a consistent world frame.
Data: The primary test data is 20 FIFA World Cup 2022 broadcast sequences at 50 fps, 1920x1080 resolution, with 6 validation and 14 test clips. Provided metadata include bounding boxes, camera intrinsics (Kt), distortion (kt), and initial camera poses (R0, t0). For training, WorldPose dataset is used, containing 89 broadcast soccer clips with pseudo-ground-truth 3D poses. A stratified split (70 train, 19 val) is created based on clip boundaries to avoid temporal overlap and match camera height/view angles. No competition frames are used during training to avoid leakage.
Architecture / Algorithm: The core body recovery module is SMPLest-X (ViT-H backbone, 687M parameters), which regresses SMPL-X body shape and pose parameters from expanded player crops (512x384 px), outputting 10,475 vertex meshes. FIFA-15 skeleton joints are derived by fixed vertex lookup. The model is fine-tuned with multi-task loss: weighted 3D MPJPE (extremities 3x weight), 2D keypoint reprojection error, and L1 depth loss on pelvis depth, crucial for global placement. For camera pose tracking, RAFT-small optical flow is run on broadcast frames inside a convex hull of 714 pitch landmarks, filtered with Median Absolute Deviation (MAD) to reject player motions, followed by RANSAC homography fitting and homography decomposition into rotation and translation per frame. Camera poses are rejected if rotation change is >60°. Pitch-plane anchoring is performed by back-projecting the lowest foot pixel in the bounding box onto the pitch plane to align feet in world space. Final skeleton placement is refined by a two-stage L-BFGS optimization minimizing joint reprojection error and aligning skeleton centroid to bounding box center. Temporal smoothing uses a two-pass approach: 1) smoothing and outlier correction of global root trajectories, 2) independent smoothing of root-relative joints with Gaussian filters.
Training Regime: Only SMPLest-X decoder and prediction heads are fine-tuned; ViT-H backbone is frozen. Training runs 15 epochs with batch size 16 on one 24GB A10G GPU, using Adam optimizer at LR=2e-6 and Automatic Mixed Precision (AMP). Loss weights are λ1=1.0 (3D), λ2=0.1 (2D), λ3=0.5 (depth), with certain wrist and foot joints weighted 3x. Broadcast augmentations include random crop, horizontal flip with symmetric joint permutation, and color jitter.
Evaluation Protocol: The main metric is competition score L = Global MPJPE + 5× Local MPJPE measured in meters; the weighting reflects emphasis on local pose quality. Evaluation is done on held-out FIFA validation and test sets. Ablations include incremental fine-tuning modifications, camera tracker comparisons (rotation error vs ground truth), and temporal smoothing effect. Camera tracking performance is verified against ground-truth camera poses on WorldPose validation splits (19 clips).
Reproducibility: The paper does not explicitly mention open source code or pretrained weights release. Data includes publicly available WorldPose and FIFA Challenge data (likely restricted). The methodology is described in sufficient detail to reproduce with access to same datasets and baseline models. One concrete example is processing a broadcast frame: detect player bounding box → crop and resize → predict SMPL-X parameters → extract pelvis depth → propagate camera pose via RAFT optical flow from initial (R0,t0) → back-project foot pixel ray to pitch plane → refine skeleton placement by optimization → apply temporal smoothing over frames.

Technical innovations

Domain-adaptive fine-tuning of SMPLest-X on broadcast soccer video via stratified clip splits and multi-task supervision including pelvis depth losses.
Use of RAFT-small dense optical flow with MAD outlier filtering for robust per-frame camera rotation and translation estimation, outperforming Lucas-Kanade sparse flow.
Foot-plane anchoring by back-projecting the lowest foot pixel to known pitch plane to constrain absolute player height and improve global and local pose accuracy.
Two-pass temporal smoothing separating global root trajectory cleaning via outlier rejection and Gaussian filtering from independent smoothing of root-relative joint offsets.

Datasets

FIFA World Cup 2022 Broadcast — 20 sequences (6 val, 14 test) — FIFA Challenge data with bounding boxes, intrinsics
WorldPose — 89 clips — Public broadcast soccer dataset with pseudo-ground-truth 3D poses

Baselines vs proposed

FIFA Baseline (SAM-3D): Score = 1.053 vs SMART (val): 0.647
SAM-3D + Placement MLP: Score = 0.864 vs SMART (val): 0.647
SMPLest-X pretrained (no adaptation): Score = 0.846 vs SMART (val): 0.647
Naive fine-tuning SMPLest-X: Score = 0.820 vs SMART (val): 0.647
Foot-plane anchoring only: Score = 0.714 vs SMART full training: 0.647
Camera Tracking Rotation Error: Lucas-Kanade = 0.043° vs RAFT-small+MAD = 0.041°

Limitations

Diving and airborne poses are underrepresented in training data; foot-plane anchoring fails when players are off the ground.
SMPLest-X model is per-frame only and struggles with occlusions; temporal body models might improve robustness.
High computational cost due to large 687M parameter ViT-H backbone, limiting inference speed and scalability.
Camera pose tracking can suffer during aggressive zoom and unstable shots causing pitch-plane deformation, raising global MPJPE.
Temporal smoothing with learned refiner over-smooths fast actions, degrading performance compared to simple Gaussian smoothing.
No adversarial evaluation against intentional visual perturbations or synthetic occlusions reported.

Open questions / follow-ons

How would a temporally-aware body mesh estimator/model improve pose estimation under heavy occlusion conditions?
Can lighter-weight backbones or distillation techniques achieve similar accuracy-speed tradeoffs to SMPLest-X ViT-H in broadcast settings?
Can the camera tracking pipeline be improved to better handle aggressive zoom-induced pitch-plane deformation?
How effective would be an end-to-end approach jointly optimizing body mesh recovery and camera pose estimation?

Why it matters for bot defense

While this work focuses on 3D pose estimation for soccer players, several elements are relevant for bot-defense and CAPTCHA practitioners focused on human motion analysis. Accurate global and local pose estimation in cluttered, occluded scenes with moving cameras is analogous to challenges faced in robust activity recognition or user gesture verification. The extensive camera pose tracking and world-space anchoring strategies demonstrate practical steps to achieve consistency across frames and viewpoints, which could help build stronger bot defenses relying on natural human motion cues rather than superficial keypoints. The emphasis on temporal smoothing to reduce jitter while preserving fast motions also informs how to balance temporal coherence with responsiveness.

However, the approach depends on significant domain-specific metadata inputs (camera intrinsics, initial pose, player tracking) which may limit direct applicability to unconstrained real-world CAPTCHA scenarios. The high computational cost of the base model suggests that simpler, more efficient architectures might be needed for real-time or embedded testing environments. Overall, the techniques highlight state-of-the-art tradeoffs in 3D human pose recovery in complex video streams, valuable for building and evaluating motion-based bot detection methods.

Cite

bibtex

@article{arxiv2605_31551,
  title={ SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation },
  author={ Parthsarthi Rawat },
  journal={arXiv preprint arXiv:2605.31551},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.31551}
}

SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​