Skip to content

SoccerNet 2026 Player-Centric Ball-Action Spotting:Retraining and Post-Processing Extensions to the FOOTPASS Baselines

Source: arXiv:2606.09679 · Published 2026-06-08 · By Parthsarthi Rawat

TL;DR

This paper presents enhancements to the FOOTPASS baselines for player-centric ball-action spotting in broadcast soccer video, as part of the SoccerNet 2026 challenge. The task requires detecting which player performs which of eight action classes and when, under significant class imbalance (213:1 ratio of pass to tackle events). Building on three baseline models—TAAD, TAAD+GNN, and TAAD+DST—the author contributes four key improvements: gradient checkpointing to enable full fine-tuning of heavy video backbones within GPU memory limits; fusion of graph neural network (GNN) player interaction logits into the DST temporal decoder to integrate tactical context with visual features; square-root frequency class weighting to mitigate class imbalance without overfitting; and a complex post-processing pipeline including per-class logit gating, temporal refinement, jersey re-assignment, and ensemble inference. These extensions improve the Macro F1 score on the test set from 0.493 to 0.548 and reach 0.446 on the challenge evaluation server.

The methodology tightly integrates visual and graph-based player representations and addresses the large inter-class frequency gap, which causes rare classes like tackles and blocks to be poorly detected. The post-processing effectively curtails false positives on rare classes at the cost of some recall. However, domain shift between the test and challenge sets causes degradation in rare class performance due to overfitting thresholds. The paper provides detailed ablation showing roughly equal contribution of GNN fusion and class weighting, with additional gains from post-processing and ensembling. Per-class results highlight the persistent difficulty of long-tail classes in sparse event spotting.

Overall, this work demonstrates a practical approach to improve multi-agent spatio-temporal action spotting in sports broadcast by incorporating player interaction context, carefully balanced training, and tailored post-processing. It also reveals challenges in robustness under dataset shifts common in real-world annotation-scarce settings.

Key findings

  • Gradient checkpointing enabled full backbone fine-tuning at batch size 6 on a single 22GB GPU, improving model capacity.
  • Fusion of GNN logits with TAAD logits in the DST encoder raised test set Macro F1 from 0.493 to 0.505.
  • Square-root class weighting (wc = 1/√nc) improved Macro F1 further to 0.521, effectively addressing 213:1 pass-to-tackle imbalance.
  • Post-processing (per-class logit gating, temporal frame refinement, jersey re-assignment) increased Macro F1 to 0.535.
  • Ensembling GNN-DST (with class weighting) and Base-DST boosted final test Macro F1 to 0.548.
  • On rare classes, tackle F1 fell from 0.256 (test) to 0.056 (challenge), indicating overfitting of logit gating thresholds to test distribution.
  • Post-processing removed 57% of block false positives and 74% of tackle false positives while retaining all true positives on the test set.
  • Tracking limitations caused ~20% of ground truth events to lack bounding boxes, reducing player identification accuracy.

Threat model

The adversary is an automated system tasked with detecting player actions accurately from broadcast soccer videos, challenged by occlusions, player tracking errors, and highly imbalanced event frequencies. The defender cannot statistically eliminate all false positives, especially for rare classes, and cannot rely on perfect player bounding boxes or uniform data distribution.

Methodology — deep read

The threat model centers on automatically detecting player-centric actions from broadcast soccer footage, challenged by noisy video data, multi-agent interactions, sparse events, and extreme class imbalance. The goal is to predict (action, player, frame) triplets for eight action classes.

Data comes from the publicly released SoccerNet FOOTPASS dataset, featuring 26 tracked players per frame and annotated action events. Approximately 37,000 pass events and 174 tackle events comprise training data, creating a severe 213:1 class imbalance. Data preprocessing includes cropping player regions via RoIAlign and extracting spatiotemporal visual features with an X3D-S backbone.

The baseline models are a cascade: TAAD extracts per-player action logits from video features; TAAD+GNN applies EdgeConv layers over a graph of players (nodes with spatial, velocity, role, jersey, and visual features) exploiting spatial and tactical context; TAAD+DST employs a Denoising Sequence Transducer transformer model to autoregressively decode spatiotemporally smoothed action events based on the per-player logits. The DST is trained with cross-entropy loss and label smoothing over 750-frame context windows.

Four key extensions were implemented: (1) gradient checkpointing enabled memory-efficient full fine-tuning of the X3D backbone on a single 22GB GPU at smaller batch size; (2) GNN logits were concatenated to TAAD logits to form a fused 598-dimensional input token for the DST encoder, allowing joint reasoning over visual and graph-based streams; (3) to counter class imbalance, the action-head cross-entropy loss was scaled by the inverse square root of per-class counts, increasing weight for rare classes roughly 10× compared to common ones; (4) a complex post-processing pipeline included per-class logit gating thresholds calibrated on the test set to suppress false positives in rare classes, temporal frame alignment to the peak logit within windows, jersey reassignment when confidence margins warrant, followed by per-class non-maximum suppression and ensemble merging of predictions from GNN-DST and Base-DST models.

Models were trained from scratch for 20 epochs, initially freezing the backbone for 2 epochs then fine-tuning with AdamW optimizer using 5e-5 learning rate for backbone and 1e-3 for head layers, with a 50-step warmup. Evaluation used Macro F1 averaged equally over all 8 classes on test and challenge splits, emphasizing robustness to rare classes. Ablations were conducted incrementally, demonstrating additive gain from each proposed component. Per-class results and failure modes were analyzed, revealing significant degradation on rare classes under distribution shift.

No public code or frozen weights were referenced, but training and inference details are extensively provided. Post-processing does not require retraining and is applied solely to DST raw JSON outputs.

Technical innovations

  • Use of gradient checkpointing on the X3D backbone to enable full fine-tuning within GPU memory constraints, improving model capacity.
  • Fusion of GNN-generated player-interaction logits directly into the DST transformer's encoder, allowing joint spatial-visual context modeling.
  • Application of square-root frequency class weighting to balance a severe 213:1 class imbalance without overfitting rare classes.
  • A multi-stage post-processing pipeline with per-class logit gating, temporal frame refinement, and jersey reassignment to reduce false positives on rare classes.

Datasets

  • SoccerNet FOOTPASS — ~37,000 pass events, 174 tackle events — publicly released multi-modal multiclass soccer action spotting dataset

Baselines vs proposed

  • TAAD+DST baseline: Macro F1 = 0.493 (test)
  • GNN logit fusion: Macro F1 = 0.505 vs baseline 0.493
  • Square-root class weighting: Macro F1 = 0.521 vs 0.505
  • With post-processing (logit gating + frame refinement): Macro F1 = 0.535 vs 0.521
  • Final ensemble and gate tuning: Macro F1 = 0.548 vs 0.535

Limitations

  • Severe performance degradation on rare classes (tackle F1 dropped from 0.256 test to 0.056 challenge) due to overfitting of logit gating thresholds.
  • Tracking failures cause ~20% of ground truth events to lack player bounding boxes, reducing player identification accuracy.
  • Post-processing thresholds calibrated on test set did not transfer well to challenge set, leading to precision collapse on rare classes.
  • The model ensemble excluded the Base-DST tackle predictions because they added false positives without true positives.
  • Did not evaluate robustness under adversarial or significantly shifted video conditions beyond test vs challenge split.
  • No public release of code or pretrained weights, limiting reproducibility.

Open questions / follow-ons

  • How to improve rare class precision and recall robustness under distribution shift without overfitting gating thresholds?
  • What alternative loss or sampling strategies (e.g., focal loss) best mitigate long-tail class imbalance in multi-agent spatio-temporal detection?
  • Can incorporating more robust tracking or multi-view data reduce missing bounding boxes and improve player identity assignment?
  • How might domain adaptation methods improve generalization from test to challenge sets in sports action spotting?

Why it matters for bot defense

This work is relevant to bot-defense or CAPTCHA practitioners concerned with robust multi-agent temporal event detection under severe class imbalance and scarce positive examples. The approach of fusing contextual graph-based features with visual cues via a transformer decoder can inspire architectures detecting coordinated behaviors or rare events in complex scenes. The class weighting and post-processing pipeline illustrate pragmatic means to balance precision and recall without retraining.

Moreover, the sensitivity of gating thresholds and classifier calibration to distribution shifts highlights a common challenge in real-world detection systems: tuning models to maintain stable false positive rates amidst imbalanced classes and domain changes. Similar ideas could be adapted for behavioral biometrics or interactive human verification tasks where rare but critical signals must be detected reliably. Overall, the methodology underscores the importance of integrated spatio-temporal and relational modeling alongside tailored post-processing to improve multi-agent action recognition systems.

Cite

bibtex
@article{arxiv2606_09679,
  title={ SoccerNet 2026 Player-Centric Ball-Action Spotting:Retraining and Post-Processing Extensions to the FOOTPASS Baselines },
  author={ Parthsarthi Rawat },
  journal={arXiv preprint arXiv:2606.09679},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.09679}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution