Towards Open World Sound Event Detection

Source: arXiv:2605.03934 · Published 2026-05-05 · By P. H. Hai, L. T. Minh, L. H. Son

TL;DR

This paper argues that standard sound event detection is too closed-world for real deployments: at inference, audio scenes contain novel, unlabeled, and often overlapping events that current detectors neither recognize as unknown nor learn from incrementally. The authors formalize this as Open-World Sound Event Detection (OW-SED), borrowing the open-world learning loop from vision: detect known events, flag unknown ones, have an oracle label some of them, and update the model without catastrophic forgetting.

The main technical response is WOOT, a Transformer-style detector built around a 1D Deformable DETR adaptation for temporal audio. The design combines sparse deformable temporal attention, feature disentanglement into class-specific and class-agnostic query representations, a one-to-many matching stage, and a diversity loss intended to keep query slots from collapsing onto the same known events. On URBAN-SED and DESED, the paper claims the 1D deformable model is competitive in closed-world SED, while WOOT improves open-world behavior over prior baselines; however, the truncated text does not expose the exact numerical deltas or full ablation tables, so those specifics are not recoverable here.

Key findings

The paper claims to be the first formulation of Open-World Sound Event Detection (OW-SED), extending open-world learning from object detection to audio event detection.
WOOT uses 1D deformable self-attention and 1D deformable cross-attention to focus on sparse temporal positions instead of dense full-sequence attention, which the authors argue is better suited to overlapping and ambiguous audio events.
Feature disentanglement splits each query into class-agnostic and class-specific components; the class-agnostic branch drives eventness, while the class-specific branch drives classification.
The open-world framework adds a one-to-many matching stage so that multiple queries can learn from the same ground-truth event, instead of only one Hungarian-matched query.
A diversity loss is added in the second training stage to push query features apart and reduce redundancy among slots that otherwise collapse onto the same event.
Experiments are reported on URBAN-SED and DESED, with the authors stating that their method is marginally better than leading closed-world techniques and significantly better than existing open-world baselines; exact metric values are not visible in the provided text.
The model is built on PROB-style eventness modeling with a multivariate Gaussian over queries and combines classification and eventness scores multiplicatively as p(l|q)=f_cls(q)·f_event(q).

Threat model

The adversary is not an active attacker but the open-world environment itself: at deployment, audio streams contain previously unseen acoustic events, overlapping events, and ambiguous boundaries that the model has never been trained on. The model knows only the current set of labeled classes K_t, may retain only a small replay buffer of old data, and must detect unknowns as class 0 so that a human oracle can label some of them later. What it cannot assume is that all future classes are known in advance or that retraining from scratch on the full history is feasible.

Methodology — deep read

The threat/problem model is not adversarial in the cybersecurity sense; the paper’s core assumption is a deployed SED system that sees an evolving acoustic environment where some future events are not in the training ontology. In OW-SED, the model must recognize known classes, assign an unknown label to unfamiliar events, and support later incremental updates when a human oracle labels some unknowns. The source text also explicitly assumes constrained replay: only a small subset of historical data can be retained, so the updated model must avoid catastrophic forgetting under memory/privacy/compute limits.

Data-wise, the paper names two datasets: URBAN-SED and DESED. The provided excerpt does not include exact sample counts, class counts, train/val/test splits, or preprocessing settings beyond the standard audio pipeline described in the method: input audio is converted to a Mel-spectrogram X ∈ R^{1×T0×F0}. The backbone is ResNet-50, which processes the spectrogram into a feature tensor f ∈ R^{C×T×F}; a 1×1 convolution maps channels to the transformer embedding size d, then the tensor is reshaped to a 1D temporal sequence X_S ∈ R^{T×D} with D = d×F. Positional encoding is one-dimensional sinusoidal code based only on time index t, broadcast across frequency bins. The text does not report augmentation, window length, hop size, optimizer, batch size, training epochs, random seeds, or hardware in the excerpt.

Architecturally, WOOT is framed as a 1D Deformable DETR adapted to audio. The encoder replaces dense self-attention with 1D Deformable Self-Attention (1D-DSA): each query attends to a small set of sampled temporal positions around a reference time rather than all positions. The sampling offsets and attention weights are learned from the query vector, and fractional positions are handled by interpolation. The decoder uses learned event queries and 1D Deformable Cross-Attention (1D-DCA) to refine them against encoder features. Prediction heads output (i) temporal boundaries via an MLP over center and duration, (ii) class probabilities via softmax, and (iii) eventness via a Gaussian-based score. The paper’s baseline framework is PROB: class prediction and eventness are decoupled, with eventness estimated from Mahalanobis distance to a learned Gaussian N(µ, Σ) over query embeddings. The final probability is the product of class score and eventness score.

The main novelty for open-world behavior is feature disentanglement. For each query q, a disentanglement block f_dis produces q_agn = f_dis(q), interpreted as the class-agnostic component. The class-specific feature is then q_spec = q - q_agn. The agnostic feature is used for eventness, while the specific feature is used for classification; the original q is retained for localization. A cosine-similarity-style disentangle loss L_dis penalizes overlap between q_agn and q_spec. The intended effect is to make unknown-event detection rely on general event presence cues rather than known-class identity, and to reduce interference when new classes are added incrementally.

Training is two-stage. First, after standard bipartite matching, the paper relaxes one-to-one assignment by treating some unmatched queries as semi-matched if they satisfy two conditions: predicted class confidence exceeds threshold α and the overlap ratio with a same-class ground-truth segment exceeds threshold β. These semi-matched queries are trained like matched ones, except their localization loss is set to zero. The motivation is specific to SED: unlike object detection, partial temporal coverage can still be a valid event prediction. Second, the paper introduces a diversity loss to discourage redundant query representations after one-to-many matching, because without it many queries can collapse onto the same event and starve the model of capacity for unknown classes. The excerpt states the existence and motivation of this loss, but the exact formula is truncated.

Evaluation is described at a high level only. The paper claims experiments on URBAN-SED and DESED, comparing against strong SED methods in closed-world settings and against open-world baselines in OW-SED settings. The excerpt references Figure 1 for the task definition and Figure 2 for the architecture. It also states that the 1D deformable model is competitive in closed world, while WOOT is better in open world. Because the provided text is truncated before the results section, there are no visible numeric PSDS/F1/unknown-detection scores, no per-dataset tables, no ablation slices, and no statistical significance tests to report.

Reproducibility is limited in the provided material. The architecture and loss definitions are explicit enough to reimplement the core method, but the excerpt does not include code release, pretrained weights, hyperparameters, seed strategy, or exact data processing. Since the text is truncated before the experiments section, it is also unclear whether the authors release a benchmark split for OW-SED or if evaluation protocols are fully specified beyond the dataset names and the open-world incremental-learning framing. One concrete end-to-end example from the paper’s logic: a query that lands inside a known-class dog-bark interval but was not selected by Hungarian matching can still be included as semi-matched if its class confidence is high and its predicted segment sufficiently overlaps the ground truth; that query then contributes to classification training but not localization regression, which is meant to preserve useful temporal coverage while avoiding over-penalizing partial matches.

Technical innovations

First formalization of OW-SED, which couples known-class detection, unknown-event flagging, and incremental learning in audio.
A 1D deformable temporal transformer that adapts Deformable DETR-style sparse attention from 2D vision to 1D sound event timelines.
Query feature disentanglement into class-specific and class-agnostic components, using the latter for eventness and the former for classification.
A one-to-many matching rule tailored to SED partial overlaps, plus a diversity loss to prevent query-slot collapse under relaxed assignment.

Datasets

URBAN-SED — size not specified in provided text — public dataset
DESED — size not specified in provided text — public dataset

Baselines vs proposed

PROB: not enough information in the provided text to extract a metric/value comparison.
Existing leading closed-world techniques: described as marginally outperformed, but no numeric metric is visible in the excerpt.
Existing open-world baselines: described as significantly outperformed, but no numeric metric is visible in the excerpt.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.03934.

Fig 1

Fig 1: Introduction to the Open-World Sound Event Detection (OW-SED) task

Fig 2

Fig 2: Illustration of the WOOT model architecture. The proposed WOOT is

Limitations

The excerpt does not expose the actual result tables, so the claimed gains cannot be verified numerically from the provided text.
No dataset sizes, split protocol, optimizer, batch size, training epochs, or hardware details are visible in the excerpt.
The open-world benchmark definition appears novel, but the paper text provided does not show whether the evaluation is robust to different class orders, unknown-class compositions, or replay-buffer sizes.
The one-to-many heuristic depends on thresholds α and β; the excerpt does not show sensitivity analysis for these hyperparameters.
It is unclear from the provided text whether unknown-event evaluation includes calibration quality or only detection/localization metrics.
The proposal assumes a human oracle labels discovered unknowns, which is realistic for some workflows but not for fully autonomous deployment.

Open questions / follow-ons

How sensitive is OW-SED performance to the unknown-class mixture, class-ordering, and the amount of replay memory?
Would the one-to-many matching and diversity loss still help under heavier polyphony or in non-stationary background noise?
Can the unknown-event discovery stage be calibrated to rank candidate novel classes for annotation more effectively than a binary unknown flag?
How does WOOT compare to open-vocabulary audio methods that use text or audio prompts, if those were adapted to the same benchmark?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the useful idea is not audio-specific; it is the open-world control loop. The paper treats “unknown” as a first-class output, then designs training so the detector can later absorb newly discovered categories without wiping out prior knowledge. In bot defense, the analogous problem is adversarial traffic that does not fit the current bot taxonomy: you want the system to surface uncertain or novel behavior patterns, cluster them for analyst review, and fold confirmed patterns into future models while preserving performance on known abuse families.

The technical lesson is that open-world systems often need representation separation and assignment relaxation. Feature disentanglement can help isolate broad “is this suspicious at all?” signals from narrow family-specific signatures, and one-to-many matching is a reminder that forcing a single label assignment can waste useful partial matches. I’d read this as a design pattern for anomaly triage pipelines: separate general anomaly detection from class attribution, allow multiple candidate explanations during learning, and add explicit diversity pressure so the model doesn’t memorize only the easiest known clusters.

Cite

bibtex

@article{arxiv2605_03934,
  title={ Towards Open World Sound Event Detection },
  author={ P. H. Hai and L. T. Minh and L. H. Son },
  journal={arXiv preprint arXiv:2605.03934},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.03934}
}

Towards Open World Sound Event Detection ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​