Rapid data quality investigations of gravitational-wave events with the Data Quality Report Builder toolkit

Source: arXiv:2605.16183 · Published 2026-05-15 · By Derek Davis, Zach Yarbrough, Joseph Areeda, Ronaldas Macas, Nicolas Arnaud, Adrian Helmling-Cornell et al.

TL;DR

The paper introduces the Data Quality Report Builder (DQRbuild) toolkit, designed to automate and accelerate the vetting of gravitational-wave (GW) event candidates by assessing data quality (DQ) issues from multiple detectors in the LIGO-Virgo-KAGRA network. The toolkit integrates diverse scientific tests including statistical noise analyses, glitch identification, and environmental noise projections to address the historically time-consuming manual event validation process. Using data from all significant candidates in the prior third observing run (O3), the authors demonstrate that DQRbuild can identify 96% of the data quality problems previously flagged by human experts, albeit with a 24% false alarm rate.

This work responds to increasing GW event rates and the need for rapid event vetting to support low-latency follow-ups. DQRbuild supports distributed workflows across multiple detector sites, orchestrated by flexible configuration, alert listening, database tracking, and result reporting components. The paper details the statistical and binary tasks used, including HVeto for transient noise correlations, stationarity checks, and environmental noise projections. The toolkit was extensively deployed during the ongoing fourth observing run (O4) to validate candidate events. The authors also reflect on future challenges towards fully automated vetting as GW detections scale up.

Key findings

DQRbuild can identify 96% of data quality problems found manually during O3.
False alarm rate of DQ issues detected by DQRbuild is approximately 24%.
Typical latency achievable by DQRbuild workflows is approximately 5 minutes, matching O3 retraction timelines.
Statistical tasks report interpretable p-values to quantify the chance of misidentifying DQ issues.
DQRbuild was successfully deployed across multiple LVK nodes including LIGO Hanford, Livingston, Virgo, and a central node, supporting concurrent multi-detector analysis.
The HVeto task identified a key noise correlation responsible for the retraction of candidate S191011af within 5 minutes of data availability.
O3 glitch rates with SNR >6.5 averaged 1 per minute, yielding a ~12% chance of glitch overlap per detector in typical 8-second analysis windows, rising to 32% probability over three-detector networks.
25% of O3 GW events with false alarm rate <1/yr had associated DQ issues impacting parameter estimation.

Threat model

The threat is unintentional noise artifacts and instrumental/environmental disturbances that mimic or contaminate gravitational-wave signals, potentially causing false alarms or biasing parameter estimation. The adversary is thus the combination of complex, non-Gaussian, non-stationary noise sources inherent in the detectors and their environments, which the data quality tools must detect and mitigate to avoid false astrophysical interpretations.

Methodology — deep read

Threat Model & Assumptions: The adversary is effectively non-malicious noise sources and instrumental/environmental disturbances causing false positive or biased candidate GW signals. The framework assumes the adversaries cannot directly manipulate the data but that environmental and instrumental noise creates non-Gaussian, transient artifacts complicating signal validation. Auxiliary sensor information is critical for correlating noise sources. The goal is to automate human expert data quality vetting under these conditions.
Data: The toolkit was validated on all significant candidate events shared as public alerts during the third observing run (O3). These candidates had human-vetted data quality reports against which DQRbuild outputs were compared. O3 data includes strain time series from the LIGO Hanford, Livingston, Virgo detectors, and auxiliary channels monitoring the environment and instrumental subsystems. The candidate analysis windows typically use 8-second data segments. Noise statistical characterizations rely on long-duration PSDs (e.g., 512s). The data includes known glitches, Gaussian noise simulations for stationarity tests, and recorded environmental sensor data with known coupling functions.
Architecture / Algorithm: DQRbuild is a modular Python-based software toolkit structured into several components:

dqr-configuration: A flexible task manager producing HTCondor workflows in the form of DAG files, configured by user-supplied YAML/JSON files that specify tasks for each candidate.
dqr-alert: Listens for public GW alerts from the LIGO-Virgo-KAGRA alert network (GraceDB via Apache Kafka) and triggers autonomous workflow launches.
dqr-database: A SQL backend to track task results locations, versions, and metadata across distributed nodes to facilitate centralized report collation.
dqr-tasks: Implements numerous scientific analyses split into "statistical" (producing p-values), "binary" (pass/fail), and "qualitative" categories.

Key statistical tasks include:

HVeto: Uses hierarchical veto algorithm to find auxiliary channels whose transient noise coincides statistically with strain glitches, iteratively vetoing based on coincidence rates.
Stationarity: Computes ratio ρ of short versus long-term PSD estimates weighted by f^-7/3 inspiral spectrum; p-values assess deviation from Gaussian noise using bilby-generated simulations.
iDQ: Uses machine-learned models correlating auxiliary channel data to classify glitch likelihoods probabilistically and compare event times to surrounding background.
PEMcheck: Projects environmental noise contribution to strain data using measured coupling functions and observed ambient sensor noise levels.

Each task inputs local strain and auxiliary channel data and outputs JSON results containing p-values or binary pass/fail flags.

Training Regime: Task configurations involve parameter thresholds (e.g., significance levels for HVeto), durations for PSD estimation, and machine learning classifiers for iDQ pretrained from historical noise data. No conventional ML training from scratch is detailed; iDQ operates online at the LIGO sites continuously. The entire workflow runs distributed across multiple computing clusters using HTCondor, leveraging centralized configuration and alert management.
Evaluation Protocol: Performance is assessed by running DQRbuild on all O3 public alert candidates, comparing automatically flagged DQ issues against previously manually identified problems. Metrics include recall (true positive rate) of 96% and false alarm rate of 24% for DQ issue detection. Latency analysis shows the toolkit produces results within minutes to match low-latency follow-up needs.
Reproducibility: The toolkit is in active use by the LVK collaboration and integrated in O4 event validation. Code packages and components like dqr-configuration, dqr-alert, dqr-database, and dqr-tasks are modular and configurable. Data from O3 is mostly public (GraceDB events). Simulation methods for stationarity testing (bilby noise injections) are standard. No direct public frozen model weights or closed datasets are mentioned; full setup requires access to LVK infrastructure and data.

Example End-to-End Workflow: Upon a new candidate alert, dqr-alert triggers dqr-configuration to assemble a task DAG per configuration files specifying which science tasks to run. Each task processes detector strain and auxiliary sensor data locally, producing JSON results with p-values or flags. Results are uploaded via dqr-upload to a central dqr-database, awaited by dqr-html which generates a user-friendly web report summarizing all task results. For candidate S191011af, HVeto identified the known noise coupling culprit within 5 minutes, matching the manual retraction timeline. This pipeline automates previously manual multi-day vetting while retaining interpretability through p-values. The entire setup operates distributedly across LVK nodes due to large data size constraints.

Technical innovations

Development of a flexible, configuration-driven workflow manager (dqr-configuration) enabling distributed multi-node data quality analyses using HTCondor DAGs.
Integration of multiple complementary statistical and binary data quality tasks into a unified automated vetting toolkit capable of rapid (~5 min) response.
Streaming integration with LVK alert network via dqr-alert leveraging Apache Kafka for real-time autonomous workflow triggering across international detector sites.
Use of well-calibrated p-values from diverse tasks (e.g., stationarity, HVeto, iDQ) allowing systematic aggregation of heterogeneous noise diagnostics into automated data quality decisions.

Datasets

O3 public GW candidate alert events — ~hundreds of significant transient candidates during third observing run — sourced from GraceDB and O3 public alert systems
Simulated colored Gaussian noise for stationarity p-value calculation — generated by bilby software matched to O3 noise spectra

Baselines vs proposed

Manual human vetting of O3 candidate events: identified 100% of DQ issues in dataset — DQRbuild automated detection: 96% recall of these issues
DQRbuild false alarm rate on O3 candidates for DQ issues: 24%
HVeto task latency: identifies known noise coupling for S191011af within 5 minutes compared to previous multi-day manual investigations

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.16183.

Fig 3

Fig 3: An example result page for the HVeto task. The displayed candidate is S191011af, which

Fig 4

Fig 4: An example result page for the GSpyNetTree task. The displayed candidate is S191225aq,

Limitations

24% false alarm rate indicates significant room for reducing spurious DQ issue flags without losing recall.
The evaluation is retrospective on O3 data; performance on new data with possibly novel noise characteristics is not yet fully characterized.
No adversarial analysis or robustness testing against intentional perturbations or unexpected noise sources.
Relies heavily on auxiliary channel coverage and premeasured coupling functions; gaps or miscalibrations there may reduce sensitivity.
Some tasks (e.g., iDQ) incorporate ML classifiers trained offline, but the paper does not provide details on retraining strategies or adaptability to non-stationary noise over time.
Centralized database and reporting depends on reliable data upload from distributed sites, which may pose operational challenges at scale.

Open questions / follow-ons

How can false alarm rates for data quality flags be meaningfully reduced without sacrificing recall, possibly via more advanced ML classifiers or combined analyses?
What is the performance of DQRbuild tasks under future observing runs (O4 and beyond) with evolving instrument noise properties and increased detection rates?
Can the toolkit be extended or adapted to anticipate novel noise sources or coupling mechanisms not present in historical data?
How can the federation of distributed nodes and real-time alert integration be optimized for scalability and fault tolerance as network complexity increases?

Why it matters for bot defense

While this paper addresses automated vetting of astrophysical signal candidates in gravitational-wave data, the core challenges it tackles—namely real-time anomaly detection in complex noisy data streams, efficient distributed workflow orchestration, and quantification of false positives versus true detections—are conceptually relevant to bot-defense and CAPTCHA challenges. Bot-defense systems often require rapid, automated decisions based on heterogeneous signals with risk of false alarms. The use of statistically-calibrated p-values and hierarchical vetoing to identify correlated noise sources parallels approaches to detect suspicious bot-like behavior in noisy user interaction data. Additionally, the architecture of distributed, modular workflows reacting to event triggers in milliseconds to minutes can inspire scalable bot detection systems responding to streaming user requests. However, the domain-specific nature of gravitational-wave auxiliary sensors and coupling functions limits direct applicability. Overall, bot-defense engineers can glean insight on integrating multiple automated tests with statistical rigor, emphasizing latency-accuracy tradeoffs and federated data processing frameworks.

Cite

bibtex

@article{arxiv2605_16183,
  title={ Rapid data quality investigations of gravitational-wave events with the Data Quality Report Builder toolkit },
  author={ Derek Davis and Zach Yarbrough and Joseph Areeda and Ronaldas Macas and Nicolas Arnaud and Adrian Helmling-Cornell and Paolina Doliva and Olivia Godwin and Hirotaka Yuzurihara and Benjamin Mannix and Sofia Alvarez-Lopez and Max Trevor and Rachael Huxford and Philippe Nguyen and Beverly Berger and Chayan Chatterjee and Francesco Di Renzo and Christiano Palomba and Viola Sordini and Dimitrios Pesios and Marissa Walker and Airene Ahuja and Man Leong Chan and Julian Ding and Raymond Frey and Franz Herbst and Yannick Lecoeuche and Annudesh Liyanage and Jess McIver and Raymond Ng and Sophie Perry and Caitlin Rawcliffe and Robert Schofield },
  journal={arXiv preprint arXiv:2605.16183},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.16183}
}

Rapid data quality investigations of gravitational-wave events with the Data Quality Report Builder toolkit ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​