WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

Source: arXiv:2606.18147 · Published 2026-06-16 · By Yuwei Zhang, Tong Xia, Bianca Emmerich, Yu Yvonne Wu, Dimitris Spathis, Xin Liu et al.

TL;DR

WEQA addresses the challenging problem of answering natural language questions about wearable health data, which is continuous, high-dimensional, and temporally complex—unlike typical text data that foundation LLMs are pretrained on. The authors propose a novel query-adaptive agentic framework that unifies large language model (LLM) reasoning with specialized wearable analytical tools and pretrained physiological prediction models. Core to WEQA is an LLM controller that dynamically synthesizes an execution plan tailored to each query, deciding how to route computation between sensor-native statistical analyses, temporal reasoning, and predictive models, followed by grounded auditing that verifies evidence and calibrates risk-aware language. The paper introduces a new benchmark for wearable health QA across four datasets spanning cardiovascular, respiratory, and mental health domains, covering descriptive, longitudinal, predictive, and open-ended insight questions. Empirically, WEQA improves accuracy over strong LLM and multi-agent baselines by at least 24% on tasks requiring complex temporal or predictive reasoning, while human evaluations with medical experts confirm better clinical soundness, personalization, and usefulness. The framework also achieves lower latency and token usage, demonstrating the efficiency benefits of its query-adaptive routing. Overall, WEQA shows that wearable health QA requires adaptive orchestration of specialized sensor-native models with LLM reasoning beyond text-based prompting or fixed agent pipelines.

Key findings

WEQA improves accuracy by 24% over LLM-only and existing agentic baselines across four datasets in cardiovascular, respiratory, and mental health domains.
HEQA achieves 95.6% exact match and 9.2 MAE on Short-Horizon Analytical QA, surpassing ReAct (64.8% / 31.3 MAE) and multi-agent baselines (72.0% / 157.2 MAE).
On Long-Horizon Analytical QA, WEQA reaches 94.0% exact match on numerical reasoning and 95.1% accuracy on trend and correlation queries, exceeding all baselines.
For Predictive Reasoning QA, WEQA attains 83.9% balanced accuracy on classification tasks and 10.9 MAE for blood pressure regression, outperforming ReAct’s 59.2% and 15.4 MAE respectively.
WEQA reduces average token usage and latency compared to existing agents, using around 10,490 tokens vs 31,000–41,000 tokens and lower average latencies.
Human evaluation with 12 medical experts and 8 users shows WEQA achieves an average score of 3.9/5 on clinical soundness, personalization, and usefulness, surpassing baselines significantly.
Ablations demonstrate that query-adaptive planning, grounded response auditing, and adaptive predictive inference each contribute distinctly to improved personalization, safety, and usefulness.
WEQA’s benefits generalize across backbone LLMs—both Gemini-3.0-Flash and Qwen3-Max yield strong results.

Threat model

No explicit adversarial threat model considered; the system assumes queries come from legitimate users interested in personal wearable health insights and does not address adversarial attempts to manipulate input data or produce misleading answers.

Methodology — deep read

The WEQA framework is designed for a threat model where an adversary is not explicitly considered, focusing instead on safe, clinically grounded responses. The adversary is thus typically a user querying their own wearable data with no direct adversarial manipulation assumed.

Data: WEQA was evaluated on a curated benchmark using four public wearable datasets: TILES (longitudinal physiological and behavioral sensing), UK COVID-19 and COVID-19 Sounds (respiratory audio recordings), and PPG-BP (photoplethysmography waveform data for blood pressure estimation). Together the benchmark contains 358 users, 1123 question-answer pairs, 6 sensing modalities, and spans cardiovascular, respiratory, and mental health tasks. Question-answer pairs were synthetically constructed with programmatic answers for analytic tasks and label-derived ground truth for prediction tasks.

Architecture and Algorithms: WEQA consists of three core components: (1) Query-Aware Planning — an LLM-based controller (prompted Gemini-3.0-Flash by default) that processes the user query, sensor metadata, available tools/models, and optional personal history to synthesize an adaptive execution plan. It identifies query objectives, data modalities, temporal scope, reasoning types, personalization needs, and risk level. The controller performs iterative replanning to adapt to intermediate evidence.

(2) Evidence Construction — divided into sensor analytical reasoning (invoking specialized statistical, temporal, and cross-sensor analytical tools on raw sensor streams) and adaptive predictive inference (selecting from pretrained ML and wearable foundation models conditioned on the query and personalization context). Predictive models provide uncertainty estimates and the controller may combine multiple model outputs for low-confidence queries. Personalized model adaptation includes few-shot fine-tuning with self-labeled user history or using history-aware baseline-relative models.

(3) Grounded Response Auditing — synthesizes the final response by verifying the evidence supports major claims, calibrating response uncertainty and confidence, enforcing risk-appropriate language, and optionally referencing external medical knowledge sources for clinical context.

Training Regime: WEQA is training-free at the controller level (the LLM is prompted rather than fine-tuned), using pretrained foundation wearable models for prediction. Predictive models and analytics tools are pretrained or heuristic algorithms focusing on physiological data.

Evaluation Protocol: The benchmark evaluates four task families: Short-Horizon Analytical QA (short-window statistical tasks), Long-Horizon Analytical QA (longitudinal trend and correlation reasoning), Predictive Reasoning QA (classification and regression tasks), and Open-Ended Insight QA (synthesis and clinically contextualized response generation). Metrics include exact match accuracy, mean absolute error (MAE), balanced accuracy (UAR) for classification, and human Likert ratings on correctness, personalization, usefulness, and clinical soundness. Baselines include LLM prompting with text or image encodings, ReAct iterative coding agents, Multi-Agent frameworks, and ablations of WEQA components.

Reproducibility: While the paper does not explicitly state code release, it provides detailed dataset references, evaluation protocols, and baseline comparisons to enable reproducibility. The benchmark is based on public datasets (TILES, COVID-19 Sounds, PPG-BP). The LLM prompting configurations and architectural details are described extensively in appendices.

Example Workflow: For a query like “Did my sleep quality improve recently?”, the controller first identifies the analytical nature and long-term temporal scope. It plans to run analytical tools extracting sleep statistics and trend analysis from multi-day sensor streams. The analytical evidence includes local and longitudinal statistics. If uncertainty or clinical risk is detected, the grounded auditing module cross-verifies evidence and integrates medical context before generating a safety-calibrated, personalized response.

This end-to-end adaptive orchestration from query parsing to specialized sensor-native analysis to clinically grounded output exemplifies WEQA’s novel approach compared to fixed LLM prompting or monolithic models.

Technical innovations

A query-adaptive LLM controller that dynamically synthesizes execution plans tailored to each wearable health query’s modality, temporal scope, and risk profile rather than using a fixed reasoning pipeline.
Integration of sensor analytical reasoning and adaptive predictive inference pathways, combining specialized physiological statistical tools with pretrained ML/foundation models for nuanced wearable data interpretation.
Grounded response auditing that verifies evidence support, calibrates uncertainty, enforces clinical safety-aware language, and incorporates external medical knowledge to produce trustworthy, risk-appropriate answers.
A unified benchmark spanning multiple wearable datasets, sensing modalities, and health task families (descriptive, longitudinal, predictive, open insight) enabling systematic evaluation of adaptive agentic wearable health QA.

Datasets

TILES — multimodal longitudinal wearable sensing (~4 months) — public
UK COVID-19 and COVID-19 Sounds — respiratory audio collection for COVID screening — public
PPG-BP — photoplethysmography waveform data for blood pressure estimation — public

Baselines vs proposed

LLM-Text (hourly input): Short-Horizon Analytical QA exact match = 6.0% vs WEQA 95.6%
LLM-Image (hourly input): Short-Horizon Analytical QA exact match = 3.2% vs WEQA 95.6%
ReAct agent: Predictive classification balanced accuracy = 59.2% vs WEQA 83.9%
Multi-Agent: Predictive classification balanced accuracy = 55.2% vs WEQA 83.9%
ReAct agent: Blood pressure regression MAE = 15.4 vs WEQA 10.9
Multi-Agent: Blood pressure regression MAE = 24.9 vs WEQA 10.9

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.18147.

Fig 1

Fig 1: Examples of heterogeneous wearable health

Fig 2

Fig 2: WEQA query-adaptive agent framework for wearable health question answering. Given a user query,

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 3

Fig 3: Examples of query-adaptive reasoning pathways in WEQA for heterogeneous wearable health questions.

Limitations

Benchmark uses only four public datasets and may not capture full diversity of real-world wearable sensor quality, modalities, or clinical populations.
Human evaluation conducted on a relatively small stratified sample (15 responses rated by 12 experts and 8 users), limiting statistical power and generalizability.
Study does not extensively test adversarial robustness or attacks against the system; safety claims remain unverified under adversarial conditions.
The LLM controller is prompted rather than fine-tuned, which may limit ability to fully optimize query planning or adapt to unseen query types.
No clear evaluation of handling missing or corrupted sensor data beyond replanning; performance under noisy real-world conditions is unclear.
While personalization is supported, details on user privacy, data security, and real-time deployment constraints are not discussed.

Open questions / follow-ons

How can WEQA’s query-adaptive agent be extended to handle adversarial inputs or sensor data corruption for improved robustness?
What is the performance and user experience impact of including real-time streaming data versus batch query analysis in wearable health QA?
Can personalization adapt dynamically over time with continuous user feedback to improve prediction and explanation quality?
How does integration of external clinical knowledge bases at scale affect grounding quality and safety calibration in responses?

Why it matters for bot defense

WEQA highlights that effective natural language question answering over complex sensor data demands adaptive, multi-step reasoning workflows rather than fixed, text-centric LLM prompting. For bot-defense and CAPTCHA practitioners, this illustrates the value of query-adaptive agent architectures that synthesize reasoning over heterogeneous, high-dimensional inputs, verifying grounding and calibrating response confidence. The use of uncertainty-aware auditing to produce safety-calibrated outputs also informs approaches to risk-aware response generation in sensitive domains. Furthermore, the efficiency gains from dynamic routing and compact evidence summaries may inspire techniques to reduce computational overhead in agentic systems interacting with multimodal inputs. Finally, WEQA’s modular framework illustrates how to effectively integrate pretrained domain models with LLM reasoning, a pattern applicable beyond wearable health to other complex question-answering or bot-detection contexts involving multimodal data.

Cite

bibtex

@article{arxiv2606_18147,
  title={ WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning },
  author={ Yuwei Zhang and Tong Xia and Bianca Emmerich and Yu Yvonne Wu and Dimitris Spathis and Xin Liu and Daniel McDuff and Cecilia Mascolo },
  journal={arXiv preprint arXiv:2606.18147},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.18147}
}

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​