Skip to content

STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media

Source: arXiv:2605.25162 · Published 2026-05-24 · By Liang Xue, Haoyu Liu, Cheng Wang, Pengyu Chen, Haozhuo Zheng, Yang Liu

TL;DR

This paper addresses the chronic scarcity of large-scale, complex, domain-specific task-oriented dialogue (TOD) data needed for vertical large language models. Existing data acquisition methods suffer a trilemma: expert annotation is costly, real-world service conversations are limited by privacy, and static datasets quickly become outdated. To overcome this, the authors propose STREAM, a novel data-centric framework that mines authentic interaction signals from publicly available streaming media—live streams, short videos, and web pages—and synthesizes high-fidelity task-oriented dialogues with rich service behaviors. STREAM integrates role-grounded persona construction, Conversational Blueprinting to model dialogue strategies and trajectories, and retrieval-augmented generation (RAG) for knowledge-aware response synthesis. Leveraging this framework, the authors release StreamDial, a large-scale multi-domain TOD dataset spanning Automotive, Restaurant, and Hotel verticals, containing 87,498 dialogue sessions and nearly 1.5 million turns with an average length of 17 turns per session.

Extensive intrinsic evaluation via multiple LLM judges and human annotators demonstrates that StreamDial dialogues outperform strong public baselines (from datasets like CrossWOZ and RiSAWOZ) across coherence, informativeness, naturalness, diversity, flexibility, and overall quality metrics. Downstream experiments on Dialogue State Tracking (DST) confirm improved task performance and cross-domain generalization for models trained using StreamDial data. Furthermore, preliminary multilingual transfer experiments on Qwen3-8B suggest the dataset’s utility across languages under resource constraints. The work pushes the frontier of automated, scalable synthesis for temporally fresh, strategically rich TOD data, which is crucial for training sophisticated domain-focused conversational agents beyond simplistic slot-filling.

Key findings

  • StreamDial dataset contains 87,498 dialogue sessions and 1,497,320 turns total, with an average of 17.11 turns per session, balanced across Automotive, Restaurant, and Hotel domains.
  • Automatic LLM judges (Qwen3-Max, GPT-5.2, Gemini3-Pro) rate StreamDial dialogues significantly higher in Overall Quality than Open Data baselines (CrossWOZ, RiSAWOZ, TransferTOD), with average quality improvements of 1.5–3 points on a 10-point scale (Fig 3).
  • Human evaluation of 1,431 anonymized sessions confirms intrinsic quality improvements on coherence, informativeness, diversity, and naturalness dimensions without sacrificing dialogue fluency or role adherence.
  • DST fine-tuning with StreamDial (Hybrid) achieves JGA of up to 96.97% and Slot-value F1 of 99.37% on Qwen3-1.7B models, surpassing baseline public data performance (JGA 93.43%, F1 98.81%) and Stream-Only or Seed-Only variants (Table 4).
  • Cross-domain consistency: StreamDial improves dialogue quality and DST utility consistently across all three vertical domains, demonstrating the generality of the mining and synthesis approach (Table 3).
  • Streaming Signal Ingestion (SSI) reduces ASR word error rates to 3.5–10.5% with domain-specific lexicons and LLM-assisted correction, critical to denoise raw stream data for dialogue synthesis.
  • Graph-based dialogue filtering reduces redundancy by clustering similar dialogues based on semantic similarity thresholds, preserving dataset diversity and realism.
  • Multilingual transfer experiments using English, French, and Korean translations with Qwen3-8B show promising cross-lingual gains on the X-RiSAWOZ benchmark under a controlled training budget.

Methodology — deep read

  1. Threat model and assumptions: The adversary is not explicitly defined since this work focuses on data mining and synthesis rather than security. The framework assumes access only to publicly available streaming sources without violating privacy. The synthesized dialogues aim to realistically simulate complex task-oriented scenarios while preserving role consistency and dialogue strategy.

  2. Data provenance, size, and preprocessing: STREAM ingests heterogeneous rich-media from massive raw streaming sources, including 320K+ public accounts from platforms like Douyin, Kuaishou, Xiaohongshu, and Ctrip. Sources are filtered by domain keywords, interaction activity (e.g., >1,000 viewers), and content quality (removing ads and off-topic streams). The raw data types include web pages (domain knowledge), live streams (audio, video, synchronized bullet chats), and short videos (edited clips plus comments).

  3. Architecture and algorithm: STREAM includes four cascading phases: (a) Streaming Signal Ingestion (SSI) extracts atomic interaction signals—user questions (Qu), agent responses (Ra), aligned QA pairs (Pqa), dialogue strategies (Stag), and account metadata (Iacc)—via ASR, denoising, temporal and semantic alignment, and LLM-based evidence correction. (b) Adaptive Persona Synthesis (APS) creates paired role-grounded personas for users (Pu) and agents (Pa). Pu integrates real-time user question distributions with seed dialogues and generates diverse natural language expressions of intent. Pa models professional agent roles, communication style, domain expertise, constraints, and goals, grounded in account metadata and domain knowledge bases. (c) Conversational Blueprinting (CB) constructs a comprehensive dialogue plan B with hierarchical elements: rhythm overview of stages (e.g., intent mining), key user signal definitions, coping strategies for specific scenarios (e.g., competitor comparison), and a dialogue flow atlas mapping multi-path trajectories. (d) Interactive Dialogue Generation (IDG) synthesizes multi-turn dialogues H by simulating interactive multi-agent exchanges conditioned on Pu, Pa, and B, with retrieval-augmented generation (RAG) supporting knowledge injection. RAG retrieval pools seed dialogues and stream-derived evidence samples for bidirectional user and agent turns. Graph-based dialogue filtering detects semantic clusters of similar dialogues, sampling proportionally to maintain dataset diversity.

  4. Training regime: Details on training parameters for synthesis models (e.g., hyperparameters or hardware) are not fully specified. The DST downstream fine-tuning uses Qwen3 and Gemma3 backbones with standard settings, ensuring fair comparison between different training data regimes. Multilingual experiments translate the synthetic data with an established pipeline and validate with held-out test splits.

  5. Evaluation protocol: Intrinsic dialogue quality is measured by LLM-as-a-Judge evaluations over six dimensions on 2,000 samples per condition, anonymized and randomized to eliminate bias. Three independent LLMs are used: Qwen3-Max, GPT-5.2, and Gemini3-Pro. Complementary human evaluation samples 1,431 sessions rated by 53 annotators, who scored the same six dimensions after inspecting full quadruplet sessions (Pu, Pa, B, H). Downstream extrinsic utility is assessed by fine-tuning three DST models (Qwen3-1.7B, Qwen3-8B, Gemma3-4B) on multiple data conditions, reporting Joint Goal Accuracy and Slot-value F1 on standard splits. Cross-domain evaluation highlights generalization. Multilingual transfer is evaluated on X-RiSAWOZ in Automotive domain.

  6. Reproducibility: The authors release StreamDial dataset with quadruplet schema (Pu, Pa, B, H) publicly (Chinese original and work underway on English/Korean/French translations). Annotated splits, translation prompts, and post-processing rules are to be released. Code or synthesis models are not explicitly stated as released. The use of publicly available stream sources and extensive filtering procedures enhance reproducibility of data mining pipeline.

Technical innovations

  • A novel multi-phase pipeline (SSI, APS, CB, IDG) that mines noisy streaming media (live streams, short videos) to synthesize structured, role-grounded, and blueprint-guided multi-turn task-oriented dialogues.
  • Integration of retrieval-augmented generation (RAG) within a multi-agent interactive loop to enable knowledge-aware and context-sensitive dialogue turn generation.
  • Use of conversational blueprints decomposed into hierarchical components—rhythm overview, key nodes, coping strategies, and dialogue flow atlas—for explicit, executable dialogue trajectory control.
  • Graph-based dialogue filtering leveraging semantic similarity on paired user and agent representations to identify and sample diverse dialogue communities, reducing redundancy and preserving dataset richness.

Datasets

  • StreamDial — 87,498 dialogues, 1,497,320 turns — constructed from public streaming media from Douyin, Kuaishou, Xiaohongshu, and Ctrip in Automotive, Restaurant, Hotel domains.
  • CrossWOZ — public — Open Data baseline
  • RiSAWOZ — public — Open Data baseline
  • TransferTOD — public — Open Data baseline
  • X-RiSAWOZ — public — used in multilingual transfer evaluation

Baselines vs proposed

  • Open Data (CrossWOZ + RiSAWOZ + TransferTOD): Overall Quality by Qwen3-Max = 6.32 (Automotive), StreamDial = 8.98
  • Open Data: Overall Quality by GPT-5.2 = 5.89 (Automotive), StreamDial = 7.81
  • Open Data: Overall Quality by Gemini3-Pro = 6.91 (Automotive), StreamDial = 8.82
  • DST Joint Goal Accuracy on Qwen3-1.7B SFT w/ Public Data = 93.43%, +StreamDial (Hybrid) = 96.97%
  • DST Slot-value F1 on Qwen3-1.7B SFT w/ Public Data = 98.81%, +StreamDial (Hybrid) = 99.37%
  • DST JGA on Qwen3-8B SFT w/ Public Data = 93.94%, +StreamDial (Hybrid) = 96.72%
  • DST F1 on Qwen3-8B SFT w/ Public Data = 98.56%, +StreamDial (Hybrid) = 99.41%
  • DST JGA on Gemma3-4B SFT w/ Public Data = 95.71%, +StreamDial (Hybrid) = 97.98%
  • DST F1 on Gemma3-4B SFT w/ Public Data = 98.78%, +StreamDial (Hybrid) = 99.54%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.25162.

Fig 1

Fig 1: Overview of the Stream framework. Starting from heterogeneous rich-media sources (web pages, live streams, and

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

  • ASR quality varies with word error rates between 3.5% and 10.5%, which may introduce noise despite domain lexicon normalization and LLM correction.
  • The synthesized dialogue quality is ultimately limited by signal extraction from noisy and unstructured streaming data, requiring extensive filtering and heuristic thresholds that may introduce bias.
  • Dialogue generation details, such as model architectures, hyperparameters, and training specifics for synthesis modules, are not fully documented, impacting reproducibility.
  • The dataset currently primarily covers three verticals—Automotive, Restaurant, and Hotel—limiting immediate applicability to other domains requiring different kinds of service dialogues.
  • Multilingual data release and transfer experiments are ongoing, with translations relying on automatic pipelines and only partial manual inspection, potentially affecting quality.
  • No adversarial evaluation or robustness testing against malicious or deliberately crafted dialogues is reported.
  • The framework relies on public streaming platforms that could be region and language dependent, possibly limiting global applicability.

Open questions / follow-ons

  • How well does the STREAM pipeline generalize to additional verticals beyond Automotive, Restaurant, and Hotel, particularly domains with more complex or sensitive service workflows?
  • Can the RAG-powered multi-agent generation be improved with stronger grounding mechanisms or multi-modal inputs (e.g., video, image cues) from streams to enhance dialogue fidelity?
  • How does the temporal freshness and dynamic updating of the dataset affect model robustness over longer periods, especially as streaming content evolves rapidly?
  • What are the trade-offs between automated streaming data mining versus controlled human annotation of vertical dialogues in terms of data quality and downstream task performance?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners focused on task-oriented dialogue systems, this work offers a novel data-centric approach to synthesize realistic, complex dialogues at scale by mining publicly available streaming media. The STREAM framework provides mechanisms to move beyond synthetic template-based or static crowd-annotated corpora by capturing authentic user-agent interactions with rich service behaviors such as negotiation, constraint management, and recovery. Practitioners designing dialogue-based bot defenses or human verification challenges can leverage StreamDial’s multi-domain dialogues for training and evaluating models that need to detect or simulate human-like strategic conversational behaviors. Given the detailed persona and blueprint metadata aligned with each dialogue, there is potential for more granular analysis of utterance authenticity and behavioral consistency, informing bot-detection features or adversarial robustness tests against sophisticated conversational bots. Additionally, the retrieval-augmented and multi-agent generation methodology presents avenues for both generating challenging synthetic dialogues and understanding real user intent signals embedded in noisy streaming content. However, for direct CAPTCHA or bot defense deployment, adaptation to adversarial attack scenarios and robustness assessments beyond the synthesis domain will be essential.

Cite

bibtex
@article{arxiv2605_25162,
  title={ STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media },
  author={ Liang Xue and Haoyu Liu and Cheng Wang and Pengyu Chen and Haozhuo Zheng and Yang Liu },
  journal={arXiv preprint arXiv:2605.25162},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.25162}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution