Abstracting Cross-Domain Action Sequences into Interpretable Workflows

Source: arXiv:2606.14654 · Published 2026-06-12 · By Gaurav Verma, Scott Counts

TL;DR

This paper addresses the challenge of extracting high-level, interpretable user activities from noisy, granular, and domain-specific low-level action sequences logged by digital applications. Prior approaches relying on pattern mining or deep learning often require extensive domain-specific training data and are sensitive to noise, limiting cross-application generalization and interpretability. To overcome these limitations, the authors propose WorkflowView, a hierarchical framework that leverages large language models (LLMs) through zero-shot and few-shot prompting to progressively abstract raw action sequences into natural language descriptions and then high-level task or activity inferences. The approach is validated across three diverse real-world domains: browser interaction logs for reconstructing user tasks with high semantic alignment (µsim = 0.91), MOOC interaction logs where few-shot prompting predicts student dropout with a weighted F1 score of 0.90, and anonymized Microsoft Word telemetry enabling privacy-preserving analysis of AI tool usage and workflow categorization. These results demonstrate the robustness and flexibility of LLM-based abstractions to produce actionable insights without heavy reliance on domain-specific training data or fine-tuning. The paper also discusses practical deployment concerns like computational cost and user privacy when integrating LLM inference within logging infrastructures.

Key findings

WorkflowView achieves a zero-shot mean semantic similarity of 0.911 (±0.042) between generated and ground-truth browser task descriptions on the Mind2Web dataset containing 2,022 tasks across 137 websites.
In few-shot MOOC dropout prediction using 67,699 student-course pairs, WorkflowView attains a weighted F1 = 0.90, precision = 0.83, recall = 0.98 using only 5 few-shot examples per class, comparable to supervised baselines trained on thousands of samples.
The best MOOC dropout prediction (F1 = 0.89) corresponds to modeling actions from 6 days to 24 hours before last activity, outperforming majority and random baseline predictions significantly.
Increasing few-shot examples per class beyond 5 for MOOC dropout prediction shows diminishing or slightly reduced performance, likely due to LLM context length limitations.
WorkflowView enables discovery and multi-class classification of user activities in Microsoft Word document workflows, revealing meaningful categories such as active content editing (15%) and collaboration (reviewing comments, real-time editing).
The framework provides privacy-preserving insights by abstracting granular telemetry without exposing sensitive textual or personal data.
WorkflowView can work well with smaller or open-weight LLMs such as Phi-4 and gpt-oss-20b, demonstrating model-agnostic flexibility.
Hierarchical prompts progressively denoise and contextualize raw action logs into natural language summaries and activity labels, facilitating modular reuse across downstream tasks.

Threat model

N/A; the work focuses on interpreting noisy low-level UI action logs into high-level workflows using LLM abstraction rather than defending against an adversary or mitigating active attacks. Privacy considerations emphasize anonymization and aggregation but no formal adversary model is defined.

Methodology — deep read

Threat model & assumptions: The adversary is not explicitly modeled as this is not a traditional security paper, but the approach assumes logged action sequences are noisy, granular, and possibly contain irrelevant or exploratory user interactions. The goal is to abstract these logs without requiring large annotated datasets. Privacy concerns are acknowledged with anonymized telemetry data.
Data provenance and preprocessing: Three datasets covering diverse domains were used: (a) Mind2Web browser logs with 2,022 web task sequences across 137 websites, (b) MOOC logs from Feng et al. (2019) with 67,699 student-course pairs and 22 unique student action types, and (c) anonymized Microsoft Word telemetry from 50,000 users in the US engaging with an AI document assistant. Time-stamped actions are tokenized and optionally converted to natural language descriptions.
Architecture and algorithm: WorkflowView is a hierarchical LLM prompting pipeline with three layers: Layer 1 converts low-level action sequences into natural language descriptions; Layer 2 abstracts these summaries into high-level activity or task inferences; Layer 3 optionally categorizes inferred activities into predefined or discovered classes (e.g., binary dropout classification or multi-class document activity). Prompts are carefully designed for modular progressive reasoning and denoising. TnT-LLM is used for category label discovery in Word workflows.
Training regime: No fine-tuning is performed. The approach relies on zero-shot or few-shot prompting using GPT-4o (version dated 2024-05-13). Few-shot prompts include K examples per class (K in {1,3,5,10,20}) randomly sampled from training sets. No gradient updates or epochs apply.
Evaluation protocol: Metrics depend on task: semantic similarity to ground-truth natural language browser tasks (using cosine similarity of OpenAI text-embedding-ada-002 embeddings) with retrieval-style metrics (MRR, Recall@K); weighted precision, recall, and F1 for dropout prediction; and interpretable activity categories with prevalence statistics for Word logs. Baselines include majority/random for dropout, and fine-tuned seq2seq for browser tasks. Model sensitivity to few-shot examples and temporal window hyperparameters is analyzed.
Reproducibility: Prompts used at all layers are provided in the appendix. No training code or model weights released as the approach uses closed-access LLM APIs. Datasets like Mind2Web are publicly referenced; MOOC and Word telemetry datasets are restricted or internal.

Concrete example: For a browser interaction log consisting of clicks and typed inputs, WorkflowView first prompts GPT-4o to generate a natural language description of the sequence, e.g., “The user created a Walgreens shopping list and added personal care and shower essentials items.” Next, a second prompt abstracts this description into a succinct task summary. The retrieved summary is then compared to the ground-truth textual task label by embedding similarity, yielding a mean cosine similarity score of ~0.91, confirming high-quality semantic abstraction.

Technical innovations

Hierarchical LLM prompting framework (WorkflowView) that progressively abstracts raw UI interaction logs into natural language descriptions and then interpretable high-level activities without fine-tuning.
Cross-domain generalized approach validated zero-shot or few-shot on diverse datasets (web browsers, MOOCs, document workflows), demonstrating high semantic fidelity and predictive accuracy with minimal training data.
Integration of LLMs with domain-agnostic prompting to denoise noisy, granular timestamped action sequences into coherent textual semantic summaries.
Use of modular architecture enabling flexible downstream task adaptation (e.g., dropout prediction, multi-class classification) by adding lightweight classification layers on top of LLM-generated summaries.
Application of existing label discovery method (TnT-LLM) for unsupervised activity taxonomy generation from LLM-produced activity summaries in complex document workflows.

Datasets

Mind2Web — 2,022 web task sequences across 137 websites — public (Deng et al., 2023)
MOOC Interaction Logs — 67,699 student-course pairs from 44,008 students in 247 courses, 22 unique action types — from Feng et al. (2019), not public
Microsoft Word AI Tool Telemetry — anonymized logs from 50,000 US users in June 2025, approx. 2,000 unique UI actions — internal Microsoft data

Baselines vs proposed

Majority baseline (MOOC dropout): weighted F1 = 0.759 vs WorkflowView: weighted F1 = 0.90 (5-shot)
Random baseline (MOOC dropout): weighted F1 = 0.622 vs WorkflowView: weighted F1 = 0.90 (5-shot)
Fine-tuned seq2seq model (browser task reconstruction, Appendix A.1.1): lower semantic similarity vs WorkflowView zero-shot µsim = 0.911
Supervised MOOC dropout models (Fu et al., 2021; Basnet et al., 2022; Feng et al., 2019) report F1 between 0.84 - 0.91 vs WorkflowView few-shot F1 = 0.90

Limitations

Relies on closed-source proprietary LLMs (GPT-4o) for core inference, limiting reproducibility and deployment unless similar models are accessible.
WorkflowView sensitivity to the number of few-shot examples suggests brittleness due to LLM context window limitations, potentially requiring careful prompt engineering.
Evaluation on only three domains; generalization to other application areas or highly heterogeneous datasets remains untested.
Privacy analysis is descriptive rather than formal; how abstraction impacts privacy guarantees under adversarial threat is not rigorously studied.
No adversarial robustness testing — e.g., deliberate noise or obfuscation in logs to evade or mislead LLM interpretations.
Datasets like Microsoft Word telemetry are internal, limiting external validation or benchmarking.

Open questions / follow-ons

How can WorkflowView be adapted for rigorous adversarial robustness against noisy or manipulated interaction sequences?
What formal privacy guarantees can be established for LLM-based abstraction of sensitive user behavioral data?
Can the hierarchical prompting framework be extended to real-time streaming logs for online workflow detection and intervention?
How does WorkflowView perform across languages, cultures, or significantly different UI paradigms (e.g., mobile apps, gaming)?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, WorkflowView illustrates a promising new approach to interpret low-level action streams into meaningful high-level activities using LLMs. This could enable more sophisticated detection of anomalous or malicious workflows rather than relying solely on heuristic or shallow feature-based classifiers. The zero-shot and few-shot LLM prompting paradigm minimizes costly domain-specific model retraining while providing interpretable intermediate representations, which could improve transparency in behavioral bot detection. However, practical deployment would need to consider LLM inference latency and privacy implications when integrating with real-time bot defenses. Additionally, the robustness of such LLM abstractions to adversarial user behaviors and the contextual diversity of bot interactions remains an open area for further research. Overall, WorkflowView offers a conceptual framework that can be adapted to augment behavioral analysis within bot-detection pipelines using advanced natural language understanding of action sequences.

Cite

bibtex

@article{arxiv2606_14654,
  title={ Abstracting Cross-Domain Action Sequences into Interpretable Workflows },
  author={ Gaurav Verma and Scott Counts },
  journal={arXiv preprint arXiv:2606.14654},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.14654}
}

Abstracting Cross-Domain Action Sequences into Interpretable Workflows ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​