Skip to content

Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

Source: arXiv:2606.03951 · Published 2026-06-02 · By Zechen Bai, Zhiheng Chen, Yiqi Lin, Kevin Qinghong Lin, Difei Gao, Xiangwu Guo et al.

TL;DR

Demo2Tutorial addresses the challenge of automatically converting raw human computer interaction recordings into structured, multimodal software tutorials. This problem is motivated by the gap between unstructured demonstration videos—passive, untrimmed recordings without guidance—and user-friendly tutorials that provide step-by-step, multimodal instructions optimized for teaching. The framework innovates by combining synchronized screen video and user action logging (HE-Recorder), multimodal semantic parsing (Action Parser), hierarchical task structuring with iterative actor-critic refinement (Step Planner), and intelligent tutorial composition with adaptive visual highlights and key-frame selection (Tutorial Composer).

The authors curate TutorialBench, a new benchmark of 110 task tutorials spanning seven common software applications, sourced from official vendor documentation and corresponding human-recorded demonstrations. Evaluations show Demo2Tutorial-generated tutorials exceed human-authored official tutorials in overall quality as judged by GPT-4o using five dimensions (actionability, completeness, conciseness, annotation, image relevance) with an overall score of 86.2 vs. 79.1. Downstream, integrating generated tutorials improves GUI agent planning success rates by +11.7% to +17.6% compared to baseline in OSWorld tasks. A user study of 20 participants demonstrates that generated tutorials reduce task completion time by 10.5% and are preferred by 80% over raw demonstration videos, confirming improved human pedagogical value.

Key findings

  • Demo2Tutorial achieves an overall tutorial quality score of 86.2 on TutorialBench, surpassing human-authored tutorials at 79.1 (Tab. 1).
  • Visual quality score for Demo2Tutorial tutorials is 88.7 vs. 70.5 for human-authored, showing automated adaptive visual annotations outperform manual ones.
  • End-to-end text-only tutorial generation baseline scores only 59.2, highlighting the need for visual grounding in tutorial creation.
  • Agent-S3 GUI agents integrating Demo2Tutorial tutorials improve task success rates by +11.7% (from 47.1% to 58.8%) on Chrome and +2.7% on VLC for GPT-o4-mini; GPT-5 improves by +17.6% and +11.1% respectively (Tab. 2).
  • User study with 20 participants shows 10.5% faster task completion (131.6s vs. 147.1s) using generated tutorials vs. raw demos (Fig. 5a).
  • 80% of participants prefer Demo2Tutorial tutorials over raw demonstration videos for learning software skills (Fig. 5b).
  • Actor-critic iterative refinement in Step Planner improves structural coherence and instructional clarity relative to vanilla multi-agent baseline, shown by 15.9 point overall quality gap (70.3 vs. 86.2).
  • Key-frame selection using multi-criterion scoring (text relevance, sharpness, motion stability, temporal proximity) significantly outperforms uniform frame sampling in tutorial visual quality.

Threat model

n/a — The paper focuses on extracting and distilling human-computer interaction experience into tutorials for pedagogical use and agent learning, rather than security or adversarial threats.

Methodology — deep read

The threat model assumes an adversary neither relevant nor addressed, as the focus is on tutorial generation from recorded human experience.

Data are collected in two forms: (1) raw human demonstrations recorded by experts using a custom HE-Recorder that synchronizes high-res 30 FPS full-screen video with detailed mouse and keyboard action logs, and (2) official, high-quality, human-authored tutorials sourced from vendor websites across seven software (PowerPoint, Word, Excel, Photoshop, Premiere Pro, After Effects, Acrobat). The dataset contains 110 samples with balanced representation. Demonstrations are recorded with instructions to prevent direct copying of tutorials to ensure authentic, untrimmed workflows. Actions within 1-second windows are merged and redundant keys consolidated to reduce granularity.

The architecture consists of four sequential modules. The HE-Recorder captures dual streams (visual and action). The Action Parser uses GPT-4o and chain-of-thought prompting with action-grounded visual inputs (screenshots with red bounding boxes indicating mouse coordinates) to produce detailed semantic descriptions for each action, including before/after states, differences, action verbalization, and inferred user intent.

The Step Planner organizes parsed actions bottom-up into hierarchical task graphs with three levels: atomic steps grouped by sub-goal, chapters clustering related steps, and a top-level tutorial goal. A chunking mechanism handles long sequences. An actor-critic iterative refinement loop alternates between generating structured tutorial drafts (actor) and evaluating quality on coverage, granularity, orderliness, and learnability (critic), with actionable feedback to improve the draft over iterations until convergence or max iterations.

The Tutorial Composer converts the hierarchical tutorial into image-text interleaved documents. It selects key-frames per step via a multi-weight scoring function combining OCR-based text relevance, image sharpness (Laplacian variance), motion stability (temporal consistency), and Gaussian-weighted temporal proximity to the recorded action. Selected frames are enhanced by adaptive visual annotations using SAM2 for UI segmentation and RapidOCR for text detection; edits include click markers, drag trajectories, hotkey badges, magnifier effects, adaptive cropping, highlighting, and contrast adjustment, tailored per action type and interface layout.

The evaluation protocol comprises three main experiments. First, tutorial generation is evaluated on TutorialBench against official tutorials and baselines, using GPT-4o as a judge to score five dimensions: actionability, completeness, conciseness for content and annotation quality, image relevance for visuals. Human consistency validation confirms reliable correlation with VLM scoring. Baselines include text-only and vision-only generation, and a vanilla multi-agent system lacking actor-critic refinement and compositor enhancement.

Second, GUI agent planning is evaluated on OSWorld benchmark tasks (Chrome and VLC domains) using Agent-S3 framework with GPT-o4-mini and GPT-5 planners. Performance metrics are task success rates under varying levels of tutorial context: baseline prompt-only, text-only, image-only, and full multimodal tutorial input.

Third, a human user study with 20 participants compares generated tutorials vs raw demonstration videos for learning a nontrivial PowerPoint task. Metrics are task completion time and user format preference.

The entire system and dataset are publicly released, supporting reproducibility. The detailed chain-of-thought prompting, actor-critic iterative refinement, and multimodal visual annotation are novel and well-documented, allowing replication. However, exact training hyperparameters for GPT-4o usage are unspecified, as the system relies on prompting rather than classic model training.

Technical innovations

  • A multimodal Action Parser combining action logs and visual context, using GPT-4o with structured chain-of-thought prompting to reconstruct detailed semantic action and intent descriptions.
  • A bottom-up hierarchical Step Planner that groups atomic actions into steps and chapters, employing an actor-critic iterative refinement loop to ensure tutorial clarity and structural coherence.
  • An intelligent Tutorial Composer that selects key-frames for each instructional step using a multi-criterion scoring function, followed by adaptive visual highlighting and editing (e.g., magnification, contrast adjustment) powered by UI segmentation and OCR.
  • Creation of TutorialBench, a novel benchmark pairing authentic human demonstrations with official tutorials across seven common software applications for standardized tutorial generation evaluation.

Datasets

  • TutorialBench — 110 samples spanning PowerPoint, Word, Excel, Photoshop, Premiere Pro, After Effects, Acrobat — curated from official software vendor tutorials and matched human-recorded demonstrations using HE-Recorder

Baselines vs proposed

  • Human-authored tutorials (GT): Overall quality score = 79.1 vs. Demo2Tutorial = 86.2 (Tab. 1)
  • Text-based generation: Overall score = 59.2 vs. Demo2Tutorial = 86.2 (Tab. 1)
  • Vision-based generation: Overall score = 64.3 vs. Demo2Tutorial = 86.2 (Tab. 1)
  • Vanilla Multi-Agent (no actor-critic or compositor): Overall score = 70.3 vs. Demo2Tutorial = 86.2 (Tab. 1)
  • Agent-S3 GPT-o4-mini baseline Chrome success rate = 47.1% vs. +Tutorial = 58.8% (+11.7%) (Tab. 2)
  • Agent-S3 GPT-o4-mini baseline VLC success rate = 53.4% vs. +Tutorial = 56.1% (+2.7%) (Tab. 2)
  • Agent-S3 GPT-5 baseline Chrome success rate = 52.9% vs. +Tutorial = 70.6% (+17.6%) (Tab. 2)
  • Agent-S3 GPT-5 baseline VLC success rate = 59.6% vs. +Tutorial = 70.7% (+11.1%) (Tab. 2)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.03951.

Fig 1

Fig 1: Demonstration vs.

Fig 2

Fig 2: Overview of Demo2Tutorial. Our framework comprises four key components: (1) HE-Recorder captures synchronized screen

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 3

Fig 3: TutorialBench statistics across 7 software applica-

Limitations

  • TutorialBench covers only seven software applications and 110 samples, limiting diversity and generalization to other software or domains.
  • No explicit evaluation on adversarial or noisy demonstration inputs; robustness to imperfect or malicious recordings is unknown.
  • User study has limited scale (20 participants) and only tests one specific, non-trivial PowerPoint task; broader applicability to other tasks or learner populations is untested.
  • Actor-critic iterative refinement relies on pretrained large language and vision models, may be computationally expensive and dependent on external APIs.
  • Visual quality improvements may depend on accuracy of UI segmentation and OCR, which can vary across software or UI themes.
  • No reported ablation on impact of each component (Recorder fidelity, Parser prompting, Planner iterations, Composer editing) separately on final tutorial quality.

Open questions / follow-ons

  • How well does Demo2Tutorial generalize to diverse software beyond the seven evaluated, particularly with different UI paradigms or languages?
  • Can the system robustly handle imperfect or incomplete demonstrations, such as noisy user inputs or multitasking behavior?
  • How does the tutorial quality affect long-term human learning retention and transfer to other software skills?
  • Could the framework be combined with adversarial robustness methods to prevent malicious manual demonstration forgery or data poisoning?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, Demo2Tutorial presents an intriguing methodology for transforming raw human interaction data into structured, interpretable multimodal instructions. This structured knowledge representation could inspire new ways to automate tutorial generation for security training interfaces or user onboarding in CAPTCHA challenges. The actor-critic refinement loop and adaptive visual annotation techniques highlight how detailed semantic parsing and hierarchical organization can improve the clarity and usability of automated instructional content.

From a defense perspective, methods for parsing coarse user input logs into high-level intents and decomposing workflows can inform detection of automated or scripted behaviors by contrasting genuine human procedural structures with bot actions. Additionally, the generation of multimodal tutorials may facilitate designing better human-in-the-loop verification steps or training regimes in CAPTCHA systems, enhancing task learnability and security. However, as the paper is not security-focused, direct adversarial evaluations or robustness to automated exploit attempts remain to be explored for bot defense applicability.

Cite

bibtex
@article{arxiv2606_03951,
  title={ Demo2Tutorial: From Human Experience to Multimodal Software Tutorials },
  author={ Zechen Bai and Zhiheng Chen and Yiqi Lin and Kevin Qinghong Lin and Difei Gao and Xiangwu Guo and Xin Wang and Mike Zheng Shou },
  journal={arXiv preprint arXiv:2606.03951},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.03951}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution