Unsupervised Skill Discovery for Agentic Data Analysis

Source: arXiv:2606.06416 · Published 2026-06-04 · By Zhisong Qiu, Kangqi Song, Shengwei Tang, Shuofei Qiao, Lei Liang, Huajun Chen et al.

TL;DR

This paper addresses the challenge of discovering reusable skills for data-analytic agents without relying on costly labeled supervision or ground-truth answers. Traditional skill discovery in data analysis tasks often depends on expensive human annotation or predefined success criteria, which are infeasible for broad, diverse analytical tasks that vary widely in formats and objectives. To overcome these limitations, the authors propose DataCOPE, an unsupervised verifier-guided skill discovery framework that iteratively coordinates a data-analytic agent, an unsupervised verifier, and a skill manager to discover and distill reusable procedural knowledge from unlabeled exploration trajectories.

DataCOPE introduces two verifier instantiations tailored to different data-analysis task types: an Adaptive Checklist Verifier for open-ended report-style analysis, which automatically generates and refines task-specific verification checklists to score report coverage; and an Answer Agreement Verifier for reasoning-style tasks that clusters trajectories by final answers and uses self-consistency to estimate confidence. By using these unsupervised signals derived internally from exploration trajectories, DataCOPE groups trajectories contrastively to guide the skill manager in distilling reusable, robust analytical skills. Empirical evaluation on two benchmark suites (Deep Data Research for reporting and DABStep for reasoning) shows consistent and substantial improvements over strong baselines across four different base models. The method improves mean accuracy scores by 9.71% on report-style tasks and 32.30% on reasoning-style tasks, while reducing token usage and inference cost.

The paper contributes a principled approach to unsupervised skill discovery in heterogeneous data analysis settings, demonstrating improved generalization and transferability of discovered skills without any external supervision, success labels, or human annotations.

Key findings

DataCOPE achieved a mean accuracy improvement of 9.71% on report-style tasks (Deep Data Research) over baseline skill-creator methods.
On reasoning-style tasks (DABStep), DataCOPE improved mean accuracy by 32.30% compared to baselines.
The framework reduced token consumption during analysis by 41.7% to 73.4% while simultaneously improving accuracy (Table V).
Removing adaptive checklist refinement decreased report task accuracy from 67.12% to 57.30%, showing the importance of iterative checklist refinement.
For reasoning tasks, answer clustering was critical, with removal dropping accuracy from 62.82% to 47.93%, even below an unfiltered baseline with raw trajectories.
Self-consistency contributed additional gains for reasoning tasks, improving accuracy from 55.92% to 62.82% when combined with answer clustering.
Skill granularity mattered: using all 9 DABStep task category-specific skills outperformed coarser skill sets, indicating the importance of appropriate specialization.
DataCOPE skills transferred well across different base models (Claude, GPT-5.2, DeepSeek, Qwen), providing consistent improvements.

Threat model

n/a — The paper does not frame a security-related adversary or threat model. The focus is on unsupervised skill discovery without assumptions about adversarial interference.

Methodology — deep read

Threat Model and Assumptions: The adversary is not explicitly modeled as this is a skill discovery framework rather than a security paper. The main assumption is that no ground truth labels, success signals, or human feedback are available for the data analysis tasks. The agent must rely solely on unlabeled interactions and outputs from an otherwise fixed pretrained language model-based data-analytic agent.
Data: The authors used two benchmark suites. For report-style tasks, Deep Data Research data sets (subsets MIMIC, GLOBEM, 10-K) were used with unlabeled partitions for exploration (Dexplore) and held-out test (Dtest) splits. For reasoning-style tasks, DABStep data sets were similarly split. There are no labels or success annotations accessed during skill discovery, only during offline evaluation.
Architecture and Algorithm: The core components are:

A Data-Analytic Agent πθ: a pretrained LLM agent (e.g., Claude Sonnet, Qwen3.5) that generates exploration trajectories interacting with the data analysis environment (including code generation, reasoning, and report writing). The agent operates under the ReAct paradigm producing interleaved thoughts and actions.
An Unsupervised Verifier ϕ: derives indirect quality signals and groups trajectories based on task-specific unsupervised criteria.
A Skill Manager ψω: distills reusable procedural knowledge by contrasting good and bad trajectory groups based on verifier signals.

The framework iterates over rounds r, where at each iteration the agent generates multiple trajectories per task under the current skill S(r). The verifier clusters or scores these trajectories to generate groups and signal scores. The skill manager then updates the skill S(r+1) by distilling contrasting patterns between better and worse groups.

For report-style tasks, the verifier is instantiated as an Adaptive Checklist Verifier that automatically generates and refines task-specific checklists. These checklists specify checkable analytical criteria derived from the task prompt. Each report trajectory is scored on checklist coverage, providing a soft quality signal. The skill manager contrasts high-scoring and low-scoring groups to improve both the data-analytic reporting skill and the checklist-generation skill iteratively.

For reasoning-style fixed-answer tasks, the verifier uses Answer Agreement: clustering final answers from multiple trajectory samples to identify agreement groups and using cluster sizes for self-consistency confidence. The skill manager focuses on refining skills that explain cluster-level divergences and improves reasoning robustness.

Training Regime:

For report-style tasks, one trajectory per instance is sampled; iterative skill refinement alternates three updates of the data-analytic agent’s skill interleaved with two updates of the checklist agent’s skill.
For reasoning tasks, ten trajectories per instance are sampled; skill manager runs three iterations of refinement.
Base models include Claude Sonnet 4.5/4.6, GPT-5.2, DeepSeek-V4-Pro, Qwen 3.5-397B.
Sampling temperature during exploration is 1.0; evaluation is at 0.0 temperature.

Evaluation Protocol: Held-out test sets with ground-truth answers and checklists were used only for final offline evaluation. Metrics include sample-averaged and item-averaged checklist accuracy for reports, overall accuracy for reasoning tasks. Baselines include Skill Creator, a supervised skill discovery baseline using labeled trajectories. Ablations evaluate components of the verifier (checklist refinement, task-specificity, answer clustering, self-consistency) and skill granularity. Iteration analysis tracks performance and verifier scores across refinement rounds to identify monotonicity and convergence.
Reproducibility: The paper does not explicitly mention releasing code or pretrained skill weights. The benchmarks Deep Data Research and DABStep are academic datasets but not confirmed if public. They report detailed hyperparameters and base models used.

Example: For a report-style task on MIMIC data, at iteration r=0 the agent generates one report trajectory per instance. The checklist agent produces an initial checklist of criteria for the task. The verifier scores each report on these criteria to partition them into positive and negative groups. The skill manager distills contrasting procedural knowledge from these groups to update the report-generation skill to S(1). Subsequently, the checklist agent refines the checklist to capture missing criteria using the contrasting positive and negative reports. This loop iterates until the average checklist score stops improving. The final distilled skill is then used on held-out tasks with no further tuning.

Technical innovations

A novel unsupervised verifier-guided skill discovery framework (DataCOPE) that leverages internally derived verifier signals from unlabeled agent trajectories to guide skill distillation without access to ground-truth labels or success labels.
Design of an Adaptive Checklist Verifier that automatically generates, scores, and iteratively refines task-specific checklists for open-ended report-style data-analysis tasks to provide unsupervised quality signals.
Introduction of an Answer Agreement Verifier for reasoning tasks that clusters trajectories by their final answers and utilizes cluster-based self-consistency as an auxiliary uncertainty signal to guide unsupervised skill improvement.
An iterative skill refinement loop that coordinates a data-analytic agent, an unsupervised verifier, and a skill manager, enabling contrastive skill distillation to yield reusable procedural knowledge transferable across tasks and models.

Datasets

Deep Data Research — size and public status unspecified — academic benchmark for report-style data analysis
DABStep — size and public status unspecified — benchmark for reasoning-style data analysis

Baselines vs proposed

Skill Creator baseline vs DataCOPE on report-style tasks: Overall average accuracy 51.34% vs 57.10%
Skill Creator baseline vs DataCOPE on reasoning-style tasks: mean accuracy 51.73% vs 61.44%
Removing checklist refinement reduces report accuracy on 10-K from 67.12% to 57.30%
Removing task-specific checklists further reduces report accuracy to 52.21%
Removing answer clustering on reasoning tasks drops accuracy from 62.82% to 47.93%
Removing self-consistency on reasoning tasks reduces accuracy from 62.82% to 55.92%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.06416.

Fig 1

Fig 1: Supervised skill discovery requires costly data annota-

Fig 2

Fig 2: Overview of the DataCOPE framework. The data-analytic agent samples trajectories from an unlabeled exploration

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

The quality of the verifier signals depends heavily on the adaptive checklist and answer clustering heuristics, which may not generalize perfectly to all data-analysis task formats.
Results show that iterative refinement is not always monotonic; later iterations can degrade performance on some datasets, highlighting challenges in convergence and stability.
The approach relies on multiple samples of trajectories per task during exploration, which may lead to increased inference cost, though overall token efficiency is improved with learned skills.
Evaluation uses only offline held-out metrics with ground-truth checklists or answers; no adversarial or real-world deployment testing to confirm robustness against adversarial behaviors.
Code, pretrained skill models, and datasets are not stated as publicly released, limiting immediate replication or extension by others.
The method assumes access to structured environment states (e.g., data files, code execution results), which may limit application scenarios where such interfaces are unavailable.

Open questions / follow-ons

How well would the verifier-guided skill discovery generalize to other complex data-analysis domains or multi-modal data beyond tabular formats?
Can the framework be adapted to accommodate dynamic or evolving task objectives where success criteria shift over time?
What is the impact of verifier design choices (e.g., clustering method, checklist granularity) on skill discovery quality under distribution shifts?
How can the iterative refinement procedure be stabilized or optimized to avoid performance degradation during later iterations?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, DataCOPE presents an insightful approach to unsupervised procedural skill acquisition from unlabeled behavioral data that might be analogously relevant to learning attacker or bot behavior patterns without known ground truth. Its verifier-guided skill distillation paradigm demonstrates how to leverage indirect, self-derived signals—like agreement clustering or adaptive criteria—to characterize complex behavioral quality and improve automated agents.

While directly applied to data-analytic language agents, this approach inspires methods for unsupervised behavioral profiling or anomaly detection in bot-defense, where labeled attack data is scarce and success criteria vary. The contrastive grouping and refinement mechanisms could inform design of defenses that adapt to evolving attacker tactics by extracting reusable behavioral "signatures" from unlabeled interaction logs, enabling lightweight augmentation of bot detection without retraining on costly labeled attack samples.

Cite

bibtex

@article{arxiv2606_06416,
  title={ Unsupervised Skill Discovery for Agentic Data Analysis },
  author={ Zhisong Qiu and Kangqi Song and Shengwei Tang and Shuofei Qiao and Lei Liang and Huajun Chen and Shumin Deng },
  journal={arXiv preprint arXiv:2606.06416},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.06416}
}

Unsupervised Skill Discovery for Agentic Data Analysis ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​