Skip to content

OpenRoundup: Multi-Table Data Wrangling Through Interactive Visualization

Source: arXiv:2606.12648 · Published 2026-06-10 · By Stephen Kasica, Charles Berret, Tamara Munzner

TL;DR

OpenRoundup addresses a key gap in data wrangling tools for data journalists by treating multi-table consolidation as the fundamental unit of work, rather than focusing on single-table transformations. It enables journalists—who often lack programming expertise—to interactively combine heterogeneous source tables from independent publications into a single consolidated, analysis-ready dataset without writing code. The system leverages an eager table consolidation approach wherein a composite table is incrementally built early in the wrangling process through two declarative operations: Stack (vertical union) and Pack (horizontal join on keys). OpenRoundup’s browser-based client-only architecture using DuckDB-WASM provides strong data privacy suitable for sensitive pre-publication datasets.

The paper’s contributions include the design of a multi-panel interactive visualization interface incorporating a schema-first, values-on-demand paradigm with live schema previews, data quality alerts, and an operation tree treemap showing the evolving consolidation structure. Evaluation through a replication study of 17 real-world journalist programming workflows demonstrates that all were expressible via OpenRoundup’s interface alone, affirming coverage of practical consolidation scenarios. A deployment study with four professional data journalists confirmed the tool’s utility for practitioners comfortable conceptually with joins but unable to program, while revealing secondary value for data journalism education. Overall, OpenRoundup bridges the research-to-practice gap imposed by schema heterogeneity, transient one-off workflows, and non-technical users in accountability journalism.

Key findings

  • OpenRoundup enabled replication of 17 published journalistic data consolidation workflows without any programming, demonstrating expressive coverage of practical tasks.
  • The system’s eager consolidation strategy assembles a composite table early via incremental operations, improving early visibility of schema mismatches and join compatibility.
  • Two declarative operations, Stack (variadic vertical union) and Pack (binary horizontal join), suffice to model the majority of journalistic table consolidation scenarios.
  • The interactive interface with five coordinated panels produced live schema previews and data quality alerts that helped users identify structural and quality issues before propagation.
  • A client-only DuckDB-WASM backend enables execution entirely in-browser, supporting sensitive journalism data with strong privacy guarantees without server-side processing.
  • Deployment with four professional journalists showed non-programmers could successfully use OpenRoundup for real consolidation tasks, expanding accessibility beyond coding-centric tools.
  • The system surfaced unexpected secondary pedagogical value for teaching data journalism concepts and joins in non-technical settings.

Threat model

The adversary is any external party attempting to access sensitive or embargoed journalism data during wrangling. The system assumes a threat model where user data privacy is paramount, so all operations run client-side in the browser via DuckDB-WASM without external servers or data transmissions. Attackers cannot observe intermediate data transformations or schemas outside the user's local environment but no comprehensive adversarial threat evaluation is presented.

Methodology — deep read

  1. Threat model & assumptions: The threat model is implicit—journalists have sensitive, embargoed data that must be wrangled locally and privately in the browser to prevent leaks. Users may lack programming skills but understand join concepts. Adversaries are external entities not able to access client-local state or data because no server-side processing occurs.

  2. Data: For evaluation, the authors collected 17 published data journalist programming workflows from various accountability journalism projects, which served as ground truth for replication. They also conducted a deployment study with four professional journalists using their own sensitive datasets. No specific dataset sizes or splits are reported, but sources reflect typical heterogeneous public and third-party tabular data encountered in journalism.

  3. Architecture/algorithm: OpenRoundup is implemented as a browser application powered by DuckDB compiled to WebAssembly (DuckDB-WASM), enabling SQL query execution locally. The system models table consolidation as an operation tree formed via interactive user application of two operations: Stack (variadic vertical union of tables aligned by column index) and Pack (binary join combining tables on shared key columns). This snowball approach incrementally constructs a composite table that accumulates structure and rows as additional tables are incorporated.

The interface has five coordinated panels: Data Inventory (lists tables, columns, operations), Schema Panel (shows schema and columns of focused table/operation), Table Rows Panel (spreadsheet view of data values), Composite Schema Panel (recursive treemap visualizing the operation tree), and Column Focus Panel (on-demand detailed analytics). Column cards support direct manipulation such as reordering within tables.

  1. Training regime: Not applicable as no machine learning is introduced; the system is a data wrangling tool.

  2. Evaluation protocol: The replication study involved reproducing 17 published journalist workflows entirely via the interface’s declarative operations without code, assessing expressive coverage. The deployment study observed four professional journalists performing typical tasks, collecting qualitative feedback on utility, usability, and workflows. The studies emphasize real-world ecological validity rather than quantitative metrics like accuracy or runtime.

  3. Reproducibility: The system is open-source, and the paper provides detailed design descriptions, but some evaluation data are unpublished or proprietary to journalists. No formal cross-validation or adversarial tests were reported. The interactive visual components and layered architecture (browser UI + WASM backend) are thoroughly described.

Concrete example end-to-end: A journalist begins with multiple source tables from independent public datasets with heterogeneous schemas. In OpenRoundup’s Data Inventory, they select a pair of tables and apply the Pack operation to horizontally join rows on a shared key. The Schema Panel shows the resulting combined schema with column cards. The Table Rows Panel displays sample data, and alerts indicate mismatches or missing data. The journalist then stacks additional tables vertically using Stack operations, gradually building a composite table whose operation tree is visualized in the Composite Schema Panel. At any point, they can reorder columns, hide irrelevant columns, and inspect detailed column statistics in the Column Focus Panel. After iterative exploration and schema refinement via direct manipulation, the journalist exports the analysis-ready consolidated table for downstream use without writing any code.

Technical innovations

  • Eager table consolidation strategy assembling a composite multi-table schema incrementally early in the wrangling pipeline, diverging from delayed consolidation models.
  • A minimal declarative vocabulary for table consolidation using only two operations—Stack (variadic vertical union by column position) and Pack (binary horizontal join on shared keys)—providing expressive coverage for heterogeneous journalistic data.
  • A coordinated multi-panel interactive visualization interface implementing a schema-first, values-on-demand paradigm with live schema previews, ambient data quality alerts, and a recursive treemap operation tree visualization.
  • Client-only browser architecture employing DuckDB compiled to WebAssembly (DuckDB-WASM), allowing full privacy-preserving execution of complex SQL operations without server dependence.

Datasets

  • 17 journalistic data consolidation workflows — not quantitatively specified — sourced from published accountability journalism projects
  • Deployment study datasets — confidential, from four professional data journalists' own sensitive data

Baselines vs proposed

  • Prior journalist programming workflows requiring code — replicated successfully using OpenRoundup interface alone without code
  • Manual spreadsheet consolidation workflows — qualitative deployment feedback shows improved accessibility and schema visibility
  • ETL and commercial multi-source tools (Tableau Prep Builder, OpenRefine) — OpenRoundup uniquely supports non-technical users on heterogeneous public tables with local browser execution (no direct quantitative comparison presented)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.12648.

Fig 1

Fig 1: OpenRoundup supports interactive table assembly with

Fig 2

Fig 2: The OpenRoundup interface, showing an intermediate step in the Crime & Heat usage example (Sec. V-A). The

Fig 3

Fig 3: OpenRoundup interaction flow. Users navigate clock-

Fig 4

Fig 4: (c), is a recursive treemap visualization [63] of the entire

Fig 5

Fig 5 (page 6).

Fig 6

Fig 6 (page 8).

Fig 5

Fig 5: OpenRoundup is built on a three-layer architecture

Fig 6

Fig 6: This 3 × 3 matrix defines the nine sagas used to

Limitations

  • Evaluation primarily qualitative with small deployment sample (4 journalists) and limited generalization testing
  • No adversarial robustness or stress testing on large-scale or heavily corrupted datasets reported
  • Focuses only on schema mapping and consolidation, deferring entity resolution, duplicate detection, and data fusion to future work
  • Constraint of Stack aligning tables by column position may struggle with highly inconsistent or unaligned schemas
  • Operation tree model does not support shared intermediate nodes or DAG structures, limiting some reuse opportunities
  • No automated semantic schema-matching or inference beyond basic ambient alerts, requiring user decisions throughout

Open questions / follow-ons

  • How can automated semantic schema matching or alignment inference be integrated to further reduce human effort in heterogeneous table consolidation?
  • What are the scalability limits for client-side browser execution with DuckDB-WASM on larger or more complex multi-table datasets?
  • Can OpenRoundup’s declarative Stack and Pack vocabulary be extended to support non-key joins, fuzzy merges, or entity resolution within the interactive framework?
  • How does schema-first, eager consolidation fare compared to delayed consolidation in diverse domain contexts outside journalism?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, OpenRoundup's approach offers lessons in designing interactive consolidation tools that emphasize early schema visibility, minimal and interpretable operations, and privacy-preserving, client-side computation. The eager consolidation concept parallels incremental integration of multi-source threat intelligence or user behavior datasets, where schema uncertainty is common and early structural feedback is valuable. The Stack and Pack vocabulary abstracts complex data merging into simple composable steps—a useful pattern for designing explainable and user-guided aggregation pipelines in bot and fraud detection. Additionally, the architectural decision to perform all processing locally with a WebAssembly SQL engine aligns with privacy-preserving analytics on sensitive behavioral data, a growing concern in security-sensitive bot detection. While OpenRoundup targets data journalists, its principles in multi-table interactive data wrangling and schema reasoning can inform bot-defense systems dealing with heterogeneous multi-source logs or telemetry integration in a user-friendly manner.

Cite

bibtex
@article{arxiv2606_12648,
  title={ OpenRoundup: Multi-Table Data Wrangling Through Interactive Visualization },
  author={ Stephen Kasica and Charles Berret and Tamara Munzner },
  journal={arXiv preprint arXiv:2606.12648},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12648}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution