TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature

Source: arXiv:2605.07507 · Published 2026-05-08 · By Hanqing Zhao

TL;DR

TCMIIES (Traditional Chinese Medicine Information Intelligent Extraction System) addresses the practical barrier between LLM-powered information extraction and domain researchers who lack programming skills or cannot risk sending sensitive unpublished data to intermediary services. The problem is real: systematic review workflows in specialized fields like TCM require structured extraction across hundreds or thousands of papers, yet most LLM tooling assumes a developer audience. The paper's contribution is a self-contained single HTML file (~1,028 lines, 45 KB) that runs entirely in the browser, uses commercial LLM APIs directly from the client, and provides a GUI for defining custom extraction schemas without writing any code or standing up any backend.

The technical core is a schema-guided prompting framework where the user defines fields (name, description, type, required flag) through a form UI, and the system automatically synthesizes a structured system prompt enforcing JSON output with field-level type annotations and fallback instructions. This auto-generation mechanism is the paper's main novel contribution over generic prompt-engineering practices. The system also implements intelligent column-name mapping for exports from Chinese academic databases (CNKI, Wanfang) using rule-based substring matching across Chinese and English header variants, concurrent batch processing with configurable parallelism and exponential-backoff retry, and session-storage-based progress persistence.

Evaluation used 500 TCM papers from CNKI across five sub-disciplines, tested with three LLM providers (DeepSeek deepseek-v4-flash, Qwen qwen-turbo, GLM-4 glm-4-flash) on two six-field extraction templates. Overall JSON compliance averaged 94.2% and field-level extraction accuracy averaged 81.6% against two TCM expert annotators (Cohen's κ = 0.82). Reliability stress tests showed 100% batch completion under 10% simulated API failure injection. The paper is primarily a system/demo paper rather than a research advance in ML methodology; its contributions are engineering and accessibility rather than new learning algorithms.

Key findings

Structured JSON output compliance across three providers and two extraction templates averaged 94.2%, with DeepSeek deepseek-v4-flash highest at 95.8% and GLM-4 glm-4-flash lowest at 92.5% (Table 1).
Field-level extraction accuracy averaged 81.6% across six fields and three providers against TCM expert annotation (Cohen's κ = 0.82 inter-annotator agreement); DeepSeek led at 83.7%, GLM-4 trailed at 79.7% (Table 2).
Hardest extraction field was Research Limitations (avg 73.5%) and easiest was Research Topic (avg 89.3%), consistent with the pattern that explicitly stated vs. implicit/distributed information drives LLM extraction difficulty.
Three-attempt retry mechanism raised effective success rate from 88.4% (first attempt) to 97.1%, achieving 100% batch completion under a simulated 10% random API failure rate across the 500-paper corpus (Table 3).
Session recovery via sessionStorage preserved 100% of in-progress results across simulated browser closures; pause/resume cycling over a 200-paper batch showed 100% data integrity.
Estimated cost for processing 1,000 papers with DeepSeek v4-flash is $0.38 total (~2,000 input tokens + 500 output tokens per paper at May 2026 pricing), making 10,000-paper batches feasible at approximately $3.80 (Table 4).
The entire system is a single self-contained HTML file (~45 KB) with no build step, no backend, and no installation, loading Vue.js 3.4 and SheetJS 0.18.5 from jsDelivr CDN at runtime.
Provider-specific parameter tuning — reasoning_effort='max' and thinking mode for DeepSeek V4/Reasoner, enable_thinking=true with thinking_budget=81920 tokens for Qwen — is claimed as a factor in DeepSeek's accuracy advantage, though no ablation isolating this effect is reported.

Threat model

The paper is not a security research paper in the adversarial ML sense, but has an implicit privacy threat model. The assumed adversary is an intermediary cloud service or SaaS platform that could log, retain, or train on user-submitted research data — particularly relevant for unpublished manuscripts, proprietary literature collections, or sensitive clinical data. The system's defense is architectural: no backend exists, so there is no intermediary to compromise. The only data leaving the user's browser is the content of individual paper records sent as LLM API prompts directly to the chosen provider (e.g., DeepSeek, OpenAI). The system explicitly does not defend against: (1) the LLM provider itself logging or training on API inputs, (2) prompt injection attacks embedded in malicious paper content, (3) API key theft from localStorage on shared or compromised machines, or (4) network-level interception of API calls (no additional TLS pinning beyond browser defaults). The paper does not articulate these residual risks.

Methodology — deep read

Threat model and assumptions: This is not a security paper in the adversarial ML sense. The implicit trust model concerns data privacy: the assumption is that researchers distrust intermediary servers and want raw research data (including unpublished manuscripts) to never leave their machine except as direct API calls to their chosen LLM provider. The adversary is implicitly a cloud service operator or analytics platform that might log or train on user data. The system sidesteps this by having no backend at all — all processing happens client-side in the browser. API keys are Base64-encoded in localStorage (the paper explicitly notes this is not encryption, only obfuscation against casual inspection).

Data provenance and splits: The evaluation corpus consists of 500 TCM research papers exported from CNKI (China National Knowledge Infrastructure), evenly split across five sub-disciplines: herbal pharmacology, acupuncture clinical trials, formula composition studies, TCM syndrome research, and integrative medicine reviews (100 papers each). Each record contains six metadata fields: title (篇名), abstract (摘要), keywords (关键词), authors (作者), source journal (来源), and publication date (发表时间). For accuracy evaluation, a random sample of 20 papers per sub-discipline (100 total) was annotated by two TCM domain experts; inter-annotator agreement was Cohen's κ = 0.82. The dataset is not publicly released; provenance is a CNKI export from Hebei University's TCM Informatics Laboratory. There is no train/test split in the ML sense — the LLMs are used zero-shot with no fine-tuning.

Architecture and algorithm: The system is a Vue.js 3 single-page application packaged as one HTML file. The pipeline has five stages: (1) API Configuration — user selects provider from a registry (DeepSeek, OpenAI, Qwen, Zhipu AI, custom), enters API key (stored Base64 in localStorage), selects model, and sets concurrency parameters. (2) Data Upload — SheetJS parses .xlsx/.xls/.csv entirely in the browser via FileReader API; column names are passed through Algorithm 1, a rule-based substring matcher against a hardcoded mapping table of seven semantic categories in both Chinese and English variants. (3) Extraction Configuration — user defines fields via GUI (name, description, type, required); Algorithm 2 auto-generates a system prompt consisting of a role declaration, a JSON schema template with field names and string type annotations, and a fallback instruction for missing values. The user also writes a prompt template referencing column names via interpolation. (4) Task Execution — a concurrent engine launches up to 10 simultaneous Fetch API calls to the provider's OpenAI-compatible chat completions endpoint; each call includes the auto-generated system prompt and the interpolated user prompt for one paper row. On failure, up to three retries occur with 1-second delays. Progress is auto-saved to sessionStorage every 10 records. (5) Result Export — parsed JSON responses are mapped back to spreadsheet rows and exported via SheetJS as .xlsx, .csv (UTF-8 BOM), or .json.

JSON parsing robustness: LLM responses may include explanatory text surrounding the JSON object. The system applies a regex {[\s\S]*} to extract the first JSON object from the raw response string before calling JSON.parse. This is a practical engineering fix for models that prepend or append natural language to their structured output despite explicit instructions not to.

Training regime: None — the system performs zero-shot inference only. No fine-tuning, no gradient updates, no epochs or batch sizes in the ML training sense. Hyperparameters are API call parameters: temperature (suppressed for DeepSeek reasoning models), reasoning_effort='max' for DeepSeek V4/Reasoner, thinking_budget=81920 for Qwen models. These are set programmatically based on model name matching, not user-configurable in the current UI.

Evaluation protocol: Three dimensions were measured. (1) Structured output compliance: proportion of LLM responses that parse as valid JSON matching the schema, measured over all 500 papers × 2 templates × 3 providers. (2) Extraction accuracy: field-level proportion of extracted values matching expert annotation (exact or semantic equivalence), measured on the 100-paper annotated subset. No statistical significance tests are reported; no confidence intervals are provided. (3) System reliability: three stress tests on the batch engine — 500 papers with 10% random failure injection, pause/resume over 200 papers, and session recovery after browser closure. Metrics are completion rate, first-attempt success rate, after-retry success rate, pause/resume data integrity, and session recovery success rate. There are no held-out attacker tests, no distribution shift evaluation across different databases or domains, and no cross-validation.

Concrete end-to-end example: A researcher uploads a CNKI export CSV of 100 acupuncture clinical trial abstracts. The system's Algorithm 1 auto-maps 篇名 → Title, 摘要 → Abstract, 关键词 → Keywords. The researcher defines six fields via the GUI: Research Topic (text, required), Methodology (text, required), Dataset (text, optional), Main Conclusions (text, required), Innovation Points (text, optional), Research Limitations (text, optional). Algorithm 2 generates a system prompt like: 'You are an academic paper information extraction assistant... {"Research Topic": string, "Methodology": string, ...} If a field is not mentioned, fill in Not mentioned.' The user writes a template: 'Please extract from the following paper. Title: Abstract: '. The engine launches, say, 5 concurrent Fetch calls to DeepSeek's API. For a paper whose abstract does not explicitly state limitations, the model fills 'Not mentioned' for Research Limitations. The response is regex-matched for the JSON object, parsed, and written into the results table. After 100 papers, the researcher exports to Excel with original columns plus six new extracted columns.

Reproducibility: The system is described as open-source and actively deployed at Hebei University, but no GitHub URL or DOI for the code is provided in the paper text. The evaluation dataset (CNKI export) is not publicly available due to database licensing. The LLM models used are commercial API endpoints whose behavior may change over time; no frozen model versions or API snapshots are guaranteed.

Technical innovations

Schema-guided automatic system prompt generation (Algorithm 2) that transforms a GUI-defined field schema into a structured JSON-constrained system prompt without manual prompt engineering, extending prior schema-guided prompting work (Wang et al. [5]) by automating the prompt synthesis step itself.
Intelligent Chinese-English column name mapping (Algorithm 1) via rule-based substring matching over a predefined bilingual pattern table, specifically targeting the heterogeneous export formats of CNKI and Wanfang databases — a problem not addressed by existing English-centric IE tooling.
Pure browser-based LLM extraction pipeline (no backend, no installation) implemented as a single 45 KB HTML file, extending the browser-based NLP paradigm from simpler classification tasks to multi-field structured extraction via commercial LLM APIs.
Provider-specific API parameter optimization (reasoning_effort, thinking_budget) applied conditionally based on model-name matching, enabling exploitation of chain-of-thought reasoning capabilities of DeepSeek and Qwen models without user configuration.
Concurrent batch processing engine with configurable parallelism, exponential-backoff retry, pause/resume via reactive state flags, and sessionStorage-based progress persistence — combining these reliability primitives in a zero-dependency browser context is not standard in existing LLM extraction tooling.

Datasets

CNKI TCM Corpus (evaluation) — 500 papers across 5 TCM sub-disciplines (herbal pharmacology, acupuncture, formula composition, syndrome research, integrative medicine) — proprietary CNKI export, Hebei University TCM Informatics Laboratory, not publicly available

Baselines vs proposed

DeepSeek deepseek-v4-flash (Paper Info template): JSON compliance = 96.4% vs Qwen qwen-turbo: 94.8% vs GLM-4 glm-4-flash: 93.2% (Table 1)
DeepSeek deepseek-v4-flash (Lit. Review template): JSON compliance = 95.2% vs Qwen qwen-turbo: 93.6% vs GLM-4 glm-4-flash: 91.8% (Table 1)
DeepSeek deepseek-v4-flash: avg field extraction accuracy = 83.7% vs Qwen qwen-turbo: 81.6% vs GLM-4 glm-4-flash: 79.7% (Table 2, 100-paper annotated subset)
Research Topic field: DeepSeek 91.2% vs Qwen 89.5% vs GLM-4 87.3% (Table 2)
Research Limitations field (hardest): DeepSeek 75.8% vs Qwen 73.5% vs GLM-4 71.2% (Table 2)
Batch engine first-attempt success rate: 88.4% vs after-3-retry success rate: 97.1% (Table 3, 10% failure injection)
Wang et al. [5] schema-guided prompting reported 97% JSON compliance vs TCMIIES overall 94.2% (author-acknowledged gap attributed to use of flash/turbo model tiers and Chinese-language content)

Limitations

Evaluation dataset is a single institutional CNKI export (500 papers, one domain — TCM); no cross-database, cross-domain, or cross-language generalization is demonstrated, making accuracy claims narrow in scope.
No statistical significance testing or confidence intervals reported for any accuracy or compliance metric; with n=100 for accuracy evaluation (20 per sub-discipline), differences between providers (e.g., 83.7% vs 79.7%) may not be statistically meaningful.
The paper evaluates only abstract-level metadata (title, abstract, keywords); no full-text processing is tested, which is where most of the hard extraction challenges in systematic review workflows actually occur.
API dependency and model version instability: results are tied to specific commercial model snapshots as of May 2026; the paper provides no mechanism for reproducibility as model APIs update, and the evaluation cannot be replicated without the same CNKI export.
Base64 encoding of API keys in localStorage is explicitly acknowledged as not encryption — on a shared or compromised machine, API credentials are trivially recoverable; this is a real operational risk for institutional deployments.
No adversarial or red-team evaluation: prompt injection via malicious content in paper abstracts (e.g., an abstract containing instructions to override the system prompt and exfiltrate the API key) is not discussed or tested.
The provider-specific parameter optimizations (reasoning_effort, thinking_budget) are presented as contributing to DeepSeek's accuracy advantage, but no ablation is run to isolate their effect from model capability differences — the claim is plausible but unverified.
The regex-based JSON fallback parser ({[\s\S]*}) will extract the first JSON-shaped object in the response, which could be a schema example or intermediate reasoning output rather than the final extraction in verbose chain-of-thought responses; edge case behavior is not characterized.

Open questions / follow-ons

How does extraction accuracy degrade for full-text papers versus abstracts, especially for fields like Research Limitations and Dataset that require reading beyond the abstract — and can a chunking or hierarchical extraction strategy close this gap?
Can the schema-guided prompting framework be adapted for adversarial robustness: specifically, can a malicious actor embed prompt injection instructions in a paper abstract to override the system prompt, extract the API key from localStorage context, or corrupt batch results, and what mitigations are effective in a browser-only architecture?
The paper claims provider-specific parameter tuning (reasoning_effort, thinking_budget) improves DeepSeek and Qwen accuracy, but no ablation isolates this effect — what is the actual accuracy delta attributable to these parameters versus underlying model capability differences?
TCMIIES uses zero-shot extraction; the DSPy-inspired active learning direction mentioned in future work raises the question of whether a few-shot bootstrap from user corrections (without fine-tuning) could close the ~15% accuracy gap to domain-specific fine-tuned models like BERT4TCM on the same TCM sub-disciplines.

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, TCMIIES is not directly relevant as an attack or defense tool, but it illustrates a broader trend worth tracking: the democratization of LLM-powered automation through zero-installation, browser-based wrappers around commercial APIs. The same architectural pattern — a client-side HTML file that drives LLM API calls in batch, with retry logic and structured output parsing — is exactly the kind of tooling that lowers the barrier for automated form-filling, CAPTCHA-solving pipeline construction, and large-scale web data extraction campaigns. A practitioner should note that the concurrent batch engine with configurable parallelism, exponential-backoff retry, and pause/resume is functionally a lightweight, browser-native bot harness; the only thing distinguishing it from a scraping tool is the target (academic PDFs vs. web forms). The 94%+ JSON compliance rate also signals that structured output extraction from LLMs is now reliable enough to be embedded in production automation workflows without extensive error handling.

More directly, the paper's discussion of LLM-based information extraction accuracy (81.6% zero-shot, approaching domain-expert annotation at κ=0.82) provides a data point on what current commercial LLMs can extract from unstructured text with minimal prompting. For bot-defense engineers designing behavioral or content-based signals, this suggests that LLM-powered bots can now reliably extract structured semantics from pages without any custom model training — meaning that signals relying on the assumption that automated clients cannot understand page semantics are increasingly fragile. The paper is low direct threat relevance but high indirect signal value for trend monitoring.

Cite

bibtex

@article{arxiv2605_07507,
  title={ TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature },
  author={ Hanqing Zhao },
  journal={arXiv preprint arXiv:2605.07507},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.07507}
}

TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​