The Relic Condition: When Published Scholarship Becomes Material for Its Own Replacement

Source: arXiv:2604.16116 · Published 2026-04-17 · By Lin Deng, Chang-bo Liu

TL;DR

This paper argues that a scholar’s published corpus can be mined for a stable reasoning architecture and then reconstituted as a deployable LLM-based “scholar-bot” that performs core academic labor at expert-judged quality. The core claim is not that the model imitates style, but that publication makes a scholar’s recurrent distinctions, evaluative thresholds, refusal patterns, and theory-choice habits legible enough to be extracted as inference-time constraints. The authors frame the result as an “existence proof” for a broader structural vulnerability in academic publication systems, which they call the Relic condition.

Empirically, the paper reports two distilled scholar-bots built from closed local corpora of humanities/social-science scholars. These bots were evaluated across peer review, supervision, lecturing, and panel-style debate by three senior academics, plus a small student survey. The preserved expert record is strongly positive: all review/supervision reports judged outputs benchmark-attaining; appointment-level syntheses placed both bots at Senior Lecturer level or higher in the Australian university system; panel scores were mostly in the high 8s/10; and a 10-participant student survey showed ceiling-heavy ratings on information reliability, theoretical depth, and logical rigor. The paper’s broader argument is that the technical threshold for extracting academically useful reasoning from public scholarship is already crossed with modest engineering effort, making consent, disclosure, and deployment governance urgent.

Key findings

Scholar A was reconstructed from 68 analytical units (approximately 1,742 pages) and Scholar B from 35 fully processed local corpus items spanning papers, long-form works, and chapters.
Across the preserved expert record, all review and supervision reports judged the outputs benchmark-attaining.
For Scholar B, one full 5-point rubric averaged 4.6/5 for peer review, 4.4/5 for supervision, and 4.2/5 for lecture.
Panel scores in the 10-round and 15-round debate settings were reported in the range 7.9–8.9/10 for Scholar A and 8.5–8.9/10 for Scholar B; in the 15-turn three-way panel, Scholar C scored 9.1 and 8.8/10 from two evaluators.
Six appointment-level syntheses placed both bots at or above Senior Lecturer level in the Australian university system; no synthesis placed either bot below Senior Lecturer.
The student survey (n = 10) produced ceiling-heavy means of 6.800 for information reliability, 6.800 for theoretical depth, 6.800 for logical rigor, and 6.680 composite performance on a 7-point scale.
Participants were already frontier-model users: 10/10 answered “yes” to ongoing frontier-model use.
Theory comparison and discussion was the most common student use case (9/10 participants, 90%), suggesting the system was valued in judgment-heavy, synthesis-heavy tasks rather than simple drafting.

Threat model

The adversary is any actor with access to a scholar’s public publication corpus and a sufficiently capable LLM who wants to reconstruct and deploy a scholar-specific reasoning system for academic functions. The paper assumes no access to private notes, hidden archives, or direct participation by the original scholar, and it frames publication itself as the vulnerable surface. The adversary cannot rely on proprietary internal training data from the scholar and, in the main experiment, does not need it; the key claim is that the public record already encodes enough relational tacit knowledge to enable functional substitution.

Methodology — deep read

The paper’s threat model is an extraction-and-substitution scenario, not a classic cyberattack. The “adversary” is any actor who can use publicly available scholarly corpora and a capable LLM to reconstruct a scholar-specific reasoning system, then deploy it for academic functions. The source scholars are not said to be collaborating; in fact, the paper’s framing implies the opposite: the published record alone is enough to support distillation. The authors explicitly treat publication as a legible public substrate and focus on closed-corpus, local analysis rather than hidden archives, private notes, or direct access to the scholars. They also distinguish their claim from mere base-model competence by comparing distilled outputs against baseline runs with leading general chat systems (Gemini, ChatGPT, Claude) in a broader project record, though those baseline trials are not reported with the same archival completeness here.

Data provenance is narrowly defined. Scholar A was distilled from 68 analytical units totaling about 1,742 pages; Scholar B from 35 fully processed local corpus items spanning papers, long-form works, and chapters. The excerpt does not enumerate each item, but it does say the corpus was “local” and “closed,” meaning the method used only the published corpus selected for each scholar, not external web retrieval or hidden materials. The paper also reports a third scholar-bot, Scholar C, introduced only in the final panel stress test and distilled from a comparably scaled local corpus from a third critical heritage scholar. The student study used 10 research-degree students from the author’s disciplinary network; these respondents were all already frontier-model users. No evidence in the excerpt suggests a randomized sample or a large external cohort. The paper reports descriptive statistics only; there is no indication of inferential modeling, significance testing, or cross-validation for the survey.

The architecture is described as an eight-layer extraction framework feeding a nine-module skill architecture, but the excerpt does not fully enumerate all eight layers or nine modules by name. What is clear is the conceptual design: the extraction pipeline reconstructs “stable reasoning architectures” from published text and reassembles them into executable, interpretable skill constraints for a general-purpose LLM in a “very-high-reasoning configuration.” The outputs are not framed as fine-tuned weights; they are inference-time constraints grounded in local corpus analysis. In plain terms, the system seems to infer what distinctions the scholar habitually makes, what evidence they consider admissible, what objections they typically prioritize, and which conceptual moves they refuse, then prompts or constrains the base model to follow those patterns. The authors emphasize that the goal is scholar-specific reasoning, not generic academic style. They argue this is visible because Scholar A and Scholar B produce different competence profiles: Scholar A is stronger in conceptual boundary control and critical-theoretical precision, while Scholar B is stronger in operationalization, mediation analysis, and pedagogical scaffolding.

The training regime is only partially specified in the excerpt. The paper repeatedly refers to a “very-high-reasoning configuration,” but does not name the exact base model, compute budget, optimizer, number of epochs, batch size, learning rate, or seed strategy in the provided text. From the wording, the system seems more like a prompt-/constraint-engineered inference pipeline than a conventional supervised training run. The important engineering novelty is the closed-corpus distillation pipeline itself, not model pretraining or end-to-end finetuning. Because the excerpt lacks explicit hyperparameters, those details should be treated as unreported rather than inferred.

Evaluation is multi-pronged. First, three senior academics produced 18 task-specific reports across peer review, supervision, and lecture tasks, plus six appointment-level syntheses. The expert archive is heterogeneous: some reviewers used explicit five-point rubrics, others wrote benchmark-oriented narrative reports, and the appointment-level syntheses varied in evidentiary strictness. The authors therefore report numeric scores only where explicit and otherwise summarize qualitative convergence. For Scholar B, one complete rubric gave 5/5 standard sense, 5/5 proportionality, 4/5 defensibility, 5/5 actionability, and 4/5 consistency for peer review; supervision got 5/5 diagnostic accuracy, 4/5 priority ordering, 4/5 feasibility judgment, 5/5 developmental awareness, and 4/5 independence orientation; lecture got 4/5 accuracy, 5/5 structure, 5/5 hierarchy, 4/5 learnability, and 3/5 Q&A robustness. The paper reports average within-function scores of 4.6/5, 4.4/5, and 4.2/5 respectively. For Scholar A, the preserved reports are less numerically itemized but repeatedly described as benchmark-attaining and theoretically strong. Second, the panel tests use order reversal as a within-task stability check: Round 1 was a 10-round two-bot exchange with Scholar A opening; Round 2 reversed the order; Round 3 expanded to a 15-turn three-way panel by adding Scholar C. Evaluators scored the participants in each round, and the authors emphasize that swapping who opened changed initiative and visible originality more than underlying reasoning architecture. Third, the student survey reports means, medians, standard deviations, minima, and maxima for five performance items and five confidence items on a 7-point scale. The paper explicitly cautions that the sample is small and the distribution has ceiling effects.

One concrete end-to-end example is given through the student use case rather than through the extraction pipeline itself. A research-degree student, already accustomed to frontier LLMs, used the scholar-bot for theory comparison and discussion. The survey then recorded a 6.8 mean on information reliability, 6.8 on theoretical depth, and 6.8 on logical rigor, with a composite performance score of 6.68/7. The interpretation offered is that the bot was especially valuable in judgment-heavy tasks like conceptual contrast, synthesis, and writing framing—precisely the areas where generalized chat systems often underperform. Reproducibility is limited in the excerpt: the authors describe the pipeline and the evaluation archive, but the provided text does not mention public code, frozen weights, release of corpora, or a downloadable benchmark suite. The paper does note that the expert-evaluation record is “preserved,” implying at least some archival trace exists, but the extent of public availability is not clear from the excerpt.

Technical innovations

An eight-layer distillation pipeline that reconstructs scholar-specific reasoning architectures from published corpora alone and converts them into inference-time constraints.
A nine-module skill architecture designed to express stable academic competencies as executable constraints rather than a generic persona prompt.
A closed-corpus, local-analysis method that aims to recover relational tacit knowledge from publication text while explicitly excluding hidden archives and private materials.
A within-task order-reversal panel design used to test whether reasoning signatures remain stable when the speaking order changes.
A third-scholar stress-test discussant introduced only in the final panel round to probe whether the framework generalizes beyond the two primary bots.

Datasets

Scholar A corpus — 68 analytical units (~1,742 pages) — local closed corpus from published works
Scholar B corpus — 35 fully processed local corpus items — local closed corpus from papers, long-form works, and chapters
Scholar C corpus — comparably scaled local corpus — local closed corpus from a third critical heritage scholar
Expert evaluation archive — 18 task-specific reports + 6 appointment-level syntheses + panel rounds — preserved expert record
Research-degree-student survey — n = 10 — author’s disciplinary network

Baselines vs proposed

Peer review (Scholar B rubric): 4.6/5 vs proposed: benchmark-attaining
Supervision (Scholar B rubric): 4.4/5 vs proposed: benchmark-attaining
Lecture (Scholar B rubric): 4.2/5 vs proposed: benchmark-attaining
Panel Round 1: Scholar A = 8.2/10 vs Scholar B = 8.5/10
Panel Round 2 (order reversed): Scholar A = 7.9/10 vs Scholar B = 8.8/10
Panel Round 3 (3-way): Scholar A = 8.9/10 and 8.3/10 vs proposed: two independent evaluators; Scholar B = 8.9/10 and 8.6/10 vs proposed: two independent evaluators; Scholar C = 9.1/10 and 8.8/10 vs proposed: two independent evaluators

Limitations

The evaluation archive is heterogeneous, so not all reports are directly comparable; the authors avoid forcing a single common metric.
The student survey is very small (n = 10), non-random, and drawn from the author’s disciplinary network, so it is exploratory only.
There are pronounced ceiling effects on the 7-point survey scale, which reduces sensitivity to between-condition differences.
The excerpt does not provide the exact base model, hyperparameters, optimizer, batch size, number of epochs, or seed strategy, so methodological reproducibility is incomplete from the provided text alone.
The paper does not show that distillation works equally well across disciplines, authors with less legible corpora, or scholars whose work is intentionally less standardized.
The broader baseline comparison against Gemini/ChatGPT/Claude is referenced but not reported with the same archival fullness, so the causal contribution of the distillation pipeline versus generic model capability remains partly underdocumented in the excerpt.

Open questions / follow-ons

How much of the effect comes from corpus-specific reasoning extraction versus careful prompt engineering around a strong base model?
Which scholarly styles are least distillable: intentionally self-revising, highly dialogic, or cross-disciplinary work with weaker textual regularities?
Can the method be generalized beyond humanities/social science corpora to fields with more formalized methods or more heterogeneous authorship patterns?
What governance mechanisms would actually be enforceable for disclosure, consent, compensation, and deployment restriction once public corpora are legible to extraction pipelines?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the paper is a reminder that high-value human expertise can sometimes be approximated by systems built from public traces rather than from privileged data. That matters because the same basic pattern—extracting stable behavioral or reasoning signatures from public or semi-public artifacts—shows up in account abuse, impersonation, and synthetic-user generation. The specific academic setting is unusual, but the deeper lesson is general: if a public artifact stream is consistent enough, an attacker may be able to reconstruct an operationally useful persona or policy engine. Defenders should therefore think not only about detecting generic LLM text, but about how much of a target’s distinctive behavior is already exposed in their public corpus, and how much downstream substitution an attacker could achieve with modest engineering effort.

Cite

bibtex

@article{arxiv2604_16116,
  title={ The Relic Condition: When Published Scholarship Becomes Material for Its Own Replacement },
  author={ Lin Deng and Chang-bo Liu },
  journal={arXiv preprint arXiv:2604.16116},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.16116}
}

The Relic Condition: When Published Scholarship Becomes Material for Its Own Replacement ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​