Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

Source: arXiv:2605.29007 · Published 2026-05-27 · By Xinming Yang, Jun Li

TL;DR

This paper addresses the challenge of generating synthetic student errors for educational applications, where privacy and ethical constraints limit access to large labelled corpora of real student mistakes. They propose a two-agent framework that produces targeted wrong answers according to a five-class taxonomy of cognitive failure modes adapted from the revised Bloom's taxonomy. A Generation Agent (GA) drafts candidate incorrect solutions conditioned on a target error class, and an Examination Agent (EA) judges whether the draft is both incorrect and consistent with the intended cognitive failure mode, rejecting and prompting regeneration as needed. Applied to TheoremQA, a challenging domain-expert science reasoning benchmark, this method enables controlled, class-stratified error generation at scale without needing real student data.

Their evaluation reveals that targeted error generation is substantially harder than unconstrained wrong-answer generation: the strongest GPT-5 based configurations reach targeted-error rates around 83–88% on 20 questions, with structural-blindness errors (E5) the hardest to generate correctly (0.35–0.70 per configuration). Answer-grounding (providing the correct solution to the GA) and feedback loops whereby EA rejects incorrect drafts and prompts retries substantially improve targeted-error generation, especially on weaker backends. Expanding few-shot exemplars or adding textbook context to GA inputs has limited effect beyond answer-grounding. An alternative fine-tuned EA classifier improves acceptance rates and reduces inference latency compared to prompt-based EA heuristics. This two-agent pipeline supplies a reusable methodology to build synthetic error datasets with fine-grained control over cognitive failure modes, supporting personalized tutoring, teacher training, and education research when authentic student data is unavailable.

Key findings

Targeted error generation rates on TheoremQA with GPT-5 backend range from 0.83 to 0.88 across pipelines with EA feedback and answer grounding (Table 3).
Structural blindness (E5) is the hardest error class to generate correctly, with targeted-error rates between 0.35 and 0.70 across configurations (Figure 2).
Adding answer-grounding (+correct solution context to GA) improves targeted-error rates by up to +6 percentage points compared to no-answer variants on GPT-5 (P0 to P1, Table 3).
Feedback loop (EA judging and retry prompting) yields largest gains on weaker backends (o3+GPT-4o), increasing targeted-error rates by +21 points (P1 to P3, Table 3).
Fine-tuned BERT EA classifier (P8) improves acceptance substantially on weak backends (+25 points on o3+GPT-4o) but has mixed effects on stronger ones (Table 3).
Expanded few-shot exemplars and external textbook excerpts supplied to GA bring minor or no consistent improvements beyond answer-grounding and feedback loop (Table 3).
Retry counts and token consumption are heavily skewed by hard classes like E5, with prompted EA pipelines requiring up to 3 retries on average per instance for E5 error generation (Table 4).
Prompted-EA loops (P1, P3) can cost up to 3–5x more tokens and latency than fine-tuned EA (P8) on the GPT-5 backend (Table 6).

Threat model

The adversary is a system or model attempting to generate synthetic incorrect answers conditioned on a specific cognitive failure mode for educational use. It can access the input question, class definitions, and optionally the correct answer and external textual context. The adversary cannot directly inspect or manipulate internal model weights and cannot produce guaranteed error-class consistent outputs without supervision. It is also constrained by the underlying LLM capability to suppress correct answers and diversify generation under rejection feedback.

Methodology — deep read

The paper targets generating synthetic student errors aligned with five cognitive failure classes (E1 mental typos, E2 knowledge gaps, E3 misconceptions, E4 wrong choice, E5 structural blindness) adapted from revised Bloom's taxonomy.

Threat Model & Assumptions: The adversary is an LLM usage context attempting to produce wrong answers matching a specific cognitive failure mode, given domain questions but without access to real student error corpora. The framework assumes the model is truthful about failure classification but needs supervision to suppress correct answers.

Data: The main evaluation uses TheoremQA, a domain-expert annotated dataset of 200 science and math questions covering multiple topics. A 20-question Tier-1 subset is manually author-annotated post-generation for error correctness and class alignment, yielding 2,530 verified (question, class, response) triples across pipelines and backends. Tier-2 evaluation scales to 200 questions with automated metrics.

Architecture / Algorithm: The system decomposes error generation into two pre-trained agents:

Generation Agent (GA): prompts a pretrained LLM (multi-class) with question, error class definitions, optionally correct answers, and few-shot exemplars to draft an erroneous solution matching the target class.
Examination Agent (EA): a second LLM or fine-tuned BERT classifier judges if the GA draft is incorrect and matches the intended error class, rejecting mismatches.

This decoupling uses a retry loop with a cap of 5 attempts, where EA guides GA to iteratively re-sample until acceptance.

Eight pipeline variants (P0-P8) explore GA input information (presence of correct answer, few-shot examples, textbook excerpts), EA mechanism (prompted vs fine-tuned classifier), and presence of feedback loops.

Training Regime: GA remains zero-shot or few-shot prompted with no fine-tuning; EA is either a prompted LLM or a BERT-base-uncased classifier fine-tuned on 1,600 manually labelled triples. Fine-tuning uses supervised classification; details on epochs or hyperparameters are not fully specified.

Evaluation Protocol: Main metric is targeted-error rate = fraction of generated responses that are both incorrect and match requested cognitive class, verified by authors on the Tier-1 dataset. Three-way labels (correct, incorrect-and-right-class, incorrect-and-wrong-class) enable detailed failure analysis. Results are reported per backend (GPT-5, GPT-5-mini, OpenAI o3+GPT-4o), pipeline, and error class. Retry counts, token consumption, latency, and EA acceptance thresholds are analyzed. Human expert validation is the gold standard; automatic EA labeling is used but not deemed fully reliable.

Reproducibility: The authors release the framework, prompting recipes, a 1,800-record replication artefact (Tier-1 annotated triples), and the fine-tuned EA model weights. However, underlying models (GPT-5, GPT-4o) are closed source. Full raw data and code availability is implied but not detailed.

Example Workflow: For a TheoremQA question and target error class E3 (misconception), the GA produces a candidate incorrect reasoning trace and answer by conditioning on class definitions and optionally the correct answer. The EA inspects this triple and rejects if it finds the answer correct or the failure pattern inconsistent with E3. If rejected, GA retries up to 5 times. Accepted outputs are collected to build synthetic student error datasets. Human annotators verify the error class applicability on Tier-1.

Overall, the method systematically tests how well large LMs can be controlled to generate cognitively-grounded errors rather than arbitrary wrong answers, enabling synthetic data generation for education research in domains lacking real error corpora.

Technical innovations

Introduction of a two-agent framework that decouples error generation (GA) from error type verification (EA) to produce class-consistent synthetic student errors.
Use of a retry loop controlled by an Examination Agent to iteratively guide the Generation Agent until generating error responses matching target cognitive failure modes.
Adoption and adaptation of a five-class taxonomy based on revised Bloom’s taxonomy to enable fine-grained, cognitively meaningful control of synthetic error generation.
Comparison of prompted vs fine-tuned EA models, demonstrating a lightweight path to scalable, human-supervised class-consistency verification with reduced inference costs.

Datasets

TheoremQA — 200 questions — domain-expert curated science and math problems, publicly documented but not fully open student error labels.
Tier-1 verified pool — 20 questions × 5 classes × 9 pipelines × 3 backends = 2,530 author-verified (question, target class, response) triples — released as replication artefact.

Baselines vs proposed

P0 (baseline GA without correct answer, no EA feedback): targeted-error rate = 0.69 (o3+GPT-4o), 0.77 (GPT-5), 0.61 (GPT-5-mini) vs P1 (answer + EA feedback): 0.54, 0.87, 0.71 respectively (Table 3).
P1 to P3 (adding EA feedback loop) improves targeting by +0.21 on o3+GPT-4o, +0.01 on GPT-5, and no change on GPT-5-mini.
Fine-tuned EA (P8) vs prompted EA (P1): +0.25 increase on o3+GPT-4o, -0.04 on GPT-5, +0.03 on GPT-5-mini.
Expanded few-shot examples and textbook context (P4 to P7) do not outperform P1/P3 by more than a few points; sometimes slightly lower (Table 3).

Limitations

Small scale expert validation limited to 20 questions per pipeline/backend in Tier-1 due to manual labeling effort; may not generalize fully.
Relative scarcity of public student misconception datasets matching their taxonomy prevents external validation against real student data.
The system relies on closed-source LLMs (GPT-5, GPT-4o, OpenAI o3) limiting reproducibility and understanding of underlying model traits.
No attempt is made to test the ecological validity—whether generated errors resemble authentic student mistakes in real educational settings.
The taxonomy is fixed to five classes adapted from Bloom's revision; other error taxonomies may yield different results or error difficulty.
Automated EA judgments are not yet reliable enough to replace human validation fully, limiting deployment scalability.

Open questions / follow-ons

How well do synthetic errors generated by this method align with authentic student misconceptions in realistic educational settings across subjects?
Can more fine-grained or alternative error taxonomies improve targeted error generation and downstream educational utility?
To what extent can model fine-tuning or trajectory conditioning, rather than prompt engineering alone, improve targeted error controllability?
What are the best strategies to scale human or automated evaluation to larger and more diverse datasets while preserving label quality?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners interested in robustness and human-mimicking synthetic data generation, this work offers a novel methodology to generate controlled synthetic errors that adhere to defined cognitive error classes. Analogously, generating adversarial human-like mistakes in response content or interaction patterns could serve to test bot detection systems or train more nuanced classifiers. The two-agent pipeline combining generative attempts with iterative judgment and rejection loops may inspire designs of multi-agent robustness testing frameworks in bot defense. However, the paper focuses primarily on educational reasoning domains and cognitive taxonomy rather than security, so direct application requires careful adaptation to attacker behavior modeling and CAPTCHA challenge-response contexts.

Cite

bibtex

@article{arxiv2605_29007,
  title={ Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation },
  author={ Xinming Yang and Jun Li },
  journal={arXiv preprint arXiv:2605.29007},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.29007}
}

Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​