Prober.ai: Gated Inquiry-Based Feedback via LLM-Constrained Personas for Argumentative Writing Development

Source: arXiv:2605.05598 · Published 2026-05-07 · By Ran Bi, Shiyao Wei, Yuanyiyi Zhou

TL;DR

Prober.ai addresses a well-documented problem in AI-assisted education: general-purpose LLMs, when used for writing support, act as cognitive prosthetics that generate polished text on demand, measurably suppressing the active reasoning processes students need to develop argumentative writing skills. The paper's motivating evidence is Kosmyna et al.'s (2025) EEG study showing reduced alpha-band directed transfer function connectivity during ChatGPT-assisted writing — a neural signal of reduced cognitive engagement — alongside behavioral observations that AI-assisted essays tend to be structurally shallow despite surface polish. The proposed solution inverts the standard AI-tutor paradigm: instead of helping students write better text, the system forces students to think better by acting only as a structured questioner.

Prober.ai is a web-based writing tool that constrains Gemini 3 Flash Preview through persona-specific system prompts and rigid JSON output schemas to produce exclusively inquiry-based questions about argumentative weaknesses, never evaluative statements or rewrites. Two complementary LLM personas — 'Reviewer #2' (expert logical scrutiny, four Toulmin-mapped questions) and 'Confused Reader' (novice-clarity probing, two questions) — are implemented via separate prompt configurations. A two-phase API architecture (/challenge → /unlock) enforces a 'pedagogical friction' mechanism: the student must write a substantive reflective defense of their argument before the system unlocks any concrete revision suggestion. This gates directive feedback behind metacognitive effort, operationalizing facilitative rather than directive feedback as defined in the AI-education literature.

The prototype was built in a reported 36-hour hackathon sprint (though the paper later clarifies a one-month competition period — a factual inconsistency in the source text) and awarded second place at the NY EdTech Hackathon (March 2026). Evaluation is limited to approximately 50 development-phase test invocations and qualitative observations; there is no controlled user study, no learning outcome measurements, and no comparison against a human-feedback baseline. The paper is therefore best characterized as a system description and design rationale paper with proof-of-concept validation, not an empirical efficacy study.

Key findings

Across approximately 50 test invocations during development, the Gemini 3 Flash Preview model produced valid JSON conforming to specified output schemas with a parsing failure rate below 5%; all failures were recoverable via a regex fallback mechanism.
/challenge endpoint (longer composite prompt) returned responses in 3–5 seconds; /unlock endpoint (shorter prompt, minimal output schema) returned responses in 1–3 seconds on the Gemini 3 Flash Preview model.
The dual-persona constraint mechanism produced qualitatively differentiated outputs on identical essays: Reviewer #2 targeted logical structure (warrant gaps, overgeneralization, scope), while Confused Reader targeted jargon, unexplained concepts, and explanatory leaps — confirming persona separation.
The structured JSON output schema enforces exactly four question fields for Reviewer #2 (claim_question, reasoning_question, counterargument_question, scope_or_implication_question) and exactly two for Confused Reader (clarification_question, co_construction_question), with optional excerpt anchor fields for each.
The gating mechanism is enforced at both the API level (non-empty userDefense string required by /unlock endpoint) and the frontend level (input validation disables the unlock button until a defense is entered), creating a two-layer enforcement of the reflection requirement.
The pedagogy guide (pedagogy_guide.md, 129 lines) is injected as internal LLM context at every /challenge invocation and is explicitly instructed to not be surfaced to the student, separating the pedagogical scaffolding layer from the student-facing interface.
The paper cites Latifi et al. (2021, BJET) finding that question-based peer feedforward significantly enhanced argumentation quality and learning depth compared to traditional evaluative peer feedback — used as theoretical justification for the question-only output constraint.
The abstract states the prototype was developed in '36 hours' but Section 6.1 states 'a one-month competition period' — this is an unresolved factual inconsistency in the source document that undermines the hackathon timeline claim.

Threat model

The system's implicit threat model is pedagogical rather than adversarial-security in nature. The 'adversary' is the student's own cognitive shortcut behavior: given access to AI-generated revision suggestions without a prerequisite effort cost, students will bypass active reasoning and outsource thinking to the AI, producing measurable cognitive debt (as evidenced by Kosmyna et al.'s EEG findings). The system assumes the student has access to the /unlock endpoint and will attempt to access it with minimal effort. What the student cannot do in the intended design is receive revision suggestions without submitting at least a non-empty defense string — though as noted, this constraint is trivially circumventable since no quality check is applied to the defense. No external adversary (prompt injection via student essay, API key theft, rate limit abuse, multi-account evasion) is formally modeled or defended against, though the client-side API key management via localStorage is noted as a deployment consideration. The paper does not address prompt injection risk despite the student essay being directly concatenated into the LLM prompt.

Methodology — deep read

The paper describes a system architecture and prompt engineering methodology, not an experimental study, so 'methodology' here refers to the technical and design methodology rather than an empirical research protocol.

Threat model and design assumptions: The adversary in the pedagogical sense is the student's own cognitive shortcut behavior — specifically, the tendency to use available AI-generated text as a substitute for active reasoning. The system assumes that if revision suggestions are available before reflection, students will skip the reasoning step. It also assumes that the LLM's default behavior (agreeableness, generation, evaluation) is pedagogically harmful and must be actively overridden. No external adversary (e.g., a student deliberately gaming the defense requirement with low-effort text) is formally modeled; the defense gating is enforced only on non-emptiness of the string, not on quality of the reflection.

Data and content: No training data is used; the system relies entirely on Gemini 3 Flash Preview's pre-trained capabilities. The pedagogy_guide.md (129 lines) serves as an injected knowledge base codifying Toulmin-based question module taxonomy, diagnostic trigger categories, and example question templates. A pre-loaded demo essay (a K–12 level argumentative essay about driverless cars, exhibiting overgeneralization, causal leaps, weak counterarguments, and scope issues) serves as the primary test artifact. No dataset splits, labels, or preprocessing are reported because this is not a trained ML system.

Architecture and novel components: The system follows a client-server architecture. The backend is Node.js/Express.js (server.js, 282 lines) acting as an API gateway to the Gemini API via the @google/generative-ai SDK (v0.21.0). The frontend is vanilla HTML/CSS/JavaScript with Quill.js for rich text editing. The novel architectural component is the two-endpoint API design: POST /challenge accepts the essay and persona selection and returns structured JSON questions; POST /unlock accepts the original question, relevant excerpt, and student defense and returns a {suggestion, tip} JSON object. Critically, these endpoints use entirely separate LLM prompt configurations — /challenge uses the adversarial questioning persona while /unlock uses a 'helpful writing tutor' persona — preventing what the authors call 'persona contamination.' The prompt for /challenge is a composite of six concatenated components: (1) persona system prompt, (2) global negative constraints, (3) full pedagogy guide as internal context, (4) internal reasoning protocol specifying four hidden chain-of-thought steps, (5) JSON output schema, (6) student essay.

LLM constraint methodology: Three mechanisms are layered to override default LLM behavior. First, explicit negative constraint instructions in the system prompt prohibit text rewriting, evaluative language ('unclear,' 'weak,' 'insufficient'), yes/no questions, leading questions, and paraphrasing of student text. Second, a four-step internal reasoning protocol instructs the model to perform argument segmentation, issue detection from a diagnostic trigger list, epistemic state classification, and trigger prioritization entirely internally without outputting these steps. Third, a rigid JSON output schema constrains the output format to specific named fields only, with each question field limited to 2–3 sentences. LLM response parsing uses a two-stage regex extraction (fenced JSON blocks first, raw JSON object fallback second) with schema validation. The Confused Reader persona's specialized fields (clarification_question, co_construction_question) are backwards-mapped to generic field names for frontend rendering compatibility.

Frontend interaction design: The Write–Challenge–Defend–Improve loop is the primary interaction paradigm. Excerpt highlighting is implemented via substring search of LLM-generated excerpt text within the Quill Delta API, applying rgba(250, 204, 21, 0.4) highlighting on hover — this is noted as fragile under whitespace or punctuation mismatches. A session export feature generates a print-ready HTML interaction log. A fully offline demo mode pre-bakes all feedback and unlock responses for the driverless cars essay, enabling the complete loop without API connectivity.

Evaluation protocol: Evaluation is informal and developer-conducted. The 50-invocation JSON parsing failure rate and the 3–5 second latency figures are the only quantitative metrics reported. Qualitative assessments of question quality (targeting genuine weaknesses vs. surface issues), persona differentiation, and gating effectiveness are based on developer observation during testing. There is no inter-rater reliability measurement for question quality, no student user study, no learning outcome assessment, no comparison against a human-feedback baseline, no ablation of individual constraint mechanisms, and no distribution shift testing across essay genres. The paper explicitly acknowledges that claims about cognitive engagement preservation remain theoretical.

Technical innovations

Persona-constrained LLM output via layered negative-constraint system prompts and explicit JSON output schemas that override a general-purpose model's default evaluative and generative behaviors — prior writing tools (Grammarly, QuillBot) operate at syntactic level; general AI agents (ChatGPT) produce unconstrained evaluative output.
Two-phase gated API architecture (/challenge → /unlock) that implements pedagogical friction as a first-class architectural primitive, enforcing mandatory student reflection as a prerequisite to accessing directive revision suggestions rather than treating reflection as optional.
Dual-persona framework operationalizing two independent critical dimensions (logical rigor via 'Reviewer #2' with four Toulmin-mapped question categories; communicative clarity via 'Confused Reader' with two question categories) through separate prompt configurations that prevent persona contamination across endpoints.
Composite prompt construction methodology combining persona identity, negative constraints, injected pedagogical knowledge base, hidden chain-of-thought reasoning protocol, JSON schema specification, and student essay into a single ordered prompt — the internal reasoning chain is explicitly instructed to execute without appearing in output.
Session export as a structured learning artifact (print-ready HTML interaction log capturing essay excerpt, each challenge question with text anchor, student defense, AI suggestion, and writing tip) designed to serve dual purposes as student review material and potential research data for educators.

Datasets

Demo essay (driverless cars, K-12 argumentative writing) — 1 sample essay — internally authored by the development team, not publicly released as a dataset
pedagogy_guide.md — 129 lines — internally authored knowledge base, not a public dataset

Baselines vs proposed

Grammarly/QuillBot (grammar-focused tools): targets argument structure = No vs. Prober.ai: Yes (qualitative comparison, Table 1)
ChatGPT/Gemini (general AI agents): provides challenging feedback = No vs. Prober.ai: Yes (qualitative comparison, Table 1)
ChatGPT/Gemini (general AI agents): gates suggestions behind reflection = No vs. Prober.ai: Yes (qualitative comparison, Table 1)
JSON schema compliance: parsing failure rate across ~50 test invocations = <5% with 100% recovery via regex fallback (no baseline comparison reported)
/challenge endpoint latency: 3–5 seconds; /unlock endpoint latency: 1–3 seconds (no baseline latency comparison reported)

Limitations

No controlled user study: all claims about cognitive engagement preservation, argumentative writing improvement, and learning outcomes are theoretical, grounded in cited prior literature not in any empirical evaluation of Prober.ai itself.
Trivial gating enforcement: the /unlock endpoint requires only a non-empty defense string; a student can submit a single character or nonsense text and unlock the suggestion, meaning the pedagogical friction mechanism has no quality threshold — this is a significant design gap unacknowledged by the authors.
No adversarial evaluation of LLM output quality: the ~50 invocation test is developer-conducted with no inter-rater reliability, no rubric-based scoring, no measurement of question pedagogical alignment, and no testing on edge-case essays (off-topic, very short, non-English, deliberately adversarial input).
Factual inconsistency in the source document: the abstract claims the prototype was developed in '36 hours' while Section 6 states 'a one-month competition period' — this undermines the paper's own timeline claims and suggests insufficient proofreading for a research publication.
Excerpt matching fragility: the contextual highlighting mechanism relies on exact substring matching and fails on whitespace or punctuation differences between LLM-generated excerpts and editor content — acknowledged by authors but unfixed in the prototype.
No persistent student model: each session is independent with no tracking of recurring weaknesses, improvement trajectories, or adaptive difficulty, limiting the system to one-shot interactions without longitudinal pedagogical value.
Single LLM dependency: the system is entirely dependent on Gemini 3 Flash Preview, a commercial API with no frozen weights, no reproducibility guarantees, and potential output drift over time — there is no model versioning or output consistency safeguard described.
Genre specificity: question modules and diagnostic triggers are optimized for argumentative/persuasive essays only; the system's applicability to narrative, expository, or analytical writing is untested and architecturally unsupported.

Open questions / follow-ons

Does the gated feedback architecture actually produce measurable differences in argumentative essay quality, revision depth, or cognitive engagement compared to immediate-access feedback — and does any benefit survive when the defense requirement has no quality threshold?
How robust are the persona-constraint mechanisms against prompt injection via adversarially crafted student essays that attempt to override the system prompt's negative constraints or extract the pedagogy guide content?
Can the internal reasoning chain (argument segmentation, issue detection, epistemic state classification, trigger prioritization) be made auditable or validated without surfacing it to the student — i.e., is there a way to measure whether the hidden chain-of-thought is actually executing as specified or whether the model is bypassing it?
How does the system's pedagogical model degrade across the full spectrum of student writing quality — from very weak essays (where question overload may overwhelm rather than scaffold) to very strong essays (where the LLM may struggle to identify genuine weaknesses and produce artificially manufactured critique)?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, Prober.ai is most relevant as a case study in LLM behavioral constraint methodology rather than as a direct security tool. The core technical challenge the paper addresses — forcing a general-purpose generative model to produce only a specific, structured output type while suppressing its default behaviors — is directly analogous to challenges in bot-detection pipelines that use LLMs for behavioral classification, explanation generation, or challenge construction. The three-layer constraint approach (negative-constraint system prompts, hidden chain-of-thought reasoning protocols, rigid JSON output schemas) provides a concrete implementation pattern for constraining LLM outputs to specific taxonomies, which is applicable to any system where an LLM must produce machine-parseable, behavior-typed outputs rather than free text. The <5% JSON parsing failure rate with a regex fallback recovery pattern is a practical benchmark for similar schema-constrained deployments.

The gated interaction architecture also has structural parallels to challenge-response CAPTCHA design: the system enforces a mandatory cognitive effort step before delivering a reward (revision suggestion), analogous to a CAPTCHA enforcing proof-of-work before granting access. The key security-relevant weakness the paper inadvertently highlights is that effort gating without quality assessment is trivially bypassable — a lesson directly applicable to CAPTCHA design, where non-empty response validation (analogous to this system's non-empty defense check) is insufficient without behavioral or semantic quality signals. Bot-defense engineers evaluating LLM-based challenge generation systems should note the excerpt matching fragility and the absence of adversarial input testing as critical gaps to address before production deployment.

Cite

bibtex

@article{arxiv2605_05598,
  title={ Prober.ai: Gated Inquiry-Based Feedback via LLM-Constrained Personas for Argumentative Writing Development },
  author={ Ran Bi and Shiyao Wei and Yuanyiyi Zhou },
  journal={arXiv preprint arXiv:2605.05598},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.05598}
}

Prober.ai: Gated Inquiry-Based Feedback via LLM-Constrained Personas for Argumentative Writing Development ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​