Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience

Source: arXiv:2605.25183 · Published 2026-05-24 · By Jake Stephen, Niraj K. Jha

TL;DR

This paper addresses the challenge of inducing expert-level, deep mechanistic reasoning in language models (LMs) within the specialized scientific domain of neuroscience. Instead of relying on large, heterogeneous web-scale corpora, the authors hypothesize that constructing a high-fidelity knowledge graph (KG) distilled from a single authoritative neuroscience textbook and leveraging this structured knowledge for supervision is sufficient for the emergence of expert reasoning. Their approach builds a neuroscience KG via a dual-LLM filtering pipeline and expands it with a masked language model trained on the KG topology. They generate multi-hop question-answering (QA) datasets with chain-of-thought reasoning traces, which are used to fine-tune the LM and subsequently to reinforce learning with KG-path-derived rewards as implicit process supervision signals. The result is a 14B-parameter model that surpasses a frontier proprietary generalist model (Gemini 3.1 Pro) in neuroscience domain accuracy and demonstrates improved compositional generalization for reasoning chains of length 3 to 5 hops. Notably, the model's reasoning traces explicitly align with KG paths capturing mechanistic neuroscience concepts, indicating structured internalization of domain knowledge beyond surface-level pattern matching.

Key findings

Seed knowledge graph contained 6,157 high-confidence triples after dual-LLM validation, expanded to 19,755 triples and 9,187 nodes after GraphMERT-based KG topology learning and expansion.
Supervised fine-tuning (SFT) on 1-hop and 2-hop KG-derived QA pairs increased average accuracy on 3-5 hop evaluation set from 78.1% to 87.3%, primarily closing factual gaps.
Reinforcement learning (RL) with KG path-derived rewards on 2-hop QA items further increased average accuracy to 89.5%, with largest gains (up to +3.2 pp) at 5-hop questions, improving compositional reasoning depth.
The normalized accuracy degradation per added hop (δ) was reduced by 44% compared to base (2.40 pp/hop to 1.35 pp/hop), demonstrating improved compositional generalization beyond training hop lengths.
The KG-grounded reasoning traces in model outputs faithfully traverse multiple entity-relation hops in the KG, demonstrated in qualitative examples (Fig. 3), showing mechanistic neuroscience understanding.
Their 14B parameter specialized model outperformed Gemini 3.1 Pro, a multi-trillion parameter generalist model, with average neuroscience domain accuracy of 89.5% vs 84.8% on 3-5 hop QA evaluation.
The dual-LLM consensus filtering approach successfully filtered out 1,843 hallucinated or faulty triples, ensuring high KG ontological fidelity.
Reinforcement learning algorithm used (GRPO) incorporated both final answer correctness and path alignment rewards, gated to assign path reward only when answers are correct, avoiding reward hacking.

Threat model

The adversary is an AI model prone to hallucinating or producing spurious correlations when reasoning over complex scientific knowledge; it lacks robust step-by-step verification capability. Human preferential feedback is not assumed to be available due to annotation cost. The defense is a supervised and RL training approach rooted in verifiable KG path traversal to prevent reward hacking and shallow reasoning. The adversary cannot successfully manipulate or bypass the dual-LLM validation filtration or the trusted textbook as the authoritative knowledge source.

Methodology — deep read

Threat Model & Assumptions: The goal is to develop robust specialized reasoning for neuroscience domain questions. The adversary scenario is not explicitly defined but implicitly the approach assumes models prone to hallucination and spurious correlations need grounding via structured domain knowledge. The model cannot rely on extensive human annotations for reasoning traces and must avoid 'reward hacking', where superficially correct answers mask shallow reasoning.
Data: The core data source is a single authoritative neuroscience textbook (Kandel et al.), parsed from digitized PDF into overlapping text chunks (~300 tokens each). Using the Qwen3-14B language model with a few-shot prompt, biomedical entities and directed relations (from a closed vocabulary of 40 neuroscience-specific relations) were extracted as triples. Two separate LLMs (gpt-oss-20b and Mistral-Nemo-12B) validated each extracted triple in a consensus filter to remove hallucinated or inconsistent facts, preserving 6,157 seed triples. The GraphMERT masked language model, an encoder-only graph topology learner with six layers and 12 attention heads, was trained on these seed triples to propose additional triples, which were again filtered by the two-LLM validator. The final neuroscience KG has 9,187 nodes, 19,755 triples, with an average node degree of 2.15.
Architecture / Algorithm: The pipeline includes KG distillation, multi-hop path extraction, and KG-grounded QA curriculum generation. KG paths of lengths 1 to 5 hops were extracted via depth-first traversal, pruned for hub nodes, weak relations, and transitive redundancy. Multi-hop paths were converted to natural language multiple-choice question (MCQ) items paired with chain-of-thought (CoT) reasoning traces. This dataset includes 13,919 1-hop QA pairs, 30,000 2-hop pairs (5,000 reserved for RL), and 1,000 items per hop length for 3-5 hop evaluation.

The Qwen3-14B LM was first supervised fine-tuned (SFT) on 1-hop and 2-hop QA items using LoRA adapters merged into full weights, grounding the model in domain axioms and CoT format.

Then it underwent reinforcement learning (RL) on 5,000 held-out 2-hop QA prompts with Group Relative Policy Optimization (GRPO), generating N=4 completions per prompt. Rewards combine answer correctness (+1/-1) and path alignment scores (measuring overlap between model reasoning traces and ground truth KG paths, capped at 0.8), gated so that path rewards are zeroed out if the answer is incorrect, preventing partial credit for shallow keyword matching. KL divergence penalty anchors the policy to the frozen SFT checkpoint to prevent entropy collapse. RL uses full-parameter fine-tuning with DeepSpeed ZeRO Stage-3 optimizer CPU offloading on 4 A100 GPUs.

Training Regime: GraphMERT trained over 20 epochs with batch size 8. LM SFT training uses LoRA modules, merged before RL. RL phase runs for 3 epochs on 5,000 held-out 2-hop QA prompts. Optimizer is AdamW with learning rate 2e-6, KL penalty 0.12, batch size 1 with gradient accumulation 16, bfloat16 precision, temperature 0.6, top-p 0.9, max completion 1792 tokens, on 4×A100-80GB GPUs with memory optimizations.
Evaluation Protocol: Models are evaluated on held-out 3-hop to 5-hop QA items (total ~3,000), measuring final answer accuracy and normalized degradation rate δ in accuracy per additional hop. Baselines include the base zero-shot Qwen3-14B model and Gemini 3.1 Pro, a commercial multi-trillion parameter generalist model. Ablations examine contributions of SFT vs RL. Qualitative analyses include inspection of model <think> CoT reasoning traces for alignment to KG paths.
Reproducibility: The Qwen3-14B base is open-weight. The final neuroscience benchmark quiz and fine-tuned LM are available in a public GitHub repository. All KG triples, QA datasets, and prompts are detailed. Some underlying datasets, i.e. the textbook PDF and proprietary Gemini 3.1 Pro model, are closed. Full training code or RL checkpoints release status is not explicitly stated.

Example end-to-end: A neuroscience 3-hop multi-choice question is generated from a KG path representing neural populations and their interactions. The fine-tuned model produces an answer with a <think> chain-of-thought reasoning trace whose bolded terms exactly match the KG nodes, demonstrating traversal of the factual chain (e.g., "local interneurons → cerebellum → sensorimotor transformations → voluntary motor behavior"). This output is scored for answer correctness and reasoning trace overlap for RL updates.

Technical innovations

Use of a dual-LLM consensus filtering approach (gpt-oss-20b + Mistral-Nemo-12B) to extract and validate a high-fidelity neuroscience KG from a single textbook source, reducing hallucinations.
Adaptation of the GraphMERT graphical masked language model to expand the seed KG by proposing new domain triples learned from KG topology.
Generation of a multi-hop neuroscience question-answer curriculum (1-5 hops), grounded in explicit KG paths and paired with chain-of-thought reasoning traces for supervision.
Reinforcement learning with Group Relative Policy Optimization (GRPO) that uses KG path-alignment as a verifiable implicit reward model combined with answer correctness, avoiding human preferences.
Demonstration that RL trained solely on 2-hop QA with path rewards enables zero-shot compositional generalization to 3-5 hop tasks, improving reasoning depth and robustness.

Datasets

Neuroscience KG — 19,755 triples, 9,187 nodes — derived from Kandel et al. textbook, publicly available
Neuroscience QA Curriculum — 13,919 1-hop, 30,000 2-hop (5,000 for RL), 1,000 per 3-to-5 hop — synthetic from KG paths
Evaluation QA set — ~3,000 multi-hop questions (3 to 5 hops) — from KG path sampling

Baselines vs proposed

Qwen3-14B (Base): Average accuracy (3-5 hops) = 78.1% vs Proposed SFT+RL = 89.5%
Qwen3-14B (SFT only): Average accuracy = 87.3% vs Proposed SFT+RL = 89.5%
Gemini 3.1 Pro: Average accuracy = 84.8% vs Proposed SFT+RL = 89.5%
Normalized degradation rate δ (pp/hop): Gemini 3.1 Pro = 2.35 vs Proposed SFT+RL = 1.35

Limitations

The approach depends heavily on a single textbook source, limiting domain coverage and potentially embedding textbook-specific biases or gaps.
The dual-LLM validation filter, while effective, relies on the accuracy and agreement of two open-weight LLMs that may share latent biases or blind spots.
The reinforcement learning phase uses only 2-hop QA items for training; direct RL training on deeper hops was not evaluated and may limit ultimate reasoning depth.
Qualitative alignment between model reasoning traces and KG paths was shown for examples but lacks a large-scale quantitative annotation study.
The evaluation compares only a single proprietary generalist baseline, Gemini 3.1 Pro, without broader benchmarking against other domain-specific or open models.
The training and evaluation focus on neuroscience; generalizability of this bottom-up KG-driven approach to other scientific disciplines requires further study.

Open questions / follow-ons

Can the KG-driven bottom-up reasoning approach generalize to other scientific domains beyond neuroscience with similar efficiency?
How does direct reinforcement learning training on 3-to-5-hop QA prompts affect deeper compositional reasoning compared to 2-hop-only RL?
What is the impact of incorporating human expert annotations in reasoning trace supervision on model accuracy and interpretability?
Could combining this KG-grounded approach with web-scale pretrained models improve performance further by blending deep domain knowledge with broad world knowledge?

Why it matters for bot defense

This research provides a compelling exemplar of leveraging structured symbolic knowledge, in the form of a high-quality and rigorously validated knowledge graph, to induce deep compositional reasoning in language models without massive general web-scale pretraining. For bot-defense and CAPTCHA practitioners, the key takeaway is that domain-specific, mechanistic knowledge graphs can enable much smaller models to achieve expert-level understanding and reasoning capability, avoiding heuristics or shallow pattern exploits often abused by advanced bots. The multi-hop reasoning generation and reinforcement learning with path-grounded rewards offer a rigorous approach to training models that produce verifiable, interpretable reasoning traces rather than just outputting end predictions. These insights could be translated into bot-defense to design challenge-response mechanisms requiring deep symbolic inference traces, making it substantially harder for bots to game the system by memorization or pattern matching alone. At the architecture level, the use of neurosymbolic graph distillation and structured QA curricula grounded in logic paths suggests promising directions for CAPTCHA generation that emphasize compositional, multi-step understanding over simple puzzles or statistical correlations.

Cite

bibtex

@article{arxiv2605_25183,
  title={ Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience },
  author={ Jake Stephen and Niraj K. Jha },
  journal={arXiv preprint arXiv:2605.25183},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.25183}
}

Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​