Agents-K1: Towards Agent-native Knowledge Orchestration

Source: arXiv:2606.13669 · Published 2026-06-11 · By Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu, Zhijie Zhong et al.

TL;DR

Agents-K1 addresses a critical gap in current large language model (LLM)-based research agents by focusing on scientific knowledge orchestration rather than only agent orchestration. Existing infrastructures commonly treat scientific documents as flat text or abstracts and use coarse citation graphs, losing key entities, claims, evidence from multimodal content (figures, tables, equations), and nuanced citation intent. Agents-K1 introduces an end-to-end pipeline that constructs agent-native, multimodal scientific knowledge graphs capturing fine-grained entities, typed relations, citation roles, and evidence provenance from full-text papers. The system integrates three components: a multimodal parser with a five-module schema extracting entities and relations across text, figures, and tables; a compact 4-billion parameter extraction backbone trained with reinforcement learning guided by rule-based rewards; and a tri-source CLI interface enabling agent-grounded retrieval and cross-document graph traversal. Applied to 2.46 million papers from six scientific domains, it produces Scholar-KG, a large-scale, richly structured scientific knowledge graph, releasing a one-million-paper subset publicly. Experimental results demonstrate substantial improvements over strong baselines in information extraction metrics, knowledge graph construction quality, and multi-hop reasoning capabilities on scientific QA benchmarks, and enable evidence-traceable, causal-aware reasoning workflows.

Key findings

Agents-K1 processes full multimodal scientific papers (not just abstracts) capturing entities, claims, evidence, mechanisms, and citation intent in a unified knowledge graph schema.
The 4B parameter extraction model trained with Group Relative Policy Optimization (GRPO) and rule-based reward matches or exceeds performance of an 8B open-source baseline on 10 extraction benchmarks and approaches 32B model NER accuracy.
On FrontierScience-Research benchmark, Agents-K1 raises Gemini-3 overall accuracy from 7.9% to 24.6% and GPT-5.2 from 25.2% to 39.4%.
On geoscience research questions, Gemini-3 rationale accuracy improves from 52.3% to 69.5% with Agents-K1 graph retrieval.
Multi-hop QA performance on HotpotQA, 2WikiMultiHopQA, and MuSiQue surpasses nine graph-augmented retrieval baselines.
Scholar-KG is built from 2.46 million papers spanning computer science, chemistry, biology, earth science, physics, and materials, with a 1M paper subset released.
Citation relations are fine-grained and typed into 5 levels (peripheral to foundational) with evidence spans and argumentative roles supporting causal and lineage tracing.
Semantic anchors aggregate fine-grained multimodal entities into robust modality-agnostic nodes linking text, figures, tables, and equations, improving retrieval robustness and extensibility.

Threat model

The work is not focused on adversarial threats; rather, it assumes a research agent consuming scientific literature to extract reliable, traceable, and richly structured knowledge. The adversary is therefore any source of incomplete or ambiguous scientific documentation, but not an active attacker. The system's goal is to reduce knowledge gaps and improve evidence provenance rather than defend against malicious actors.

Methodology — deep read

The core idea of Agents-K1 is to build an agent-native scientific knowledge graph (KG) from full multimodal papers to enable structured retrieval and verifiable reasoning.

Threat model & assumptions: The paper targets research agents acting as knowledge consumers needing comprehensive access to claims, evidence, and scientific context across multiple papers. There is no adversarial focus; rather, the model assumes access to raw PDFs and metadata of scientific literature. Information integrity relies on accurate parsing and extraction.
Data: The system processes 2.46 million scientific papers from six disciplines: computer science, chemistry, biology, earth science, physics, and materials. A 1 million paper subset of Scholar-KG plus full metadata is released. Papers are parsed holistically including text, figures, tables, and equations with layout and caption info preserved. Data splits or specific benchmark datasets are used for evaluation on extraction and reasoning tasks but are not detailed extensively.
Architecture / algorithm: The pipeline has three components:

KG Layer: Multimodal parser decomposes PDF documents into content units across modalities (text, figure, table, equation). Semantic anchors provide modality-agnostic summaries enabling entity aggregation without brittle cross-modal alignment. The knowledge graph consists of a three-layer heterogeneous graph capturing fine-grained entities, anchors, and document structure. A disaggregated schema models meta/factual entities (metadata, authors), explicit textual scientific entities (methods, datasets, metrics, theorems), implicit abstractions (motivations, claims, limitations), typed citation relationships with strength and argumentative roles, and inter-entity relations (causal, structural, comparative). Entities are canonicalized via controlled vocabularies and embeddings.
LLM Layer: A 4-billion parameter transformer information extraction backbone is trained with Group Relative Policy Optimization (GRPO), a reinforcement learning approach using rule-based reward that enforces JSON format validity, schema compliance, and task-specific F1 scoring on NER, relation extraction, and long-form extraction outputs. The model balances accuracy and computational efficiency compared to larger baselines.
CLI Layer: GraphAnything CLI is a tri-source agent interface that fuses real-time web search, multimodal graph-based retrieval from Scholar-KG, and cross-document knowledge traversal. It supports graph operators, multi-agent coordination, idea generation, method specification, and executable research workflows.

Training regime: The 4B parameter model training details include epochs, batch sizes, hardware setups, and seeds, but these are not explicitly stated in the summary; however, training jointly optimizes rules enforcing valid structured extraction. Reinforcement learning with rule-based rewards guides the model.
Evaluation protocol: Agents-K1 is evaluated extensively on 10 scientific information extraction benchmarks comparing to 8B and 32B baseline models. Multi-hop QA tasks include FrontierScience-Research, HotpotQA, 2WikiMultiHopQA, and MuSiQue, measuring accuracy and rationale correctness. Cross-domain and cross-modality performance is analyzed. Baselines include strong open-source models and heuristic systems.
Reproducibility: The authors release a one-million-paper subset of Scholar-KG and provide GraphAnything CLI. The codebase and trained models are accessible via links. The full 2.46 million paper corpus is available through an SCP link. However, some datasets used for evaluation or proprietary corpora details are less clear.

Example end-to-end flow: A raw scientific paper PDF is ingested by MinerU multimodal parsers extracting text paragraphs, figures, tables, and equations into structured content units. Semantic anchors summarize each content unit bridging modalities. The 4B extraction backbone processes these inputs producing JSON outputs of named entities, entity relations, and citation roles guided by reinforcement-learned rewards. Extracted knowledge forms nodes and edges in Scholar-KG with fine-grained typing and provenance. An agent using GraphAnything CLI can query this graph for relevant entities and traverse multi-hop citation or methodological relations to support complex scientific reasoning queries backed by exact evidence spans.

Technical innovations

Introduction of a five-module multimodal parsing schema capturing entities, evidence, citations, and typed inter-entity relations from full papers including figures, tables, and equations rather than abstracts alone.
Use of semantic anchors as modality-agnostic intermediate nodes bridging fine-grained entities from heterogeneous data modalities to improve graph robustness and retrieval.
Training a 4B parameter extraction backbone with Group Relative Policy Optimization (GRPO) reinforcement learning guided by rule-based rewards that enforce structured output validity and improve efficiency.
GraphAnything CLI integrates tri-source retrieval (web, multimodal graph, cross-document traversal) with executable graph operations enabling closed-loop research workflows grounded in structured scientific knowledge.

Datasets

Scholar-KG — 2.46 million scientific papers across six disciplines (computer science, chemistry, biology, earth science, physics, materials) — full corpus released via SCP, one-million-paper subset publicly released
FrontierScience-Research benchmark — unspecified size — for multi-hop scientific reasoning
HotpotQA — existing multi-hop QA benchmark
2WikiMultiHopQA — existing multi-hop QA benchmark
MuSiQue — multi-hop QA benchmark

Baselines vs proposed

8B open-source extraction model: F1 scores on 10 NER and relation extraction benchmarks < Agents-K1 4B model (exact numbers not specified)
32B base NER accuracy: Agents-K1 4B extraction backbone matches performance
FrontierScience-Research benchmark: Gemini-3 accuracy improved from 7.9% to 24.6% with Agents-K1 graph; GPT-5.2 accuracy improved from 25.2% to 39.4%
Geoscience research questions: Gemini-3 rationale accuracy increases from 52.3% to 69.5% with Agents-K1
Multi-hop QA on HotpotQA, 2WikiMultiHopQA, MuSiQue: Agents-K1 outperforms nine graph-augmented retrieval baselines (exact scores not provided)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.13669.

Fig 1

Fig 1: | Agents-K1: Architecture and Capabilities. Left: Extracting multimodal knowledge from

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

Details on training hyperparameters, epochs, batch sizes, and hardware configurations for the 4B extraction model are not fully specified, limiting reproducibility clarity.
The paper does not extensively report adversarial evaluation or robustness against noisy, contradictory scientific documents.
Distribution shifts, such as applying the pipeline to radically different domains or less structured documents beyond scientific papers, are only lightly explored with the schema-adaptive General-KG extension.
Citation relation classification and evidence extraction rely partly on heuristic and classifier-based approaches that may require manual verification or calibration, possibly limiting scalability.
Some benchmark datasets and evaluation metric specifics are not fully detailed in the summary, making comparative exactness uncertain.
While a large subset of Scholar-KG is released, full dataset access requires SCP link and may have access constraints.

Open questions / follow-ons

How to extend agent-native knowledge orchestration beyond scientific literature to highly heterogeneous or semi-structured real-world corpora with minimal domain adaptation?
What are the limits of reinforcement learning with rule-based rewards for structured information extraction across vastly different document styles or noisy inputs?
Can the citation intent and argumentative role classification be further refined with more advanced natural language inference models or human-in-the-loop feedback at scale?
How can agent-facing graph interfaces like GraphAnything CLI be generalized to support collaborative multi-agent workflows or user-in-the-loop verification beyond research agents?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, Agents-K1 presents a compelling example of building agent-native knowledge orchestration that tightly integrates rich multimodal scientific evidence, citation roles, and typed entity relations into a unified graph. This approach contrasts with many existing keyword- or snippet-based retrieval methods prone to shallow context or brittle provenance. Agents working in complex adversarial settings involving CAPTCHA or bot-detection could benefit from adopting similar layered knowledge graphs to ground decisions in well-structured, auditable evidence rather than isolated text segments. Furthermore, the reinforcement-learned structured extraction backbone highlights how tight integration of extraction and retrieval can enable more reliable multi-hop reasoning, important in detecting sophisticated automated or coordinated abuse patterns that require deeper contextual understanding. Finally, the tri-source agent interface demonstrates how unifying real-time data retrieval with structured knowledge bases supports robust and flexible workflows, a principle that could inspire advanced bot-defense tooling combining external threat intelligence with internal behavioral graphs.

Cite

bibtex

@article{arxiv2606_13669,
  title={ Agents-K1: Towards Agent-native Knowledge Orchestration },
  author={ Zongsheng Cao and Bihao Zhan and Jinxin Shi and Jiong Wang and Fangchen Yu and Zhijie Zhong and Zijie Guo and Tianshuo Peng and Zhuo Liu and Yi Xie and Xiang Zhuang and Yue Fan and Runmin Ma and Shiyang Feng and Xiangchao Yan and Anran Liu and Peng Ye and Wenlong Zhang and Shufei Zhang and Chunfeng Song and Fenghua Ling and Jie Zhou and Liang He and Bo Zhang and Lei Bai },
  journal={arXiv preprint arXiv:2606.13669},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13669}
}

Agents-K1: Towards Agent-native Knowledge Orchestration ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​