LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Source: arXiv:2605.26781 · Published 2026-05-26 · By Xiaohan Wang, Mingze Yin, Yilin Zhao, Gang Liu, Dian Li

TL;DR

This paper introduces LiveK12Bench, a novel benchmark designed to rigorously evaluate the reasoning capabilities of large multimodal models (LMMs) on authentic high school-level examinations. Unlike prior static and limited datasets, LiveK12Bench dynamically ingests real examination papers across four core K-12 disciplines (Mathematics, Physics, Chemistry, Biology) via an automated pipeline to avoid data leakage and contamination. It features a comprehensive multimodal dataset with over 2000 verified questions and novel evaluation criteria including a "Mock Exam" scheme that assesses models not just on final answer accuracy but also on reasoning process correctness, efficiency, and robustness in an end-to-end setting. The benchmark includes challenging subsets that probe model limitations such as complex visual layout understanding and long-horizon reasoning tasks.

Extensive experiments on 12 state-of-the-art proprietary and open-source LMMs reveal pronounced performance drops under realistic exam constraints. For example, GPT-5's overall exam score falls sharply from 79 to 53 out of 100 when factoring in rigorous process and efficiency evaluations. Models exhibited notable vulnerabilities to complex page layouts and degraded reasoning quality, highlighting a significant gap between idealized benchmark performance and real-world educational readiness. The dataset, code, and evaluation framework are publicly released to facilitate ongoing research.

Key findings

LiveK12Bench dataset contains 2,114 manually verified questions spanning Mathematics, Physics, Chemistry, and Biology.
The dataset includes three evaluation modalities: Text-Only, Text-Image, and Image-Only (full exam page input).
GPT-5's overall exam score decreased from 79 to 53 out of 100 when incorporating reasoning process and efficiency metrics.
Gemini-3-pro achieved highest overall exam scores across most subjects, with Chemistry and Biology process exam score leads (e.g., +2.6 and +3.1 points respectively over second-best).
Claude-opus-4.6 ranked highest in reasoning efficiency (ARL metric) across all subjects, indicating concise and accurate generations.
Models perform substantially worse on Chemistry and Biology relative to Mathematics and Physics.
All evaluated LMMs struggled significantly on the Complex Layout subset, with exam scores dropping as much as 38.9 points compared to standard evaluation.
Process reasoning evaluation penalized models for condition interpretation errors, logical assumption errors, and deductive reasoning errors – exposing frequent flaws beyond final answer correctness.

Threat model

n/a — This paper is primarily focused on realistic, comprehensive evaluation of AI model capabilities in educational exam contexts rather than addressing adversarial threats or attacks.

Methodology — deep read

The authors target the challenge of evaluating LMMs on authentic K-12 exams, where adversaries are nonexistent; instead, the focus is on model performance under realistic constraints. The dataset is sourced from 200 fresh Chinese high school exams published in 2026, spanning Mathematics, Physics, Chemistry, and Biology. The freshly curated exam papers minimize data leakage risks from training corpora.

Raw exam papers (PDFs/scans) are processed via a structural document extraction pipeline using OCR for text, formulas, tables, and image cropping. The authors fine-tuned MinerU framework OCR models on their internal exam data for better accuracy with complex formula characters. Extracted raw markdown text plus images are parsed into structured JSON question-answer databases using a large language model with variable prompts that encode question types, output format, and layout cues to reliably locate and segment exam components. Results undergo human expert verification and deduplication.

LiveK12Bench supports three input modalities: Text-Only (textual question and options), Text-Image (text plus cropped visuals), and Image-Only (full exam page images with question indices) which simulate varying levels of realism and manual intervention. Three challenging subsets target issues like complex layouts, need for rigorous process reasoning (mitigating lucky guessing), and long-horizon reasoning complexity.

The core evaluation protocol is a multi-dimensional "Mock Exam" scheme requiring models to generate full reasoning chains plus final answers under strict format constraints. The scoring includes accuracy of final answers, detailed assessment of reasoning process correctness via error type identification, efficiency measured by accuracy weighted by reasoning response length (ARL), and a composite exam score blending outcome and process scores with human-educator assigned weights. To avoid subjective bias, evaluations are arbitrated by panels of several different advanced LLMs.

For experiments, 12 mainstream multimodal models (including GPT-5 variants, Gemini-3, Claude-opus, open-source Qwen3 and Kimi-k2.5) were evaluated on multiple dataset splits. Models used fixed prompts to ensure uniformity. Accuracy, process exam score (PES), overall exam score (OES), and efficiency metrics (ARL) were reported. Detailed ablations on subsets and modality variations measured robustness to layout noise and test conditions. The authors fixed penalty hyperparameters for process evaluation and averaged scores across multi-model judges. The dataset, code, and evaluation scripts are publicly released to facilitate reproducibility and future benchmarks. Specific per-discipline and per-model results with detailed error analyses enable fine-grained insights.

Technical innovations

Automated, continuous ingestion pipeline parsing authentic exam PDFs into structured multi-modal question-answer datasets using OCR and LLM parsing with variable templates to mitigate data contamination.
Novel multi-dimensional Mock Exam evaluation protocol combining final answer correctness, detailed stepwise reasoning process auditing, and efficiency metrics to simulate realistic test-taking constraints.
Introduction of an Image-Only modality with full-page exam snapshot inputs requiring autonomous context extraction, enabling truly end-to-end evaluation with minimal human intervention.
Establishment of three challenging subsets—Complex Layout, Rigorous Process, and Long-Horizon Reasoning—to specifically probe LMMs' vulnerabilities observed in real exams.

Datasets

LiveK12Bench — 2114 questions — sourced from 200 recent authentic Chinese high school exam papers (publicly available with code)

Baselines vs proposed

GPT-5: Overall Exam Score dropped from 79 to 53 when reasoning process and efficiency were evaluated jointly
Gemini-3-pro: Achieved highest Accuracy and Overall Exam Score across four disciplines with Chemistry PES lead of +2.6 and Biology PES lead of +3.1 over second best
Claude-opus-4.6: Ranked highest in reasoning efficiency (ARL) across all subjects, outperforming GPT-5 and Gemini-3
Open source Qwen3 and Kimi-k2.5 models scored lower on exam accuracy than proprietary models but showed higher reasoning efficiency than GPT-5 and Gemini
All models' exam scores dropped significantly (up to -38.9 points) on the Complex Layout subset compared to standard evaluation

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.26781.

Fig 1

Fig 1: | Performance degradation of cutting-

Fig 2

Fig 2: | Overall framework of LiveK12Bench. The framework consists of three interconnected

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 3

Fig 3: | Examples from the Three Challenging Subsets in LiveK12Bench. The figure illustrates

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

Dataset limited to Chinese high school exams; cross-cultural or different education system generalization is untested.
Evaluation focused on K-12 exams; higher education level or other disciplines remain unexplored.
No adversarial attack or robustness testing beyond layout complexity and process evaluation was conducted.
Efficiency metrics rely on token counts as a proxy for reasoning time, which may not map perfectly to actual runtime or latency.
Human evaluation intervention remains necessary in verification and process error identification; fully automated grading future work.
Limited discussion on model fine-tuning or adaptation strategies to improve process rigor under this benchmark.

Open questions / follow-ons

How can LMM architecture and training be improved to handle complex exam layouts more robustly for Image-Only inputs?
What targeted fine-tuning or curriculum learning methods can enhance reasoning process rigor and reduce logical assumption or deductive errors?
Can the Mock Exam evaluation scheme be extended to other educational levels and additionally incorporate adaptive timing constraints?
How to design fully automated, scalable grading systems that accurately audit reasoning chains with minimal human intervention?

Why it matters for bot defense

This work provides valuable insights for bot-defense and CAPTCHA practitioners on the limitations of current multimodal AI systems in processing highly structured, visually complex documents under realistic constraints. The extensive performance degradation observed when models must autonomously interpret entire exam pages without manual cropping or annotation underscores vulnerabilities in visual grounding and contextual extraction that could be exploited by malicious bots using similar models. The reasoning process auditing framework highlights the necessity of verifying internal logical consistency beyond superficial answer correctness, a principle that can inform advanced bot-detection schemes aiming to discern genuine user cognitive effort versus model-generated shortcuts. Furthermore, the dynamic data ingestion pipeline model offers a paradigm for continuously refreshing challenge datasets to mitigate data reuse or contamination attacks. Overall, the benchmark exposes critical gaps between idealized AI reasoning benchmarks and real-world compositional understanding necessary for robust automated human-like task solving.

Cite

bibtex

@article{arxiv2605_26781,
  title={ LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? },
  author={ Xiaohan Wang and Mingze Yin and Yilin Zhao and Gang Liu and Dian Li },
  journal={arXiv preprint arXiv:2605.26781},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.26781}
}

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​