K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Source: arXiv:2606.02404 · Published 2026-06-01 · By Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park et al.

TL;DR

K-BrowseComp addresses a key gap in Korean-centric agentic evaluation by introducing a comprehensive web browsing agent benchmark grounded in culturally and linguistically relevant Korean contexts. This benchmark comprises 400 questions divided into a 300-problem manually validated subset (K-BrowseComp-Verified) and a 100-problem synthetic diagnostic subset designed adversarially to stress-test model weaknesses. The benchmark specifically focuses on assessing compositional, multi-hop, and parallel-branching web search and reasoning abilities in a Korean setting, which remains under-explored in contrast to English-centric benchmarks such as BrowseComp.

State-of-the-art global large language models (LLMs) like GPT-5.5 perform significantly worse on K-BrowseComp-Verified (max 45.67% accuracy) than on prior English benchmarks (84.4% on BrowseComp), while Korean proprietary LLMs underperform drastically (0-10.33%). The synthetic split designed from observed failure modes yields even lower accuracies (max 26.0%) for top models, demonstrating persistent challenges and providing a rigorous diagnostic for future system improvements. The authors release the dataset, question construction and validation procedures, failure taxonomy, and generation code publicly to catalyze Korean AI agent development.

Key findings

GPT-5.5 achieves 45.67% accuracy on K-BrowseComp-Verified vs 84.4% on the original English BrowseComp benchmark, indicating substantial performance drop in Korean agentic context.
DeepSeek-V4-Pro and GLM-5.1 score similarly around 30% accuracy on K-BrowseComp-Verified, compared to >80% on BrowseComp.
Korean proprietary LLMs from government-funded projects score poorly, ranging from 0.00% to 10.33% accuracy on K-BrowseComp-Verified.
On the synthetic 100-problem split generated using failure-mode targeting and few-shot prompting, the strongest model only reaches 26.0% accuracy, underscoring the increased challenge.
K-BrowseComp includes trajectory-level failure modes such as premature candidate commitment, constraint-tracking failure, cross-source hopping failure, and incomplete trajectory finalization (F0-F8), each detailed and manually annotated.
Trajectory analyses reveal many errors occur after relevant evidence has been retrieved, due to failure of maintaining candidates, constraints, role bindings, and final-answer consistency.
Search effort analyses show incorrect trials use more search calls on average than correct trials, indicating failures are not due to insufficient search budget.
The synthetic generation pipeline yields a 37.3% acceptance rate after multiple filtering steps ensuring questions are adversarial, well-formed, uniquely answerable, and specifically target browsing failure modes.

Threat model

The adversary is the evaluated web browsing agent model, which interacts with publicly accessible Korean web sources via a constrained search API with limited budget. The model attempts to retrieve and integrate web evidence to answer compositional queries but cannot access private, paid, or non-textual resources. The evaluation assumes the agent does not cheat via memorization, and the questions require multi-hop or parallel search constrained by Korean language and cultural knowledge.

Methodology — deep read

The authors first identified a lack of Korean-contextualized benchmarks for compositional web browsing agent evaluation, focusing on multi-hop and parallel-branching questions requiring cultural and regional knowledge.

Threat model: The adversary is effectively the language model or browsing agent evaluated, tasked with answering questions that require compositional reasoning over Korean web sources. It is permitted to make up to 10 search calls per question using Perplexity Search API. The agent must maintain consistent state during multi-step retrievals and reasoning. It cannot access private or non-textual sources; the challenge is to operate effectively in a low-resource, linguistically specialized environment.
Data construction: 300 questions were manually constructed by 17 native Korean annotators targeting scenarios difficult to solve by direct search yet easily verifiable once the relevant evidence is found. Questions require at least 4 steps of multi-hop or parallel constraint satisfaction, are uniquely answerable, temporally stable, and only use public textual Korean web evidence. The final set passed rigorous author validation checking gold answers, intermediate reasoning steps, and cited sources.
Architecture & generation: Beyond verified questions, the authors aimed to scale dataset construction by using a web-browsing LLM agent (Claude Code) to generate synthetic questions that are adversarially hard. They leveraged a failure taxonomy (F1-F8 modes) identified from baseline errors across multiple Korean websites to instruct the agent in generating questions focusing on those failure modes, using few-shot exemplars from human-written problems and iterative refinement over four generations.
Training regime: Not applicable as this is a benchmarking study. However, evaluation models include various proprietary and open-weight LLMs including GPT-5.5, DeepSeek-V4-Pro, GLM-5.1, Gemini-3.1-Flash-Lite, and several Korean foundation models.
Evaluation protocol: Models are evaluated as agents using the search_evals framework with Perplexity Search as backend. Each question allows 10 search calls. Final answers are extracted by GPT-5.4-mini and matched against the gold answer for pass@1 accuracy. Expected calibration error is measured to assess confidence alignment. Results are reported separately for verified and synthetic splits. Error trajectories are manually annotated with failure modes to diagnose model weaknesses.
Reproducibility: The authors publicly release the dataset, synthetic question generation code, and evaluation scripts. Baselines are evaluated with fixed external retrieval API, ensuring consistency. No frozen model weights are released since many baselines are closed-access proprietary models.

Concrete example (from Figure 4): A question requires identifying a webtoon-based drama with specific broadcast properties, then finding a song title from its OST Part 4 containing a keyword. The gold trajectory involves sequentially verifying broadcaster, ranking, identifying the drama, then querying OST lyrics. Model failures include shortcutting search order or committing prematurely to wrong candidates, demonstrating difficulties in state maintenance rather than retrieval alone.

Technical innovations

Introduction of K-BrowseComp, the first Korean web browsing agent benchmark emphasizing compositional multi-hop and parallel reasoning grounded in culturally relevant Korean web sources.
A novel hybrid dataset construction methodology combining manually curated, validated questions with adversarial synthetic question generation guided by an empirically derived failure-mode taxonomy.
Development of a detailed trajectory-level failure taxonomy (F0-F8) categorizing browsing agent errors like candidate capture and constraint-tracking failures to illuminate persistent weaknesses beyond retrieval.
Use of adversarial filtering via multi-model evaluation during synthetic question generation to ensure challenging, well-formed, uniquely answerable tasks that elicit targeted browsing failures.

Datasets

K-BrowseComp-Verified — 300 questions — publicly released, manually constructed and validated by Korean native speakers.
K-BrowseComp-Synthetic — 100 questions — generated using LLM agents with failure-mode targeting and adversarial filtering, publicly released.

Baselines vs proposed

GPT-5.5: accuracy = 45.67% on K-BrowseComp-Verified vs 84.4% on BrowseComp (English)
DeepSeek-V4-Pro: accuracy = 30.00% Verified vs 83.4% on BrowseComp
GLM-5.1: accuracy = 30.67% Verified
Korean proprietary LLM K-EXAONE-236B: accuracy = 10.33% Verified
GPT-5.5: accuracy = 26.00% on Synthetic split (lower than Verified 45.67%)
GPT-5.4-mini: accuracy = 0.00% on Synthetic (used in adversarial filtering)
Gemini-3.1-Flash-Lite: accuracy = 11.33% Verified, 11.00% Synthetic

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.02404.

Fig 4

Fig 4: Representative trajectory-level failures in K-BROWSECOMP. Each panel contrasts the required

Fig 2

Fig 2 (page 7).

Fig 3

Fig 3 (page 7).

Fig 5

Fig 5: Excerpt from the written instructions provided to contributors for constructing K-BROWSECOMP-

Limitations

Performance numbers reflect evaluation with fixed 10 search call budget; longer or adaptive search was not tested.
Synthetic question generation excludes failure mode F0 (incomplete trajectories), potentially missing some realistic errors.
Use of closed-access proprietary models limits reproducibility of exact baseline numbers.
No adversarial evaluations beyond internal failure-mode targeting; external attackers or query distributions not explored.
Question categories vary heavily in distribution between verified and synthetic splits, complicating direct comparisons.
Analysis is focused on Korean web contexts; generalization to other languages or domains is untested.

Open questions / follow-ons

Can more advanced state-maintenance architectures reduce constraint-tracking and candidate capture failures identified in K-BrowseComp?
How do multi-turn dialogue agents with external memory modules perform on these compositional Korean web browsing tasks?
Can the synthetic generation pipeline be extended to other languages and domains to scale agentic benchmarks more broadly?
What methods can improve agent calibration and final answer stabilization (F0 failure mode) under long-horizon multi-step retrieval?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, K-BrowseComp highlights the significant difficulty that even advanced LLM-based browsing agents face when operating in linguistically and culturally distinct web environments. This suggests that region- or language-specific web agents remain vulnerable to challenge questions that require maintaining complex intermediate states, verifying multiple constraints simultaneously, and integrating dispersed evidence from semi-structured sources. The detailed failure taxonomy can guide the design of adversarial CAPTCHAs or interaction tests that exploit these persistent agentic weaknesses, particularly in Korean contexts.

Moreover, the synthetic generation method leveraging failure-mode targeting could be adapted to create adversarial challenge questions dynamically, focusing on known browsing failure points. This illustrates a promising direction for evolving CAPTCHA systems that remain difficult for automated agents despite rapid LLM improvements, by requiring extended reasoning chains grounded in local knowledge that agents currently struggle to maintain consistently.

Cite

bibtex

@article{arxiv2606_02404,
  title={ K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts },
  author={ Nahyun Lee and Dongkeun Yoon and Guijin Son and Geewook Kim and Dayoon Ko and Jeonghun Park and Haneul Yoo and Jaewon Cho and Junghun Park and Changyoon Lee and Kyochul Jang and Jaeyeon Kim and Eunsu Kim and Woojin Cho and Seungone Kim },
  journal={arXiv preprint arXiv:2606.02404},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.02404}
}

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​