Skip to content

AI4SE and SE4AI Exploration: A Decade Looking Back and Forward

Source: arXiv:2606.19630 · Published 2026-06-17 · By H. Sinan Bank, Daniel R. Herber, Thomas Bradley

TL;DR

This article provides a comprehensive retrospective and prospective analysis of the intersection between artificial intelligence (AI) and systems engineering (SE), a field branded AI4SE/SE4AI. It traces the evolution over roughly the last six years (2020–2025) across three phases: foundational (conceptual frameworks and definitions), applied (practical experimentation and early deployments), and LLM inflection (recent surge in large language model adoption and validation-focused tools). The study uniquely combines a systematic literature review with a human-AI agreement analysis on 1,712 INCOSE INSIGHT articles and 889 SERC publications to identify consensus relevant works and quantify AI model reliability in topic triage. From this, the authors distill convergence points, critical research gaps, and practical implications for systems engineering practitioners integrating AI.

Key findings

  • From 2020 onward, the share of INCOSE INSIGHT articles addressing AI+SE rose fourfold from under 2% to about 8%.
  • Human expert and proprietary AI models (Gemini 3.1, Claude Opus 4.6, GPT-5.2) showed 97.8%–99.0% agreement on INSIGHT corpus relevance judgments; local open-source models varied widely, from 80.8% agreement (Mixtral) up to 96.7% (DeepSeek).
  • Cohen’s kappa for proprietary models ranged 0.67–0.77 on INSIGHT and 0.66–0.85 on SERC title-only judgments, indicating moderate to strong agreement beyond chance; local models ranged from near chance (κ=0.10) to comparable (κ=0.71).
  • 33 articles showed consensus relevance between human and at least one proprietary model; these averaged 9.5 citations each versus 2.4 for general INSIGHT articles—a 7.1 citation advantage.
  • The field has experienced three phases: Foundational (2020–21) focused on conceptual roadmaps; Applied (2022–23) saw NLP for requirements and ontology-based knowledge graph use mature; LLM Inflection (2024–25) saw rapid uptake of large language models in systems engineering tools and workflows.
  • Augmented intelligence (AI as an augmenting, not replacing, tool) is widely accepted as the dominant paradigm in AI4SE/SE4AI.
  • Five critical gaps identified: lack of industrial-scale empirical validation, conflation of software and systems engineering tasks, absence of systems engineering-specific AI benchmarks, missing traceability frameworks for AI-generated artifacts, and underdeveloped workforce transformation curricula.

Threat model

n/a — This is a meta-research and literature review paper, not focused on adversarial threats or direct security implications.

Methodology — deep read

The authors took a multi-pronged approach combining literature review, human expert relevance assessment, and AI model triage to characterize the state of AI in systems engineering.

  1. Threat Model & Assumptions: The adversary is not adversarial in a traditional sense but the study assumes domain experts and AI models both attempting to classify relevance of papers to the AI+SE domain correctly. The human expert's classifications serve as reference comparators rather than gold standard ground truth.

  2. Data: The data comprise two corpora—1,712 INCOSE INSIGHT articles indexed by Google Scholar (title-only judged) and 889 SERC publications (judged by title-only and title+abstract). A human expert identified 46 INSIGHT articles as relevant; AI models judged all articles' titles for relevance.

  3. Algorithm and Models: Six AI models evaluated relevance—three proprietary cloud-hosted (Gemini 3.1 Pro Preview, Claude Opus 4.6, GPT-5.2) and three open-source locally hosted models (Mixtral 8x22B, Llama 3.3 70B, DeepSeek R1 70B). These models independently classified papers as relevant or not, with SERC corpus including finer-grained categories for relevance direction.

  4. Training Regime: Not detailed for the models since these are pre-existing language models leveraged for zero-shot or few-shot relevance judgments. No retraining was conducted; focus was on model outputs and agreement.

  5. Evaluation Protocol: Agreement metrics included simple percentage agreement and Cohen’s kappa to account for chance. Title-only vs. title+abstract input conditions were compared for the SERC corpus to measure stability and model changes of mind. Citation counts from Google Scholar were analyzed to assess impact difference between agreed-relevant and general articles.

  6. Reproducibility: The human-AI agreement dataset and the AI4SE/SE4AI Explorer web application codebase are publicly available with links. The SERC corpus is non-public but partially indexed for human relevance judgments.

One concrete example: For each paper, the human expert read only the title to assign relevance. Simultaneously, each AI model was provided the paper title as input and independently produced a relevance classification. The authors then computed pairwise percentage agreement and Cohen’s kappa between each model and the human judgments across the entire corpus, revealing substantial variability especially among open-source models. When adding abstracts for SERC papers, proprietary models showed stable agreement (~6% papers reclassified), confirming title-based judgments largely captured relevance, while open-source models showed more volatility (15–37% reclassified). This comparison underscores differential model reliability and the importance of input scope.

Technical innovations

  • Systematic human-AI agreement study to assess domain-specific literature triage reliability across six diverse AI language models.
  • Development and public release of the AI4SE/SE4AI Explorer web application enabling users to compare their own relevance judgments with both human and AI raters on SE-AI literature.
  • Proposal of a three-phase evolution framework for AI+SE research: foundational, applied, and LLM inflection phases, explaining shifts in maturity and focus.
  • Identification and quantification of research gaps specific to AI for systems engineering distinct from software engineering, including the need for SE-specific benchmarks and traceability for AI-generated artifacts.

Datasets

  • INCOSE INSIGHT corpus — 1,712 articles — Publicly indexed via Google Scholar
  • SERC publications corpus — 889 publications — Restricted archive from Systems Engineering Research Center
  • Human–AI agreement dataset — Released at Harvard Dataverse (https://doi.org/10.7910/DVN/IKLUYN)

Baselines vs proposed

  • Human expert agreement baseline: N/A (reference comparator)
  • Proprietary models on INSIGHT corpus: agreement with human 97.8% (GPT-5.2), 99.0% (Opus 4.6)
  • Local open-source models on INSIGHT: agreement 80.8% (Mixtral) to 96.7% (DeepSeek)
  • Cohen’s kappa proprietary models on INSIGHT: 0.67–0.77; local models range 0.10–0.71
  • Citation impact: Agreed relevant AI articles mean citations = 9.5 vs general INSIGHT articles = 2.4

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.19630.

Fig 3

Fig 3: Mean citations per article: agreed-relevant AI articles versus general INCOSE INSIGHT articles.

Limitations

  • Human relevance judgments rely on a single expert without intra-rater reliability assessment, limiting consensus robustness.
  • Screening based only on titles for INSIGHT corpus assumes titles accurately reflect content, which may not always hold.
  • Local open-source AI model performance varied widely, indicating instability and possibly sensitive to architecture and training data.
  • Cohen’s kappa comparisons across corpora should be interpreted with caution due to prevalence variations in relevant articles (≈3% INSIGHT vs. ≈16–18% SERC).
  • Citation impact comparisons are confounded by venue selection bias and recency effects on citation accumulation.
  • The LLM inflection phase is emergent with relatively fewer agreed-relevant articles, partially due to limited indexing and recent publication lag.

Open questions / follow-ons

  • How can industrial-scale empirical validation studies for AI4SE tools be designed and conducted to establish practical effectiveness and reliability?
  • What domain-specific benchmarks beyond SysEngBench can be developed to meaningfully evaluate AI applications across diverse systems engineering tasks?
  • How can traceability and provenance frameworks for AI-generated and AI-modified systems engineering artifacts be standardized and incorporated into lifecycle tooling?
  • What curricula and competency models are required to prepare the systems engineering workforce for responsible AI adoption and oversight?

Why it matters for bot defense

For practitioners in bot-defense and CAPTCHA design, this paper underscores broader challenges in the reliable incorporation of AI tools in complex, safety- and governance-critical workflows. Key takeaways emphasize the augmented intelligence paradigm where human oversight remains essential given AI-generated artifact errors and classification variability. The documented variability between AI models in domain-specific triage illustrates cautionary lessons about AI reliability boundaries relevant to automated CAPTCHA generation, adaptive challenge design, and risk assessment. Additionally, the identified gaps around traceability and validation provide parallels for bot-defense environments that must maintain auditability and robust defense logic. Finally, this work highlights the value of community-shared benchmarks and transparent evaluation interfaces, which could inspire similar shared testbeds or human-AI agreement studies in CAPTCHA and bot-detection tool research.

Cite

bibtex
@article{arxiv2606_19630,
  title={ AI4SE and SE4AI Exploration: A Decade Looking Back and Forward },
  author={ H. Sinan Bank and Daniel R. Herber and Thomas Bradley },
  journal={arXiv preprint arXiv:2606.19630},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.19630}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution