Breaking Changes in Software Ecosystems: A Systematic Literature Review

Source: arXiv:2605.24397 · Published 2026-05-23 · By Juntao Chen, Tingting Bi, Yanlin Wang, Patanamon Thongtanunam

TL;DR

This paper presents the first comprehensive systematic literature review focused specifically on breaking changes in modern software ecosystems. The authors survey 97 empirical studies spanning five major ecosystems—Maven/Java, npm/JavaScript, Python, Web APIs, and Linux distributions—covering breaking change classification, causes, impact, detection, and management. They develop a novel four-dimensional taxonomy of breaking changes across Nature, Detectability, Scope, and Visibility. The survey reveals that while syntactic breaking changes are well-studied and reasonably detectable, behavioral breaks remain difficult to identify automatically. Semantic Versioning, the dominant compatibility signaling scheme, is frequently violated, undermining trust. Transitive dependencies are a major source of breakage with challenges in diagnosis due to information asymmetry between maintainers and consumers. To address these gaps, the authors identify three pressing research challenges—behavioral break detection at scale, Semantic Versioning failures, and transitive dependency propagation—and three research opportunities including leveraging large language models for behavioral contract inference and ecosystem-level dependency graph intelligence. Overall, the paper provides a systematic map of the breaking change problem space, highlights the current state of research, and charts future directions for improving ecosystem reliability.

Key findings

Developed a four-dimensional taxonomy of breaking changes covering Nature (syntactic vs behavioral), Detectability (compile, link, runtime), Scope (direct, transitive, widespread), and Visibility (public, internal, beta, deprecated)
Maintenance and design improvements account for more breaking changes than new feature additions
Identified 43 detection techniques achieving high accuracy on syntactic breaks but limited coverage on behavioral breaks
Documented 66 strategies for communicating, preventing, and recovering from breaking changes, organized by actor roles
Semantic Versioning is widely adopted but frequently violated—many non-major releases contain breaking changes
Behavioral breaking changes represent 68.1% of breaks in one npm dataset but receive less research attention due to detection difficulty
Transitive dependency propagation is the largest source of client-impacting breakages and suffers from information asymmetry challenges
Research on breaking changes has grown steadily since 2010, with 63% of papers published since 2020

Threat model

n/a; the paper is a systematic literature review focused on software ecosystem reliability rather than security threat modeling. The likely adversary in context would be inadvertent breaking changes from maintainers rather than malicious actors.

Methodology — deep read

The authors conduct a systematic literature review following Kitchenham and Charters' guidelines. They searched three academic databases (IEEE Xplore, ACM Digital Library, Springer) in April 2026 using a comprehensive Boolean query of 24 breaking change-related terms, applied to full text and metadata. The search covered papers from 2010 to 2026 in computer science, yielding 2,995 results, reduced to 2,750 unique papers after de-duplication. They applied explicit exclusion criteria (e.g., < 5 pages, no empirical evaluation, peripheral mention only, grey literature, duplicates), assisted by an LLM for initial screening. This excluded ~2,695 papers. The remaining 55 papers were manually reviewed for inclusion criteria (empirical study or evidence on breaking changes), resulting in 50 seed papers. Next, backward and forward snowballing was performed using the seed studies' references and Google Scholar citations, adding 47 more papers and relaxing the 2010 cutoff to include influential older works, culminating in 97 primary studies.

Data extraction employed a 10-item schema covering bibliographic info, breaking change types, reasons, impacts, detection techniques, prevention strategies, and mitigation approaches. Two authors independently extracted and cross-validated the data. For taxonomy construction (RQ1), they performed open card sorting in three iterative rounds: extraction of terms, clustering into normalized types, then hierarchical organization into four dimensions. For thematic analysis (RQ2-4), they applied line-by-line coding, theme generation, and synthesis following Cruzes and Dyba's guidelines. Results were validated through cross-checking and iteration to consensus.

The final synthesis presents a multi-dimensional taxonomy and thematic insights into reasons, impact, detection methods, and management strategies. Detection approaches include static, dynamic, learning-based, and hybrid techniques. The evaluation was based on a qualitative meta-synthesis mapping methodologies and evidential support reported by primary studies. The survey does not perform new empirical experiments but rigorously aggregates existing findings.

For example, in one npm dataset [S1], 68.1% of detected breaking changes were behavioral in nature, revealing detection gaps. Similarly, Semantic Versioning compliance studies showed frequent violations where minor/patch versions contained breaking changes, misleading downstream clients about safety. Overall, the methodology is comprehensive, rigorous, and transparent, leveraging both automatic assistance and manual expert curation.

Technical innovations

A novel four-dimensional taxonomy synthesizing breaking change characteristics across Nature, Detectability, Scope, and Visibility dimensions, capturing both technical and normative aspects
Comprehensive mapping and comparative analysis of 43 state-of-the-art breaking change detection techniques across ecosystems, highlighting coverage gaps especially for behavioral breaks
Systematic synthesis of 66 ecosystem-wide strategies for communicating, preventing, and recovering from breaking changes tailored by actor roles (maintainers, consumers, ecosystem managers)
Identification of three key open challenges (behavioral break detection at scale, Semantic Versioning's trust failure, and transitive dependency break propagation) and three research opportunities (LLM-augmented behavioral contract inference, ecosystem dependency graph intelligence, domain-specific ML tooling)

Datasets

npm breaking change dataset by Kong et al. — 1,519 breaking changes — public, referenced as [S1]
97 primary empirical studies collected in this SLR — heterogeneous datasets spanning Maven, npm, Python, Web APIs, Linux distributions — aggregated from academic literature

Baselines vs proposed

Static detection techniques: high accuracy on syntactic breaks (compilation-time failures) — behavioral break detection accuracy remains limited
Semantic Versioning adherence: violation rate high; many minor/patch updates introduce breaking changes versus SemVer expectations
Behavioral breaks represent 68.1% of breaks in npm dataset vs syntactic breaks 31.9%, highlighting detection gaps
Transitive dependency breakage causes more client impact compared to direct dependencies but suffers from diagnosis difficulty

Limitations

The review relies on primary studies with heterogeneous methodologies and varying dataset quality, limiting direct quantitative synthesis
Behavioral breaking changes are underrepresented in existing empirical studies due to measurement difficulty, leading to potential bias in taxonomy and detection survey
Semantic Versioning violation statistics and impact analysis are ecosystem-dependent, and generalizability may vary
Lack of large-scale empirical validation or benchmarking of detection and mitigation strategies beyond reported accuracies in primary studies
Snowballing may miss some unpublished or grey literature sources that could provide additional insights
The survey does not address adversarial actor models or deliberate break introduction from a security perspective

Open questions / follow-ons

How to reliably detect behavioral breaking changes at scale with automated methods that generalize beyond syntactic analysis?
What new ecosystem-wide trust or signaling mechanisms can replace or augment Semantic Versioning to reduce unsafe upgrades?
How can transitive dependency breakage be predicted and mitigated considering asymmetrical information between maintainers and consumers?
What domain-specific tooling can address silent behavioral drift in emerging ecosystems such as machine learning libraries?

Why it matters for bot defense

Bot-defense and CAPTCHA systems are increasingly built atop complex software ecosystems with many third-party dependencies. Understanding breaking changes in these ecosystems helps practitioners anticipate and mitigate risks from dependency upgrades that might introduce incompatibilities or behavioral regressions breaking security or detection logic. The taxonomy and surveyed detection strategies inform how to classify and prioritize changes affecting CAPTCHA pipelines and infrastructure code. The identified challenges, such as incomplete behavioral break detection and Semantic Versioning violations, alert engineers to gaps in current tooling that may cause silent failures impacting bot-defense reliability. The paper’s research opportunities, including LLM-augmented contract inference and dependency graph intelligence, suggest promising avenues to develop more robust update management techniques critical for maintaining secure, dependable CAPTCHA services amid rapidly evolving dependencies and transitive breakages.

Cite

bibtex

@article{arxiv2605_24397,
  title={ Breaking Changes in Software Ecosystems: A Systematic Literature Review },
  author={ Juntao Chen and Tingting Bi and Yanlin Wang and Patanamon Thongtanunam },
  journal={arXiv preprint arXiv:2605.24397},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.24397}
}

Breaking Changes in Software Ecosystems: A Systematic Literature Review ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​