Exploring the Effectiveness of Abstract Syntax Tree Patterns for Algorithm Recognition
Source: arXiv:2605.06098 · Published 2026-05-07 · By Denis Neumüller, Florian Sihler, Raphael Straub, Matthias Tichy
TL;DR
This paper addresses the problem of automatically identifying which algorithm a given piece of source code implements — a task relevant to software maintenance, quality assurance, and reverse engineering. Prior approaches either rely on formal semantic equivalence proofs (which are undecidable in the general case and therefore limited in scope) or machine learning classifiers (which cannot easily generalize to algorithms outside their training distribution and do not address multi-method search). The authors propose AlDeSCo, a prototype that uses a hand-crafted, embedded Java domain-specific language (DSL) to express structural search patterns over abstract syntax trees (ASTs), a matching algorithm that traverses those pattern trees against Spoon-parsed ASTs, and an initial catalog of patterns for six algorithms. Pattern authoring is done manually by a developer who inspects reference implementations found via web search and encodes the key structural features in the DSL.
The central novelty is the DSL itself: it provides typed builders for Java constructs (loops, assignments, binary operations, etc.), binding constraints that enforce variable identity across pattern nodes, two wildcard types (wide for siblings, depth for arbitrary nesting), alternatives (oneOf), optionals, and both ordered and unordered matching modes. Convenience wrappers like loop() (matching for/while/do-while/foreach), anyMod(), and compareCommutative() reduce pattern verbosity and improve recall by abstracting over common implementation variations.
Evaluation on a six-algorithm subset of BigCloneEval shows a macro-averaged F1 of 0.74 for AlDeSCo, compared to 0.35 for CodeLlama and a best recall of 0.20 for the strongest of four code clone detection (CCD) tools (CloneWorks, SourcererCC, NiCad, Oreo) when those tools are given reference implementations as queries. AlDeSCo's recall of 0.62 macro-averaged substantially outperforms all CCD baselines, with the largest gains on Type 3 and Type 4 clones (semantically similar but syntactically divergent implementations).
Key findings
- AlDeSCo achieves a macro-averaged F1-score of 0.74 on a six-algorithm subset of BigCloneEval (Fibonacci, Bubble Sort, Binary Search, and three others), outperforming CodeLlama's F1 of 0.35 on the same benchmark.
- AlDeSCo achieves a macro-averaged recall of 0.62 across six algorithms; the best-performing CCD tool tested (CloneWorks, SourcererCC, NiCad, or Oreo — paper does not name the single best explicitly in the truncated text) achieves only 0.20 recall on the same task.
- CodeLlama achieves similar recall to AlDeSCo but dramatically worse precision, causing its F1 to collapse to 0.35; AlDeSCo's higher F1 is therefore largely a precision advantage.
- Binary Search is identified as the weakest individual result for AlDeSCo — good precision but insufficient recall — suggesting that purely structural AST patterns struggle when algorithmic identity depends heavily on numeric boundary conditions or data-structure context rather than control-flow shape.
- AlDeSCo's runtime is stated to be orders of magnitude lower than CodeLlama's inference time, making it more practical for integration into CI pipelines (exact latency figures are not provided in the available text).
- The approach achieves good precision and recall for all six algorithms except Binary Search, implying that most imperative sorting and arithmetic algorithms have sufficiently distinctive AST skeletons to be captured by structural patterns alone.
- Type 3 and Type 4 clone recall is substantially higher for AlDeSCo than for all CCD tools, indicating that the hand-crafted semantic abstraction in DSL patterns captures implementation variation that text/token similarity metrics miss.
- The DSL design was iteratively refined: an early version was used to describe several algorithms, improvement opportunities were identified, and the convenience features (loop(), anyMod(), compareCommutative(), etc.) were added in a second pass — the final DSL is therefore shaped by empirical pattern-writing experience rather than a priori language design alone.
Threat model
n/a — this is a software engineering / program comprehension paper, not a security paper. The implicit operational model assumes a benign codebase being analyzed for maintenance or quality purposes. The 'adversary' is implementation variability (renamed identifiers, restructured control flow, interleaved unrelated statements), not a malicious actor. No attacker capabilities, evasion scenarios, or security properties are formally defined.
Methodology — deep read
Threat model and assumptions: This is not a security paper in the adversarial sense. The implicit assumption is a cooperative or neutral setting — a developer or CI tool running AlDeSCo on a Java codebase. The adversary is the variability of real-world implementations (renamed variables, restructured control flow, inlined helpers, added logging), not a malicious actor trying to evade detection. The system is explicitly a best-effort approach acknowledging that semantic equivalence is undecidable.
Data provenance and benchmark: The evaluation uses BigCloneEval, a publicly available benchmark of manually verified Java code clone pairs drawn from real-world open-source projects (IJaDataset). The authors use a six-algorithm subset covering algorithms such as Fibonacci, Bubble Sort, Binary Search, and three others (the full list is not entirely enumerated in the truncated text). BigCloneEval provides per-clone-pair labels and classifies clones into Types 1–4 by degree of syntactic divergence. The exact number of files or clone pairs used in the subset is not stated in the available text — this is a reproducibility gap.
Pattern authoring (training analog): There is no machine learning training. Instead, each search pattern is manually authored by a developer who (1) performs a web search for the algorithm name to find reference implementations, (2) identifies key structural features of those implementations, and (3) encodes those features in the Java-embedded DSL. This is analogous to writing hand-crafted rules and introduces author bias: patterns reflect the implementations the author happened to find and inspect. The paper does not quantify how many reference implementations were consulted per algorithm or how long pattern authoring took.
Architecture and algorithm: The pipeline has four stages. First, the developer writes a search pattern in the embedded Java DSL using typed builder objects (forLoop(), whileLoop(), binOp(), assignment(), varRead(), varWrite(), returns(), etc.). Builders are composable: each exposes construct-specific configuration (e.g., binOp().ops('<', '<=') restricts the operator). Second, calling build() on the root builder performs a recursive transformation into a pattern tree — an intermediate representation that separates pattern specification from matching logic. During this transformation, convenience methods are desugared into core primitives (e.g., after(stmt) inserts wideWildcard() nodes; anywhereInRhs(stmt) wraps content in depthWildcard()); binding identifiers are registered; matching order (ORDERED via .next() or UNORDERED via .has()) is resolved; and attribute predicates are generated. Third, Spoon (an open-source Java source analysis framework) parses each Java file in the target codebase into its AST. Fourth, the matching procedure walks the pattern tree against each AST. The matcher maintains an immutable matching-state set tracking: (a) already-matched node pairs as a bidirectional mapping, (b) binding constraints (identifier → AST node), and (c) current depth. Each pattern node returns a set of successor matching states rather than a single state, enabling the exploration of all possible wildcard consumptions and UNORDERED permutations. wideWildcard nodes generate up to N+1 states (consuming 0 through N siblings); UNORDERED blocks try all permutations of pattern children against AST children, pruning eagerly when types mismatch. A match is declared successful when all required pattern nodes have been satisfied and all binding constraints map to non-empty matched sets. The root node type (e.g., method()) determines valid AST entry points, limiting the search space.
Concrete end-to-end example: For the prime-factors algorithm, the developer writes a pattern asserting: a method body containing (after arbitrary statements) a loop whose condition uses < or <=, whose body contains a nested loop whose condition uses % or / with both operands bound to identifiers 'num' and 'index', whose body contains (after arbitrary statements) an assignment that writes 'num' and reads 'index' anywhere on the right-hand side, followed optionally by a return. Spoon parses a candidate file. The matcher finds all method nodes as entry points, attempts the pattern on getPrimeFactors(int n), the wideWildcard before the outer loop consumes the LinkedList declaration, the outer oneOf() selects forLoop, the inner oneOf() selects whileLoop, binding constraints confirm that the same variable appears in the condition and the assignment, and multiple valid terminal states are unified into a single positive match. The match location (line range of the method) is reported and can be injected as a comment.
Evaluation protocol: The primary metrics are precision, recall, and F1-score, computed per algorithm and then macro-averaged across the six algorithms. BigCloneEval provides the ground-truth labels. For the LLM comparison, CodeLlama is prompted (prompting strategy not detailed in available text) to classify whether a code snippet implements a given algorithm; its outputs are evaluated on the same benchmark. For the CCD comparison, four tools (CloneWorks, SourcererCC, NiCad, Oreo) are each provided with one or more reference implementations of each algorithm as queries; the tools retrieve candidate clones from the benchmark corpus, and recall is computed against ground truth. Results are also broken down by clone type (T1–T4) to assess where each approach succeeds or fails. No statistical significance tests are reported. No cross-validation is performed (patterns are fixed rules, not learned parameters). There is no held-out adversarial test set.
Reproducibility: The authors name their prototype AlDeSCo and use Spoon (a public framework). BigCloneEval is public. However, the paper does not explicitly state whether AlDeSCo's source code or the pattern catalog is publicly released, and the exact subset of BigCloneEval used (number of files, clone pairs per algorithm) is not fully specified in the available text, limiting independent replication.
Technical innovations
- A fluent, embedded Java DSL for expressing structural AST search patterns with typed construct builders, two-dimensional wildcards (wideWildcard for sibling-level skipping, depthWildcard for arbitrary-depth nesting), binding constraints that enforce cross-node variable identity, and explicit ordered/unordered matching modes — prior concept-recognition approaches (e.g., Mesnard et al., Metzger et al.) relied on formal semantic equivalence proofs or PDG/data-flow analysis rather than a developer-friendly pattern language.
- A pattern tree intermediate representation that decouples DSL specification from matching logic, enabling desugaring of convenience features into core primitives at compile time and allowing clean separation of the builder and matcher subsystems.
- An immutable, set-valued matching-state model that tracks all viable wildcard consumptions and UNORDERED permutations simultaneously, enabling correct handling of ambiguous matches (e.g., UNORDERED blocks where multiple variables could satisfy a binding) without backtracking restarts.
- A convenience layer (loop(), anyMod(), varDefOrAss(), compareCommutative(), ite().optionalOtherwise()) that abstracts over Java syntactic variants and commutativity, directly improving recall on Type 3/4 clones without requiring separate pattern variants — a practical engineering contribution absent from prior AST-matching tools.
- Empirical demonstration that hand-crafted AST structural patterns outperform both a state-of-the-art LLM (CodeLlama) and four code clone detection tools on the algorithm recognition task, establishing a new performance reference point on BigCloneEval for this specific problem formulation.
Datasets
- BigCloneEval (six-algorithm subset) — size not fully specified in available text; ground-truth clone pairs manually verified — public benchmark derived from IJaDataset (real-world open-source Java projects)
Baselines vs proposed
- CodeLlama (LLM, zero-shot or prompted classification): macro-averaged F1 = 0.35 vs AlDeSCo: F1 = 0.74
- Best CCD tool among {CloneWorks, SourcererCC, NiCad, Oreo} (reference-implementation-as-query): macro-averaged recall = 0.20 vs AlDeSCo: recall = 0.62
- CCD tools collectively: macro-averaged recall ≤ 0.20 (all four tools) vs AlDeSCo: recall = 0.62
- Binary Search (AlDeSCo per-algorithm): good precision, insufficient recall (exact figures not provided in available text) vs other five algorithms: good precision and recall
Limitations
- Pattern authoring is entirely manual and requires a skilled developer to read reference implementations and encode features in the DSL; there is no mechanism for automatically inducing patterns from examples, making the approach labor-intensive to scale to large algorithm catalogs.
- The evaluation covers only six algorithms on one benchmark (BigCloneEval subset); generalizability to algorithms with more complex or less structurally distinctive implementations (e.g., dynamic programming, graph algorithms) is undemonstrated.
- The approach is Java-specific: Spoon parses Java, the DSL builders correspond to Java constructs, and the pattern catalog targets Java idioms. Extension to other languages would require rebuilding the parser integration and all builder types.
- Binary Search already exhibits low recall, suggesting a fundamental ceiling for purely structural AST matching when algorithm identity depends on numeric invariants, index arithmetic correctness, or data-structure type — information not captured by AST shape alone.
- No adversarial or obfuscation evaluation: the paper does not test whether simple refactorings (e.g., converting a for-loop to recursion, using iterators instead of index variables, extracting helper methods) defeat the patterns, which is directly relevant to detecting intentional evasion.
- The exact number of BigCloneEval instances per algorithm, train/test split details (if any), and inter-rater agreement on ground truth labels are not reported, making statistical interpretation of the F1 scores difficult.
- No ablation study quantifying the individual contribution of each DSL feature (wildcards, binding constraints, unordered matching, convenience wrappers) to the final F1 score; it is therefore unclear which primitives are load-bearing for performance.
Open questions / follow-ons
- Can patterns be learned or synthesized automatically from a corpus of labeled algorithm implementations (e.g., via program synthesis or inductive logic programming over AST node sequences), reducing the manual authoring burden while preserving interpretability?
- How does recall degrade when algorithms are implemented recursively, split across multiple methods, or inlined into larger functions — and can the multi-pattern binding mechanism (used for quicksort partition+swap) scale to handle these cases systematically?
- Would augmenting AST patterns with lightweight data-flow or type information (e.g., array element access patterns, monotonic variable update detection) recover the recall lost on Binary Search and similar index-sensitive algorithms without sacrificing the DSL's developer-friendliness?
- How robust are the hand-crafted patterns to deliberate obfuscation or adversarial refactoring — a question directly relevant if the technique were applied to detect algorithm plagiarism or license-violation scenarios rather than benign maintenance?
Why it matters for bot defense
For bot-defense and CAPTCHA engineers, this paper is most directly relevant to the problem of client-side JavaScript or WASM analysis: detecting whether a bot's automation script, browser extension, or headless browser contains a known algorithm implementation (e.g., a specific hash function, CAPTCHA-solving routine, or fingerprint spoofing algorithm). AlDeSCo's approach — structural AST pattern matching with binding constraints and wildcards — is essentially a form of semantic code signature matching that is more robust to variable renaming and minor refactoring than token-level or line-level clone detection. A bot-defense team could adapt the DSL concept to a JavaScript AST (e.g., via Babel or Acorn) to build a catalog of patterns for known solver implementations, running it at submission time on deobfuscated client code or as a static analysis step on intercepted scripts.
However, the limitations are significant in this context. The manual pattern authoring bottleneck means the catalog will always lag behind new solver variants, and the paper shows that even carefully crafted patterns fail on structurally divergent implementations (Type 4 clones, Binary Search recall). A sophisticated bot operator who restructures their solver into recursion, splits it across closures, or transpiles it from another language would likely evade the patterns. The technique is better framed as a high-precision signal for known, poorly-obfuscated solver code than as a general-purpose bot detection method. Combined with other signals (behavioral biometrics, TLS fingerprinting, timing analysis), it could raise the cost of automation by forcing bot authors to substantially restructure their implementations for each new detection signature — a useful friction even if not a complete solution.
Cite
@article{arxiv2605_06098,
title={ Exploring the Effectiveness of Abstract Syntax Tree Patterns for Algorithm Recognition },
author={ Denis Neumüller and Florian Sihler and Raphael Straub and Matthias Tichy },
journal={arXiv preprint arXiv:2605.06098},
year={ 2026 },
url={https://arxiv.org/abs/2605.06098}
}