On the Complexity of the Matching Problem of Regular Expressions with Backreferences

Source: arXiv:2605.07289 · Published 2026-05-08 · By Soh Kumabe, Yuya Uezato

TL;DR

This paper studies the matching complexity of regular expressions with backreferences (REWBs) through the lens of fine-grained complexity, with an eye toward ReDoS-safe regex engines. The core problem is that classical regex matching is linear-time, but once backreferences are allowed, practical engines can become much more expensive; the paper asks which restricted backreference fragments still permit near-linear matching and which ones are provably hard.

The main result is a sharp boundary for 1-use REWBs: the authors give an O(n log^2 n)-time algorithm for matching, improving a recent O(n^2) algorithm for the ABCBD subproblem of Nogami and Terauchi (MFCS 2025), and then lift that to all 1-use REWBs via a canonical-form reduction. On the hardness side, they prove k-REWB matching cannot be solved in O(n^{2k-ε}) time under SETH, show W[2]-hardness when parameterized by expression length (strengthening prior W[1]-hardness), and prove a triangle-detection-based lower bound ruling out n^{1+o(1)} time for a fixed 2-use 2-REWB unless triangle detection itself becomes near-linear.

Key findings

1-use REWB matching is solvable in O(|w| log^2 |w|) time; Theorem 8 states the hidden constant is 2^{O(|E|^4)} after reduction to canonical form.
ABCBD matching improves from Nogami–Terauchi's O(|w|^2) to O(|w| log^2 |w|); Theorem 7 gives a uniform algorithm with hidden constant 2^{O((|A|+|B|+|C|+|D|)^2)}.
For any fixed k, k-REWB matching under SETH cannot be solved in O(n^{2k-ε}) time for any ε>0; Theorem 5 is stated for a fixed ω-use k-REWB expression Ψ_k.
REWB matching parameterized by expression size |E| is W[2]-hard; this strengthens earlier W[1]-hardness results for the same parameterization.
There exists a fixed 2-use 2-REWB Ψ such that matching cannot be solved in n^{1+o(1)} time unless triangle detection has an n^{1+o(1)} algorithm; Theorem 4 gives a conditional lower bound based on triangle detection.
Theorem 6 reduces any 1-use REWB to a disjunction of ABCBD instances plus a residual Regex E′, with K=O(|E|^2), |A_i|=O(|E|^2), |B_i|=O(|E|), |C_i|=O(|E|^2), |D_i|=O(|E|).
Theorem 9 solves the auxiliary XY^αZ problem in O(|w| log |w|) time for fixed α≥2, with hidden constant 2^{O(s^2)} + α·2^{O(s)} where s=|X|+|Y|+|Z|.

Threat model

The adversary controls the input string and aims to trigger worst-case matching cost in a fixed regular-expression engine; the expression itself is fixed by the application. In the hardness reductions, the adversary is computational: if a matching algorithm existed with the claimed faster-than-bound runtime, it would imply unexpectedly fast algorithms for SETH-hard problems, k-OV, or triangle detection. The model does not assume the attacker can change the regex, observe internals, or exploit memory corruption; the only resource being attacked is algorithmic time on valid inputs.

Methodology — deep read

The threat model is the standard ReDoS setting: the expression is fixed by the application, while the attacker controls the input string w and tries to force worst-case matching behavior. The paper is not about learning a model or classifying benign versus malicious inputs; it is about exact worst-case matching complexity for REWBs under fixed-expression semantics. The authors treat the expression as either fixed or as part of the input depending on the theorem. For the hardness side, the adversary is a reduction adversary showing that if you could match these REWBs faster than the claimed bound, you would also solve SETH-hard or triangle-detection-hard problems faster. For the algorithmic side, the assumption is that the expression is fixed, and the task is to match a single string against it in near-linear time.

The paper does not use empirical data, benchmark corpora, or trainable components. There are no datasets, labels, splits, or preprocessing pipelines. Instead, the objects of study are formal languages and reductions. The main “input” to the algorithms is a string w together with fixed Regex fragments A,B,C,D or a fixed REWB E. The paper explicitly notes uniformity: the algorithms take the relevant expression(s) together with w, preprocess based on the expression(s), and then process w; there is no non-uniform advice. For the algorithmic theorem on 1-use REWBs, the authors first prove a canonical decomposition theorem (Theorem 6) that turns an arbitrary 1-use REWB into a disjunction of ABCBD instances plus a leftover plain Regex part E′. The size blowups are explicit: quadratic in |E| for the branches and O(|E|^2) for each Regex component, with K=O(|E|^2). This reduction is central because it lets them focus on a single structured pattern A(B)xC\xD.

The ABCBD problem asks whether a string w can be split as w = w_A w_B w_C w_B w_D with each piece accepted by a fixed Regex A,B,C,D. The authors normalize the instance first, including reducing to a binary alphabet Σ={0,1} and ensuring ϵ is not in L(D), which lets them assume w_D is nonempty. They then split the problem into two subproblems. In the XYYZ case (their name for the α=2 case with w_C=ϵ), the string contains an adjacent repetition w_B w_B, and the algorithm exploits the periodic structure induced by that overlap. They use Crochemore-style maximal local power enumeration together with periodicity properties of regular languages / transition monoids to handle this part. In the more general branching ABCBD case, they work on the suffix tree of w. The key combinatorial object is a node v corresponding to a candidate w_B, and they maintain sets of suffix-start indices in a heavy-light style decomposition: H_v is the heavy child’s index set and L_v the light remainder. For each i in L_v, they need to find a j in H_v such that the suffixes starting at i and j form a feasible decomposition. This is where the algorithm splits into Right-B and Left-B cases based on whether the shorter suffix w_B w_D starts at j or the longer suffix w_B w_C w_B w_D starts at j. The “branching” condition says the two candidate suffixes diverge at the node for w_B; this enables suffix-tree-based navigation rather than brute-force scanning.

The architecture is highly nontrivial and built out of several subroutines. For the Right-B case, they need to enforce the C- and D-constraints while w_B changes as they traverse suffix-tree nodes. To manage the D-constraint, they partition candidate indices into a constant number of equivalence classes using the finite transition monoid of Regex, then maintain separate instances of a custom RightEnumerator data structure for each class and merge them with disjoint sets. A straightforward segment-tree implementation yields O(n log^3 n), and the paper claims the extra logarithm is removed by using factorization forests, reaching O(n log^2 n). For the Left-B case, they need to find j<i such that the gap from j+|w_B| to i matches C, but the constraint depends on |w_B|. They split this further into near-j and far-j. In near-j, overlapping occurrences induce periodicity that can be exploited in O(log n) time. In far-j, the distance is large enough to support a more elaborate data-structure treatment. The paper says the full far-j machinery is spread across Sections 10–12 and involves LeftEnumerator and auxiliary structures, but the excerpt does not fully spell out every invariant; still, the overall pattern is a heavy-light traversal plus specialized interval/query structures over suffix-tree leaves.

The evaluation protocol is purely theoretical. There are no measured runtimes or experimental comparisons; the correctness and complexity claims are proofs. The main complexity metrics are worst-case running time in |w| for fixed expressions, and conditional lower bounds under SETH, k-OV, and triangle-detection hypotheses. The comparison baselines are prior algorithms and prior hardness results: the standard dynamic-programming algorithm for k-REWB matching runs in O(|E||w|^{1+2k}), Nogami and Terauchi’s ABCBD algorithm runs in O(|E|^2|w|^2), and the previous parameterized hardness was W[1]-hardness. The paper strengthens these results to W[2]-hardness and better fine-grained lower bounds. Reproducibility is partially supported by the fact that the paper is a full technical theory paper with detailed section-by-section algorithmic decomposition, but there is no code release, implementation, or benchmark suite in the provided text. The proof techniques named explicitly include suffix trees, transition monoids, factorization forests, periodicity arguments, and reductions from triangle detection and orthogonal vectors. A concrete end-to-end example is the ABCBD task: given w, the algorithm first normalizes the expression, then builds a suffix tree of w, classifies candidate suffix starts into heavy and light sets at each suffix-tree node, uses a monoid-based partition to preserve D-constraints, and then queries the appropriate enumerator to test whether there exists j producing w=w_A w_B w_C w_B w_D with each segment accepted by the corresponding Regex. The final running time is O(|w| log^2 |w|), with constants exponential in the combined expression size.

Technical innovations

A canonical-form theorem reduces every 1-use REWB to a disjunction of ABCBD instances plus a residual plain Regex, with explicit polynomial blowup in |E|.
A near-linear O(|w| log^2 |w|) algorithm for ABCBD uses suffix trees, heavy-light decomposition over suffix-tree leaves, transition-monoid partitioning, and factorization forests.
A separate O(|w| log |w|) algorithm for XY^αZ (equivalently X(Y)x(\x)^{α-1}Z) isolates the repeated middle block and exploits periodicity.
The paper strengthens parameterized hardness from W[1]-hardness to W[2]-hardness when parameterized by expression size.
It gives fine-grained lower bounds showing k-REWB matching needs O(n^{2k-ε}) time or more under SETH, and fixed 2-use 2-REWB matching is triangle-detection-hard to near-linear time.

Baselines vs proposed

Standard DP for k-REWB matching: O(|E|·|w|^{1+2k}) vs proposed 1-use REWB algorithm: O(|w| log^2 |w|) after canonical reduction.
Nogami and Terauchi (MFCS 2025) for ABCBD: O(|E|^2|w|^2) vs proposed: O(|w| log^2 |w|).
Segment-tree-based implementation of RightEnumerator: O(|w| log^3 |w|) vs proposed factorization-forest-based version: O(|w| log^2 |w|).
For XY^αZ, the paper contrasts prior generic dynamic programming-style approaches implicitly with its O(|w| log |w|) algorithm; no explicit competing runtime is quoted in the provided text.

Limitations

The near-linear-time result is only proved for 1-use REWBs; the complexity of r-use 1-REWB for any constant r≥2 remains open.
The 1-use algorithm still has a log^2 n factor, so it is not truly linear-time.
The hidden constants are large and depend exponentially on expression size (e.g., 2^{O((|A|+|B|+|C|+|D|)^2)} and 2^{O(|E|^4)}), which may limit practical use.
The algorithmic sections are intricate and multi-stage; the authors themselves call simplification a meaningful open problem.
The results do not cover other practical regex features such as lookaround, greedy/lazy semantics, or atomic grouping, which the paper flags as important future work.
No implementation, benchmarking, or empirical validation is provided in the excerpt, so practical performance remains untested.

Open questions / follow-ons

Can the O(|w| log^2 |w|) algorithm for 1-use REWBs be simplified or improved to true linear time?
What is the complexity of r-use 1-REWB matching for fixed r≥2?
Can the exponential dependence on expression size in the near-linear algorithm be reduced to polynomial while preserving near-linear dependence on |w|?
How do these techniques extend to lookaround, atomic grouping, or other practical regex features that real engines support?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the paper is mainly a warning and a design guide: backreferences are a structural source of worst-case matching cost, and even restricted forms can remain hard. If a validation or challenge-generation pipeline uses regexes with backreferences, the result suggests that simply relying on a generic regex engine is not enough to guarantee ReDoS safety; the exact fragment matters. In practice, a defender would want to inventory which regexes are 1-use-like versus more general, avoid attacker-controlled backreference patterns, and treat near-linear theoretical bounds as upper bounds only after checking constant factors and unsupported language features. The paper also implies that safe handling may require either restricting the regex feature set or using specialized matchers for known tractable fragments, rather than assuming all “regex” behaves like classical Thompson NFA matching.

Cite

bibtex

@article{arxiv2605_07289,
  title={ On the Complexity of the Matching Problem of Regular Expressions with Backreferences },
  author={ Soh Kumabe and Yuya Uezato },
  journal={arXiv preprint arXiv:2605.07289},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.07289}
}

On the Complexity of the Matching Problem of Regular Expressions with Backreferences ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​