Asymptotically Optimal Codes for Correcting Burst Deletions and Insertions in Labeled DNA Sequences

Source: arXiv:2606.14573 · Published 2026-06-12 · By Wenhao Liu, Zhengyi Jiang, Zhongyi Huang, Hanxu Hou

TL;DR

This paper addresses the challenge of correcting burst synchronization errors—specifically bursts of consecutive insertions or deletions—in labeling sequences used in DNA-based data storage. Fluorescent labeling enables random access, but biochemical noise induces burst indels in the label domain, corrupting data readout. The authors formalize burst t-deletion/insertion A-labeling codes to correct single bursts of length t ≥ 1 acting on sequences labeled over a minimal label set S of dinucleotide probes. They establish a fundamental information-theoretic lower bound on redundancy of any such code as log4 n + O(1), resolving the previously open question even for t=1. For t ≥ 2, they give explicit code constructions and encoding/decoding algorithms with O(n²) complexity, based on a novel “generalized run-length limited” constraint on the DNA sequences. The scheme achieves asymptotic optimality with redundancy matching the lower bound’s dominant term up to a small O(log log n) overhead. Extensive combinatorial arguments underpin the lower bounds, and the construction cleverly navigates the complex dependencies between DNA and label domains.

Key findings

Established redundancy lower bound of log4 n + O(1) for any burst t-deletion/insertion A-labeling code with t | n, for all t ≥ 1 (Theorem 4).
For burst length t ≥ 2 and code length n ≥ 7t + 3, proposed explicit encoding/decoding algorithms run in O(n²) time, practical for moderate block lengths.
Introduced a generalized ℓ-run-length-limited (RLL) constraint on DNA sequences that enforces a classical ℓ-RLL property on labeling sequences, enabling control of burst error localization (Theorem 5).
Achieved asymptotic redundancy r(n,t) = log4 n + (t−1) log4 log8/3 n + O(1), matching the fundamental redundancy lower bound up to a small O(log log n) term.
Demonstrated that the labeling map is non-surjective and many-to-one, complicating code design and analysis; constructed sphere-packing bounds operating strictly over the valid image space of labeling sequences.
Applied interleaving and run-based counting arguments to tightly characterize deletion balls and derive bounds on the size of valid received sequences.
Developed enumerative encoding/decoding based on rank/unrank algorithms with dynamic programming on constrained sequences, enabling efficient mapping between information and constrained codewords.

Threat model

The threat model assumes a single burst of t consecutive insertions or deletions (indels) occurring in the labeling sequence after the DNA strand is encoded and labeled. The adversary—representing biochemical noise or readout artifacts—can corrupt the labeling domain but cannot alter the underlying DNA sequence x, nor inject multiple bursts or substitutions. The boundary bases x1 and xn are assumed reliably known and unaltered, enabling unique decoding. The model and codes do not protect against multiple bursts, distributed random errors, or adversaries with the ability to modify DNA bases themselves.

Methodology — deep read

The paper addresses a DNA storage scenario where data is encoded into a DNA strand x over the quaternary alphabet {A,T,C,G}, then fluorescent labeling produces an 11-ary labeling sequence LS(x) over a minimal label set S of 10 dinucleotide patterns plus a mismatch label 0. The core problem is that the labeling sequence is corrupted by a burst of t consecutive insertions or deletions in the label domain, inducing synchronization errors that shift positions and corrupt reconstruction.

Threat Model & Assumptions: The adversary induces exactly one burst of t consecutive label insertions or deletions (indels) in the labeling sequence after encoding. The boundary bases x1 and xn of the DNA strand are assumed known, enabling unique reconstruction from valid labeling sequences.

Data & Preprocessing: The codewords are DNA sequences x ∈ Σ_4^n with n divisible by t, mapped via LS to labeling sequences in Σ_11^n. The labeling map is non-surjective and many-to-one, and the error acts on the labeling domain, not directly on x.

Algorithm & Architecture:

They model burst errors as a burst-t deletion or insertion acting on LS(x).
They prove a redundancy lower bound using sphere-packing arguments restricted to the valid image space of LS, leveraging interleaved subsequences and run-length counts of x.
To design codes meeting this bound, they impose a generalized run-length limited (RLL) constraint directly on x viewed as a sequence of DNA base pairs mapped to a 16-ary alphabet.
This generalized RLL constraint restricts consecutive symbols coming from the same equivalence class (defined by label mapping), ensuring the labeling sequence satisfies a classical RLL constraint tightly controlling long runs, thus limiting positional error ambiguity.
Encoding combines generalized RLL constraints with Varshamov–Tenengolts (VT) and shifted-VT (SVT) codes on a matrix representation of the DNA sequence to reduce burst errors to localizable single insertions/deletions.
The encoding/decoding algorithms utilize enumerative coding techniques based on ranking and unranking sequences satisfying the generalized RLL constraint, with precomputed tables for dynamic programming.

Training & Hyperparameters: Not applicable; this is a coding theory construct.

Evaluation Protocol:

They rigorously prove lower bounds on redundancy.
They provide explicit code constructions with proven asymptotic redundancy closely matching the lower bound.
The complexity of encoding/decoding is O(n²).
The paper includes theoretical analyses but no empirical experimental data.

Reproducibility:

The code constructions and algorithms are described in full detail.
The proofs are constructive.
No software release mentioned, but the formal nature aids replication.

Example:

For t=1, the paper clarifies the lower bound on redundancy log_4 n + O(1).
For t≥2 and n divisible by t, sequences are structured into interleaved subsequences to facilitate counting argument and code design.
The generalized RLL constraint ensures the labeling sequences avoid long identical runs, aiding decoder error localization.
Matrix construction applying VT and SVT codes localizes and corrects bursts. This comprehensive methodology tightly couples combinatorics, code construction, and error models specific to the DNA labeling channel.

Technical innovations

Formal introduction of burst t-deletion/insertion A-labeling codes correcting a single burst of length t ≥1 in the DNA labeling domain, a novel model reconciling the DNA encoding and noisy labeling channel.
Establishment of the first nontrivial information-theoretic redundancy lower bound of log4 n + O(1) for burst labeling codes, resolving a longstanding open problem even for t=1.
A novel generalized run-length limited (RLL) constraint on base-pair grouped DNA sequences that deterministically implies classical RLL constraints on labeling sequences, bridging domain mismatch.
Explicit O(n²)-time encoding and decoding algorithms combining generalized RLL constraints with Varshamov–Tenengolts and shifted-VT codes to localize and correct burst indels efficiently.
Sphere-packing bound technique applied strictly over the valid image of the labeling map to handle its non-surjective, many-to-one structure.

Baselines vs proposed

Standard q-ary burst deletion codes: redundancy lower bound roughly log11 n + O(1) vs proposed: log4 n + O(1), significantly tighter (Theorem 4).
Code redundancy for t≥2 from prior single-error models not applicable; proposed explicit code achieves redundancy within O(log log n) gap of fundamental lower bound.
Encoding and decoding time complexity: classical VT/SVT codes decode in O(n) time vs proposed combined scheme decodes in O(n²) due to generalized RLL constraints and matrix construction.

Limitations

Explicit code construction and asymptotic optimality proven only for burst length t ≥ 2 and code lengths n divisible by t satisfying n ≥ 7t + 3; t=1 remains intricate for explicit code.
Decoding complexity is O(n²), which may be prohibitive for very large block lengths typical in DNA storage applications.
No empirical evaluation on real or synthetic biochemical labeling data to validate practical error rates or decoding robustness under noise beyond a single burst indel.
The model assumes knowledge of boundary bases x1 and xn, which may be challenging or unrealistic in some DNA readout contexts.
No adversarial or distribution-shift robustness analysis beyond worst-case single burst error considered.
Code rate calculations rely on asymptotic polynomial expansions; finite length performance and overhead constants are not fully characterized.

Open questions / follow-ons

How to design explicit codes and efficient decoding algorithms for the single-error case t=1 with matching redundancy bounds?
What are the finite-length performance characteristics and overhead constants of the proposed codes in practical DNA storage scenarios?
Can the generalized RLL framework be extended to handle multiple bursts or combinations of burst and scattered errors effectively?
How robust are the codes under realistic biochemical noise models beyond worst-case bursts, including substitutions, interleaving errors, or coverage gaps?

Why it matters for bot defense

For bot-defense and CAPTCHA systems leveraging DNA-based random access or molecular tagging techniques, this work presents foundational coding advances for correcting burst synchronization errors in fluorescence-based labeling readouts. Ensuring reliable positional decoding amidst biochemical noise is critical for security: the generalized RLL codes and burst correction methods here provide a formal framework for designing robust label sets that tolerate localized indel noise bursts common in biochemical assays. While the problem is domain-specific, the conceptual separation between encoding constraints and error domains, and techniques for localizing synchronization errors, may inspire analogous methods in CAPTCHA challenges where burst insertion/deletion corruption occurs. The explicit lower bound provides theoretical guidance on the minimal redundancy required to secure reliable reconstruction under burst errors, an important metric for balancing label complexity and decoding reliability. However, practical application requires integration with experimental noise models and error patterns specific to molecular bot-defense implementations.

Cite

bibtex

@article{arxiv2606_14573,
  title={ Asymptotically Optimal Codes for Correcting Burst Deletions and Insertions in Labeled DNA Sequences },
  author={ Wenhao Liu and Zhengyi Jiang and Zhongyi Huang and Hanxu Hou },
  journal={arXiv preprint arXiv:2606.14573},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.14573}
}

Asymptotically Optimal Codes for Correcting Burst Deletions and Insertions in Labeled DNA Sequences ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​