Private Information Retrieval for Large-Scale DNA-Based Data Storage

Source: arXiv:2606.14557 · Published 2026-06-12 · By Gökberk Erdoğan, Daniella Bar-Lev, Rawad Bitar, Antonia Wachter-Zeh, Zohar Yakhini

TL;DR

This paper explores the adaptation of Private Information Retrieval (PIR) protocols to synthetic DNA-based data storage systems. PIR allows a user to retrieve a record without revealing which record is requested, a well-studied primitive in digital storage. However, DNA-based storage differs significantly due to its biochemical access model, which lacks electronic-style random access and in-memory algebraic computations. The authors propose two novel two-server PIR protocols tailored to DNA storage capabilities, exploiting cheap random primer synthesis and PCR amplification that is independent of database size. These protocols balance privacy, efficiency, and biochemical feasibility while addressing the unique constraints of molecular operations.

The paper formulates three scenarios with increasing biochemical user capabilities and privacy requirements, analyzing information-theoretic privacy leakage and failure probabilities in each. One protocol uses base counting from sequencing data to recover the record but incurs high communication cost. The other protocol efficiently compresses answers using XOR of binary-encoded sequences, achieving PIR rate 1/2 with careful randomized primer pool design. Theoretical leakage bounds are derived, showing perfect information-theoretic privacy is achievable at appropriate parameter settings. Overall, the paper lays foundational work for secure, scalable molecular retrieval in large-scale DNA data storage, highlighting key trade-offs and implementation considerations.

Key findings

A direct implementation of classical two-server PIR in DNA storage requires O(n) specific primer synthesis per query, which is infeasible for large n.
Synthesizing random primers is inexpensive and PCR amplification time is independent of pool size, enabling new protocol designs exploiting these features.
Protocol 1, base counter approach, returns counts of A, C, G, T bases per position and has communication cost O(ℓ log n) bits.
Protocol 2, XOR approach, compresses answers to 2ℓ bits per server, achieving PIR rate 1/m = 1/2 for two-server setting.
Theorem 1: Fixing primer pool size r leads to privacy leakage ε ≥ 1 - log(1 + 1/n), preventing perfect information-theoretic privacy except for trivial n=1.
Theorem 2: Randomizing primer pool with Bernoulli probability p=1/2 yields zero privacy leakage (ε=0), achieving perfect IT privacy.
Scenario 2 and 3 require repeating protocol over multiple server pairs to achieve low failure probability, with m ≥ 2λ / (2ℓ_bc − log r) servers needed for failure ≤ 2^−λ.
Barcode sets must have minimum Hamming distance ≥ 2ρ+1 to prevent PCR primer cross-talk, bounding maximum usable database size n.

Threat model

The adversary is an honest-but-curious server hosting one database replica, who can observe queries and protocol transcripts but does not collude with other servers. The adversary cannot perform biochemical operations beyond those specified, nor collude to aggregate queries. The model assumes the adversary cannot infer the queried record index f beyond the information leakage bounded by mutual information between queries and f.

Methodology — deep read

Threat model and assumptions: The adversary is an honest-but-curious non-colluding server possessing one of the two DNA databases. The goal is to ensure the queried record index f remains private from any single server. The model assumes perfect biochemical operations (no PCR cross-amplification or sequencing errors) and non-colluding servers. Privacy is quantified via mutual information leakage I(Qj;F).
Data provenance and model: The database D consists of n records encoded as DNA strands composed of barcode (address), payload (data), and universal primer sequence. Barcode sequences are codewords with minimum Hamming distance to prevent PCR cross-talk. The universe of barcode sequences size is u = 4^ℓ_bc.
Architecture and algorithms: Two main PIR protocols are proposed:

Protocol 1 (Base Counter): Servers PCR amplify queried primer pools, sequence them, reconstruct strands, and count base occurrences per position. Answers are matrices of counts over 4 nucleotides and ℓ positions. The user subtracts the matrices from the two servers to reconstruct the queried sequence.
Protocol 2 (XOR): After sequencing and reconstruction, servers convert strands per position to 2-bit binary representations, XOR all reconstructed sequences, and return the XOR to user. The user XORs both server responses to decode the queried strand.

Queries are generated by user via biochemical random primer synthesis: user selects random subset R of primers excluding (if possible) the target barcode bf. A duplicate of R (R') is created biochemically. The target primer bf is added to R' to form Rf. The two servers receive (R,Rf) or (Rf,R) in random order.

Training regime: This is not a learning system; the protocols do not involve machine training but biochemical PCR operations and sequencing followed by computational decoding. Primer pools are sampled either at fixed size r or Bernoulli(p) random sampling.
Evaluation protocol: Information-theoretic privacy leakage ε=I(Qj;F) is analyzed theoretically, deriving closed-form expressions under combinatorial and probabilistic assumptions. Failure probabilities due to bf appearing in random pools are bounded. Communication costs are evaluated in bits per query response. The PIR rate metric is used: retrieved information divided by download cost.
Reproducibility: The paper is theoretical with algorithmic pseudocode and detailed protocol descriptions. No experimental implementation or code release at this stage. Assumptions of noiseless biochemical conditions are limiting but set a foundation for future noisy evaluations.

Concrete example: To retrieve record df, user synthesizes random primer pool R excluding bf (or Bernoulli sampling with p=1/2), creates copy R', adds target primer bf to get Rf, and sends queries (R,Rf) or (Rf,R) to servers. Servers perform PCR, sequencing, reconstruct reads, XOR (Protocol 2) or base count (Protocol 1) their sequences, return answers. User XORs or subtracts answers to decode df barcode sequence. Privacy leakage depends on randomness in R and protocol parameters.

Technical innovations

Formulating PIR for DNA-based storage by exploiting biochemical random primer synthesis and PCR amplification properties unique to molecular storage.
Designing two novel two-server PIR protocols (base counter and XOR) adapted to biochemical constraints where in-pool algebraic computation is impossible prior to sequencing.
Analyzing information-theoretic privacy leakage in molecular PIR setting under combinatorial barcode design and random primer sampling, deriving explicit leakage bounds.
Introducing three user biochemical capability scenarios that trade off privacy, efficiency, and feasibility in DNA-based PIR.
Applying coding-theoretic principles (barcode design with minimum Hamming distance) to ensure PCR primer specificity and prevent cross-amplification.

Baselines vs proposed

Classical two-server PIR with direct PCR primer synthesis: requires O(n) primer synthesis operations per query, infeasible for large DNA databases.
Protocol 1 (Base Counter): communication cost O(ℓ log n) bits per answer vs Protocol 2 (XOR): communication cost 2ℓ bits per answer, achieving PIR rate 1/2 for two servers.
Privacy leakage ε in Protocol 1 fixed pool size r case ≥ 1 - log(1 + 1/n), not zero, vs Protocol 2 with Bernoulli p=1/2 primer sampling achieving ε=0 perfect information-theoretic privacy.
Failure probability in Scenario 2 with fixed r decreases exponentially with number of servers m ≥ 2λ / (2ℓ_bc−log r) vs scenario 3 where imperfect copying further motivates multiple server pairs for robustness.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.14557.

Fig 1

Fig 1: A schematic representation of the database setup. From left to right: the

Fig 2

Fig 2: A schematic representation of the user’s query preparation.

Fig 3

Fig 3: Schematic representations of the subroutines Answer1 and Recovery1.

Fig 4

Fig 4 (page 3).

Fig 4

Fig 4: Schematic representations of the subroutines Answer2 and Recovery2.

Fig 5

Fig 5: Leakage for fixed r. u ≈1.1 · 1012, n ≈2.7 · 109. H(F) = 31.36.

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

Protocols assume noiseless biochemical operations—perfect PCR specificity, zero sequencing errors, and flawless pool copying—limiting real-world applicability.
The information-theoretic privacy guarantees hinge on precise random primer synthesis parameters that may be challenging to control experimentally.
The barcode design constraint (minimum Hamming distance) limits maximal database size and may require complex coding constructions.
Failure probabilities in weaker biochemical capabilities scenarios require multiple server pairs, increasing practical complexity and resource requirements.
No experimental or simulation validation of protocols to demonstrate feasibility or performance under realistic biochemical noise and error models.
Privacy leakage analysis is limited to non-colluding two-server setting; extensions to multi-server colluding adversaries are not addressed.
Communication overhead in base counter protocol is high, potentially limiting scalability despite improvements over electronic PIR.

Open questions / follow-ons

How do realistic biochemical noise sources—PCR bias, sequencing errors, strand dropouts—impact correctness and privacy guarantees of DNA-based PIR?
Can error-correcting codes or robust decoding algorithms improve protocol resilience to experimental noise while maintaining privacy?
What are efficient barcode code constructions maximizing database size while ensuring primer specificity and PCR cross-talk prevention?
How can the protocols be extended to multi-server settings with colluding adversaries or coded storage representations?

Why it matters for bot defense

This research broadens the horizon of private retrieval protocols beyond classical digital databases into emerging molecular data storage technologies. For bot-defense and CAPTCHA practitioners, this work highlights the evolving landscape of secure data retrieval models that must account for hardware and physical substrate constraints. While DNA storage is not yet mainstream in CAPTCHA or bot-defense domains, analogous challenges exist in designing privacy-preserving retrieval systems under constrained computation or physical access limitations.

Practitioners can draw from these insights about leveraging substrate-specific primitives (here, random primer synthesis and PCR) to design privacy mechanisms that balance system constraints and adversary models. It also underscores the importance of carefully analyzing leakage and failure probabilities in privacy protocols when moving outside classical digital memory assumptions. Consequently, lessons on query encoding strategies, redundant server usage, and noise tolerance can inform advanced bot-defense CAPTCHAs deployed in resource-constrained or novel hardware-backed contexts.

Cite

bibtex

@article{arxiv2606_14557,
  title={ Private Information Retrieval for Large-Scale DNA-Based Data Storage },
  author={ Gökberk Erdoğan and Daniella Bar-Lev and Rawad Bitar and Antonia Wachter-Zeh and Zohar Yakhini },
  journal={arXiv preprint arXiv:2606.14557},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.14557}
}

Private Information Retrieval for Large-Scale DNA-Based Data Storage ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​