SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems

Source: arXiv:2606.12703 · Published 2026-06-10 · By Tarun Sharma

TL;DR

This paper addresses a novel security vulnerability in Retrieval-Augmented Generation (RAG) systems that use persistent, append-only memory stores across user sessions. An adversary interacting through normal channels can inject malicious entries at runtime which later influence the agent's responses without modifying model weights or code, termed Multi-Session Memory Poisoning (MSMP). Existing defences assume static knowledge bases or rely on heuristic filters that can be bypassed by fluent text, providing no formal guarantee against such attacks. The authors propose Signed Memory with Smoothed Retrieval (SMSR), a two-component defence that provides the first certified robustness bound against MSMP. Component 1 cryptographically signs all legitimate memory writes with HMAC-SHA256 to block unsigned injections, eliminating unsigned adversary attacks. Component 2 protects against authenticated adversaries who can write signed content by applying randomized over-fetching and ablation combined with a verdict-based majority vote aggregation to certify upper bounds on attack success rate.

Their security analysis establishes a formal impossibility for any provenance-free retrieval-time filtering to provide a non-trivial robustness certificate, proving provenance tagging is essential. By mathematically modeling the random sampling defense with a hypergeometric distribution certificate, they bound attack success probability. Extensive evaluation on 15 enterprise knowledge scenarios with 3,150 repeated trials shows Component 1 completely blocks unsigned injections, and Component 2 reduces attack success from near-total to 8.0% or less for a single signed injection. They also identify and solve a novel Consistent Minority Effect where string-based vote aggregation can be gamed by adversaries, showing verdict-based aggregation avoids this pitfall. Overall SMSR significantly enhances run-time memory poisoning defense with formal guarantees and practical deployment viability while preserving strong utility on clean queries.

Key findings

Component 1 (HMAC provenance) cuts attack success rate (ASR) from 93-100% to 0% for all unsigned injection types, including heuristic bypass variants.
Component 2 reduces authenticated adversary ASR from 93-100% to 8.0% (95% CI [5.8%, 10.9%], n=450) for t=1 injected entries in a production-scale store (m=20, nruns=5), below the certified worst-case bound δ=10.4%.
In smaller stores (m’=10+t) with t=1 injection, empirical ASR is 37.8% (95% CI [33.4%, 42.3%], n=450), also below the theoretical bound δ=41.5%.
An end-to-end query-only attack where the agent injects poison reduces ASR from 65.3% to 5.3% (n=150) using SMSR.
Utility on clean (non-attacked) queries is retained at 90% for Component 1 alone and 85% with combined SMSR (Components 1+2).
Impossibility theorem proves no provenance-free deterministic retrieval-time filter can provide non-trivial security guarantees against adaptive MSMP adversaries.
String-based majority vote is vulnerable to the Consistent Minority Effect, with adversarial responses winning majority despite being a numerical minority; verdict-based aggregation solves this issue.
Judge reliability is high (Cohen’s κ=0.955, 97.6% raw agreement) between Claude Haiku and Claude Sonnet LLMs, confirming robustness of automated verdict evaluation.

Threat model

The adversary is either an unsigned attacker with direct write-only access to the agent memory store but no ability to create valid HMAC signatures, or an authenticated user who can write validly signed entries but does not know the server secret key. Both adversaries can inject up to t malicious memory entries aiming to cause malicious future agent responses when those memories are retrieved. The adversary knows the embedding and retrieval functions but cannot modify existing signed entries or corrupt the HMAC key. No direct access to model weights or code is assumed.

Methodology — deep read

Threat Model & Assumptions: The authors consider a multi-session LLM agent with a persistent, append-only memory store M containing text entries with embeddings. Legitimate users and adversaries write to M through normal agent interactions. The adversary classes are:

Unsigned adversary AU: full write access to memory store but no credentials, cannot sign entries.
Authenticated adversary AP: legitimate users with signing credentials (secret HMAC key unknown). Both know embedding model, retrieval method, but not the server secret key.

The attack goal is to inject up to t malicious entries into M that cause the agent's response to future queries q* to be malicious (false claims).

Data & Experimental Setup: They evaluate on 15 realistic enterprise knowledge base scenarios spanning finance, compliance, IT security, HR, procurement, incident response. Each scenario defines ground truth and a malicious target answer. The memory store is pre-seeded with 10 legitimate signed entries; adversary injects varied attack payloads (direct copies, paraphrase flooding, subtle embeddings). A total of 3,150 repeated trials (6 defense configurations × 15 scenarios × 30 reps + 450 production-scale). LLM model used is Claude Haiku 4.5 for agent and judge.
Architecture/Algorithm - SMSR: Component 1 applies server-side HMAC-SHA256 tags with a secret key K on every memory write, creating a cryptographic provenance barrier. Retrieve only verified-signed entries at query time, filtering out unsigned injections fully.

Component 2 defends authenticated adversaries who can write signed entries. At query time, retrieve an over-fetch pool C of size m (m>k). Run nruns independent random ablation trials by uniformly sampling k items from C without replacement for each run, generating nruns responses. Then pass responses through a judge function J(r,q) outputting verdicts (correct, malicious, neither). Final answer is majority verdict over nruns samples. This randomized smoothing reduces influence of up to t adversarial entries.

The protocol is summarized in Algorithm 1.

Training & Hyperparameters: No traditional training; system relies on cryptographic HMAC and random sampling methodology. Parameters used: m=20 candidates, k=5 per ablation, nruns=5 independent runs, t ≤ 3 adversarial injections. They validated judge reliability with 84 human-annotated samples comparing Haiku and Sonnet models.
Evaluation: Metrics are Attack Success Rate (ASR) - fraction of trials judged malicious, and Utility - fraction of clean queries judged correct. Baseline attacks tested with no defense, heuristic filters (keyword, entropy, semantic anomaly), Component 1 only, and combined SMSR (Components 1+2). Empirical ASR vs certified worst-case bounds δ are reported on both production-scale and smaller stores.
End-to-end runtimes and live query-only injection attacks were also tested, measuring real system robustness.

Reproducibility: Code and datasets are publicly released at https://github.com/tarun-ks/smsr. Judging uses Claude LLMs. The evaluation methodology and parameters are fully documented.

Example walk-through: For a query q, the system first retrieves signed memories (Component 1). Then, for nruns =5 times, randomly samples k=5 entries from the over-fetched m=20 candidates and queries the LLM to produce responses. Each response is judged as malicious or not. Majority verdict produces final answer. This design statistically limits adversary influence due to random sampling and majority voting, establishing a certified upper bound δ on ASR.

Overall, the method combines cryptographic write-time guarantees with randomized smoothing at retrieval time and robust majority aggregation to defend dynamic memory poisoning with formal security certificates.

Technical innovations

A formal impossibility theorem proving that provenance-free retrieval-time filters cannot provide non-trivial security against adaptive MSMP adversaries.
Integration of HMAC-SHA256 cryptographic provenance tagging at memory write time as a foundational defense against unsigned injection.
Application of randomized over-fetch retrieval with ablation and verdict-based majority vote aggregation to bound the influence of authenticated adversaries, with a precise hypergeometric certificate.
Formal identification and quantification of the Consistent Minority Effect in memory poisoning, showing how string-based majority vote can be exploited and how verdict-based aggregation mitigates it.

Datasets

Nexora Corp enterprise knowledge-base scenarios — 15 scenarios — synthetic enterprise policy queries and responses, pre-seeded with 10 signed memories plus adversarial injections (total store size variable)

Baselines vs proposed

Undefended baseline: ASR = 93–100% (unsigned and authenticated attacks) vs SMSR Component 1 only: ASR = 0% (unsigned injection)
Heuristic baseline (keyword, entropy, semantic anomaly): ASR = 86.7–100% (cannot block bypass attacks) vs SMSR Component 1 only: ASR = 0%
Component 1 only vs combined SMSR (Component 1+2) on authenticated attacks: ASR from 93–100% reduced to 8.0% (95% CI [5.8%, 10.9%], n=450) at t=1
String-based majority vote vs verdict-based majority voting: ASR = 93.3% vs 13.3% on single n=15 run for authenticated t=1 injection
Utility (correct non-attack query rate): 90% for Component 1 alone vs 85% for combined SMSR

Limitations

Component 2's certified robustness bound degrades sharply as adversary budget t increases beyond 1, with ASR approaching 93–100% for t=3.
Certificate validity depends on assumptions of uniform sampling and independence; adversaries exploiting correlated or context-aware injections might partially degrade guarantees.
Empirical evaluation uses a synthetic company policy domain and Claude LLMs for both agent and judge; transfer to other domains or LLM architectures may vary.
Judge reliability, while high, is not perfect and depends on LLM capacity, which may affect verdict aggregation fidelity.
Practical key management for HMAC secret K is critical; compromise or improper rotation can impact security guarantees.
Utility cost of Component 2 (randomized ablation) is a 5% drop, which may be significant for highly sensitive real-time systems.

Open questions / follow-ons

Can the SMSR model be extended to handle adaptive adversaries injecting more than t entries or colluding multi-user adversaries?
How does SMSR perform under adversarial distribution shifts or real-world datasets with larger, less curated memory stores?
What are the trade-offs in computational efficiency and latency for SMSR in high-throughput real-time systems?
Could components of SMSR be integrated with other LLM-based trust scoring or anomaly detection approaches to improve robustness without utility loss?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper introduces a critical runtime attack vector against persistent LLM agent memory stores where adversaries can inject malicious context that poisons responses over multiple sessions. Unlike static corpus poisoning, dynamic memory poisoning exploits continuous interaction and persistence, which is relevant for long-lived conversational or agentic services protected by CAPTCHAs or authentication.

SMSR’s approach combining cryptographic write-time provenance with randomized retrieval smoothing and robust aggregation provides a blueprint for certifying defenses that block injection attempts early and limit adversary influence later. The formal impossibility result cautions against relying solely on heuristic or content-based filters for such runtime injections, urging CAPTCHA engineers to consider authenticated provenance and randomized retrieval defense layers. While the utility tradeoffs and certificate tightness vary, SMSR demonstrates a practical, certified strategy applicable to enterprise agent systems facing poisoning or stealth manipulation attacks. This can guide CAPTCHAs designed to mitigate stealthy multi-session contamination in persistent backend memory stores underpinning generative AI services.

Cite

bibtex

@article{arxiv2606_12703,
  title={ SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems },
  author={ Tarun Sharma },
  journal={arXiv preprint arXiv:2606.12703},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12703}
}

SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​