Skip to content

Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

Source: arXiv:2605.30169 · Published 2026-05-28 · By Botao Amber Hu, Helena Rong, Max Van Kleek

TL;DR

This paper addresses the foundational challenge of establishing trust and credible reputation systems for autonomous language model (LM) agents as they proliferate and interact in increasingly consequential real-world settings. The authors argue that traditional reputation mechanisms, inherited from human social and economic contexts, fundamentally rely on assumptions about persistent, embodied identity that do not hold for LM agents. Unlike humans, LM agents are ontologically dissociative: they are mutable assemblages of modular components such as foundational models, system prompts, tool policies, and external memory, any of which can change independently to produce different behaviors without altering the agent’s identifying label.

This dissociativity breaks the preconditions necessary for functional reputation feedback loops—namely, continuity of identity, behavioral predictability, sanction sensitivity, and costly uniqueness. The authors draw on analogies from dissociative identity disorder jurisprudence to highlight how without a stable, embodied agent substrate, reputation mechanisms collapse into a "credibility trap," enabling attacks like Sybil identity forgery and reputation laundering. They advocate shifting AI governance from ex post identity- and sanction-based reputation toward ex ante behavioral monitoring protocols that constrain agent conduct in real time.

The paper synthesizes eight necessary preconditions for reputation to function, identifies four dissociativity dimensions in LM agents, critically analyzes current agent-reputation proposals, and situates the argument within broader AI governance and multi-agent systems research, emphasizing the conceptual and practical limits of human-inspired reputation for LLM agents.

Key findings

  • Eight necessary preconditions for effective reputation loops include persistent identity, behavioral continuity, iteration of interactions, memory retention, observability, sanction sensitivity, costly identity creation, and social learning.
  • LM agents violate these preconditions along four dissociative dimensions: modular assemblage (mutable components), persona fluidity (swappable and drifting behavioral surfaces), detachable memory (no durable experiential learning), and trivial fungibility (costless copying and replacement).
  • Persona fluidity means a single LM can exhibit over a billion distinct personas, with changes in prompt or system configuration producing qualitatively different consistent behavioral profiles.
  • Memory is detachable and resettable; while in-context learning adapts behavior during a session, no lasting behavioral change occurs from sanctions or consequences.
  • Trivial fungibility enables unlimited identity creation and rapid replacement, making Sybil-resistant reputation functions impossible, per formal impossibility proofs.
  • The reputation feedback loop presumes embodied biological agents with physical continuity and painful costs of sanction; LM agents lack embodiment, undermining reputation’s mechanistic foundation.
  • Current agent identity approaches identify a container (name or handle) but do not track the mutable module configuration that drives behavior, leading to decoupling between label and behavior.
  • Reputation signals applied to LM agents become systematically ungrounded, distorting credibility assessment and enabling active attack surfaces such as reputation laundering and Sybil manipulation.

Threat model

The adversary is an entity capable of manipulating LM agent identities by swapping foundational components such as model weights, system prompts, external memory, and tool policies; dynamically creating or discarding agent instances at negligible cost; or injecting adversarial memory or configuration changes. They aim to undermine reputation mechanisms by creating new, behaviorally distinct identities or laundering reputation history. The adversary cannot modify frozen core model weights during inference but has full control over mutable scaffolding and orchestration. They do not have physical embodiment to anchor identity or sanction sensitivity intrinsic to biological agents.

Methodology — deep read

  1. Threat Model & Assumptions: The paper conceptualizes the adversary as one who can manipulate LM agent configurations, swap modules, edit external memory, or instantiate unlimited agent identities to undermine reputation-based trust. It assumes the adversary cannot modify the underlying irreducibly immutable foundational model weights during inference but can change externally mutable components or spawn fresh instances at low cost.

  2. Data and Evidence: This is primarily a theoretical and conceptual analysis. The authors synthesize findings from a broad interdisciplinary literature spanning AI, multi-agent systems (MAS) research, philosophy, cognitive science, law (DID jurisprudence), and formal game-theoretic impossibility results. Specific references include empirical studies on LM persona drift, toxicity amplification through persona assignment, misalignment generalization, and retrieval poisoning attacks (e.g., MINJA). No new datasets or empirical benchmarks are introduced.

  3. Architecture/Algorithm: The authors analyze the LM agent as a modular assemblage tuple ⟨L, O, M, P, A, R⟩ representing foundational model weights, orchestration logic, external memory, prompt configuration, action interfaces (tool policies), and runtime context. Each is independently mutable and influences the agent's behavioral profile. The paper emphasizes that no single component fully defines identity or behavior, leading to a Ship of Theseus problem in identity persistence.

  4. Training Regime: The discussion highlights that weights are frozen at inference and training time, precluding experience-dependent plasticity. Persona and behavior are shaped primarily by post-training prompting, Constitutional AI techniques, or role-play activations rather than continuous learning. Stability engineering of persona is treated as external configuration rather than intrinsic self-maintenance.

  5. Evaluation Protocol: The paper critically evaluates extant reputation and trust models from MAS literature such as FIRE, Beta Reputation System, TRAVOS, and decentralized reputation frameworks, identifying their implicit assumptions about identity persistence and stationarity as untenable for LM agents. The authors also incorporate insights from human reputation mechanisms and DID law cases to evaluate the structural applicability of reputation principles to LM agents.

  6. Reproducibility: No code or datasets are released as this is a conceptual analysis. The authors call for a paradigm shift in governance and propose focusing future efforts on observability-based, protocol-level behavioral constraints that can be monitored ex ante rather than relying on ex post identity-based sanctions.

A concrete example discussed is the "virtual jail" thought experiment: encoding narrative punishment in an agent’s context can elicit contrite behavior within a session, but once the context resets, no durable behavior change occurs. This highlights the detachable memory problem where sanctions have performative but not substantive effects on LM agents, contrasting sharply with human embodied experience of social punishment.

Technical innovations

  • Identification of eight necessary identity and embodiment-based preconditions that underpin functional reputation feedback loops.
  • Conceptualization of language model agents as ontologically dissociative entities along four dimensions (modular assemblage, persona fluidity, detachable memory, trivial fungibility) breaking these preconditions.
  • Application of dissociative identity disorder jurisprudence as an analogy to illustrate the difficulty of assigning reputation or accountability to agents with fluid, fragmented identities.
  • Articulation of a governance shift from ex post identity- and sanction-based reputation toward ex ante observability-based, protocol-driven behavioral harnesses for agent trustworthiness.

Limitations

  • The analysis is conceptual and theoretical rather than empirical; no new experiments or quantitative evaluations of proposed governance mechanisms are presented.
  • The paper does not provide a concrete implementation or evaluation of the proposed ex ante protocol-based behavioral harness approach.
  • Details on how to operationalize observability-based governance in scalable, interoperable agentic ecosystems are left as future work.
  • The argument assumes current LM architectures and agent configurations; future architectural changes (e.g., continual learning, embodied agents) may alter the dissociativity landscape.
  • The analogy to dissociative identity disorder—while insightful—is limited by differences between psychological phenomena and computational modularity.
  • No empirical measurement of the frequency or impact of reputation laundering, identity swapping, or Sybil attacks in deployed agent ecosystems is provided.

Open questions / follow-ons

  • What practical engineering methods and protocols can realize effective ex ante behavioral harnesses that reliably constrain agent conduct in real time?
  • How can we design observability infrastructures for multi-agent ecosystems to detect and mitigate identity laundering, Sybil attacks, and reputation manipulation?
  • Can modified LM architectures with mechanisms for continual learning or embodied grounding restore some necessary preconditions for reputation feedback loops?
  • What metrics and benchmarks might measure trustworthiness and sanction sensitivity in agent systems without relying on persistent identity?

Why it matters for bot defense

Bot-defense and CAPTCHA practitioners operating in increasingly autonomous multi-agent environments should recognize that traditional reputation systems relying on persistent, embodied identity are not suitable for evaluating LM agent trustworthiness. Unlike humans or even traditional software agents, LM agents’ identity is fluid and mutable, meaning reputation signals tied to fixed agent IDs can be misleading or manipulable at machine speed.

Thus, practitioners should be cautious about extending human-oriented reputation models directly to AI-driven agents without accounting for their dissociative ontology. Instead, defenses and trust signals need to shift toward real-time observability and behavioral protocol enforcement that monitor agent interactions continuously rather than retrospectively rating their histories. This calls for novel CAPTCHAs, interaction monitoring, and attestation schemes that verify dynamic, session-level behavioral properties over static identity claims in bot-defense contexts.

Cite

bibtex
@article{arxiv2605_30169,
  title={ Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms },
  author={ Botao Amber Hu and Helena Rong and Max Van Kleek },
  journal={arXiv preprint arXiv:2605.30169},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30169}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution