If These Walls Could Talk: Critical Play with Large Language Models in Museums

Source: arXiv:2606.15565 · Published 2026-06-14 · By Anders Sundnes Løvlie

TL;DR

This paper examines the emerging use of Large Language Models (LLMs) in museums as chatbots that simulate conversations with historical figures or artefacts. While these role-playing chatbots offer engaging, playful, and immersive visitor experiences, the paper highlights a fundamental dilemma: LLMs cannot be relied upon to tell the truth, yet attempts to constrain or correct them to improve factual accuracy tend to reduce their conversational engagement and life-likeness. Through analysis of several museum chatbot examples, including the Anne Marie Carl-Nielsen bot that was limited to factual base knowledge but proved bland and unengaging, and a striking worker mannequin chatbot employing a literary persona to enable freer conversation, the author argues for embracing the unreliability of LLMs via design for "critical play." This means allowing chatbots to inhabit fictional or semi-fictional roles that represent historical narratives, discourse styles, or diverse perspectives, leveraging their capacity for dialogue and role play rather than insisting on factual correctness. Such an approach provokes critical reflection and visitor engagement without misleading as authoritative sources of truth. The paper situates these findings in museology and play theory, supporting a design philosophy that balances illusion and criticality rather than naive trust or strict regulation.

Key findings

The largest GPT-3 model (175B parameters) was truthful on only 58% of a set of 817 test questions, less than smaller models and far below humans (94%).
Efforts to restrict LLM chatbots to curated knowledge bases (e.g. Anne Marie Carl-Nielsen chatbot with Retrieval-Augmented Generation) led to limited conversational ability and bland, FAQ-like responses.
Visitors and museum staff preferred chatbots that were less restricted and more conversational, even if less factual.
Using fictional personas based on historical novels or culturally resonant characters (e.g. Pelle the Conqueror, Viking Sorceress) improved engagement and allowed freer, playful dialogue.
Role playing within a fictional or semi-fictional identity enables chatbots to embody diverse historical narratives or discourses without the requirement of strict factual accuracy.
Museum visitors tend to overtrust AI outputs, even in low-stakes settings, increasing risks of misinformation if chatbots are presented as authoritative.
Chatbots that embrace deliberate unreliability and unreliability as a feature of design can facilitate critical reflection and transformative experiences aligned with critical play theory.

Threat model

The paper assumes a non-adversarial setting where museum visitors interact with LLM chatbots lacking deep technical knowledge. The main risk comes from inadvertent misinformation generated by LLM hallucinations or confabulations rather than malicious attacks. Visitors may overtrust chatbot responses due to persuasive conversational abilities. The chatbot deployment context offers no strong user-side controls to verify truthfulness. There is no explicit consideration of attackers aiming to subvert the systems.

Methodology — deep read

The paper adopts a design research and conceptual analysis approach, grounded in case studies and prototype experiments within museum contexts. The threat model centers on museum visitors as users without technical expertise, who may overtrust chatbots but lack ability to discern chatbot truthfulness. Adversaries or attackers are not directly considered; the focus is mainly on unintentional misinformation from LLM hallucinations or confabulations.

Data sources include direct interactions with deployed or prototype LLM-based museum chatbots, marketing materials, and evaluations drawn from user feedback and observations at sites like the Dalí Museum, the Metropolitan Museum of Art, the National Gallery of Denmark, and the Workers Museum in Copenhagen. The paper reviews prior empirical LLM evaluation studies (e.g. Lin et al. 2022) for hallucination rates and truthfulness metrics.

The chatbot architecture generally involves GPT-4 or GPT-3 based LLMs, sometimes combined with Retrieval-Augmented Generation techniques limiting the knowledge to curated archives. Voice recognition and synthesis (e.g. ElevenLabs voice models) were used in some installations for audio interaction. Novelty lies not in technical algorithms but in the design framing: an intentional use of storytelling and fictional role playing to match LLM capabilities rather than attempting full factual strictness.

Training regimes, hyperparameters, or hardware specifics are not explicitly detailed given reliance on commercially available LLM APIs. Evaluation procedures involve qualitative user testing, observer notes, and analysis of conversational transcripts from prototypes exploring different prompt strategies and knowledge constraints. The paper details a concrete example: iteratively developing a chatbot for 19th-century striking workers, shifting from strict historical source constraints (producing bland answers) to role playing the protagonist from a historical novel (yielding more engaging dialogue).

Reproducibility is limited; much evaluation is qualitative and dependent on proprietary LLM APIs and museum installations that are not publicly accessible. The author flags these constraints and prioritizes design insights over algorithmic innovation.

Technical innovations

Proposal to embrace LLM unreliability through design for critical play rather than attempting to enforce factual correctness.
Use of fictional or semi-fictional personas rooted in historical novels or mythology as chatbot roles to better align with LLM capabilities and visitor engagement.
Integration of role-play as a central design strategy to balance museological concerns of truth with enabling lively, immersive conversation experiences.
Application of conversational styles (e.g. poetic verse of Viking sorceress) to enable educational encounters focused on discourse and culture rather than purely factual content.

Baselines vs proposed

Lin et al. (2022) GPT-3 family models: Truthfulness on 817 questions = 58% for 175B model vs human baseline = 94%
Anne Marie Carl-Nielsen chatbot with Retrieval-Augmented Generation: conversational breadth very limited and answers mostly bland vs unconstrained prompt-chatbots: more engaging but less reliable

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.15565.

Fig 1

Fig 1: The wedding dress exhibit of the Chat with Natalie experience (the Metropolitan Museum of Art, New

Fig 2

Fig 2: The Anne Mare Carl-Nielsen chatbot at the National Gallery of Denmark. (Photo: Jonas Heide Smith)

Fig 3

Fig 3: Visitor talking to one of the prototypes in the Workers Museum, in front of the mannequins of striking

Fig 4

Fig 4: Storyboard illustrating the "Viking Sorceress" prototype.

Limitations

Evaluation relies on qualitative user feedback and observations rather than quantitative or controlled user studies.
Most museum chatbot examples are proprietary installations with limited accessibility or reproducibility for external validation.
The approach prioritizes design framing over algorithmic mitigation; no technical solutions to completely solve hallucination are demonstrated.
The paper discusses concepts primarily from a single domain (museums); generalizability to other contexts is not empirically tested.
No systematic user studies on visitor understanding of chatbot unreliability or impact on knowledge acquisition are presented.
The bifurcation between factual and fictional role play may oversimplify complex visitor expectations about historical truth.

Open questions / follow-ons

How can museums effectively communicate chatbot unreliability to visitors to prevent overtrust while maintaining engagement?
What are best practices for balancing factual integrity with conversational depth in chatbot design for heritage institutions?
Can hybrid architectures combining retrieval-based methods with generative models better guarantee truthfulness without losing engagement?
How do various visitor demographics interpret and learn from fictional role-play chatbots versus factually constrained chatbots?

Why it matters for bot defense

The paper’s in-depth exploration of LLM unreliability and visitor overtrust is highly relevant for bot defense practitioners concerned with user trust and behavior around AI-generated content. While not directly about CAPTCHAs, the findings highlight that conversational AI systems—even those that are engaging and appear authoritative—can propagate misinformation if not carefully designed or constrained. For CAPTCHA engineers considering integrating LLMs or conversational agents in user flows, this work cautions against naive deployment without safeguards or transparency.

Furthermore, the concept of designing chatbots as deliberately unreliable fictional personas to induce critical reflection could inspire innovative approaches in bot defense systems where engagement rather than strict correctness is prioritized. It suggests a nuanced balance between AI-generated textual fluency and factual grounding, a theme that resonates with building trustworthy and robust human-AI interaction systems in security-sensitive settings.

Cite

bibtex

@article{arxiv2606_15565,
  title={ If These Walls Could Talk: Critical Play with Large Language Models in Museums },
  author={ Anders Sundnes Løvlie },
  journal={arXiv preprint arXiv:2606.15565},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.15565}
}

If These Walls Could Talk: Critical Play with Large Language Models in Museums ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​