Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?
Source: arXiv:2606.04592 · Published 2026-06-03 · By Leonard Kinzinger, Jochen Hartmann
TL;DR
This paper investigates how well large language models (LLMs) can create synthetic individual-level digital twins using heterogeneous panel survey data rather than bespoke, purpose-collected datasets traditionally used for this task. The authors use the German Socio-Economic Panel (SOEP), a rich, naturalistic longitudinal dataset with nearly 1,000 question-answer pairs per participant, to build detailed synthetic respondent personas. They systematically explore a 3×5×2×2 experimental grid varying: model choice (three open-weight LLMs), information depth (ranked by normalized Shannon entropy across question responses), embedding method (narrative summary vs raw dialog history), and reasoning mode (with or without explicit internal reasoning). Over 2.1 million twin responses across 500 participants and 183 held-out questions are evaluated. The results show twin quality improves with more information but with diminishing returns beyond the 75th percentile of entropy-ranked items, which offers a cost-effective balance. Using raw dialog history embeddings raises hold-out accuracy across all models at full information, while adding explicit reasoning improves rank-order correlation without accuracy gains. The best performance reached 78.8% accuracy and a Fisher-z rank correlation of r=0.59 on held-out SOEP items, comparable to prior bespoke digital twin benchmarks. Overall, the study demonstrates that synthetic individual twins of practical quality can be constructed from standard CRM or panel data, and identifies key construction choices impacting performance.
Key findings
- Twin accuracy improves monotonically with added information depth, but shows diminishing returns after including the 75% normalized Shannon entropy quartile of questions, making it a cost-efficient Pareto point.
- Switching embedding from narrative persona summaries to raw dialog history raises hold-out accuracy for every model and reasoning mode at the 100% information depth.
- Explicit reasoning mode (generating internal thought traces prior to output) increases rank-order correlation without changing accuracy.
- Best accuracy achieved is 78.8% and highest Fisher-z rank correlation is r = 0.590 on the SOEP held-out evaluation set.
- Qwen 3 leads on average across accuracy, correlation, and dispersion ratio metrics, with Gemma 4 attaining the single best accuracy cell.
- Individual-level twins built from pre-existing heterogeneous panel data (SOEP) can match or approach performance of twins built on custom-designed surveys.
- The normalized Shannon entropy serves as an effective information density ranking to select survey items for digitized twin construction.
- Evaluation against 183 held-out questions across 500 fixed participants enables robust statistical differentiation between construction methods.
Threat model
The adversary is essentially absent as the task is non-adversarial: the goal is to faithfully replicate an individual's survey responses for marketing research applications, assuming honest participants and no deliberate attacks. The models do not account for manipulation, spoofing, or data poisoning scenarios but instead focus on modeling heterogeneity and fidelity in normal, real-world panel data settings. Adversaries cannot change or erase ground-truth panel data, and twin construction assumes full access to an individual's historical responses.
Methodology — deep read
The authors frame the threat model around digital twins as synthetic respondents aiming to replicate individual human survey response patterns, assuming no adversarial manipulation but focusing on maximizing fidelity to recorded panel data.
They use the German Socio-Economic Panel (SOEP)-Core v40 EU Edition, a representative longitudinal survey with 16,055 participants after filtering for completeness and linkage, each with 949 quality-checked question-answer pairs across 10 thematic domains relevant for market research. Of these, 38 demographic questions form a fixed baseline persona context, 728 persona-context questions are used for twin construction, and 183 questions are held out for evaluation, ensuring no overlap.
From the full SOEP dataset, they sample 500 participants with fixed random seed to create a consistent evaluation subset for the full experimental grid. This size balances statistical power and compute cost, as all inference and scoring are done locally with 3 open-weight LLMs.
The three evaluated LLMs are Qwen 3 (30B parameters), Gemma 4 (26B parameters), and Ministral 3 (14B parameters), chosen for open-source availability, strong performance on relevant benchmarks, and native German support matching the SOEP source language. All inference is done on a single NVIDIA H100 GPU using the vLLM framework.
Persona context embedding is tested in two forms: a narrative persona summary created by a Chain-of-Density hierarchical summarization method compressing individual responses into dense prose, and a raw dialog history preserving question-answer pairs as alternating user-assistant turns. Reasoning modes include a non-reasoning setting where the model directly responds, and a reasoning mode inspired by chain-of-thought prompting, where the model first generates an explicit internal thought trace before outputting the final answer. For two models dedicated reasoning checkpoints are used; for Gemma 4 inference-time prompting elicits reasoning.
Information depth is operationalized by ranking the 728 persona-context questions by normalized Shannon entropy of their response distributions across the large SOEP sample. Entropy values range from 0 (no variation) to 1 (maximum heterogeneity). Questions are split into quartiles, and five cumulative information levels are formed: baseline demographics only, plus successive entropy quartiles up to 100%.
The authors run all 3×5×2×2 = 60 construction cells, generating over 2.1 million answers on held-out questions across 500 participants. Model outputs are scored using metrics adapted from prior twin literature including accuracy (scaled closeness to human response), rank-order correlation (how well twins reproduce inter-individual differences), and dispersion ratio (comparative variance of twin to human responses).
Statistical significance of differences is assessed by paired metrics on the fixed participant and held-out item sets, isolating marginal effects of construction choices. The study uses open weights and local computation for reproducibility within license constraints. Some pipeline steps such as question normalization and LLM-based reformulation are documented but code release is not explicitly stated.
An example end-to-end flow: for participant A, the highest-entropic 75% questions plus demographics form the context via raw dialog history; Qwen 3 reasoning mode then processes this context plus a held-out test question prompt, outputs internal reasoning and then final predicted answer; this output is scored against the participant's actual answer to compute accuracy and rank-order correlation. Repeated over all participants and items builds the evaluation matrix.
Technical innovations
- Systematic exploration of a 3×5×2×2 construction-method grid integrating three open-weight LLMs, five levels of information depth ranked by normalized Shannon entropy, two embedding approaches, and two reasoning modes on a single large panel dataset.
- Use of normalized Shannon entropy to rank and select survey items for inclusion in the persona context, providing a principled, cost-efficient strategy for context budgeting.
- Demonstrating that a raw dialog-history embedding of question-answer pairs outperforms narrative persona summaries in predicting held-out individual responses.
- Showing that explicit reasoning modes improve rank-order correlation metrics independently of accuracy gains in synthetic twin generation.
- Building individual-level digital twins at scale from pre-existing heterogeneous panel microdata instead of bespoke, data-collection-intensive survey instruments.
Datasets
- German Socio-Economic Panel (SOEP)-Core v40 EU Edition — ~16,000 filtered participants with 949 usable question-answer pairs each — proprietary academic panel data accessed under contract with DIW Berlin
Baselines vs proposed
- Demographics-only context baseline: Accuracy ~5.1% lower than adding the first entropy quartile of questions.
- Narrative summary embedding at 100% info depth: Lower accuracy than raw dialog history embedding by a consistent margin across all three models and reasoning modes.
- Non-reasoning mode vs reasoning mode: Rank-order correlation increased (e.g., Fisher-z correlations reach 0.590 best in reasoning) without significant accuracy difference.
- Gemma 4 at dialog-thinking-100% setting achieves best cell accuracy of 78.8%.
- Qwen 3 achieves highest average rank-order correlation (r = 0.590) across all cells.
Limitations
- Study restricted to open-weight LLMs due to SOEP license constraints; closed-source frontier models excluded which might perform differently.
- Use of a single cross-sectional 2023 wave snapshot limits longitudinal fine-grained temporal dynamics modeling.
- Participants limited to 500 subsample to balance compute and statistical power; larger samples might reveal subtler effects.
- No adversarial or out-of-distribution evaluation to test twin robustness against deliberate manipulation or domain shifts.
- No public release of code or model checkpoints reported, limiting immediate reproducibility by external researchers.
- Findings primarily situated on German-language data; multilingual performance seen as a lower bound relative to English datasets.
Open questions / follow-ons
- How would closed-source foundation models or larger proprietary LLMs perform relative to the open-weight models tested here?
- What is the impact of longitudinal modeling and incorporating multi-wave time dynamics on synthetic twin fidelity?
- How robust are synthetic twins to adversarial inputs or data corruption scenarios common in real-world panel systems?
- Can calibration or debiasing methods be integrated to reduce known LLM response biases like shrinkage or demographic stereotyping in this context?
Why it matters for bot defense
For practitioners building bot defenses and CAPTCHA mechanisms, the findings highlight that LLM-based synthetic human models can nearly replicate individual human behavioral patterns from existing panel data not explicitly designed for such tasks. This implies that bot-detection systems examining repeated survey-like interactions or behavioral questionnaires must consider that sophisticated synthetic personas can be generated from standard CRM or loyalty program data, potentially simulating genuine responders with high accuracy.
From a CAPTCHA design perspective, simply aggregating demographic or coarse persona attributes may be insufficient for catching highly personalized synthetic bots. Instead, incorporating nuanced response heterogeneity and contextual behavioral histories—akin to the raw dialog-embedding approach—could offer more stringent behavioral fingerprints. Moreover, explicit reasoning traces, while improving rank-order correlation, do not directly improve accuracy, suggesting that defense systems should monitor for subtle discrepancies in reasoning consistency or diversity rather than just answers. Ultimately, digital twin advances underline the importance of multidimensional, longitudinal behavior modeling rather than surface response matching in bot-defense frameworks.
Cite
@article{arxiv2606_04592,
title={ Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata? },
author={ Leonard Kinzinger and Jochen Hartmann },
journal={arXiv preprint arXiv:2606.04592},
year={ 2026 },
url={https://arxiv.org/abs/2606.04592}
}