Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

Source: arXiv:2605.12411 · Published 2026-05-12 · By Eilam Shapira, Moshe Tennenholtz, Roi Reichart

TL;DR

This paper addresses the problem of predicting the next decision of an unfamiliar AI agent engaging in natural language negotiation or bargaining games after observing only a few prior interactions. Unlike prior work that either models entire populations or uses direct LLM prompting, the authors formulate it as a target-adaptive text-tabular prediction task. Each decision point is represented as a tabular row combining structured game state features, dialogue text embeddings, and a novel LLM-as-Observer hidden state feature extracted from a frozen small LLM that observes the current state and dialogue but does not produce a direct prediction. Trained on 13 frontier LLM agents from a benchmark tournament and tested on 91 held-out scaffolded agents from a university hackathon, their multimodal tabular model substantially outperforms both direct few-shot LLM prompting and tabular baselines without the Observer representation.

The results show that hidden states of an LLM observing the interaction contain rich decision-relevant signals that are not surfaced by prompting alone. With 16 adaptation games (K=16), adding the Observer features improves response (accept/reject) AUC by approximately 4 points and reduces bargaining proposal prediction error by 14% over game+text baselines. The approach generalizes across different LLM providers and varying agent prompt/control logic. This framing clarifies that target adaptation from few examples is best approached by extracting reusable representations and combining them with structured features in tabular models, rather than direct generation by a large LLM predictor.

Key findings

At K=16 adaptation games, LLM-as-Observer representation improves response-prediction AUC by ~4 percentage points over game+text tabular baseline across bargaining and negotiation tasks.
Observer hidden states reduce bargaining proposal prediction error by 14% at K=16 compared to game+text features alone (e.g., reducing one-off offer prediction error from $552 to $473 on a $10,000 nominal split).
LLM-as-Predictor direct few-shot prompting baseline lags behind tabular models in response prediction by up to ~6.7 points AUC and shows negative or near-zero median R2 for proposal prediction.
Feature ablations confirm structured game-state features are essential, Observer features add critical decision information, and generic dialogue embeddings become largely redundant once Observer features are included.
Observer hidden state gains are stable across different frozen small LLMs (Gemma-2-2B, Qwen3-1.7B, Llama-3.2-1B) and mid-to-late Transformer layer representations (relative depth 0.6–0.9).
Cross-population transfer: models trained on the 13 frontier LLM agents (varying underlying LLM) successfully adapt to 91 held-out scaffolded agents (sharing LLM but varying prompts/control logic), demonstrating robustness to different agent heterogeneity axes.
Direct output logits of the Observer LLM are less informative than its hidden state representations, justifying use of LLM as an encoder rather than a few-shot predictor.

Threat model

The threat model concerns an observer who wishes to predict the next strategic decision of a previously unseen language-based AI agent interacting in a negotiation or bargaining game. The observer sees only the public game state, offer history, and dialogue messages—the private prompt, underlying LLM weights, control logic, and fallback code of the target agent are completely hidden. The adversary cannot be queried interactively, only past gameplay logs of K observed games are available for adaptation. The black-box setting prevents direct probing or intervention of the target’s internals.

Methodology — deep read

The methodology proceeds in several steps:

Threat Model & Assumptions: The adversary is an unfamiliar language-based AI agent making decisions in negotiation and bargaining games. The predictor never sees the adversary's prompt, code, or control logic and only accesses public game state, offer history, and dialogue text. Private information such as private valuations is masked when appropriate.
Data: Training uses a source population of 13 frontier LLM agents from the GLEE round-robin tournament dataset comprising ~64,000 games and 197,000 decisions across bargaining and negotiation settings. The target population is a newly introduced university hackathon dataset with 91 scaffolded agents (student-built variants differing in prompting pipelines and control logic), with 4,921 games and 11,341 decisions. Both datasets include offers, accept/reject decisions, dialogue messages, and game state variables across varying horizons and information settings.
Architecture: Each decision point is encoded as a tabular row composed of three feature blocks:

Game-state features: structured variables such as current offer, round number, previous offers, public config parameters.
Dialogue features: fixed text embeddings produced by a sentence encoder on the conversation so far.
Observer features: hidden states extracted from a small frozen LLM (Gemma-2-2B, Qwen3-1.7B, or Llama-3.2-1B) prompted with the current game state and dialogue but whose direct answer is discarded. Instead, latent representations from mid-to-late Transformer layers (relative depth 0.6–0.9) form decision-oriented features.

A tabular foundation model (TabPFN v2.6) performs target-adaptive prediction by conditioning on labeled source rows plus K labeled games of the target agent, allowing rapid adaptation. It supports both classification mode (response prediction) and regression mode (proposal prediction).

Training Regime: The tabular model is trained on up to 3,000 decisions randomly sampled from the source agents' games, balanced across agents. Adaptation involves augmenting this training set with the K previously observed target games' labeled decisions. Observers are kept frozen, no gradient updates to LLMs. Multiple seeds (5) are used for robustness.
Evaluation: Metrics are AUC for binary response prediction (accept/reject) and R2 for proposal regression over normalized offers. Evaluation uses cross-population transfer: training on frontier-LLM tournament, testing on held-out hackathon agents to test generalization to scaffolded agents with different prompts/control logic. K varies from 0 to 16 few-shot adaptation games. Baselines include tabular without Observer, and LLM-as-Predictor direct prompting (Gemini 2.5 Flash).
Reproducibility: Code and the 91-agent hackathon dataset of 4,921 games will be released upon acceptance. The GLEE tournament dataset is public. Observers are frozen pre-trained LLMs.

Example end-to-end: For a new hackathon agent (target), K prior games with labeled decisions are provided. At a new decision point consisting of public game state and dialogue, the system:

Extracts game-state features
Encodes dialogue with sentence embeddings
Passes game state + dialogue through frozen Observer LLM, extracts latent hidden states
Forms a multimodal tabular feature row
Performs target-adaptive inference via TabPFN model conditioned on the source population and the target's K examples to predict response (accept/reject) or propose the next offer.

This approach leverages population-level generalization alongside target-specific adaptation without fine-tuning the LLM, and achieves superior accuracy and calibrated predictions compared to direct prompting alone.

Technical innovations

Formulating prediction of unfamiliar strategic LLM-based agents as a target-adaptive text-tabular learning problem combining structured game features, dialogue text embeddings, and decision-oriented LLM hidden representations.
Introducing LLM-as-Observer: using a small frozen LLM to encode decision-time game state and dialogue into latent hidden states that serve as features rather than as direct few-shot predictors.
Employing a tabular foundation model (TabPFN) to perform few-shot target adaptation by conditioning on the source population data plus the target’s K labeled games, enabling robust cross-population transfer.
Demonstrating that frozen LLM hidden states provide richer, more reliable decision signals than the LLM’s prompted output logits for downstream prediction tasks.

Datasets

GLEE frontier LLM tournament — ~64,000 games, 197,000 decisions — public benchmark
University hackathon scaffolded agents — 4,921 games, 11,341 decisions — newly introduced dataset from 91 student teams with engineered control logic and prompting

Baselines vs proposed

Game+text features tabular baseline: bargaining response AUC at K=16 = 0.791 vs full Observer model = 0.831
Game+text features tabular baseline: negotiation response AUC at K=16 = 0.803 vs full Observer model = 0.826
LLM-as-Predictor direct prompting: bargaining response AUC at K=16 = 0.770 vs Observer model = 0.831
LLM-as-Predictor direct prompting: negotiation response AUC at K=16 = 0.785 vs Observer model = 0.826
Game+text features tabular baseline: bargaining proposal R2 at K=16 = 0.622 vs Observer model = 0.676 (14% error reduction)
LLM-as-Predictor direct prompting: bargaining proposal R2 at K=16 = -0.3 (poor regression calibration)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.12411.

Fig 1

Fig 1: Alice (seller) and Bob (buyer) negotiate via free-text offers. Following Bob’s $5,000

Fig 2

Fig 2 (page 2).

Limitations

Evaluation only on synthetic negotiation and bargaining games (GLEE benchmark), which may differ from real-world marketplace interactions in complexity and noise.
The target LLM observers are relatively small (1–2B parameters), so results might differ with larger or differently structured models.
Transfer evaluated across two axes of heterogeneity (LLM variation vs scaffolding), but other sources of agent variation remain untested.
No reported adversarial evaluation where the target agents deliberately try to deceive the predictor or obfuscate strategies.
The approach depends on the availability of K prior games of the target; performance degrades with fewer or zero adaptation examples.
Private information configurations mask valuations that may be exploited in real settings by richer features unavailable here.

Open questions / follow-ons

How well do these target-adaptive text-tabular models scale to more complex multi-agent environments with richer communication protocols or multiple agents?
Can this approach be extended to continuous adaptation settings where the target agent’s behavior evolves over time or in response to the predictor?
What are the impacts of adversarial targeting or deceptive communication by the opponent agent on prediction accuracy?
Could incorporating other modalities (e.g., speech, multi-modal signals) or more fine-grained latent state extraction from larger LLMs yield further gains?

Why it matters for bot defense

For bot-defense and CAPTCHA engineers, this paper illustrates a rigorous methodology for anticipating the likely next actions of automated agents communicating in natural language—particularly when those agents are unknown or black boxes. The target-adaptive text-tabular modeling framework, especially the LLM-as-Observer representation, enables accurate few-shot predictions by combining structured contextual signals with latent representations extracted from observing LLMs. Such predictive capabilities may be leveraged in bot-detection schemes that must classify or anticipate behavior of language-based bots interacting through dialogue or commerce interfaces. The approach’s robustness to variation in internal agent design and prompt-engineering is encouraging for real-world settings where internal models are hidden. In contrast to directly prompting large LLMs to predict bot outputs, extracting intermediate hidden states and combining them with rich game-state and dialogue features yields better-calibrated, adaptable predictions. However, practical adoption would require adapting to more diverse real-world agent heterogeneity and potentially adversarial tactics.

Cite

bibtex

@article{arxiv2605_12411,
  title={ Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling },
  author={ Eilam Shapira and Moshe Tennenholtz and Roi Reichart },
  journal={arXiv preprint arXiv:2605.12411},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.12411}
}

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​