Skip to content

Counterintuitive problems in discrete probability

Source: arXiv:2606.07516 · Published 2026-06-05 · By Luca Avena, Gianmarco Bet, Bernardo Busoni

TL;DR

This manuscript compiles a carefully curated set of counterintuitive and challenging problems in discrete probability, along with fully worked solutions. It was created as part of a research project examining how state-of-the-art large language models (LLMs) reason through discrete probability tasks, particularly investigating whether LLMs exhibit systematic mistakes that mimic known human cognitive biases. The dataset combines classical paradoxes, puzzles from recreational mathematics, and novel problems designed to foil heuristic intuition and force rigorous mathematical reasoning. The authors rephrase existing problems to prevent models from simply memorizing known answers, aiming to elicit genuine reasoning. The problems are presented mostly as open-ended questions to avoid leading and better capture reasoning capabilities. The detailed solutions clarify why intuitive answers fail and mathematically rigorous solutions hold. This publicly available collection serves as a transparent testbed referencing the exact problems used in the authors' LLM evaluation experiments, but may also advance research on cognitive biases, probabilistic reasoning, and AI model evaluation more broadly.

The dataset includes problems exploring a diverse range of fundamental probabilistic phenomena such as conditioning, sequence patterns, Bayesian updates, paradoxes, stopping rules, and adversarial guessing games. Empirical calculations, combinatorial arguments, Markov chains, and Bayes theorem are integrated into solutions. For example, problems like "Coins Under Cups" compute exact win probabilities involving subtle dependencies. Others such as "An Unfair Coin in a Haystack" use Bayes’ theorem on large but skewed prior spaces. Some problems relate to well-known paradoxes and paradoxical reasoning outcomes (e.g., Monty Hall variants with nonstandard prizes).

The work provides formalized, reproducible reasoning walkthroughs and precise probabilistic computations, exposing common pitfalls in intuition and heuristic shortcuts. As such, this resource is uniquely positioned to evaluate AI probabilistic reasoning abilities in settings resistant to superficial pattern matching or memorization, advancing the frontier of trustworthy AI reasoning assessment.

Key findings

  • Reformulated classical paradox and puzzle problems designed to disrupt heuristic reasoning and expose cognitive biases in probability.
  • Precise probability solutions reported for 20+ problems; e.g., 'Coins Under Cups' yields P(Alice)/P(Bob) ≈ 0.75, 'Head-Head or Head-Tails?' ratio ≈ 0.94.
  • Bayesian inference applied to large prior spaces yields counterintuitive posterior probabilities, e.g., biased coin identification at ~0.48 probability.
  • Sequence pattern race problem yields differing win probabilities (Alice:Bob ≈ 1.25) arising from Markov chain analysis of coin toss orderings.
  • Complex combinatorial problems solved exactly, e.g., probabilities for multiple animal selection and conditional events (Pen of Cows and Sheep problem at 2/3 conditional probability).
  • Open-ended problem formats to circumvent memorization biases and better test genuine reasoning.
  • Mathematical reasoning steps include probabilistic conditioning, total law of probability, Bayes’ Theorem, and solving linear systems derived from state transitions.
  • Probabilistic modeling reveals the fragility of intuitive judgements, showing that sometimes switching choices (like Monty Hall variant) is counterproductive.

Threat model

The adversary is the large language model (LLM) under evaluation attempting to solve discrete probability problems. The LLM may leverage patterns or memorized solutions but lacks direct explicit knowledge of the problem rephrasings designed to block retrieval. The model cannot access external sources or unseen auxiliary data during evaluation and is tested purely on its internal reasoning capability. The evaluation aims to detect systematic errors linked to known human cognitive biases rather than adversarial attacks.

Methodology — deep read

The paper's methodology centers on compiling a dataset of counterintuitive discrete probability problems carefully reconstructed or newly developed to prevent direct recognition by large language models (LLMs), hence requiring genuine probabilistic reasoning rather than memorization.

  1. Threat model & assumptions: The adversary is the LLM being evaluated, which attempts to solve discrete probability problems. It is assumed the LLM sees only problem statements reformulated to inhibit pattern matching and retrieval of known textbook answers. No assumptions on the LLM’s internal architecture or training data; focus is on externally observable reasoning errors linked to cognitive heuristics.

  2. Data: The dataset consists of over 20 discrete probability problems sourced from classical paradox literature, cognitive bias research, recreational math, and authors’ own constructions. Problems are mostly open-ended to avoid leading multiple choice bias. Problems are elementary in math concepts but deliberately subtle, emphasizing probabilistic conditioning, independence, and Markov chains. Provenance: publicly released via arXiv as a transparent reference.

  3. Architecture/algorithm: The authors do not propose a new model but use detailed human mathematical solutions as the benchmark. Each problem solution spans precise probabilistic calculations: combinatorial enumerations, Bayes’ theorem applications, absorbing state Markov chains, and careful logical conditioning.

  4. Training regime: N/A for model training; the focus is on dataset/problem construction to evaluate LLM behavior. Some references to experimental evaluations of LLMs exist in parent research projects but not detailed here.

  5. Evaluation protocol: The dataset is intended as a benchmark for evaluating LLMs’ probabilistic reasoning. Metrics in companion evaluation work include correctness and answer ratio comparisons. This document provides detailed baseline solutions to ensure clarity in evaluation.

  6. Reproducibility: The dataset is public and open-access with all problem statements and human-made step-by-step solutions. No frozen weights or restricted datasets involved. The solutions detail probability calculations explicitly, allowing full reproducibility.

Example: For the 'Coins Under Cups' problem, the solution defines sequences of Bernoulli random variables for coin outcomes under two observation orders, partitions events based on stopping times when a player first sees two heads, then sums event probabilities using binomial and combinatorial terms to yield P(Alice wins) ≈ 0.3120 and P(Bob wins) ≈ 0.4169, producing their winning probability ratio approximately 0.75. This detailed decomposition exemplifies their methodology combining formal probabilistic modeling with combinatorial enumeration to derive precise results counter to simplistic intuition.

Technical innovations

  • Compilation of a diverse, rigorously solved benchmark dataset of discrete probability problems designed specifically to break heuristic reasoning and test true probabilistic understanding in AI.
  • Problem reformulation strategy that prevents direct recognition by LLMs, forcing genuine reasoning rather than retrieval.
  • Extensive human-crafted detailed solutions showcasing stepwise probabilistic derivations, Bayes theorem applications, and Markov chain analysis for complex state transitions.
  • Use of open-ended question format rather than multiple-choice to eliminate leading cues, better assessing model reasoning capabilities.

Datasets

  • Counterintuitive Discrete Probability Problems Dataset — 20+ curated problems with detailed human solutions — publicly available via arXiv preprint arXiv:2606.07516

Baselines vs proposed

  • Classical reasoning heuristic baseline: often fails on these problems by producing intuitive but mathematically incorrect results.
  • Human derived probabilities: 'Coins Under Cups' ratio P(Alice wins)/P(Bob wins) ≈ 0.75 vs naive intuition expecting equality or reversed ordering.
  • Sequence pattern race: P(Alice wins)/P(Bob wins) = 1.25 from Markov chain analysis vs simple guessing would suggest ≈1.
  • Bayesian update for unfair coin identification: posterior probability ~0.48 of biased coin presence when nine coins showed heads repeatedly, contradicting naive certainty.
  • Fruit market Bayesian inference: probability apple is green given man's claim = 17/44 ≈ 0.39, contrasting with naive trust in claim.

Limitations

  • No adversarial robustness analysis on how LLMs might strategically exploit problem formulations or distributions.
  • Dataset limited to discrete probability, excluding continuous or more complex probabilistic reasoning challenges.
  • Unclear how well problems generalize beyond reformulated classical paradoxes and constructed puzzles to real-world probabilistic tasks.
  • No extensive empirical evaluation results for LLM performance reported in this manuscript itself; relies on a companion research context.
  • Open-ended question format may be harder to automatically evaluate using simple accuracy metrics.

Open questions / follow-ons

  • How do different LLM architectures and training datasets affect systematic errors on these counterintuitive problems?
  • Can automated reasoning systems reliably generalize beyond this curated dataset to more complex or continuous probability domains?
  • What are effective methods to prompt or constrain LLMs to avoid heuristic biases exposed by these problems?
  • How do human subjects’ error patterns compare quantitatively to LLM error patterns on this dataset?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this dataset and corresponding analysis highlight the subtle ways probabilistic heuristics and intuitive but incorrect reasoning can fail — a critical insight when designing challenges that aim to differentiate between genuine human reasoning and superficial pattern matching by automated systems. Understanding probabilistic paradoxes helps build test scenarios that are easy for genuine humans familiar with logical reasoning but difficult for bots relying on shallow heuristic shortcuts, especially when models may rely on memorized patterns rather than true inference.

Practitioners can leverage the detailed solutions and problem structures to design CAPTCHAs or challenge-response tests that require nontrivial probabilistic reasoning, conditioning, and Bayesian inference rather than simple recognition or rote computation. This could be pertinent when evaluating systems that claim to use probabilistic or reasoning-based bot detection components, ensuring these systems themselves do not fall victim to common heuristic biases internally. Moreover, such problems illustrate the importance of open-ended challenges and nuanced conditional probability setups that resist trivial automation.

Cite

bibtex
@article{arxiv2606_07516,
  title={ Counterintuitive problems in discrete probability },
  author={ Luca Avena and Gianmarco Bet and Bernardo Busoni },
  journal={arXiv preprint arXiv:2606.07516},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.07516}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution