Reasoning with Sampling: Cutting at Decision Points

Source: arXiv:2605.30327 · Published 2026-05-28 · By Felix Zhou, Anay Mehrotra, Quanquan C. Liu

TL;DR

This paper addresses the challenge of efficient reasoning with large language models by sampling from a sharpened power distribution of reasoning traces. Prior work showed that such power sampling, which upweights high-probability base model outputs, can elicit strong reasoning abilities without additional training but suffers from inefficient sampling due to uniform random truncation points. The authors propose Entropy-Cut Metropolis–Hastings (Entropy-Cut MH), a training-free Markov chain Monte Carlo (MCMC) sampler that leverages next-token entropy jumps in the base model as a proxy to identify and resample at critical decision points of the reasoning trace, instead of uniform cuts. This targeted cutting enables the sampler to move efficiently between qualitatively distinct reasoning strategies, improving mixing time and sample quality.

They theoretically analyze Entropy-Cut MH under a reasoning-tree model where decision points correspond to high-entropy branches. They prove the mixing time scales with the number of semantic decisions (k) rather than total token length (T), a substantial speedup when k << T. Empirically, across diverse language models (Qwen and Phi variants) and reasoning benchmarks (MATH500, HumanEval, GPQA Diamond, AIME26), Entropy-Cut MH outperforms uniform-cut MH and multiple baselines including low-temperature sampling, Sequential Monte Carlo, and Twisted Monte Carlo. It achieves up to 36% absolute accuracy gain on MATH500 and improved sample likelihoods without sacrificing diversity. Overall, the method provides a practical, general approach for extracting existing reasoning capabilities from pretrained models at test time.

Key findings

Entropy jumps in the base model's next-token distribution are a reliable proxy for decision points in reasoning traces, confirmed by 1.33× higher normalized edit distance and 1.36× higher distinct-answer fraction when resampling at high-entropy jump positions.
Theoretical analysis under a reasoning-tree model shows Entropy-Cut MH mixing time scales as O(k), the number of semantic decisions, whereas uniform-cut MH scales as Ω(T/b1), token depth divided by earliest decision position, which can be much larger (Theorem 4.1).
On MATH500 with Qwen2.5-7B, Entropy-Cut MH improves accuracy from 35.9% (standard) and 67.4% (uniform-cut MH) to 71.9%.
Entropy-Cut MH outperforms SMC, TMC, and uniform-cut MH baselines across multiple models (Qwen2.5-Math-7B, Qwen3-8B-Base, Phi-3.5-mini-instruct, Phi-4-mini-instruct) and datasets with improvements of up to +5.6 percentage points over uniform-cut MH.
The algorithm samples from reasoning traces with higher base model log-likelihoods (Figure 4) compared to baselines, indicating better alignment to the power distribution.
Entropy-Cut MH maintains sample diversity comparable to standard sampling despite substantially improving pass@1 accuracy.
The method requires no additional training, curated datasets, or verifiers, relying solely on base model entropies and Metropolis–Hastings correction.
Computationally, entropy calculations add negligible cost since entropies come from standard forward passes used for acceptance ratios.

Threat model

The paper does not target an adversarial security threat but rather assumes access to a pretrained base language model whose weights and outputs the sampler can query. The adversary is effectively the challenge of sampling correctly from a complex sharpened distribution over reasoning traces. No active attacker or model manipulation is assumed.

Methodology — deep read

Threat Model & Assumptions: The paper assumes an adversary with limited ability to influence base model weights; the approach is a test-time method leveraging base model probabilities and no additional training. The key assumption is that the base model's next-token entropy reliably indicates high-level decision points in reasoning traces—positions where the continuation semantics greatly diverge.
Data: Evaluation datasets include MATH500 (competitive math problems), HumanEval (code generation), GPQA Diamond (graduate-level science questions), and AIME26 (math olympiad problems), covering diverse reasoning tasks. Each dataset is evaluated in a zero-shot, single-shot setting. Details on dataset splits and preprocessing are in the Appendix.
Architecture / Algorithm: The core algorithm is Entropy-Cut Metropolis–Hastings (MH), based on an MCMC sampler that samples from the power distribution ΠT ∝ p(x)^α, where p(x) is the base model trace probability, and α > 1 is the sharpening power.
The sampler iteratively proposes new traces by cutting the current token sequence at a position m and resampling the suffix with a proposal low-temperature distribution pprop (temperature 1/α). The novelty is in choosing the cut distribution λβ(m; x) proportional to the positive entropy jump ∆m(x)^β, where ∆m(x) = max(0, h_m - h_{m-1}) for next-token entropy h_t at prefix x_{<t}.
The Entropy-Cut sampler focuses proposals on tokens near semantic decision points (large ∆t), enabling transitions between different reasoning strategies (e.g., induction vs contradiction proofs), instead of uniform token positions which mainly revise local details.
Training Regime: No training is involved. The method operates at inference/test time using pretrained base models. The hyperparameters include power α = 4.0 (α=5.0 for HumanEval), entropy-cut exponent β=4.0, block sizes T and B adapting to datasets, and MCMC steps N_MCMC=10 per stage.
Evaluation Protocol: Performance metrics are accuracy on reasoning datasets, pass@k accuracy for code generation, and analysis of trace likelihoods. Baselines include standard sampling, low-temperature sampling, Sequential Monte Carlo (SMC), Twisted Monte Carlo (TMC), and uniform-cut Metropolis–Hastings. Multiple independent runs (8 for most datasets, 64 for AIME26) are averaged. Computational cost is measured using a single H200 GPU per run. Figures illustrate log-likelihood distributions and diversity metrics.
Reproducibility: The authors provide detailed pseudocode (Algorithms 1 and 2) and clear hyperparameter settings. Code for SMC and TMC baselines was reimplemented due to lack of public code. Dataset evaluation protocols reference established benchmarks. No explicit public code release mentioned in the truncated content.

Concrete Example End-to-End: For a math problem from MATH500, the method generates a reasoning trace conditioned on the prompt. The base language model computes next-token distributions and their entropies. The sampler calculates entropy jumps ∆t along the trace, selects cut positions biased toward large ∆t, and proposes new suffixes from those points using a low-temperature version of the base model. Metropolis–Hastings acceptance criteria ensure samples approximate the power distribution. Over iterations, this enables exploration of qualitatively distinct proof strategies instead of minor edits. Resulting samples have higher probabilities under the sharpened distribution and yield improved accuracy on the final answer.

Technical innovations

Entropy-Cut MH sampler uses the base model's next-token entropy jump as a proxy for semantic decision points, focusing proposals at those positions rather than uniform random cuts.
Theoretical proof that Entropy-Cut MH mixing time scales with the number of semantic decisions (k) instead of total token length (T), enabling significantly faster convergence in reasoning traces with few key decisions.
Fully training-free, dataset-free Metropolis–Hastings sampler that retains the target power distribution, requiring only base model access and entropy computations already produced during forward passes.
Empirical demonstration that entropy jumps mark branching points that produce more diverse and distinct reasoning trajectories upon suffix resampling.

Datasets

MATH500 — ~500 problems — public competitive math dataset
HumanEval — ~164 coding problems — open-source code generation benchmark
GPQA Diamond — size unspecified — graduate-level science questions
AIME26 — 26 problems — mathematics olympiad problems

Baselines vs proposed

Standard sampling: MATH500 accuracy = 35.9% vs Entropy-Cut MH = 71.9%
Uniform-Cut MH: HumanEval accuracy = 66.8% vs Entropy-Cut MH = 68.9%
SMC (Sequential Monte Carlo): GPQA Diamond accuracy = 28.8% vs Entropy-Cut MH = 30.2%
TMC (Twisted Monte Carlo): AIME26 accuracy = 9.4% vs Entropy-Cut MH = 9.4% (equal best)
Low-Temperature sampling: MATH500 accuracy = 62.3% vs Entropy-Cut MH = 71.9%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.30327.

Fig 2

Fig 2: Illustration of a low-conductance

Limitations

The analysis relies on an approximate symmetry assumption in the reasoning-tree model, which may not hold in real-world reasoning traces.
Entropy jump heuristic for decision points may not capture all important semantic decisions, especially in less structured or very noisy traces.
Experiments are limited to a subset of reasoning and instruction-following benchmarks; generalization to other tasks is not demonstrated.
The MCMC approach requires multiple iterative sampling steps, potentially increasing inference latency compared to single-pass generation.
SMC and TMC baselines are reimplemented by authors due to lack of public code, which may affect comparison fidelity.
No explicit evaluation under adversarial conditions or distribution shift scenarios to test robustness of the sampler.

Open questions / follow-ons

Can entropy-based cuts be combined with other learned or adaptive heuristics to better identify semantic decision points in varied reasoning domains?
How does Entropy-Cut MH perform on out-of-distribution tasks or with larger-scale models like GPT-4 or PaLM?
Can the approach be extended to multimodal reasoning or tasks with multimodal inputs (e.g., code + natural language)?
What is the tradeoff between inference latency and accuracy improvements, and can the MCMC sampling be accelerated?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, understanding Entropy-Cut MH provides insight into how large language models can be coaxed into producing high-quality multi-step reasoning outputs without costly retraining. The notion of leveraging internal uncertainty measures (entropy jumps) to identify critical decision points could inspire defenses that detect or manipulate reasoning styles of bots. Sampling efficiency improvements also relate to realtime verification where prompt-response latency matters. However, since Entropy-Cut MH is a test-time sampling algorithm for improving reasoning, it is primarily a model-internal technique rather than an attack or defense method. Practitioners might consider whether similar entropy-based heuristics can differentiate human-like versus bot-like symbolic reasoning processes or guide challenges that require adaptive reasoning robustness. Moreover, the power sampling framework emphasizes how seemingly subtle probability sharpening can unlock latent reasoning capabilities, a relevant consideration when designing CAPTCHAs or interaction protocols needing strong semantic comprehension.

Cite

bibtex

@article{arxiv2605_30327,
  title={ Reasoning with Sampling: Cutting at Decision Points },
  author={ Felix Zhou and Anay Mehrotra and Quanquan C. Liu },
  journal={arXiv preprint arXiv:2605.30327},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30327}
}

Reasoning with Sampling: Cutting at Decision Points ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​