To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Source: arXiv:2605.00737 · Published 2026-05-01 · By Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee, Krishna P. Gummadi et al.

TL;DR

This paper addresses a crucial but underexplored aspect of tool-augmented large language models (LLMs): the decision of whether to invoke an external tool for a given task instance. While it is established that tools like web search can enhance LLM capabilities, the utility of calling them is highly task- and instance-dependent, and inappropriate or excessive calls can degrade output quality and incur costs. The authors present a novel, principled framework rooted in decision theory, encompassing three dimensions—necessity (true need), utility (performance gain), and affordability (cost-effectiveness)—to assess LLM tool-calling behavior.

Using this framework, they analyze six open-source LLMs across three real-world question-answering tasks with two different web search tools. They find LLMs’ self-perceptions of tool need and utility are systematically misaligned with true need and utility derived from an optimal allocation oracle. This misalignment leads to suboptimal tool-calling decisions characterized by unnecessary or harmful calls, poor cost management, and failure to prioritize high-utility instances under budget constraints. To address this, the authors leverage internal LLM hidden states to train lightweight classifiers—latent need and latent utility estimators—that substantially close the gap between actual and optimal tool calling. The proposed controlled tool calling strategy outperforms models’ default self-decisions in accuracy and cost efficiency, though it falls short of oracle-level performance, highlighting the inherent difficulty modeling tools’ behavior. This work advances understanding and optimization of LLM tool use beyond simple aggregate metrics to a nuanced, principled decision-making paradigm.

Key findings

Across six models and three QA tasks, optimal tool calling (oracle) outperforms ALWAYS TOOL by up to +0.07 absolute accuracy using fewer calls (e.g., GPT-OSS-120B accuracy: oracle 0.83 vs always 0.78 on entity task with 300 vs 500 calls, Table 1).
Models’ SELF-DECISION tool calling is suboptimal, lagging oracle accuracy by 0.05 to 0.08 and uses fewer calls, indicating inefficient tool use (Table 1).
True need (model performance under NO TOOL) and true positive utility (performance gain under ALWAYS TOOL) are correlated but imperfectly; about 51% of truly needy cases yield positive utility, while 34% of no-need cases see performance degradation from tool calls (Fig 2).
Perceived need and perceived utility—derived from model responses and call behaviors—do not align well with true need and utility, causing frequent misjudgments (Fig 5). Larger models better calibrated but still substantial errors remain.
Under explicit budget constraints, models poorly prioritize tool calls by utility, with NDCG ranking score dropping sharply as budget allowance increases, indicating weak cost-aware decision making (Fig 6, Fig 14).
Models tend to systematically violate budget constraints, with some (Gemma, Qwen3-IT) exceeding call budgets even at high per-call costs (Fig 13).
Lightweight latent estimators trained on LLM hidden states improve alignment with normative tool-calling policies and increase utility gains relative to SELF-DECISION baselines, though still below oracle performance (Section 4.3).
Results replicate across two different web search backends, indicating generality of findings (Appendix G referenced).

Methodology — deep read

The authors develop a comprehensive framework to analyze and optimize LLM tool-calling decisions along three dimensions: necessity (true need), utility (performance improvement), and affordability (cost-effectiveness).

Threat model and assumptions: The adversary is not explicitly treated here, as this is not a security-focused paper; rather, the focus is on the rational decision-making of LLMs in a cost-constrained environment. Assumptions include availability of external search tools and task instances drawn from realistic QA datasets.

Data: Three datasets are utilized—Entity Task (entity-centric descriptive prompts from real user chat logs), InVivoQuery Task (real factual user queries), and BFCL Task (Berkeley Function Calling Leaderboard) which evaluates function calling with known ground-truth answers. Datasets range around 500 instances each. Performance is measured via factuality scores judged by a combination of 3 annotators and LLM-as-judge methods.

Architecture/algorithm: Tool calling policy π(x) ∈ {0,1} governs whether to call the external tool or not. The LLM produces either y = M(x) (no tool) or y = M(x, r) using tool response r. The authors define three policy baselines: NO TOOL (never call), ALWAYS TOOL (always call), and SELF-DECISION (model autonomously chooses). Key novel components are Latent Need Estimator (LNE) and Latent Utility Estimator (LUE), both lightweight MLP classifiers trained on the LLM's final token hidden representations (h(x)) to predict true need and true utility respectively.

Training regime: Classifiers are trained to predict binarized targets (need: yes/no; utility: positive/neutral-or-negative) derived from performance comparisons between NO TOOL and ALWAYS TOOL setups. Training details like batch size, epochs, optimizer are not fully specified in the truncated text, but training is efficiently done since inputs are hidden states from frozen LLMs.

Evaluation protocol: Evaluation measures include factuality accuracy scores on task outputs, number of tool calls (budget constraint), true vs perceived need and utility classification accuracy, and cost-aware utility gain (Gain_K) over varying budgets. NDCG is used to evaluate how well models prioritize high-utility calls. Models' tool-call decisions are analyzed descriptively (self-reports) versus normative (oracle decisions using ground-truth utility). Ablations consider cost-aware versus no-cost scenarios. Statistical uncertainty is partially addressed via multiple annotators and tasks.

Reproducibility: Models evaluated are all open-source LLMs ranging 3B to 120B parameters, with links shared. Datasets appear derived from existing public or published sources. Code availability is not explicitly stated, and oracle computations require running both NO TOOL and ALWAYS TOOL policies per instance, which may be computationally costly.

Concrete example: Given an Entity task prompt, the model generates a predicted need signal under NO TOOL configuration. If need is detected, under budget constraints, the LUE ranks instances by predicted utility. The lightweight classifiers then guide which instances should get a web search call (via SerpApi). The final outputs are evaluated for factuality gains. This process is contrasted against naive SELF-DECISION and oracle policies using ground-truth scores to demonstrate improved alignment and efficiency.

Technical innovations

A novel three-dimensional tool-calling assessment framework inspired by rational choice theory capturing necessity, utility, and affordability.
Definition and empirical operationalization of ground-truth (normative) need and utility metrics by contrasting NO TOOL and ALWAYS TOOL model behaviors.
Identification and quantification of systematic misalignment between LLMs’ self-perceptions and actual need/utility leading to suboptimal tool use.
Use of LLM internal hidden states to train lightweight classifiers (latent need and utility estimators) that better predict true need and utility for informed tool-calling decisions under budget.
Application of rank-based budget allocation policies derived from utility estimators to efficiently prioritize costly tool calls.

Datasets

Entity Task — ~500 instances — derived from real chat logs (Karnam et al., 2026)
InVivoQuery Task — ~500 instances — real-world user factual queries (Karnam et al., 2026)
BFCL Task — functional calling evaluation — Berkeley Function Calling Leaderboard (Patil et al., 2025)

Baselines vs proposed

NO TOOL: average factuality score ~0.56–0.72 across models vs SELF-DECISION: 0.74–0.81 vs OPTIMAL: 0.83–0.88 (Entity task, Table 1).
ALWAYS TOOL: 0.78–0.82 vs OPTIMAL: 0.83–0.88, showing suboptimality of always call baseline.
Under budget constraints, latent estimators based controllers achieve higher utility gain than SELF-DECISION policies, but still fall short of oracle (Figures 6, 14).
Models with cost-unaware prompts often exceed budgets and incur worse utility gains compared to cost-aware prompting (Fig 6).

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.00737.

Fig 1

Fig 1: Given input x, the model M decides π(x) ∈{0, 1} to call a tool (response r) or

Fig 2

Fig 2: True need and true positive utility

Fig 37

Fig 37: [BFCL Task] factuality distribution across different models.

Limitations

Oracle decisions require running both NO TOOL and ALWAYS TOOL policies per instance, impractical for many real-world deployments.
Latent estimators rely on proxy targets from model performance rather than direct modeling of tool behavior, limiting accuracy.
Benchmarks restricted to web search tools and three QA datasets; generality to other tools or tasks uncertain.
Models evaluated are limited to open-source LLMs; closed/proprietary large LLMs may behave differently.
Human annotation used for factuality verification only on two of three datasets; potential evaluator noise.
Budget constraints modeled as fixed uniform costs; real APIs may have more complex cost structures.
No explicit adversarial evaluation or robustness tests under distribution shift.

Open questions / follow-ons

How to directly model external tool behavior to improve utility estimation beyond proxy classifier approaches?
Can reinforcement learning or feedback mechanisms better align LLM perceptions with true tool utility?
How do the findings generalize to other types of external tools beyond web search, e.g., calculators or retrieval-augmented generation?
What is the impact of more sophisticated cost models (e.g., variable pricing, latency penalties) on tool-calling policies?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights that LLMs' automated decisions to access external web services (analogous to tool calls) can be inefficient or even harmful if not carefully managed. Understanding the nuanced decision-making framework around necessity, utility, and cost can inform the design of smarter LLM agents that optimally query external data only when it truly benefits the task. The latent need and utility estimators trained on internal LLM states provide a promising lightweight approach to improve such tool invocation decisions, potentially lowering attack surface or abuse risk by avoiding unnecessary external calls. Importantly, applying cost- and utility-aware control to LLM tool calls can reduce wasteful or overly frequent network lookups, a factor relevant for rate limiting, CAPTCHAs, or detecting suspicious automated querying behavior. However, practitioners should note the misalignment issues uncovered here, as naive LLM self-assessment alone is insufficient to gauge when external data is needed or beneficial. Ongoing evaluation of real tool utility and stricter cost constraints will be critical in production bot-defense scenarios. Finally, the paper's framework invites further exploration of decision-theoretic metrics beyond aggregate accuracy, an approach potentially valuable for fine-grained bot-detection or CAPTCHAs that challenge correct agent behavior under cost.

Cite

bibtex

@article{arxiv2605_00737,
  title={ To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling },
  author={ Qinyuan Wu and Soumi Das and Mahsa Amani and Arijit Nag and Seungeon Lee and Krishna P. Gummadi and Abhilasha Ravichander and Muhammad Bilal Zafar },
  journal={arXiv preprint arXiv:2605.00737},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.00737}
}

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​