TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks

Source: arXiv:2605.17170 · Published 2026-05-16 · By Hanzhang Shen, Haoran Wu, Yiren Zhao, Robert Mullins

TL;DR

This paper addresses the challenge of compressing the key–value (KV) cache for large language models (LLMs) used in agentic inference tasks, which involve long-context processing, multimodal inputs, and structured interactions with tool calls. Existing quantization approaches typically treat the KV cache uniformly or consider only one dimension of token heterogeneity, leading to suboptimal accuracy-memory tradeoffs. The authors profile agentic workloads and find that token quantization sensitivity varies significantly along three orthogonal axes: temporal recency (how recent the token is), modality (text or image), and semantic role (e.g., user query, tool call, reasoning). Leveraging this insight, they propose TriAxialKV, a mixed-precision KV-cache quantization framework that assigns each token a triaxial tag, calibrates per-tag sensitivity via offline measurements of attention output distortion, and optimally allocates INT2/INT4 bitwidths under a memory budget. This is integrated with a custom fused Triton decode kernel and paged memory allocator for serving.

Experimentally on benchmarks including BFCL Memory and OSWorld with multimodal and function-calling agentic workloads, TriAxialKV achieves similar task accuracy to full BF16 precision (within 1.1–1.8% across models) while reducing memory footprint by 4.5× and improving end-to-end throughput by 26–52% on real GPUs (NVIDIA B200 and H100). Ablations confirm the importance of modeling all three axes. Compared to uniform quantization baselines such as KIVI and SGLang FP4, TriAxialKV maintains accuracy with far lower bitwidth and higher concurrency of in-flight requests, demonstrating both the necessity and practicality of joint triaxial token sensitivity modeling for aggressive KV-cache compression in agentic LLM inference.

Key findings

Per-token KV quantization sensitivity varies by more than an order of magnitude and is well captured by three axes: temporal recency, token modality, and semantic role.
TriAxialKV's mixed INT2/INT4 allocation maintains task accuracy within 1.1 points on BFCL Memory and within 1.8 points on OSWorld compared to BF16 KV cache across multiple large models (Qwen3-14B, 32B, 235B, Falcon3-10B, InternVL3.5-38B).
TriAxialKV supports 4.5× larger KV cache size relative to BF16 baseline on Qwen3-VL-32B (OSWorld) while matching accuracy.
End-to-end serving throughput improves by 26% to 52% over BF16 baseline, with 1.26× on Qwen3-VL-8B (B200), 1.32× on Qwen3-VL-32B (B200), and 1.52× on Qwen3-VL-32B (H100).
TriAxialKV enables 3.4–4.0× higher average concurrency of in-flight requests under the same memory budget compared to BF16 baseline.
Ablation removing semantic or temporal axes degrades accuracy by 4–6 points on BFCL Memory, with semantic axis having the largest impact.
A memory budget sweep near calibrated bitwidth shows accuracy drops steeply (~5% per 0.1 bit decrease), highlighting the importance of careful bitwidth selection via calibration.
Uniform low-bit baselines like KIVI and SGLang FP4 suffer 4–5 point accuracy drops due to uniform quantization ignoring semantic sensitivity, especially for critical tokens like system prompts.

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly modeled as this is a performance and efficiency engineering paper. The focus is on reducing KV cache size and memory bandwidth without harming model accuracy. The system assumes the inference workload is agentic with structured token sequences exhibiting heterogeneity along temporal, modal, and semantic axes. It assumes knowledge of token metadata via chat template structure without semantic understanding.
Data: They calibrate their quantization scheme on 5% of workload datasets (BFCL and OSWorld), capturing KV caches across multiple layers (4–6 layers per model). Tokens are tagged by temporal recency, modality (text, image), and semantic role (system prompt, user, reasoning, tool call, observations, delimiters). Calibration sets are used for measuring per-tag quantization sensitivity by replaying attention outputs under different quantization bitwidths.
Architecture/Algorithm: TriAxialKV assigns each prefill token a triaxial tag based on temporal (current, recent past turns, older), modality, and semantic role. It measures per-tag output distortion Dk(2) and Dk(4) (MSE of attention output) from calibration data. Given token counts Nk per tag and a memory bitwidth budget B, an optimization either exhaustively or greedily selects 2-bit or 4-bit quantization per tag to minimize total expected output MSE. The KV cache is stored in two pooled buffers: INT2 and INT4 bitwidths, with a fused Triton kernel that decodes quantized KV on the fly during attention.
Training Regime: Not applicable since quantization allocation is calibrated offline without model retraining. Sensitivity is measured by quantizing only one tag at a time at 2 and 4 bits and measuring resultant attention output distortion vs full precision.
Evaluation Protocol: Metrics include task accuracy on BFCL function-calling and OSWorld multimodal benchmarks, end-to-end throughput (tokens/s) on GPUs (NVIDIA B200 and H100), KV cache size scaling (relative to BF16 baseline), and concurrency of in-flight requests. Baselines include full BF16, uniform FP4 (SGLang FP4), and KIVI 2-bit uniform quantization. Ablations remove each axis from tagging to measure impact on accuracy. Memory budgets are swept to analyze tradeoffs.
Reproducibility: The method integrates into the SGLang serving system with custom Triton kernels. Calibration uses 5% holdout data from public benchmarks BFCL and OSWorld. Models evaluated include Qwen3 variants (14B, 32B, 235B) and Falcon3-10B. Detailed algorithms and code release status are unspecified.

Example: For a given prefill with N tokens, each token is tagged with triaxial labels (temporal, modal, semantic). From calibration, sensitivity scores Dk(2), Dk(4) are assigned per tag. The allocator solves for b_k ∈ {2,4} per tag to minimize ∑_k Dk(b_k)*Nk subject to average bitwidth ≤ B, using either full enumeration (≤22 tags) or greedy approximation. Then during serving, tokens are split into INT2 and INT4 KV pools accordingly and decoded on the fly by fused kernels, enabling a larger KV cache and improved throughput at matched accuracy.

Technical innovations

Identification that per-token KV quantization sensitivity in agentic inference workloads varies significantly along three orthogonal axes – temporal recency, modality, and semantic role – and jointly modeling these axes for bit allocation.
Development of a taxonomy-driven, chat-template-only token tagger that produces triaxial tags in a single pass without model inference.
A novel offline calibration procedure measuring attention-output distortion per tag and per bitwidth, directly optimizing for attention accuracy rather than raw KV quantization error.
An optimal (enumeration or greedy) bitwidth allocation algorithm that minimizes expected attention output MSE under a bitwidth budget, enabling mixed INT2/INT4 precision assignment per token segment.
An end-to-end serving system integrating paged INT2/INT4 memory pools with a fused Triton kernel for fast on-the-fly mixed-precision decode during attention computation.

Datasets

Berkeley Function Calling Leaderboard (BFCL) Memory — 5% calibration subset, public benchmark for function-calling agentic inference
OSWorld — 5% calibration subset, multimodal (text+image), computer-use agentic inference benchmark

Baselines vs proposed

SGLang BF16: Task accuracy baseline = 100% reference; TriAxialKV Mixed matches within ±1.1 points on BFCL Memory
SGLang BF16: Throughput baseline = 1×; TriAxialKV Mixed achieves 1.26× (Qwen3-VL-8B B200), 1.32× (Qwen3-VL-32B B200), 1.52× (Qwen3-VL-32B H100)
SGLang FP4: Task accuracy drops by up to 7.1 points on Qwen3-14B and 32B on BFCL Memory, while TriAxialKV Mixed stays within 1.1 points
KIVI 2-bit uniform: Task accuracy drops 4–5 points on BFCL Memory compared to BF16; TriAxialKV Mixed recovers these losses by per-tag allocation

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.17170.

Fig 1

Fig 1: Comparison of single-axis KV-cache compression methods, including PM-KVQ [25] for

Fig 2

Fig 2: Overview of the TriAxialKV compression flow. During the prefill stage, KV entries are

Fig 3

Fig 3 (page 2).

Limitations

Calibration data covers only 5% of workload; generalization to unseen prompt or domain distributions is not shown.
Evaluation limited to two agentic benchmarks (BFCL Memory, OSWorld) and specific models/Qwen3 variants; applicability to other models or tasks is untested.
Adversarial robustness or security against KV cache manipulation is not studied.
The fused Triton kernels and paged memory pools add implementation complexity; portability to other serving frameworks unclear.
Memory savings and throughput improvements depend on hardware configuration; benefits may vary on other GPUs or CPUs.
No end-user latency measurements or tail latency distributions reported, which are important for interactive agentic workloads.

Open questions / follow-ons

Can the triaxial quantization framework extend to transformer KV caches with additional modalities or semantic roles beyond those studied?
How robust is the per-tag sensitivity calibration to distribution shifts in user behavior or prompt distributions over time?
Would joint optimization of bitwidth and token eviction further improve memory and throughput tradeoffs?
Can the approach be adapted for on-device or constrained hardware where memory and compute budgets differ substantially?

Why it matters for bot defense

While this work focuses on compressing KV caches to enable efficient agentic LLM inference, its insights on structured token heterogeneity and sensitivity-guided mixed-precision quantization could inform bot-defense systems that rely on LLM-based reasoning or multi-turn conversations. Bot-defense engineers working on CAPTCHAs or interaction analysis might consider analogous token tagging along semantic and temporal axes to optimize feature extraction or model state caching to reduce memory overheads. The general principle of exploiting orthogonal token metadata axes for fine-grained resource allocation might inspire more efficient bot interaction tracking or detection pipelines. However, since the method targets LLM serving system internals, direct application to CAPTCHA challenge design or response scoring would require adaptation to those systems' specifics. Understanding performance-accuracy tradeoffs in token caching could indirectly help optimize defenses that use generative agents or multimodal LLMs behind the scenes.

Cite

bibtex

@article{arxiv2605_17170,
  title={ TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks },
  author={ Hanzhang Shen and Haoran Wu and Yiren Zhao and Robert Mullins },
  journal={arXiv preprint arXiv:2605.17170},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.17170}
}

TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​