WarpGuard: Protected-Site Control-Flow Integrity for CUDA SASS Binaries

Source: arXiv:2606.11871 · Published 2026-06-10 · By Igor Santos-Grueiro

TL;DR

WarpGuard addresses the pressing problem of control-flow corruption attacks inside CUDA GPU device code binaries. Recent research revealed that GPU memory bugs can escalate into control-flow hijacks by corrupting return continuations, function pointers, dispatch tables, or indirect branch targets. Existing CUDA security approaches focus on source or PTX levels, which do not reflect the final executed binary, NVIDIA SASS, where inlining, register allocation, and SIMT execution shape the actual control flow. WarpGuard is the first system to enforce protected-site control-flow integrity (CFI) for CUDA binaries at the executed SASS level. It recovers consumption sites of control-flow state from SASS binaries, synthesizes per-site policies, authenticates backward-edge return continuations, validates recoverable forward edges by checking targets, and fails closed on violations. The system explicitly accounts for fixed-edge, unsupported, fallback, and profile-excluded sites to maintain policy soundness.

Evaluated on a corpus of 77 CUDA artifacts, WarpGuard identifies 51,621 SASS control-flow sites including 1,343 returns and 154 forward target sets. It records 52.2 million dynamic checks confirming enforcement viability at scale. Representative backward- and forward-edge attack tests show that native execution leads to attacker-directed behavior, detect-only mode logs violations correctly, and enforcement mode halts execution securely before illegal control transfers. Public CUDA code examples demonstrate the presence of the recovered consumption patterns in real applications. WarpGuard decouples instrumentation placement from backend enforcement, enabling performance-aware deployment strategies.

Key findings

WarpGuard classified 51,621 SASS control-flow consumption sites across 77 CUDA artifacts, including 1,343 protected returns and 154 supported forward indirect target sets.
Recorded 52.2 million dynamic control-flow integrity checks during runtime, validating scalability and monitoring coverage.
In backward-edge corruption attacks, enforcement mode stops execution before corrupted return continuation is released, whereas detect-only logs the violation.
In forward-edge corruption attacks, enforcement mode fails closed before releasing invalid indirect target transfers validated against recovered per-site target sets.
104 out of 836 analyzed functions have no device-side returns; 292 have callsites only, and 440 include instrumented returns (Table A8), indicating varied backward-edge exposure.
Unsupported or ambiguous indirect call sites are explicitly marked and excluded from protected coverage rather than assumed allowed, improving policy soundness.
Three instrumentation backends (WG-NVBit, WG-ST, WG-PC) enable flexible placement strategies from runtime insertion to static patching with verified patch manifests for sm_89 architectures.
SIMT lane-local and warp-uniform control-flow states are modeled conservatively to prevent one corrupted lane from hiding within a warp-level aggregate check.

Threat model

The adversary controls inputs to vulnerable CUDA device code and can induce device memory corruptions affecting global, local, or shared memory accessible by the kernel. These corruptions may tamper with return continuations, function pointers, dispatch tables, or indirect branch targets consumed by SASS instructions. The adversary cannot compromise the host process, CUDA driver/runtime, GPU firmware, trusted backend instrumentation keys, or the WarpGuard policy generator. The model includes denial of service by triggering fail-closed enforcement. The defense protects the device-side consumption points of corrupted state, preventing illegal control transfers at the SASS binary level.

Methodology — deep read

WarpGuard's methodology centers on securing CUDA device-side control-flow at the binary SASS instruction level, the actual code executed by NVIDIA GPUs after compiling from CUDA source and PTX. The core idea is to enforce Control-Flow Integrity at recovered protected SASS consumption sites where control-flow state (return continuations, indirect call targets, indirect branch targets) is consumed.

Threat Model & Assumptions: The adversary controls inputs to potentially vulnerable device code and can cause device memory corruption (in global, local, or shared memory). Such corruption can reach returned continuations, function pointers, or dispatch tables inside the device binary. The attacker cannot compromise host processes, CUDA drivers, firmware, or the trusted backend state holding cryptographic keys. WARPGUARD assumes a trusted host runtime stack and tooling environment.
Data: The evaluation uses 77 CUDA binary artifacts, including compiled cubins, fatbins, and JIT-produced images. From these, 51,621 SASS control-flow sites are recovered. The data involves static disassembly of binaries (using nvdisasm, cuobjdump) and dynamic execution traces recording over 52 million runtime CFI checks. Functions and their returns, call sites, and indirect branches are identified, with site IDs computed as hashes of module digests, architecture, function starts, SASS PC offsets, and site classes.
Architecture/Algorithm: WarpGuard works as a load-time CFI enforcement layer. The host tooling statically recovers SASS control-flow sites and synthesizes per-site policies dictating allowed targets. These policies distinguish protected vs unsupported or fixed-edge sites, fail-closed vs allow-any decisions. Backward-edge checks authenticate expected return continuations using cryptographic MACs and a shadow stack indexed by CUDA thread geometry. Forward-edge checks validate recovered indirect call and branch targets against policy sets derived from constant relocation entries, ELF metadata, and dispatch table scans. Three check placement backends are implemented: WG-NVBit (dynamic instrumentation), WG-ST (static timing trampolines), and WG-PC (patch-cache verified static rewrites). Checks are inserted just before the original control-flow consumption instruction and respect predicate guards for SIMT lane-level correctness.
Training Regime: Not applicable as WarpGuard is a rule-based static/dynamic analysis and instrumentation system rather than a learned model.
Evaluation Protocol: Comprehensive static recovery metrics (number of sites, protected vs unsupported coverage, function return characteristics) are recorded, as well as dynamic enforcement with millions of runtime checks. Representative backward-edge and forward-edge corruption attacks from prior work simulate exploits; enforcement, detect-only, and native baseline execution modes are compared. Real public CUDA device code using dispatch tables and callbacks is audited for matching consumption site patterns. Results are stratified by site outcome category. SIMT lane-local vs warp-uniform checks are validated by divergent-lane attack tests.
Reproducibility: Source code and anonymized artifacts are publicly available at https://anonymous.4open.science/r/warpguard-anon/. The analysis depends on standard CUDA tools (nvdisasm, cuobjdump), and instrumentation uses NVBit and custom static patching infrastructure. Detailed tables in appendices provide reproducible site classification and policy provenance. The system is modular to adapt to new GPU architectures or binary formats.

Technical innovations

Protected-site CUDA binary CFI defined and implemented at executed SASS level accounting explicitly for fixed-edge, unsupported, fallback, and profile-excluded sites, improving precision over source or PTX-level approaches.
Authenticated backward-edge continuation checks using keyed tokens and per-thread-slot shadow stack metadata to verify return addresses at the point of consumption in register- or memory-backed returns.
Selective forward-edge CFI enforcing per-site recovered indirect call/branch target sets derived from binary metadata and ELF-decoded dispatch tables with fail-closed semantics rather than coarse allow-any policies.
SIMT-aware checking that models lane locality and predication effects conservatively, preventing masked-lane attacks where one corrupted lane hides within warp-uniform aggregate checks.
Multi-backend instrumentation architecture separating policy synthesis from check placement including a dynamic NVBit helper insertion backend, timing-matched static trampolines (WG-ST), and verified patch-cache static rewrites (WG-PC) for scalable deployment.

Datasets

CUDA binary artifact corpus — 77 binaries — collected from public and in-house CUDA applications and libraries

Baselines vs proposed

Native execution (no CFI): attacker-selected control-flow hijack achieved in corruption attack scenarios
Detect-only mode: violation events recorded correctly on both backward- and forward-edge corruption attacks but execution proceeds
Enforcement mode: fails closed by aborting execution before releasing corrupted return or indirect call targets, preventing hijack
Coverage: 51,621 total recovered control-flow sites with 1,343 returns and 154 supported forward target sets classified
Performance: Detailed overhead numbers not disclosed; three backends provide tradeoffs between runtime instrumentation and static patching

Limitations

Does not provide complete memory safety or prevent the initial memory corruption; complementary to memory safety tools but does not replace them
Coverage limited to recoverable SASS consumption sites; inline functions, return-free kernels, or ambiguous indirect call targets remain unsupported or excluded
Does not protect against malicious binaries; assumes trusted host, runtime, and backend environment
No protection guarantees for unsupported, fallback, profile-excluded, or fixed-edge sites beyond audit/logging
SIMT model is conservative but may introduce false negatives or require careful tuning for newer GPU architectures
Current implementation focuses on NVIDIA GPUs (sm_89, etc.); portability to other vendors or architectures is not addressed
Instrumentation backends have varied deployment complexity; static patching requires verified manifests and is architecture-specific

Open questions / follow-ons

How to improve coverage of unsupported or fallback sites, for example those with ambiguous indirect calls or fully inlined return-free code?
What are the end-to-end performance overheads on diverse CUDA workloads for each instrumentation backend, especially static patching (WG-PC) at scale?
Can GPU hardware support integrated secure enclaves or key storage to eliminate the reliance on trusted backend software state for authentication?
How to extend or generalize WarpGuard to other GPU vendors or heterogeneous parallel compute platforms outside NVIDIA CUDA?

Why it matters for bot defense

For bot-defense and CAPTCHA systems that rely increasingly on GPU-accelerated workloads or CUDA-based computation for rendering, verification, or anti-automation logic, WarpGuard provides a principled approach to mitigating device-side control-flow hijack attacks that could undermine trust in GPU-executed binaries. By protecting the actual binary execution layer (SASS), WarpGuard ensures that even if memory corruption flaws are exploited, control-flow attacks critical for malicious code reuse or function-pointer hijacking are blocked in enforcement mode. The fine-grained protected-site CFI model with explicit accounting for unsupported or fallback sites offers a detailed threat surface map that can guide robust GPU binary deployment.

Practitioners should consider WarpGuard's approach as complementary to memory safety and software integrity efforts on GPUs. It explicitly enforces security policies at the true execution boundary rather than at source or higher-level IR, which often miss subtle attack surfaces exposed by GPU compiler optimizations and SIMT lane scheduling. While targeted at CUDA device code, the architectural principles could inform security hardening of other GPU programming models used in bot-detection pipelines that demand cryptographic verification of computation integrity.

Cite

bibtex

@article{arxiv2606_11871,
  title={ WarpGuard: Protected-Site Control-Flow Integrity for CUDA SASS Binaries },
  author={ Igor Santos-Grueiro },
  journal={arXiv preprint arXiv:2606.11871},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.11871}
}

WarpGuard: Protected-Site Control-Flow Integrity for CUDA SASS Binaries ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​