SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Source: arXiv:2606.13673 · Published 2026-06-11 · By Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song et al.

TL;DR

This paper addresses the core challenge of open-ended spatial reasoning in vision-language models (VLMs), specifically how the design of an agent's action interface impacts its ability to compose complex 3D and 4D spatial analyses. Existing approaches either run a single-pass code execution committing to a fixed analysis strategy without intermediate feedback or use structured tool-call interfaces that limit flexible composition of tool outputs. SpatialClaw proposes a training-free, iterative framework treating code itself as the action interface by maintaining a persistent Python kernel preloaded with input frames and a suite of perception and geometry primitives. A VLM backend generates and executes one code cell per step conditioned on all previous outputs, enabling flexible composition, inspection, and revision of spatial analyses based on intermediate results. Evaluated on 20 diverse spatial benchmarks covering static/dynamic 3D/4D reasoning, SpatialClaw achieves 59.9% average accuracy, outperforming prior spatial agents by +11.2 absolute points, with consistent gains across six VLM backbones of varying scale and architecture without any task- or model-specific tuning.

Key findings

SpatialClaw achieves 59.9% average accuracy across 20 spatial reasoning benchmarks, outperforming recent spatial agents by +11.2 percentage points (Figure 1).
Consistent performance gains occur across six backbone models from two different families (Qwen and Gemma4) with parameters from 27B to 397B, without any system prompt or toolset adaptation.
SpatialClaw yields largest improvements (+14.3 percentage points average) in multi-view spatial reasoning and +18.3 points on video spatial & 4D reasoning benchmarks, tasks requiring multi-step geometric composition.
Compared to single-pass code execution and structured tool-call interfaces using the same toolset, SpatialClaw outperforms them on 11 out of 13 meta-task categories (Figures 4a and 4b, Table 2).
The persistent Python kernel stores all intermediate results as variables visible to subsequent steps, enabling iterative inspect-and-revise reasoning rather than committing upfront to a fixed program.
System prompts encode general spatial reasoning principles rather than task-specific templates, enabling zero-shot transfer across benchmarks and backbone models.
A planner LLM generates a step-by-step analysis plan that conditions the main agent’s iterative code generation applied in a five-stage loop (planning, code generation, execution, feedback assembly, answer submission).

Threat model

The work is not a security paper and does not address adversarial threats. The 'adversary' metaphor corresponds to inherent challenges in flexible spatial reasoning where standard fixed pipelines or single-pass programs are insufficient. The agent is assumed to have full access to input images and tools but cannot anticipate the task-specific composition requirements in advance.

Methodology — deep read

Threat model & assumptions: This research focuses on enhancing agentic spatial reasoning in vision-language models augmented with perception tools. The adversary here is not adversarial attackers but practical challenges in flexible, iterative spatial reasoning. The system does not target adversarial robustness but rather generalization and compositional flexibility without training.
Data: The evaluation aggregates 20 publicly established spatial reasoning benchmarks covering a broad spectrum of tasks: single-image, multi-view 3D, general spatial, video spatial/4D reasoning, and general video understanding. Examples range from metric distance and object motion to spatial planning, totaling up to 1,000 samples per benchmark for scalability. Metadata like video frame rate and frame indices are also provided.
Architecture/algorithm: SpatialClaw revolves around a persistent Python kernel that retains all intermediate computational variables, including segmentation masks, depth maps, geometric reconstructions, point clouds, and plots. The kernel is preloaded with perception tools (e.g., SAM3 for segmentation, Depth Anything 3 for 3D reconstruction) and scientific libraries (NumPy, SciPy, Matplotlib). The vision-language model acts as an agent that iteratively generates and executes one Python code cell per step, conditioned on previous feedback (text outputs, images, variable summaries). This design contrasts with prior single-pass code or fixed structured tool invocation.

Each step, the agent produces a structured response in markdown comprising purpose, reasoning, next goal, and code fields. A static AST-based checker validates code safety before execution to prevent unsafe operations.

Training regime: SpatialClaw itself is training-free, relying on pretrained VLM backbones and code generation capabilities out of the box. System prompts encode spatial reasoning discipline instead of fine-tuning on individual benchmarks or development data.
Evaluation protocol: Metrics are accuracy on diverse spatial reasoning benchmarks, with no benchmark-specific prompts or toolset modifications. Baselines include original VLM no-tool performances, prior spatial agents that use single-pass or structured call interfaces, and other spatial toolkits. Ablations compare the three action interfaces using shared toolsets and prompts. Statistical or significance tests are not specified but performance deltas and absolute scores are given. Held-out attacker scenarios or adversarial tests are not performed.
Reproducibility: Code and project page links are provided. The dataset benchmarks are public, but the kernel with all perception modules and the exact prompt are internal. No explicit frozen weights released for the entire system, but pre-trained open VLM models are used. The large scale and complexity may pose challenges for complete reproduction.

Example end-to-end: Given a question about the distance between a heater and a door in an input multi-view scene, SpatialClaw’s agent iteratively writes Python cells to segment objects using SAM3, reconstruct 3D geometry with Depth Anything 3, compute nearest point distances using SciPy’s KDTree, and visualize intermediate masks or plots. At each step, the agent inspects the intermediate results rendered into its context and revises computations such as replacing centroid distance with KDTree-based closest point distance until confident to submit an answer. This contrasts with a baseline single-pass approach that commits to one static program upfront.

Overall, the methodology carefully integrates tool-based perception, iterative code generation, and a persistent execution environment, underpinned by a spatial reasoning prompt guiding disciplined analysis, to enable generalizable, flexible spatial reasoning across diverse VLM backbones and tasks.

Technical innovations

Introduction of a persistent Python kernel as an action interface that maintains all intermediate state across iterative code execution steps unlike single-pass or structured calls.
Use of code generation by a VLM agent as the sole means to orchestrate perception tools and scientific libraries, enabling flexible, open-ended spatial reasoning compositions at test time.
A five-stage outer loop combining planning, code generation, secure execution, feedback assembly with intermediate visualizations, and answer submission, forming a disciplined inspect-and-revise reasoning scaffold.
Unified system prompts encoding general spatial reasoning principles and verification discipline that provide strong zero-shot generalization across 20 benchmarks without task-specific tuning.

Datasets

ERQA (2025) — ~1000 samples — public benchmark for single-image spatial reasoning
Omni3D (2025) — ~1000 samples — multi-view spatial reasoning benchmark
OmniSpatial (2026) — ~1000 samples — general spatial reasoning
SPBench (2025) — ~1000 samples — single-image spatial reasoning
MindCube (2025) — ~1000 samples — multi-view spatial reasoning
MMSI (2025) — ~1000 samples — multi-view spatial reasoning
SPAR-Bench (2025) — ~1000 samples — multi-view spatial reasoning
BLINK (2024) — ~1000 samples — general spatial reasoning
SpatialTree (2025) — ~1000 samples — general spatial reasoning
ViewSpatial (2025) — ~1000 samples — general spatial reasoning
MMSI-Video (2025) — ~1000 samples — video spatial & 4D reasoning
OSI-Bench (2025) — ~1000 samples — video spatial & 4D reasoning
PAI-Bench (2025) — ~1000 samples — video spatial & 4D reasoning
VSI-Bench-U (2025) — ~1000 samples — video spatial & 4D reasoning
VSTI-Bench (2026) — ~1000 samples — video spatial & 4D reasoning
DSI-Bench (2025) — ~1000 samples — video spatial & 4D reasoning
CV-Bench (2025) — ~1000 samples — general video understanding
PerceptComp (2026) — ~1000 samples — general video understanding
Video-MME (2025) — ~1000 samples — general video understanding
Video-MME-v2 (2026) — ~1000 samples — general video understanding

Baselines vs proposed

No-tool baseline (Qwen3.5-397B): average accuracy = 57.3% vs SpatialClaw: 60.4%
Single-pass code execution (Gemma4-31B): 55.2% average vs SpatialClaw: 59.9%
Structured tool-call interface (Gemma4-31B): 56.7% average vs SpatialClaw: 59.9%
SpaceTools-Toolshed (Chen et al., 2026): 54.3% average vs SpatialClaw: 63.6%
pySpatial (Luo et al., 2026): 50.7% average vs SpatialClaw: 63.6%
VADAR (Marsili et al., 2025): 33.3% average vs SpatialClaw: 61.3%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.13673.

Fig 1

Fig 1: SpatialClaw improves spatial reasoning across the board. Per-benchmark accuracy on 20 spatial

Fig 2

Fig 2: SpatialClaw studies code as the action interface for spatial reasoning. Three action interfaces on

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 3

Fig 3: Agentic loop for iterative code execution. SpatialClaw wraps a persistent kernel in a five-stage loop.

Limitations

No adversarial evaluation or robustness analysis against malicious input or attacker behavior is presented.
The approach depends on large-scale VLM backbones with strong code generation ability and is not demonstrated on smaller or less capable models.
The evaluation caps large datasets at 1,000 samples for feasibility, potentially limiting statistical power for some benchmarks.
The interactive framework adds computational overhead and latency due to iterative code execution, questioning suitability for low-latency applications.
Intermediate results rely on perception tool outputs whose errors propagate; no error analysis or calibration of perception modules is detailed.
Reproducibility is partially limited by internal tool implementations and prompt engineering that are not fully open-sourced.

Open questions / follow-ons

How would SpatialClaw perform under adversarial perturbations to input images or segmentation masks?
Can the iterative code interface be extended to support learning or adaptation of new perception tools on the fly?
What are quantitative tradeoffs between iteration budget, latency, and accuracy in practical usage scenarios?
How to integrate uncertainty estimation or confidence scoring into the iterative reasoning loop for robust decision making?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the SpatialClaw work illuminates how an agentic spatial reasoning interface that supports iterative composition, feedback, and analysis revision can substantially improve understanding of complex 3D/4D visual tasks. If CAPTCHAs or bot challenges require spatial reasoning about objects, motion, or scene geometry beyond single-frame visual text/image cues, adopting an interface that allows intermediate evidence inspection and flexible use of perception primitives could yield substantially more robust and human-aligned behavior in automated solvers. However, the iterative nature implies higher processing costs. The paradigm also suggests designing tool interfaces for visual question answering or interaction that emphasize persistent state and compositional coding rather than monolithic API calls. This insight may motivate novel CAPTCHA designs that challenge bot systems lacking deep iterative code-like reasoning and intermediate evidence verification abilities, raising the technical bar for automated attacks.

Cite

bibtex

@article{arxiv2606_13673,
  title={ SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning },
  author={ Seokju Cho and Ryo Hachiuma and Abhishek Badki and Hang Su and Byung-Kwan Lee and Chan Hee Song and Sifei Liu and Subhashree Radhakrishnan and Seungryong Kim and Yu-Chiang Frank Wang and Min-Hung Chen },
  journal={arXiv preprint arXiv:2606.13673},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13673}
}

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​