Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera

Source: arXiv:2606.14535 · Published 2026-06-12 · By Seoyoon Kim, Kanghyun Kim, Dongwoo Ko, Yeong Jin Heo, Min Jun Kim

TL;DR

SCDP integrates a multi-scale ResNet-18 image encoder to capture both coarse scene context and fine-grained local details, and samples point-wise visual features aligned with evolving end-effector trajectory estimates during the diffusion policy's iterative denoising process. Extensive experiments on the Meta-World and DexArt benchmarks demonstrate that SCDP substantially outperforms prior single-view baselines, matching or exceeding performance of policies that use additional wrist or depth cameras. Real robot tests on precise and contact-rich tasks (USB insertion, battery insertion, grasping with distractors) confirm SCDP's robustness and accuracy using only a single global RGB view. The approach emphasizes data efficiency and spatial attention without requiring external perception modules or pretrained large vision models.

Key findings

SCDP achieves over 80% success rates on all Meta-World difficulty groups using only 20 demonstrations and a single global RGB camera (Table 1).
On challenging Meta-World "Hard" tasks, SCDP attains an average 82.5% success, outperforming DP (16.7%) and DP3 (60.3%) baselines by large margins (Table 2).
SCDP matches or exceeds the performance of multi-camera baselines such as DP with wrist-view (74.2%) and DP3 with depth (60.3%) on Meta-World Hard tasks using only a single camera (Table 3).
In ablations, multi-scale features (F1–F5) improve success rate by 13.9 percentage points over using only the highest semantic level (F5) (Fig. 4b).
Increasing the future trajectory reconstruction horizon from 0 to 12 steps raises success rate from 75.8% to 83.9%, enabling richer spatial context (Fig. 4a).
Real-world experiments show SCDP achieves 85% success on USB insertion grasp, 40% on insert, and maintains high success under distractors, outperforming DP, DP3, OTTER, and SKIL baselines (Table 4).
SCDP retains robust performance in cluttered and distractor-rich scenes, demonstrating focus on task-relevant regions along the end-effector trajectory.
Data efficiency results show SCDP learning faster and converging to higher performance with fewer demos compared to baselines (Fig. 3).

Threat model

The adversary is environmental complexity and ambiguity in single global camera visual input, including distractors, clutter, and occlusions. The policy must infer precise end-effector trajectories without privileged wrist-camera views or additional sensing. The adversary cannot manipulate the robot's internals or sensor calibration but may introduce visual distractors or lighting variation challenging perception.

Methodology — deep read

Threat Model & Assumptions: The adversary is the environment and visual ambiguities inherent in single-camera robotic manipulation, including clutter and distractors. The policy has only access to a single RGB camera view, no wrist cameras or additional sensors. The method assumes that task-relevant visual features lie near the end-effector trajectory projections onto the image plane. Occlusion or indirect interaction cases where relevant regions lie far from the end-effector are not addressed.
Data: The method is evaluated primarily on two established benchmarks: Meta-World (54 diverse manipulation tasks from easy to very hard difficulty groups) and DexArt (4 dexterous hand manipulation tasks). Meta-World models are trained with 20 expert scripted demonstrations per task for 1,000 epochs across 3 random seeds. DexArt uses 100 RL-expert demos for 3,000 epochs. Image inputs are single-view RGB frames with multi-scale feature extraction.
Architecture / Algorithm: The core model (SCDP) has three components: (1) a multi-scale image encoder (ResNet-18 backbone) produces hierarchical feature maps F1 to F5 with varying spatial resolutions and channel depths; (2) a spatial conditioning module reconstructs the future end-effector trajectory from intermediate action samples during diffusion denoising, projects these 3D points onto the 2D image plane, and samples point-wise visual features along these anchor points from the multi-scale maps via bilinear interpolation, aggregating them temporally into a spatial context vector F; (3) a conditional denoising U-Net diffusion network predicts cleaner action chunks conditioned on F and the current robot state, using FiLM layers for modulation.

During diffusion, predicted action trajectories at each denoising step k are used to refine attention locations dynamically, allowing evolving focus on task-relevant regions. Training optimizes mean squared error between noise and predicted noise with perturbed actions. Inference uses DDIM sampling for efficient generation.

Training Regime: On Meta-World, trained for 1000 epochs with 20 expert demos, batch size, optimizer specifics, and random seed values (0,1,2) are reported. DexArt training uses 3000 epochs and 100 RL-generated demos. The paper does not detail optimizer hyperparameters in depth. No mention of special seed strategies beyond three fixed seeds.
Evaluation Protocol: Performance is measured as task success rate on held-out test episodes across difficulty groups. Comparisons include baselines DP (diffusion policy with global visual features), DP3 (point-cloud augmented), SKIL and OTTER (methods with explicit task-relevant visual conditioning). Ablation studies vary reconstruction horizon s of trajectory anchors and multi-scale feature utilization. Statistical significance or confidence intervals beyond mean and standard deviation over seeds are not explicitly reported. Both simulation and real-robot experiments test robustness to distractors and clutter.
Reproducibility: No public release of code or weights is confirmed in the paper. Benchmarks used (Meta-World, DexArt) are publicly available. Details of architecture and hyperparameters are described sufficiently for academic repro but dataset splits and exact training pipelines are less detailed. Real-world experimental hardware specifics and camera calibration details are summarized but not fully open.

Concrete Example Walkthrough: For USB insertion, the SCDP model takes a single global RGB image of the scene and the current robot state as input. Intermediate diffusion denoising steps generate partial action trajectories forecasting end-effector positions over a short horizon. These 3D points are projected into the image to sample visual features from multi-scale feature maps. The aggregated spatial context conditions the U-Net to refine action predictions iteratively. Over multiple denoising steps, the predicted trajectory becomes more precise, focusing attention on the USB port and connector geometry, enabling stepwise grasp, insertion, and push actions with geometric alignment inferred purely from RGB data.

Technical innovations

Use of intermediate diffusion action trajectory estimates as spatial attention anchors to extract point-wise visual features from multi-scale image encodings, enabling task-relevant spatial conditioning.
Integration of multi-scale hierarchical visual feature maps from a single RGB camera backbone to capture both coarse scene context and fine-grained local details necessary for precise manipulation.
Formulation of a conditional diffusion policy denoising network modulated by spatial context vectors aggregated along the predicted end-effector trajectory, replacing fixed global embeddings.
Demonstration that single global RGB camera manipulation performance matches or exceeds multi-camera and depth-augmented baselines on challenging tasks, without reliance on wrist cameras or pretrained large vision models.

Datasets

Meta-World — 54 tasks — publicly available manipulation benchmark
DexArt — 4 dexterous hand manipulation tasks — publicly available benchmark

Baselines vs proposed

DP (Diffusion Policy, global visual features): Meta-World Hard avg success = 16.7% vs SCDP = 82.5%
DP3 (Diffusion Policy with point-cloud input): Meta-World Hard avg success = 60.3% vs SCDP = 82.5%
DP with Wrist Camera: Meta-World Hard avg success = 74.2% vs SCDP single RGB = 82.5%
SKIL (Semantic Keypoint Imitation Learning): Meta-World Hard avg success = 27.5% vs SCDP = 82.5%
OTTER (Vision-Language Model Conditioning): Meta-World Hard avg success = 47.2% vs SCDP = 82.5%
Real-world USB Insertion Grasp: DP = 60% vs SCDP = 85%
Real-world Distractor Robustness (Push Cube Task): DP success drops to 50%, SCDP maintains 75%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.14535.

Fig 1

Fig 1: (a, b) In a single-camera setting, precise and robust manipulation requires the robot to

Fig 2

Fig 2: SCDP Architecture. (a) SCDP constructs hierarchical feature maps using a multi-scale

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 3).

Fig 4

Fig 4: Design Ablations. Effects of (a) reconstruction horizon, (b) multi-scale features, and (c)

Limitations

Assumption that task-relevant visual information lies near the end-effector trajectory limits applicability to tool-use or manipulation involving contacts away from the end-effector.
Heavily occluded environments challenge the extraction of reliable visual features along the attention anchors.
Lack of external perception modules or pretrained large visual models may limit generalization to visually complex or highly cluttered scenes.
No explicit adversarial robustness evaluation against domain or viewpoint shifts beyond tested distractor and clutter scenarios.
Training largely conducted on scripted expert demonstrations with small dataset sizes (e.g., 20 demos), may limit scalability to more diverse data.
No public release of code or models currently documented, hindering reproducibility.

Open questions / follow-ons

How can spatial conditioning be generalized to tasks where relevant visual cues are spatially distant from the end-effector (e.g., tool-use or multi-contact scenarios)?
Can additional sensor modalities (force, tactile, depth) complement and enhance the robustness of spatially conditioned diffusion policies in occluded or cluttered scenes?
What are the limits of this approach under stronger domain shifts such as lighting changes, camera viewpoint variations, or dynamic backgrounds?
Could training with larger, more diverse datasets and/or pretrained vision models improve SCDP's generalization without sacrificing focus on fine-grained task relevance?

Why it matters for bot defense

For bot-defense or CAPTCHA practitioners focusing on robot interaction security, SCDP's approach illustrates an effective way to achieve accurate robotic manipulation from minimal visual input—a single camera—without complex multi-view setups. This is relevant when designing systems resisting spoofing or deception by attackers relying on simplified visual cues. The spatial conditioning on inferred action trajectories may inspire new methods for focusing detection or validation mechanisms on relevant visual areas, improving robustness against adversarial scene clutter or occlusions often used in bot attacks. The demonstrated efficiency and robustness to distractors indicate potential for practical deployment in constrained monitoring environments typical of embedded or edge-capable CAPTCHA-solving robots.

Cite

bibtex

@article{arxiv2606_14535,
  title={ Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera },
  author={ Seoyoon Kim and Kanghyun Kim and Dongwoo Ko and Yeong Jin Heo and Min Jun Kim },
  journal={arXiv preprint arXiv:2606.14535},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.14535}
}

Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​