Difference-Aware Retrieval Policies for Imitation Learning
Source: arXiv:2606.09758 · Published 2026-06-08 · By Quinn Pfeifer, Ethan Pronovost, Paarth Shah, Khimya Khetarpal, Siddhartha Srinivasa, Abhishek Gupta
TL;DR
This paper addresses a fundamental limitation of behavior cloning (BC) in imitation learning: poor generalization and instability when encountering out-of-distribution (OOD) states due to compounding errors during deployment. The authors propose Difference-Aware Retrieval Policies (DARP), a semi-parametric method that combines local neighborhood structure with parametric action prediction to reduce prediction variance and improve robustness. Instead of predicting actions solely from a global parametric policy, DARP retrieves k-nearest neighbor expert states and actions at inference time and predicts the query action conditioned on neighbor states, neighbor actions, and the relative difference vectors between the neighbors and the query. Predictions from each neighbor are then aggregated in a permutation-invariant way. This architecture implicitly implements a Laplacian smoothing effect on the policy over the expert data manifold, encouraging stability without requiring explicit regularization or tuning. The method remains purely in the behavior cloning setting and does not require additional data or online expert feedback.
Empirical evaluation across diverse continuous control and robotic manipulation tasks, including those with high-dimensional visual features, shows DARP outperforms standard BC by 15–46% in success rates and task scores. Ablations highlight the importance of using difference vectors and permutation-invariance in the aggregator. The approach generalizes easily to multimodal action distributions using expressive aggregation models. Overall, DARP presents a conceptually elegant and practically effective way to reduce the brittleness of imitation learning purely via architectural changes combined with nearest-neighbor retrieval at inference.
Key findings
- DARP improves performance by 15-46% over standard behavior cloning across multiple domains, including locomotion (MuJoCo) and manipulation (Robosuite, RoboCasa) with low-dimensional and visual features.
- In MuJoCo locomotion tasks, DARP achieves mean rewards of 3545 (Hopper), 4383 (Ant), 4894 (Walker), and 5515 (HalfCheetah) compared to BC scores of 2313, 2376, 2658, and 1063 respectively (Table 1).
- In robotic manipulation with low-dimensional states, DARP doubles success rates in some tasks—e.g., DrawerClose task success improves from 54% (BC) to 85% (DARP) (Table 2).
- For vision-based Robosuite tasks using R3M image embeddings, DARP achieves ∼35% absolute improvement in success rate over BC (Table 3).
- Ablations show that incorporating state difference vectors (s_i - s_q) and using permutation-invariant aggregation functions are crucial for DARP’s performance, while neighbor actions contribute moderately (Fig. 6).
- Even fully parametric multimodal aggregation (e.g. Set Transformers or diffusion models) further boosts performance vs simple averaging, enabling richer action distributions.
- DARP implicitly performs Laplacian smoothing on the k-NN graph of expert states at a fixed smoothing level (λ≈1) with no additional hyperparameters, improving policy stability and reducing variance (Theorem 2).
- Nearest neighbor retrieval with a meaningful distance metric is important; random neighbor selection degrades performance significantly.
Threat model
n/a — This work is focused on mitigating compounding error in imitation learning policies due to distribution shift during deployment, not on defending against an active adversary.
Methodology — deep read
Threat model & assumptions: The environment is a finite-horizon MDP with expert demonstrations containing state-action pairs. The adversary is not explicitly modeled as this is an imitation learning improvement; assumptions are standard BC setup with no additional online interaction or data. The focus is on improving stability and generalization under distribution shift arising from covariate shift during rollout.
Data provenance & preprocessing: Uses public and simulation datasets—MuJoCo locomotion tasks, Robosuite manipulation, and RoboCasa kitchen manipulation; also real robot furniture assembly (FurnitureBench). Expert demonstrations are state-action pairs collected with MimicGen or similar methods. Splits and dataset sizes are not explicitly detailed in the truncated text. High-dimensional state representations include pre-trained R3M visual embeddings.
Architecture & algorithm: DARP is a semi-parametric policy architecture that decomposes the action prediction as an aggregation over retrieved k-nearest neighbor (k-NN) states and actions from the expert dataset. For a query state s_q at inference, retrieve the k nearest expert states (s_i*), their actions (a_i*), and compute difference vectors Δs_i = s_i* - s_q. For each neighbor tuple (s_i*, a_i*, Δs_i), a parametric predictor f_θ produces a candidate action a_i' = f_θ(s_i*, a_i*, Δs_i). These candidate actions are aggregated via a permutation-invariant function g_ψ (e.g., average, Set Transformer, or DeepSets) to yield the final predicted action: â_q = g_ψ({a_i'})_i∈k-NN. This reparameterizes imitation learning from global state-action mapping to local neighborhood-conditioned predictions.
Training regime: The model is trained solely with the standard behavior cloning objective minimizing mean squared error or maximum likelihood (for multimodal distributions) between predicted actions â_q and expert actions a_q* at all expert query states s_q*. The architecture enforces implicit Laplacian smoothing without explicit smoothness regularizers or hyperparameters. Training uses feedforward or convolutional neural networks for f_θ; specifics on training steps, batch sizes, epochs, or hardware are not provided. Ablations vary aggregator types and neighborhood selection.
Evaluation protocol: Tested on multiple domains using held-out environment executions comparing DARP against BC, nearest neighbor retrieval (R&P), locally weighted regression (LWR), MRIL (explicit Laplacian regularization), and REGENT (transformer-based in-context learning). Metrics include cumulative rewards, success rates, and rollout stability under covariate shift. Experiments cover low-dimensional states, visual embeddings, and real robot tasks. Statistical significance is supported by averages over 50-100 trials with confidence intervals. Ablations isolate component contributions. Evaluation emphasizes out-of-distribution robustness and noise resistance.
Reproducibility: Code and demos are publicly released at https://weirdlabuw.github.io/darp-site/. Expert datasets are publicly known benchmarks or simulation datasets. Detailed hyperparameters and model architectures appear in Appendix sections (not fully included here). The method requires neighbor retrieval infrastructure but is compatible with standard BC training pipelines.
Technical innovations
- Reparameterization of imitation learning policies via difference-aware retrieval from expert neighborhoods, conditioning action prediction on neighbor states, actions, and difference vectors rather than global mapping.
- Moving neighborhood aggregation from loss function as explicit Laplacian regularization to implicit architecture design (iMRIL), enabling parameter-free smoothing under standard BC training.
- Use of permutation-invariant aggregation functions (e.g. Set Transformers) to flexibly model complex multimodal action distributions over retrieved neighbors.
- Theoretical connection proving equivalence of iMRIL neighbor aggregation to manifold Laplacian smoothing with a fixed spectral filter, reducing variance and improving closed-loop stability.
- Practical application of this framework across low/high-dimensional states and diverse robotics tasks without requiring additional data, simulators, or online oracle feedback.
Datasets
- MuJoCo — several continuous control locomotion tasks (Hopper, Walker, Ant, HalfCheetah) — simulation benchmark
- Robosuite — robotic manipulation tasks (Stack, Threading, Square Peg) — simulation benchmark
- RoboCasa — robotic kitchen manipulation tasks (Drawer, Door, Stove) — simulation benchmark
- FurnitureBench — real-world furniture assembly task — hardware evaluation
Baselines vs proposed
- Behavior Cloning (BC): mean Hopper reward 2313.65 vs DARP 3545.57
- BC: mean Ant reward 2376.20 vs DARP 4383.28
- BC: mean Walker reward 2658.40 vs DARP 4894.01
- BC: mean HalfCheetah reward 1063.23 vs DARP 5515.41 (Table 1)
- BC: RoboCasa Drawer success rate 54% vs DARP 85% (Table 2)
- BC: Robosuite Stack success rate 47% vs DARP 72%
- BC: High-dimensional visual state Robosuite Threading success 38% vs DARP 76% (Table 3)
- MRIL (explicit smoothing) generally improves over BC but underperforms DARP in all tested tasks
- Nearest neighbor (R&P) and locally weighted regression (LWR) perform worse than BC
- REGENT (transformer in-context baseline) generally underperforms compared to DARP
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.09758.

Fig 1: Overview of DARP: Unlike standard BC (left), DARP (right) utilizes a retrieval-based reparameteri-

Fig 2 (page 1).

Fig 3 (page 1).

Fig 4 (page 1).

Fig 5 (page 1).

Fig 3: The “DrawerClose”

Fig 4: The “Threading” Ro-

Fig 5: Real-world Furni-
Limitations
- The distance metric used for neighbor retrieval relies on a suitable embedding space; robustness to poor metrics is not deeply explored.
- Experiments focus on offline datasets with expert demonstrations; no analysis of adaptation to non-expert or noisy data.
- No direct adversarial robustness or security evaluations; threat model is non-adversarial and does not consider malicious perturbations.
- Computational overhead of retrieval and aggregation at inference time in real-time systems is not quantified.
- Limited discussion of scalability with very large datasets or very high-dimensional state spaces beyond visual embeddings.
- Few details on hyperparameter sensitivity (e.g., neighborhood size k) beyond default choices.
Open questions / follow-ons
- How well does DARP perform with noisy, suboptimal, or mixed-quality demonstration datasets?
- Can the retrieval architecture be combined with online adaptation or feedback from expert corrections?
- What are the computational tradeoffs and latency impacts for real-time deployment with large-scale retrieval?
- How sensitive is DARP to the choice of distance metric and embedding space for neighbor retrieval in unseen domains?
Why it matters for bot defense
For bot-defense engineers and CAPTCHA practitioners, DARP offers a novel architectural approach to improve imitation learning agents’ robustness purely via inference-time retrieval and implicit smoothing, without requiring additional labeled data or online feedback. Since many attack or automation bots may use learned control policies or behavior cloning, integrating neighbor-conditioned policies could reduce brittleness on unseen states and improve reliability. The implicit Laplacian smoothing induced by neighbor aggregation could inspire defenses that incorporate data-driven local consistency to detect or disrupt automated agent behaviors that exploit brittle learned policies. However, the method is primarily focused on robotics and sequential decision-making; direct application in CAPTCHA-breaking or bot detection may require domain adaptation and attention to adversarial attack models beyond what DARP addresses. Still, the general principle of semi-parametric retrieval-based smoothing could guide bot-defense mechanisms that monitor or constrain behavior based on similarity to known expert or human-like patterns.
Cite
@article{arxiv2606_09758,
title={ Difference-Aware Retrieval Policies for Imitation Learning },
author={ Quinn Pfeifer and Ethan Pronovost and Paarth Shah and Khimya Khetarpal and Siddhartha Srinivasa and Abhishek Gupta },
journal={arXiv preprint arXiv:2606.09758},
year={ 2026 },
url={https://arxiv.org/abs/2606.09758}
}