Skip to content

Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

Source: arXiv:2606.12334 · Published 2026-06-10 · By Balázs Gyenes, Emiliyan Gospodinov, Jan Frieling, Enrico Krohmer, Nicolas Schreiber, Xiaogang Jia et al.

TL;DR

This paper addresses a key limitation in robotic imitation learning policies conditioned on 3D point cloud observations. The authors identify that neural networks encoding Cartesian coordinates suffer from spectral bias, favoring learning low-frequency functions and thereby struggling to capture fine geometric details crucial for high-precision tasks such as peg insertion or drawer manipulation. To overcome this, they propose applying a NeRF-style Fourier feature mapping to input point cloud coordinates, projecting them into a high-dimensional sinusoidal embedding space to amplify high-frequency geometric cues.

The method is evaluated extensively on three challenging benchmarks: RoboCasa (16 high-precision kitchen tasks with human demos), ManiSkill3 (4 diverse manipulation tasks), and four difficult real-world robotic manipulation tasks. Across different point cloud encoder architectures and diffusion-based imitation learning policies, Fourier features consistently improve success rates—up to 20% absolute on RoboCasa and 7% on ManiSkill3, and boost normalized real-world task scores from 14.8% to 40.2%. Qualitative analysis shows smoother, more decisive motions with Fourier features. The work thus validates Fourier features as a simple, architecture-agnostic, robust method to overcome spectral bias and enable policies to better leverage detailed spatial geometry for imitation learning.

Key findings

  • Fourier feature mappings improve mean success rate by 20% (e.g. from 34% to 72% on RoboCasa CloseDrawer) and 7% on ManiSkill3 compared to Cartesian baselines (Fig 5, Tables 6,7).
  • On real robot tasks, Fourier features increase normalized score from 14.8% to 40.2%, notably improving tasks requiring fine geometric precision like Cup-Stacking (Table 8).
  • Fourier features significantly improve performance on larger point clouds rich in geometric detail, but have diminished benefit on heavily downsampled clouds (~2k points) (Fig 6).
  • Policies trained with Fourier features exhibit faster, more confident actions without hesitation, closely imitating expert demonstrations (qualitative results in Appendix A.5).
  • Minimal architecture changes are required; Fourier mappings improve all tested point cloud encoders including PointPatch, PCM, DP3, PointTransformer, and their multi-modal variants.
  • Fourier features remain robust across hyperparameter choices and do not require additional regularization.
  • Even when fine geometric detail is removed by large Gaussian jitter noise on inputs, Fourier features still confer benefits, suggesting improved learning dynamics beyond geometry encoding (Fig 7).
  • Spectral analysis shows Fourier feature enabled networks increase sensitivity to high-frequency input components by orders of magnitude more than baselines (Fig 8).

Methodology — deep read

The authors address imitation learning (IL) of high-precision robotic manipulation policies from expert demonstrations, where the observation includes 3D point clouds constructed from calibrated multi-view depth images. The policy is learned via score-based diffusion models that iteratively denoise Gaussian noise to generate coherent action sequences conditioned on observations and goal embeddings.

  1. Threat Model & Assumptions: The adversary model is not explicitly defined as this is a learning-focused paper, but implicitly the policy must robustly distinguish fine-grained spatial details to correctly imitate expert actions. The main assumption is that the spectral bias inherent in neural networks limits encoding of high-frequency geometric details necessary for precise policies.

  2. Data: The method is evaluated on RoboCasa (50 human expert demos per task, 16 tasks), ManiSkill3 (500 expert demos per task, 4 tasks), and 4 real-world manipulation tasks with 75–102 demos each. Observations include point clouds generated by unprojecting depth images and transforming them into a world frame, with pre-processing including point voxel downsampling and cropping to relevant workspace.

  3. Architecture/Algorithm: The core novelty is applying a non-learned Fourier feature mapping to point cloud coordinates before feeding them into point cloud encoders. The Fourier features use 16 logarithmically spaced frequency bands from 2cm to 4m wavelengths, mapped via sinusoids on each axis. This results in 96-dimensional Fourier embeddings per point. These act as input node features to message-passing GNN encoders such as PointPatch, PointPatch-attn, PCM, DP3, and PointTransformer. The output tokens from these encoders feed into a frozen CLIP-based goal encoder and a transformer-based diffusion policy that iteratively denoises noisy action chunks.

  4. Training Regime: Models are trained with 5 random seeds, evaluating 3 checkpoints each. VariableJitter augmentation is applied—random noise scale sampled per point cloud during training—to augment data and avoid overfitting to spurious frequencies. The diffusion model is trained with score matching loss using MSE.

  5. Evaluation Protocol: Mean success rate over 50 rollouts for RoboCasa and 100 for ManiSkill3 tasks is reported, with 95% confidence intervals from bootstrapping. Real-world tasks are evaluated with 16 rollouts per task. Ablations include removing Fourier features, varying Fourier parameter settings, and adding Gaussian noise to inputs. Spectral analysis of sensitivity functions is conducted to explain performance gains.

  6. Reproducibility: The authors provide source code and rollout videos publicly. The used datasets RoboCasa and ManiSkill3 are public; real-world data is from their experimental setup. Training hyperparameters, Fourier frequencies, and model architectures are detailed in appendices.

One concrete example: For the RoboCasa CloseDrawer task, point clouds from two static and one in-hand camera are unprojected and voxel-downsampled (~32k points). Fourier features are computed per point, forming high-dimensional input features to the PointPatch encoder. The diffusion policy then learns to denoise action chunks to produce drawer opening actions. The success rate improves from 34% baseline to 72% with Fourier features, demonstrating large gain from reduced spectral bias and better geometric encoding.

Technical innovations

  • Systematic application of NeRF-style Fourier feature mappings to 3D Cartesian point cloud inputs in imitation learning policies to overcome spectral bias.
  • Integration of high-frequency Fourier features into diverse point cloud encoder architectures with minimal modification.
  • Demonstration that Fourier features consistently and robustly improve diffusion-based score matching policies for fine-grained robotic manipulation.
  • Use of VariableJitter data augmentation on Fourier frequencies to improve training stability and generalization without task-specific frequency tuning.

Datasets

  • RoboCasa — ~50 demonstrations per each of 16 manipulation tasks — public benchmark
  • ManiSkill3 — 500 demonstrations per each of 4 manipulation tasks — public benchmark
  • Real World Tasks — 75 to 102 human demonstrations per task for 4 complex manipulation tasks — authors' experimental setup

Baselines vs proposed

  • PointPatch baseline (Cartesian coords): mean success rate on RoboCasa 13% vs with Fourier Features: 34%
  • CloseDrawer task: PointPatch baseline 34% vs with Fourier Features 72%
  • TurnOffSinkFaucet task: baseline 28% vs Fourier Features 63% success
  • ManiSkill3 average success rate: baseline ~70% vs Fourier ~77% (minor improvement)
  • Real World aggregate normalized score: baseline 14.8% vs Fourier Features 40.2%
  • PointPatch + RGB (pretrained) on real world tasks: minor sim improvement vs large real world improvement with Fourier

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.12334.

Fig 1

Fig 1: Method Overview. Adding a Fourier feature mapping

Fig 2

Fig 2: Overview of PointPatch encoder family. We group the

Fig 3

Fig 3: Overview of all evalution tasks from RoboCasa, ManiSkill3, and Real World benchmarks. Left: 4 of 16 RoboCasa tasks used

Fig 4

Fig 4: Left: Setup for the real-world drawer experiments.

Fig 5

Fig 5: Mean success rate across all tasks of 3D encoders with and without Fourier features on RoboCasa (left), ManiSkill3 (middle),

Fig 6

Fig 6 (page 4).

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

  • Benefit of Fourier features diminishes on tasks or datasets with fewer geometric details or simpler manipulation scenarios (e.g. some ManiSkill3 tasks).
  • Effectiveness is reduced when point clouds are heavily downsampled (~2k points), limiting applicability to very sparse sensor data.
  • The study focuses on imitation learning with diffusion models; applicability to reinforcement learning or other policy classes is not explored.
  • Real-world evaluation limited to 4 tasks and relatively small demo datasets; generalization to wider robotics applications remains to be verified.
  • Potential hyperparameter sensitivity around Fourier frequency bands and data augmentation exists but was not exhaustively studied.
  • The spectral bias phenomenon is addressed at input encoding stage; complementary architectural methods to enhance frequency learning were not investigated.

Open questions / follow-ons

  • Can Fourier feature mappings be adapted or learned end-to-end jointly with the rest of the IL policy to further improve performance?
  • How do Fourier features affect policies trained with reinforcement learning or other modalities beyond imitation learning?
  • What are the limits of point cloud sparsity and noise under which Fourier features still provide meaningful improvements?
  • Can similar high-frequency encoding approaches be applied effectively to hybrid 2D/3D multi-modal policies at scale?

Why it matters for bot defense

Bot-defense or CAPTCHA systems that rely on detecting subtle spatial or temporal features in 3D sensor data or spatial embeddings may encounter challenges analogous to the spectral bias discussed here. This paper highlights that using Fourier feature mappings can enable models to better distinguish fine-grained geometric or signal details that are otherwise difficult for neural networks to learn due to low-frequency bias. From a bot-defense perspective, such techniques might be leveraged to design more precise behavioral or spatial pattern recognition modules, especially when working with 3D point cloud or spatial sensor data — enabling capture of high-frequency patterns indicative of human vs automated interaction. Conversely, adversarial bots seeking to mimic human fine-grained manipulation might exploit similar spectral encoding to improve their policy fidelity. Understanding frequency biases and Fourier feature methods can inform both defensive model architecture design and robustness assessment in bot or CAPTCHA settings involving physical or spatial interaction data. The cross-benchmark robustness and real-world validation shown here underscore the practical utility of Fourier feature mappings as a general-purpose tool for improving high-precision spatial reasoning in learned policies.

Cite

bibtex
@article{arxiv2606_12334,
  title={ Fourier Features Let Agents Learn High Precision Policies with Imitation Learning },
  author={ Balázs Gyenes and Emiliyan Gospodinov and Jan Frieling and Enrico Krohmer and Nicolas Schreiber and Xiaogang Jia and Niklas Freymuth and Gerhard Neumann },
  journal={arXiv preprint arXiv:2606.12334},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12334}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution