CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios

Source: arXiv:2605.15122 · Published 2026-05-14 · By Michael Baumgartner, David Müller, Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross et al.

TL;DR

This paper addresses the challenge of accurate proprioceptive state estimation for legged robots executing highly dynamic and contact-rich motions. Existing methods commonly rely on binary contact states to model foot or limb contact, which fails to capture intermediate contact conditions such as directional slippage or partial contact. To address this limitation, the authors present CoCo-InEKF, a hybrid approach combining an invariant extended Kalman filter (InEKF) with a learned neural contact module that predicts continuous, directionally-aware contact velocity covariances instead of binary contact flags. This learned covariance expresses uncertainty about contact quality, allowing the filter to dynamically weight contact constraints in a nuanced way. Training is performed end-to-end via backpropagation through time (BPTT) using only state error losses, removing the need for heuristic ground-truth contact labels.

In extensive experiments on a bipedal robot performing complex motions including dancing and multi-contact ground interactions, CoCo-InEKF consistently outperforms prior methods including classical InEKF with heuristic contact detection, hybrid learned contact classifiers, and purely end-to-end transformers. The results demonstrate significant reductions in linear velocity estimation error and improved filter consistency. The model also proves insensitive to exact contact candidate placement, and an automated candidate selection procedure was introduced to ease application to new robots. Overall, the paper advances state estimation robustness for legged robots by replacing brittle binary contact assumptions with learned continuous covariances integrated into a differentiable InEKF framework.

Key findings

CoCo-InEKF reduces linear velocity RMSE by approximately 62% compared to the best hybrid baseline on dancing motions (0.046 m/s vs 0.121 m/s, Tab. III).
On ground motion datasets, CoCo-InEKF achieves linear velocity RMSE of 0.099 m/s, outperforming heuristic InEKF (4.448 m/s) and hybrid baselines (~0.4 m/s) by large margins (Tab. IV).
End-to-end training with state velocity loss avoids reliance on contact labels and yields a contact covariance output aligned with physical contact states but with richer nuances (Fig. 3).
Replacing binary contact flags with learned contact velocity covariances allows continuous modulation of contact confidence including slippage, improving filter consistency measured with NEES (Fig. 7).
Automated contact candidate selection matches or slightly outperforms hand-picked points in linear velocity ATE with stable performance across multiple runs (Tab. IX).
Model accuracy improves with the number of candidate contacts (from 4 to 18) with acceptable computational scaling, outperforming large-scale transformer baselines while being ~5–10× faster (Tab. VIII, V).
Longer input history (H=150) and longer BPTT unroll lengths (L=256) degrade performance compared to the default configuration (H=20, L=128), indicating an optimal temporal context size (Tabs. VI, VII).
MLP-based learned contact module achieves comparable or better accuracy than prior CNN architectures while reducing inference time by ~5×, enabling real-time onboard deployment (Tab. V).

Threat model

N/A — the work is not a security-focused paper. It assumes a standard robotics setting with no active adversaries or intentional spoofing. The focus is on improving robustness and accuracy under noisy and uncertain proprioceptive sensing in dynamic legged robot locomotion.

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly modeled since the focus is on improving legged robot state estimation robustness in dynamic, contact-rich motions. It assumes proprioceptive sensor inputs (IMU, joint encoders, actuator torque) and a predefined set of candidate contact points on the robot. The robot morphology and sensor placements are assumed known. The method is designed to accommodate partial or slipping contacts by learning continuous contact covariances rather than binary contact states.
Data: Training data is generated via simulation using a pretrained robot control policy executing various motions including dancing and full-body ground contact. For dancing, 81 sequences from a Reallusion motion dataset retargeted to the robot are used. The dataset contains 20-second episodes with randomized friction, terrain, and disturbance force domain randomization. For ground motions, a stylized falling policy generates 6-second episodes with randomized environments. Proprioceptive sensor streams and ground-truth states (orientation, position, velocity) are recorded at 600 Hz. Training employs parallel simulation with many environments (E=1280) for data aggregation.
Architecture / Algorithm: The base is an invariant extended Kalman filter (InEKF) estimating robot base orientation, linear velocity, position, and contact candidate positions jointly. A learned contact module predicts a 3x3 contact velocity covariance matrix per candidate point, represented via a lower-triangular matrix output by a neural network to guarantee positive semi-definite covariances. The neural net input over a history window H includes IMU gyro and accelerometer data, joint encoder positions and velocities, actuator torques, and relative contact point positions and velocities in the body frame. The contact covariances modulate the noise model in the InEKF prediction and correction steps.

This differentiable formulation omits the typical discrete contact state additions/removals and keeps all candidate contacts in the state vector permanently, enabling backpropagation through the entire Kalman filter graph. The contact module uses a multi-layer perceptron (MLP) for real-time onboard computation, with the architecture optimized over baselines. Slip states are implicitly encoded as increased covariance magnitude and anisotropy.

Training Regime: Training applies backpropagation through time with an unroll length (buffer) L=128 timesteps (about 213 ms), using Adam optimizer on velocity L2 loss between estimated and ground-truth body-frame velocities. No ground-truth contact labels are provided or required, avoiding heuristic labeling bias. The InEKF state is initialized to ground-truth at episode start but allowed to drift during rollout, ensuring realistic error accumulation. Experiments investigate hyperparameters including history length (H=20 default), batch size, and buffer length. Training runs on a single Nvidia RTX 4090 GPU for up to 5 days with 100k iterations.
Evaluation Protocol: Evaluation is performed on held-out simulated test sets for dancing and ground motions (100 sequences each). Metrics include absolute trajectory error (ATE) in linear velocity, position, and orientation, with RMSE, mean, median, and standard deviations reported. The normalized estimation error squared (NEES) quantifies filter consistency. Baselines include InEKF with heuristic and ground-truth contacts, learned hybrid classifiers from prior work, and purely end-to-end state estimation transformers (SET). Ablation studies probe architecture variations, history lengths, BPTT horizon sizes, and contact candidate counts. Real-time runtime performance is measured on the robot computer (Intel i7).
Reproducibility: While no public code or datasets are indicated to be released, the paper details the simulation policy source and dataset names (Reallusion retargeted motions) and enumerates hyperparameters and architectural choices. The method depends on differentiable Lie group EKF implementations and standard neural network training tools. Full reproducibility may be challenging without the robot simulator and exact training scripts.

Concrete Example Walkthrough: For each training iteration, data is gathered by rolling out a pretrained control policy in multiple parallel simulation environments with randomized terrain and friction. For each timestep, the neural contact module ingests the past 20 steps of sensor data and predicts contact velocity covariance matrices for all candidate contact points. These covariances parameterize the noise in the differentiable InEKF, which fuses joint kinematics and IMU data to estimate robot state over the rollout. The velocity estimation error relative to simulator ground-truth is computed and backpropagated through the InEKF and neural network parameters using BPTT. Parameters are updated using Adam to reduce velocity estimation errors in future iterations. This cycle continues over 100k iterations.

At test time, only proprioceptive sensing and contact candidate locations are required. The trained neural module outputs continuous covariance estimates per candidate point, which are combined with the InEKF to estimate the current robot state online in real time, enabling robust dynamic locomotion despite uncertain contact conditions.

Technical innovations

Introduction of continuous, learned contact velocity covariance matrices replacing brittle binary contact states in contact-aided InEKF for legged robot state estimation.
End-to-end differentiable InEKF formulation that permanently maintains all contact candidates in the state, enabling backpropagation through the filtering steps without discrete contact state changes.
Neural network contact module trained via backpropagation through time on state velocity error alone, eliminating the need for ground-truth contact labels or heuristics.
Automated contact candidate selection procedure based on farthest point sampling from sampled mesh points, providing robust performance without manual tuning.

Datasets

Reallusion motion capture dataset — 81 sequences retargeted for dancing motions — public dataset
Simulated ground motion dataset — synthetic falling and multi-contact episodes — generated in-house via simulation

Baselines vs proposed

InEKF with ground-truth contact: linear velocity RMSE = 0.176 m/s (dancing), 0.418 m/s (ground); CoCo-InEKF = 0.046 m/s (dancing), 0.099 m/s (ground) (Tab. III, IV)
InEKF with heuristic contacts: linear velocity RMSE = 2.675 m/s (dancing), 4.448 m/s (ground); CoCo-InEKF much lower at 0.046 m/s and 0.099 m/s
Hybrid Baseline+ (learned contact classification with slip): RMSE = 0.121 m/s (dancing), 0.428 m/s (ground); CoCo-InEKF reduces error by over 60%
State Estimation Transformer (SET), large: RMSE = 0.286 m/s (dancing), 0.096 m/s (ground); CoCo-InEKF better on dancing dataset and comparable on ground dataset while faster
MLP-based CoCo-InEKF inference time 0.14 ms NN / 0.42 ms filter vs SET large 3.02 ms (Tab. V)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.15122.

Fig 1

Fig 1: CoCo-InEKF. Given a set of predefined contact candidates

Fig 3

Fig 3: Contact Covariance. Visualization of the contact covariance

Fig 5

Fig 5: We hypothesize that as our method reasons about

Fig 4

Fig 4: Foot contact candidate configurations. As used in the

Fig 6

Fig 6: Consistency. Visualization of the normalized estimation error

Fig 6

Fig 6 (page 7).

Fig 7

Fig 7 (page 7).

Fig 8

Fig 8 (page 7).

Limitations

Evaluations primarily on simulated datasets with ground-truth state; limited explicit testing on physical robot data accuracy available.
The method assumes accurate robot morphology and sensor calibration; real-world sensor noise distributions may vary.
While continuous covariances model slippage implicitly, explicit slip detection or recovery is not formulated separately.
No adversarial robustness analysis or evaluation against sensor spoofing or malicious perturbations.
Automated contact candidate selection works well in tested scenarios but may degrade on novel robots with substantially different morphologies or contact patterns.
Global yaw drift remains unobservable and not directly optimized in velocity-only loss formulation.

Open questions / follow-ons

How well does the learned contact covariance generalize to novel terrains, robots, or motion types outside the training distribution?
Can the learned covariance predictions be combined explicitly with tactile or force sensing to further improve slip detection and rejection?
What are the effects of incorporating additional losses on position and orientation states to mitigate global drift in longer episodes?
How does the differentiable InEKF training scale to very large numbers of contact candidates or multi-robot systems?

Why it matters for bot defense

While CoCo-InEKF focuses on proprioceptive state estimation under complex contact dynamics in robotics, its core idea of replacing brittle binary discrete latent states with continuous learned uncertainty representations may inspire analogous approaches in bot defense for modeling uncertain or partial behavioral signals. The differentiable filtering combined with end-to-end learning of noise covariances provides a pathway to integrate structured model-based components with learned modules, potentially enabling more nuanced bot-behavior confidence scoring beyond binary labels. Automated candidate selection techniques may also be analogous to automated feature or token candidate selection in CAPTCHA design. However, the specific application domain and sensory modalities differ substantially from CAPTCHA bot detection, so direct application requires careful adaptation. Overall, the paper contributes principled methods for learning uncertainty-aware dynamic models that could inform future bot-defense ML filter designs.

Cite

bibtex

@article{arxiv2605_15122,
  title={ CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios },
  author={ Michael Baumgartner and David Müller and Agon Serifi and Ruben Grandia and Espen Knoop and Markus Gross and Moritz Bächer },
  journal={arXiv preprint arXiv:2605.15122},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.15122}
}

CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​