NeuROK: Generative 4D Neural Object Kinematics
Source: arXiv:2605.30347 · Published 2026-05-28 · By Chen Geng, Guangzhao He, Yue Gao, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu
TL;DR
NeuROK addresses the challenge of generating physically plausible 4D object dynamics (temporal deformations) from static 3D shapes without relying on category-specific physical priors or predefined models. Prior approaches typically require explicit physical models and system identification tailored to specific object types, limiting generalization and scalability. NeuROK introduces a learned neural kinematic state parameterization representing object dynamics within a low-dimensional latent space, which maps to plausible mesh deformations. This framework enables simulating diverse object types under varying physical conditions via a transformer-based encoder-decoder architecture trained on a large-scale curated 4D dataset.
The NeuROK approach formulates the learned latent space as generalized coordinates in the classical Lagrangian mechanics framework, allowing simulation of temporal object dynamics by solving Euler-Lagrange equations over the latent trajectories. This results in physically consistent 4D sequences generated from single static 3D objects by sampling latent states and propagating dynamics under initial conditions or external forces. Experiments demonstrate NeuROK’s superior reconstruction accuracy, simulation fidelity, and generalization across diverse dynamic object categories compared to existing methods, both model-based and end-to-end learned.
Key findings
- NeuROK reduces Chamfer distance (L1) in inverse kinematics to 0.028 vs 0.067 for KeyPointDeformer and 0.082 for CANOR (Tab. 1), showing more accurate kinematic state recovery.
- User studies show NeuROK 4D motion preferred for realism and alignment in 81.43% and 83.33% of cases respectively, far outperforming PhysDreamer (5.95%, 5.36%) and other baselines (Tab. 2).
- NeuROK achieves higher VBench Aesthetic Quality (0.483) and WorldScore Imaging Quality (51.1) than PhysDreamer (0.362, 48.43) and AnimateAnyMesh (0.450, 48.37).
- The latent space dimension is effectively reduced via Active Subspace Method preserving deformation fidelity, improving performance (Tab. 1 ablations).
- NeuROK’s physically-inspired dynamics maintain near energy conservation over time, demonstrating physical consistency through Euler-Lagrange numerical solutions (Fig. 8).
- NeuROK generalizes well to unseen object categories outside training data, successfully synthesizing plausible motion dynamics (Fig. 9).
- Removing design components (model reduction, data augmentation, dual-quaternion deformation parameterization) each degrades inverse kinematics accuracy (Tab. 1).
- NeuROK simulates real scanned objects, e.g. a laptop closing motion, showing practical applicability beyond synthetic datasets (Fig. 7).
Methodology — deep read
Threat Model & Assumptions: NeuROK assumes no prior knowledge of the object's physical parameters or category-specific structural priors. The adversary is not explicitly modeled; instead, the goal is to learn a generalizable simulator for object-centric deformable bodies with unknown kinematic and dynamic models. The approach requires access only to sequences of meshes showing object deformation over time, without explicit action/force labels. The system cannot rely on predefined physical equations or handcrafted constraints.
Data: The method is trained on a curated large-scale 4D dataset composed of sequences of deforming meshes collected from previous datasets (e.g., PartNet-Mobility) and synthetic physics simulations. Each object instance has a mesh M0 with n vertices and deformation trajectories {M1,...,MT} sharing topology. During training, random pairs of frames are sampled to obtain deformation fields. Point clouds are sampled from meshes for encoding features. The dataset size and full provenance details are in supplementary materials.
Architecture/Algorithm: NeuROK uses a conditional variational autoencoder framework with three components: (a) Kinematic Prior Encoder Econd takes the static mesh M0, extracts nsample point clouds with positional embeddings, and uses a perceiver-inspired transformer with K learnable tokens to generate parameters of a Gaussian prior distribution pM0(z) over latent space Z. (b) Variational Deformation Encoder EVAE encodes deformation fields ϕ along with M0, computing deformation vectors per sampled points and using another transformer with 2K tokens to produce posterior qM0(z|ϕ). (c) Deformation Decoder D(z,M0) takes a sampled latent z (reshaped into K tokens) and cross-attends with query point clouds to predict deformation vectors via an MLP, mapped back to vertex deformations with barycentric smoothing. The latent space dimension k is initially high.
Dimension reduction is performed post hoc via the Active Subspace Method, identifying directions most influential on deformation norm and compressing Z to a lower kq-dimensional latent Q.
Training Regime: The three models (Econd, EVAE, D) are jointly trained end-to-end with a conditional VAE loss combining reconstruction (L2 error between sampled and decoded deformations) and KL divergence regularization, weighted by λ=0.01. During each iteration, random mesh instances and deformation samples are drawn from the 4D dataset. The architecture leverages transformers with cross/self-attention layers.
Evaluation Protocol: Performance is evaluated on inverse kinematics by optimizing latent vectors to reconstruct target deformations, reporting Chamfer and volumetric IoU metrics against baselines such as NDG, SINGAPO, CANOR, KPD, FreeArt3D. Generative 4D dynamics are assessed via user studies on realism and alignment with conditioning, as well as quantitative metrics (VBench and WorldScore) on aesthetics, dynamics quality, and motion magnitude. Ablations probe the effect of key components. Energy conservation tests verify physical consistency of latent ODE evolutions under Euler-Lagrange dynamics. Generalization is tested on unseen categories.
Reproducibility: Code release and full dataset details are on the project page (link in paper). The latent space and deformation decoder weights are trained and frozen for experiments. Some datasets used are proprietary or curated from prior sources.
Concrete example: Given a static 3D mesh of a laptop (M0), the encoder produces a prior latent distribution pM0(z). Sampling a latent state z is decoded into a plausible deformation corresponding to a specific 3D shape (e.g., laptop half-closed) by D. Under physical conditions (contact/forces), initial latent states and velocities (z0, ˆz0) are optimized to match known vertex positions and velocities, then the Euler-Lagrange equations solve the latent ODE trajectory {zi} over time, which is decoded into a temporal 4D sequence of deformed meshes showing the laptop closing realistically.
Technical innovations
- Learning a data-driven low-dimensional latent space as a neural kinematic state parameterization that maps to plausible deformed meshes, replacing over-parameterized geometry-based representations.
- Integration of the learned latent space with classical Lagrangian mechanics to generate physically consistent 4D object dynamics via solving Euler-Lagrange equations over latent trajectories.
- A scalable transformer-based conditional VAE architecture that infers instance-specific kinematic state distributions from raw 3D meshes without relying on physical or action annotations.
- Application of dimension reduction (Active Subspace Method) to further compress latent spaces while preserving deformation fidelity, enhancing simulation efficiency and accuracy.
Datasets
- PartNet-Mobility [104] — Medium scale, publicly available 3D articulated object dataset with mobility annotations.
- Curated large-scale 4D dataset compiled from multiple prior works and synthetic physical simulations — size and full source details in supplement (not fully public).
Baselines vs proposed
- NeuralDeformationGraphs: Chamfer L1 = 0.670 vs NeuROK: 0.028
- SINGAPO: Chamfer L1 = 0.313 vs NeuROK: 0.028
- FreeArt3D: Chamfer L1 = 0.169 vs NeuROK: 0.028
- CANOR: Chamfer L1 = 0.082 vs NeuROK: 0.028
- KeyPointDeformer: Chamfer L1 = 0.067 vs NeuROK: 0.028
- PhysDreamer user realism preference = 5.36% vs NeuROK = 83.33%
- AnimateAnyMesh VBench Aesthetic Quality = 0.450 vs NeuROK = 0.483
- PhysDreamer WorldScore Motion Magnitude = 0.783 vs NeuROK = 2.343
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.30347.

Fig 1: We present a versatile and scalable framework for generating simulative 4D dynamics of static 3D objects under physical

Fig 2: Kinematic state parameterization. (a) Several kinematic state parameterizations can be used to describe a physical system. The

Fig 3 (page 1).

Fig 4 (page 1).

Fig 5 (page 1).

Fig 6 (page 1).

Fig 7 (page 1).

Fig 8 (page 1).
Limitations
- The learned latent space and dynamics are data-driven and depend heavily on the quality and diversity of the training 4D dataset; rare or unseen phenomena may not be well represented.
- No explicit adversarial robustness or security evaluation was conducted; the model’s behavior under adversarial perturbations or malicious input manipulations is unknown.
- The method requires objects to share mesh topology across their deformed states, limiting applicability to objects with changing topology or topology-ambiguous deformations.
- Physical accuracy is approximate and emergent from learned latent dynamics rather than exact physical simulation; detailed quantitative error bounds on physical laws are not reported.
- Real-world object simulation requires pre-captured static 3D meshes and initial velocity conditions, which may be challenging in uncontrolled environments.
- Certain algorithmic details, such as Christoffel symbol computation and numerical solver stability, are in supplementary material and not fully analyzed.
Open questions / follow-ons
- How well does NeuROK handle complex contact, collisions, and topological changes in real-world deformable objects?
- Can the framework extend to incorporate multimodal sensory inputs beyond geometry, such as vision or tactile data, to improve latent space learning?
- What are the limits of generalization when simulating completely novel materials or composite objects with heterogeneous dynamics?
- How robust is the learned latent dynamics to noisy or partial observations at test time, and can it be adapted online?
Why it matters for bot defense
NeuROK introduces a novel generative framework for simulating physically plausible 4D dynamics directly from static 3D objects via a learned latent kinematic space, bypassing reliance on category-specific physical models. For bot-defense and CAPTCHA practitioners, embedding such a model could enable the creation of interactive, physically consistent 3D stimuli or challenges that reflect natural object dynamics under user manipulation.
Unlike classical scripted or heuristic animations, NeuROK’s approach offers scalable generation of diverse dynamic behaviors from raw shape inputs, providing a rich source of realistic but hard-to-predict temporal deformations. This could help design bot-resistant 3D puzzles or interactive captchas where human users reason about plausible object motion, leveraging the model’s ability to synthesize consistent spatiotemporal object states. However, deployment would require carefully curated 3D assets and possible adaptation to real-time constraints. The model’s generalizability to unseen objects also enables flexible CAPTCHA content generation beyond fixed domains.
Cite
@article{arxiv2605_30347,
title={ NeuROK: Generative 4D Neural Object Kinematics },
author={ Chen Geng and Guangzhao He and Yue Gao and Yunzhi Zhang and Shangzhe Wu and Jiajun Wu },
journal={arXiv preprint arXiv:2605.30347},
year={ 2026 },
url={https://arxiv.org/abs/2605.30347}
}