Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

Source: arXiv:2605.08054 · Published 2026-05-08 · By Hanchao Liu, Fang-Lue Zhang, Shining Zhang, Tai-Jiang Mu, Shi-Min Hu

TL;DR

This paper addresses a fundamental failure mode of existing training-free diffusion noise optimization (DNO) methods for human motion generation: they break down when constraints become highly challenging—e.g., navigating a 0.4-meter-wide gap, ducking under a 0.5-meter barrier, or walking exactly 6 steps over 4 meters. The authors observe that the choice of initial noise in DDIM-based diffusion is the root bottleneck: random noise simply lacks the structural prior needed to satisfy rare or extreme spatiotemporal constraints, no matter how many optimization steps are applied.

The proposed framework, Retrieval-Guided Diffusion Noise Optimization (RG-DNO), augments the training-free DNO pipeline with a retrieval stage that searches a large motion dataset (HumanML3D) for clips that already partially satisfy the hard constraints. A novel 'relational task parsing' step—executable either by hand or via an LLM (DeepSeek R1)—decomposes the full constraint set into a difficult subset (sent to retrieval) and easier subsets (handled by standard random-noise optimization). The retrieved clip's inverted diffusion noise is then blended with a random-noise trajectory via a learned binary mask selected from a tractable candidate set and scored by a reward combining constraint error and motion quality penalties.

Experiments on three purpose-designed highly-constrained tasks show that RG-DNO substantially reduces constraint error versus the ProgMoGen+DNO baseline (e.g., constraint error drops from 0.0162 to 0.0050 on the narrow-gap task, and from 0.000115 to 0.000049 on the low-barrier task) while also reducing joint jitter. On the numerical step-count task, success rate improves from 0.469 to 0.594. The method is training-free, requires no fine-tuning of the base diffusion model, and integrates cleanly into the existing ProgMoGen/DNO ecosystem.

Key findings

On Task-1 (0.4 m narrow gap, 5 m walk), RG-DNO reduces constraint error from 0.0162 (ProgMoGen+DNO) to 0.0050, and max scene penetration from 0.073 to 0.027; with 5-run noise search (NS=5) constraint error reaches 0.0000 vs 0.0019 for the baseline (Table 1).
On Task-2 (0.5 m overhead barrier, 5 m walk), constraint error drops from 0.000115 to 0.000049 and max scene penetration from 0.149 to 0.079; with NS=5, error is 0.000007 vs 0.000015 for baseline, and max scene penetration 0.048 vs 0.064 (Table 1).
On Task-3 (6 steps over 4 m with concurrent hand-raise), success rate improves from 0.469 to 0.594, semantic success rate from 0.375 to 0.438, and constraint error from 0.282 to 0.0003 (Table 2).
Ablation on Task-2 shows that removing relational task parsing (using full constraint set for retrieval) causes constraint error to spike to 0.000132 (worse than full method at 0.000049) and max acceleration to 0.289, confirming that targeted retrieval subset selection is critical (Table 4).
Using retrieved noise alone (no random noise mixing) produces local foot-skate ratio of 0.180 vs 0.134 for the full method and max acceleration of 0.109 vs 0.194, demonstrating that retrieval alone degrades motion quality on non-retrieval constraint dimensions (Table 4).
DNO failure threshold for the overhead barrier is empirically around 0.5 m height; below this threshold constraint error and joint jitter rise sharply for the baseline but remain controlled for RG-DNO (Fig. 5a).
The learned reward model MotionCritic, which aligns with human perceptual preferences, was found to be counterproductive for highly-constrained tasks because it penalizes rare but necessary motions such as side-walking.
Combining RG-DNO with LLM-adjusted text prompts further reduces constraint error on Task-2 to 0.000027 and max acceleration to 0.168, outperforming either intervention alone (Table 5).

Methodology — deep read

Threat model and assumptions (problem framing): The adversary here is the constraint function itself—the system must satisfy user-specified spatiotemporal and numerical constraints that lie outside the typical distribution of motions the diffusion model has learned. The key assumption is that a sufficiently large motion dataset (HumanML3D, ~14k clips) contains examples that partially instantiate the rare required skills (e.g., side-stepping, crawling), even if no clip perfectly matches all constraints simultaneously. The base diffusion model (MDM-RoHM) is frozen throughout; all optimization is over the initial noise vector z in the DDIM sampling schedule.

Data: The base model uses the training split of HumanML3D (14,616 motion sequences from AMASS and HumanAct12, covering a wide range of daily activities). The retrieval database uses the entire HumanML3D dataset (training + validation + test). Motion sequences are preprocessed into a 284-dimensional per-frame representation (RoHM format): root trajectory r, joint positions J, joint velocities J-dot, SMPL rotation parameters theta, and foot-contact labels f. This SMPL-compatible representation is necessary because scene-penetration constraints require body-surface geometry unavailable in the native MDM joint representation. The paper notes the base MDM-RoHM is retrained from scratch on this representation using MDM's default training configuration, but exact epoch count and hardware for pretraining are not reported.

Architecture and novel components: The backbone generator G is MDM retrained on the RoHM 284-dim representation, used with a 50-step DDIM sampler. RG-DNO introduces three novel modules on top: (1) Relational Task Parsing: given a full constraint set C, this module—either hand-coded or LLM-driven (DeepSeek R1)—identifies the most difficult constraint c_D, categorizes all other constraints as 'connected', 'conflicting', or 'none' relative to c_D, and partitions C into retrieval set C_R, random-noise fitting set C_1, and retrieved-noise fitting set C_2. A binary retrieval-confidence flag s determines whether c_D also remains in C_1. (2) Constraint-based Retrieval: the dataset D is searched exhaustively (Eq. 5) for the clip minimizing the constraint loss F_{C_R}; spatial horizontal transforms H are allowed (rotation/translation in the ground plane) to improve alignment; semantic consistency is enforced by text-annotation matching; duration is normalized by linear temporal interpolation. The top-k clips below an error threshold are retained and one is selected as x_R. The retrieved noise z_R is obtained by DDIM inversion of Hx_R conditioned on text C_0 (Eq. 7). (3) Masked Noise Optimization: Two separate noise trajectories are optimized—z_1 initialized from random noise N(0,I) to minimize F_{C_1}, and z_2 initialized from z_R to minimize F_{C_2}—both using gradient-normalized Adam with cosine LR decay and warmup (LR=0.05, N_1=100 steps). A frame×pose-feature binary mask M ∈ R^{T×d} is then selected from a tractable candidate set (2^{N_T} temporal candidates with N_T=8 segments, or 2^{N_S} spatial candidates with N_S=8 body-part partitions) by minimizing F_C(G(z')) + R(G(z')) over all candidates in parallel, where z' = Mz_1 + (1-M)*z_2. Soft boundary refinement via sigmoid parameterization is applied after discrete selection. The reward R penalizes joint jitter (max acceleration), foot-skating, noise de-correlation (from DNO), and optionally semantic text-motion alignment. Final motion is produced by a second DNO round initialized from z' (LR=0.02, N_2=400 steps, Eq. 12).

Training regime: The base MDM-RoHM is pretrained with MDM's default configuration (details deferred to the base paper; exact epochs/hardware not specified in this paper). The RG-DNO optimization is entirely inference-time: 100 steps for stage-1 (z_1, z_2, z' and retrieval fitting) and 400 steps for stage-2 (delta_z from z' initialization), with the frozen generator. Reward weights lambda_k are all set to 1.0 as a default. No random seed strategy is described.

Evaluation protocol: Three purpose-designed tasks form the benchmark: Task-1 (narrow gap, C.Err = mean scene penetration / distance metric), Task-2 (low barrier, C.Err = mean per-frame barrier violation), Task-3 (step count + concurrent action). 32 samples are generated per method per task. Metrics are: Foot Skate ratio, Max Joint Acceleration, Constraint Error, Max Scene Penetration (Tasks 1-2), Success Rate / Semantic Success Rate / Pace Pattern Score (Task 3). Additionally, Task HSI-2 from ProgMoGen (joint-based obstacle avoidance) is used to compare against MaskControl. Ablation (Table 4) removes each module independently on Task-2. Task difficulty sweeps (Fig. 5) vary barrier height and step count to identify the failure-onset threshold for DNO. No cross-validation or statistical significance tests are reported. Evaluation is deterministic in the sense that 32 samples are drawn and metrics are averaged, but no confidence intervals are provided.

Concrete end-to-end example (Task-2): A character must walk 5 m while ducking under a 0.5 m overhead barrier between z∈[2,3]. Constraint set C = {loss_barrier, loss_target_pos, loss_y_tstart, loss_y_tend}. Relational task parsing identifies loss_barrier as the hardest constraint (c_D); loss_target_pos is 'connected' to it; loss_y_tstart and loss_y_tend are neither connected nor conflicting. C_R = {loss_barrier, loss_target_pos}; C_2 = {loss_barrier, loss_target_pos, loss_y_tstart, loss_y_tend}; C_1 = C (all constraints). Retrieval searches HumanML3D for clips minimizing loss_barrier+loss_target_pos, finding a crawling or very-low-crouch clip; this is temporally resampled to the target length and spatially aligned. z_R is obtained by DDIM-inverting the retrieved clip. z_1 is optimized for 100 steps from random noise to fit C_1; z_2 is optimized for 100 steps from z_R to fit C_2. Temporal mask candidates (2^8=256) are evaluated in parallel; the winning mask preserves z_1 for frames outside the barrier zone and z_2 for frames in the barrier zone. Reward filtering eliminates masks causing jitter or foot-skating discontinuities. The composite z' is then used to initialize 400 more steps of DNO over the full F_C, yielding the final motion.

Reproducibility: A project page is provided (https://hanchaoliu.github.io/RetrievalGuidedDNO/). Code release status is not explicitly stated in the available text. The base MDM-RoHM checkpoint is retrained but full training configuration relies on MDM defaults. The LLM prompts for DeepSeek R1 are referenced as supplementary material. HumanML3D is a publicly available dataset.

Technical innovations

Relational task parsing: a novel divide-and-conquer decomposition of a mixed constraint set into retrieval-target, random-noise-fitting, and retrieved-noise-fitting subsets based on difficulty ordering and inter-constraint relationships, extending the motion programming framework of ProgMoGen [24] which treats all constraints uniformly.
Reward-guided binary mask selection over a tractable 2^N candidate set for composing random-noise and retrieved-noise trajectories in both temporal and spatial (body-part) dimensions, replacing the intractable direct optimization of M ∈ R^{T×d} and avoiding the over-smoothing of fixed linear interpolation (e.g., M=0.5).
Constraint-based retrieval with horizontal spatial transform and semantic consistency filtering to obtain DDIM-invertible motion references as improved diffusion noise initializations for hard constraints, a use of retrieval distinct from prior retrieval-augmented generation methods (ReMoDiffuse [48], RMD [23]) which use retrieved clips as generation conditioning rather than noise initialization.
LLM-automated relational task parsing using DeepSeek R1 to infer constraint difficulty ordering and inter-constraint relationships without user specification, enabling zero-shot generalization to novel task descriptions.
Two-stage noise optimization pipeline: a fast first stage (100 steps, LR=0.05) for mask construction and noise pre-conditioning, followed by a full second stage (400 steps, LR=0.02) initialized from the composed noise z', separating the structural prior injection problem from the fine-grained constraint satisfaction problem.

Datasets

HumanML3D — 14,616 motion sequences — Derived from AMASS and HumanAct12; publicly available

Baselines vs proposed

Unconstrained MDM-RoHM (Task-1): C.Error = 14.101, Max SP. = 0.506 vs proposed: C.Error = 0.0050, Max SP. = 0.027
ProgMoGen+DNO (Task-1): C.Error = 0.0162, Max SP. = 0.073, Max Acc. = 0.178 vs proposed: C.Error = 0.0050, Max SP. = 0.027, Max Acc. = 0.120
ProgMoGen+DNO NS=5 (Task-1): C.Error = 0.0019, Max SP. = 0.029 vs proposed NS=5: C.Error = 0.0000, Max SP. = 0.001
Unconstrained MDM-RoHM (Task-2): C.Error = 0.000115 vs proposed: C.Error = 0.000049
ProgMoGen+DNO (Task-2): C.Error = 0.000115, Max SP. = 0.149, Max Acc. = 0.261 vs proposed: C.Error = 0.000049, Max SP. = 0.079, Max Acc. = 0.194
ProgMoGen+DNO NS=5 (Task-2): C.Error = 0.000015, Max SP. = 0.064 vs proposed NS=5: C.Error = 0.000007, Max SP. = 0.048
ProgMoGen+DNO (Task-3): C.Error = 0.282, Succ. Rate = 0.469, Sem. Succ. Rate = 0.375 vs proposed: C.Error = 0.0003, Succ. Rate = 0.594, Sem. Succ. Rate = 0.438
ProgMoGen (Task HSI-2): C.Error = 0.097, Foot Skate = 0.189 vs proposed: C.Error = 0.000, Foot Skate = 0.165
MaskControl (Task HSI-2): C.Error = 0.000, Foot Skate = 0.146 vs proposed: C.Error = 0.000, Foot Skate = 0.165

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.08054.

Fig 1

Fig 1: Training-free Human Motion Generation for Highly-

Fig 2

Fig 2: Overview of Retrieval-Guided Diffusion Noise Optimization. Given a motion generation task represented by a combined

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 3).

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

Small evaluation scale: only 32 samples per task per method, and only 3 custom tasks are quantitatively evaluated; no confidence intervals or statistical significance tests are reported, making it difficult to assess result reliability.
Retrieval quality is a hard ceiling: if the dataset contains no clips resembling the required rare skill (e.g., a completely novel movement type), the method degrades to standard DNO; the paper does not characterize this failure mode quantitatively or identify out-of-distribution skill categories.
Computational cost of two-stage optimization (100 + 400 steps per sample on top of DDIM generation) is not reported—inference latency and GPU memory requirements relative to the baseline are absent, which matters for practical deployment.
The base MDM-RoHM is retrained from scratch but full training details (epochs, hardware, convergence) are not provided in this paper, relying on MDM defaults; reproducibility of the base model is partially deferred.
The LLM-based relational task parsing (DeepSeek R1) is evaluated only qualitatively (Fig. 3); no ablation measures parsing accuracy, error rate, or sensitivity to prompt phrasing across a systematic task set.
Diversity is acknowledged to decrease under highly-constrained settings; no diversity metric (e.g., APD—average pairwise distance) is reported, making the quality-diversity tradeoff unquantified.
The method is tested only on walking-centric tasks; generalization to upper-body-dominant actions, contact-rich manipulation, or multi-person scenarios is undemonstrated.

Open questions / follow-ons

How does performance degrade when the retrieval database does not contain any clip that satisfies the hard constraints even approximately—what is the minimum dataset coverage needed, and can synthetic augmentation of the retrieval corpus substitute for real clips?
Can the two-stage noise optimization be replaced by a learned noise prior (e.g., a small normalizing flow conditioned on constraint descriptors) to amortize the per-sample optimization cost across tasks?
The relational task parsing rules are manually designed heuristics; can they be learned from a dataset of constraint decompositions, or systematically derived from constraint Jacobian structure, to handle more complex multi-objective interactions automatically?
The method currently handles one retrieved reference per task; how would it extend to tasks requiring the simultaneous blending of multiple distinct skills (e.g., crawling through a gap while counting steps), where multiple disjoint clips each partially satisfy different hard constraints?

Why it matters for bot defense

At first glance this paper is squarely in the graphics/animation domain, but it contains two threads relevant to bot-defense engineers. First, the core problem—satisfying rare, out-of-distribution behavioral constraints that stump standard generative models—is structurally analogous to the challenge of generating synthetic human behavioral traces (mouse movement, typing rhythm, touch gesture) that must satisfy hard statistical or kinematic constraints to pass bot-detection heuristics. The retrieval-guided noise initialization idea could be adapted to augment synthetic behavior generators: rather than pure noise, seed diffusion-based trajectory generators with inverted noise from real human examples that already match the target constraint profile, improving the authenticity of edge-case synthetic human behaviors used in red-team or training-data pipelines.

Second, and more defensively, the paper highlights that LLM-guided constraint decomposition can automatically identify which aspects of a behavioral constraint are hardest to satisfy and retrieve targeted examples to meet them. A bot-defense team should note that this capability—if applied to CAPTCHA or behavioral biometrics challenges—could allow an adversary to decompose the challenge's implicit constraint set, retrieve real human examples matching the hardest sub-constraints, and blend them into a synthetic response that satisfies the full constraint with low error. This is not an immediate operational threat (the pipeline requires a frozen base motion model retrained on a large domain-specific dataset), but it points toward the need for CAPTCHA designs whose constraint structure is not decomposable into independently satisfiable sub-problems, and whose hardness derives from joint rather than marginal constraint satisfaction.

Cite

bibtex

@article{arxiv2605_08054,
  title={ Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization },
  author={ Hanchao Liu and Fang-Lue Zhang and Shining Zhang and Tai-Jiang Mu and Shi-Min Hu },
  journal={arXiv preprint arXiv:2605.08054},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.08054}
}

Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​