ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

Source: arXiv:2605.06667 · Published 2026-05-07 · By Omar El Khalifi, Thomas Rossi, Oscar Fossey, Thibault Fouque, Ulysse Mizrahi, Philip Torr et al.

TL;DR

ActCam addresses the problem of jointly controlling character motion and camera trajectory in video generation — a capability that existing methods handle separately or require task-specific fine-tuning to combine. Prior art either conditions on 2D pose signals that become geometrically ambiguous under camera movement, or requires custom architectures and expensive training (e.g., Uni3C) to unify both control axes. ActCam's core insight is that both control signals — human pose and scene depth — can be made geometrically consistent with a target camera trajectory at inference time, without touching model weights, by constructing camera-aligned conditioning inputs from a reference image, a driving video, and a per-frame camera specification.

The method builds on VACE, a pretrained flow-matching video diffusion model that already accepts dense depth and pose conditioning. ActCam's pipeline extracts a background-only depth mesh (by inpainting the character out of the reference image), recovers 4D human motion from the driving video via monocular 3D estimation (GVHMR), aligns the recovered skeleton to the background scene geometry using a weighted centroid depth-affine transform, and rasterizes both pose and depth+pose control signals from the target camera viewpoint. A two-phase denoising schedule then applies depth+pose conditioning during early high-noise steps to establish scene structure and camera geometry, then drops depth and continues with pose-only conditioning to refine motion details without propagating monocular depth artifacts into high-frequency content.

ActCam is evaluated on RealisDance-Val across 4 camera presets (400 total test cases) and a static-camera benchmark against nine baselines. It outperforms Uni3C — its closest competitor — on all reported metrics including MPJPE (0.2087 vs 0.2121), Sampson Error (0.4546 vs 0.5665), Subject Consistency (0.9212 vs 0.9084), Imaging Quality (0.7212 vs 0.6640), and WorldScore 3D-Consistency (0.6304 vs 0.539). On the static-camera benchmark it achieves the highest VBench average (86.47) among ten methods including VACE itself (85.33) and SteadyDancer (85.15). A 17-user 2AFC study strongly favors ActCam over Uni3C on camera adherence, motion faithfulness, and visual quality.

Key findings

ActCam reduces MPJPE from 0.2121 (Uni3C) to 0.2087 on RealisDance-Val with 4 camera presets, indicating improved 3D pose fidelity in the generated video despite zero-shot operation.
Sampson Error (geometric consistency under moving camera) drops from 0.5665 (Uni3C) to 0.4546 — a ~20% reduction — suggesting that camera-aligned depth+pose conditioning enforces more consistent multi-view geometry.
WorldScore 3D-Consistency improves from 0.539 (Uni3C) to 0.6304 (+17pp), and Object Control from 0.9878 to 0.9953, both on the joint-camera evaluation benchmark.
On the static-camera VBench benchmark (RealisDance-Val, 10 baselines), ActCam achieves average 86.47 vs second-best VACE at 85.33, with Appearance Fidelity 58.66 vs VACE's 57.81 and Temporal Consistency 98.88 vs UniAnimate-DiT's 98.78.
The two-phase conditioning cutoff ND=0.2 (i.e., depth conditioning applied only during the first 20% of denoising steps) is identified as optimal via ablation; ND=0 causes camera/character motion conflation, ND=1.0 causes static background artifacts under dynamic camera motion (Fig 4 and Fig 5).
Removing the static reference character from the background depth mesh before inserting the animated skeleton is shown via ablation to be necessary to prevent character duplication in generated outputs.
A 17-participant 2AFC user study shows clear preference for ActCam over Uni3C on all three dimensions (camera adherence, motion faithfulness, visual quality), with Uni3C preference consistently below ActCam and with a non-trivial tie fraction (exact numbers not extracted from truncated text but Fig 3 shows Uni3C clearly below 0.5 on all axes).
ActCam is model-agnostic at inference time: it requires only that the backbone accept dense depth and pose conditioning, and was demonstrated on VACE (Wan 2.1 14B backbone), enabling applicability to future architecturally compatible models without retraining.

Methodology — deep read

Threat model and assumptions: This is not a security paper. The adversarial framing does not apply. The system assumes: (a) a pretrained image-to-video diffusion model (VACE, based on Wan 2.1 14B) that accepts dense per-frame depth and pose conditioning; (b) a reference image defining identity and scene; (c) a driving video providing the acting performance; (d) a target per-frame camera specification as intrinsics K_τ ∈ R^{3×3} and extrinsics (R_τ, t_τ) ∈ SO(3) × R^3. The method assumes monocular depth estimation (MoGe) and 3D human motion estimation (GVHMR) are sufficiently accurate for the conditioning pipeline to be coherent. No adversarial robustness is discussed.

Data: The primary evaluation dataset is RealisDance-Val (Zhou et al. 2025b), used both for moving-camera evaluation (100 clips × 4 camera presets = 400 test cases) and static-camera evaluation. For each clip, a reference image I_ref is sampled and the original clip provides the acting motion signal. No separate training data is used — the method is entirely inference-time. The backbone VACE was trained on an unspecified video dataset combining depth, optical flow, pose, and activation-mask conditions. Dataset provenance for VACE training is not detailed in this paper. No data augmentation or held-out split strategy specific to ActCam is described beyond the use of RealisDance-Val as a validation set.

Condition construction pipeline (the core algorithmic contribution): Step 1 — background depth extraction. MoGe (Wang et al. 2025b) estimates a depth map D_ref from I_ref. The character is segmented (using Liu et al. 2024 for the binary mask M), inpainted out of I_ref, and MoGe is re-run on the inpainted image to produce a background-only depth map D_bg. This is unprojected into a 3D mesh. Step 2 — scene transfer (depth alignment). Because D_ref and D_bg are estimated in independent passes, their 3D coordinate frames are inconsistent. A weighted centroid alignment resolves this: background pixels (u,v) ∉ M receive importance weights w(u,v) = exp(−dist(x_ref_{u,v}, M)) that exponentially decay with distance from the character mask boundary, emphasizing pixels near the character for better local registration. Weighted centroids p_ref and p_bg are computed in 3D space from each depth estimate. Character depth values are then affinely transformed along the z-axis: z^char_bg = (z^char_ref − p^z_ref) * (p^z_bg / p^z_ref) + p^z_bg. This accounts for scale and translation mismatch between the two monocular estimates. Step 3 — 4D motion recovery. GVHMR (Chu et al. 2024) extracts a per-frame SMPL articulated state S_τ from the driving video V_act. Unlike 2D keypoints, SMPL parameters are 3D and thus viewpoint-stable. Step 4 — character fitting into scene. The recovered pose sequence S is aligned to the reference character in the assembled 3D scene using the Umeyama (1991) least-squares rigid transformation at τ=0 between S_0 and 3D keypoints extracted from I_ref, yielding rotation R ∈ SO(3), translation t̂, and scale s. The full aligned sequence Ŝ_τ = s·R·S_τ + t̂ is used for all subsequent rendering. Step 5 — rendering control signals. Two signals are rasterized under the target camera trajectory C. The pose control video c_pose = R_C(Ŝ_τ) ∈ R^{T×H×W×3} renders the OpenPose skeleton on a black background. The joint depth+pose signal c_{pose+depth} = R_C(Ŝ_τ, D_bg) superimposes the OpenPose skeleton on the min-max normalized grayscale background depth mesh, following VACE's training-time format exactly. The background mesh is rendered as a mesh (not a point cloud) to reduce sparsity artifacts under rotation.

Two-phase conditioning schedule: Let t ∈ [0,1] be the continuous flow-matching timestep (t=0 is pure noise, t=1 is data). A cutoff t_stop defines two phases: for t ≤ t_stop (early, high-noise steps), conditioning uses c_{pose+depth}; for t > t_stop, conditioning switches to c_pose only. The threshold is parameterized as N_D = t_stop × N where N is the total number of denoising steps. Ablation (Fig 4) sweeps N_D ∈ {0, 0.1, 0.2, 1.0} and finds N_D = 0.2N optimal. N_D=0 causes the model to misattribute camera motion as body motion; N_D=1.0 propagates coarse depth artifacts into fine-grained details and freezes dynamic elements (Fig 5, e.g., a barbell fails to move with the character). The optimal N_D=0.2 balances structure enforcement and detail freedom.

Training regime: Not applicable — ActCam introduces no trainable parameters. The backbone VACE is frozen. All computations are inference-time geometry transformations, monocular estimation calls, and standard rasterization. Hardware, batch size, epoch count, and seed strategy are not reported (not relevant for the zero-shot framing). The number of denoising steps N is kept identical across all compared methods for fairness. Specific N is not reported in the truncated text.

Evaluation protocol: Moving-camera evaluation: 400 test cases (100 clips × 4 camera presets) from RealisDance-Val. Metrics: VBench (Subject Consistency, Background Consistency, Appearance Fidelity, Imaging Quality, Temporal Consistency, Motion Smoothness); MPJPE (3D pose error between driving video and generated video, computed via 3D human pose estimation on output); Sampson Error (epipolar geometric consistency); WorldScore 3D-Consistency and Object Control. Static-camera evaluation: RealisDance-Val, static viewpoint, VBench only, 10 baselines. User study: 17 participants, 2AFC design, same 4 camera presets, anonymized video pairs, 3 questions (camera adherence, motion faithfulness, visual quality). No statistical significance tests (p-values, confidence intervals) are reported. No cross-validation is described. Distribution shift (e.g., out-of-domain scenes, extreme motions, non-humanoid subjects) is not tested.

End-to-end concrete example: Given I_ref (a person standing in a gym), V_act (a barbell curl video), and C (45° rightward pan). (1) MoGe estimates D_ref; character is segmented and inpainted; MoGe re-runs on inpainted image yielding D_bg. (2) Weighted centroid alignment places the character correctly in the background mesh z-axis. (3) GVHMR extracts SMPL parameters for all τ frames of the curl. (4) Umeyama alignment at τ=0 fits the SMPL skeleton to the gym scene. (5) For the first 20% of denoising steps, the model receives c_{pose+depth} rasterized under the 45° pan camera — the background walls appear to shift rightward in the depth map, signaling camera rotation. For the remaining 80%, only c_pose is used, allowing the barbell (a dynamic foreground object not in the skeleton) to move naturally rather than being frozen by the coarse depth constraint. The output video shows the character curling the barbell from a 45° angle while the gym background exhibits the expected parallax.

Reproducibility: No code release is confirmed in the truncated text. The project page (https://elkhomar.github.io/actcam/) exists. VACE backbone weights are presumably public (Wan 2.1). GVHMR and MoGe are cited as existing tools. The specific N_D hyperparameter and all algorithmic formulas are provided in the paper. RealisDance-Val is a public benchmark. The method should in principle be reproducible given access to these components, though no frozen weights or official code repository is confirmed.

Technical innovations

Camera-aligned joint conditioning pipeline: ActCam constructs per-frame pose and depth conditioning signals that are co-registered to the target camera viewpoint via 3D unprojection and rasterization, unlike prior 2D-keypoint methods (e.g., MimicMotion, SteadyDancer) that become geometrically ambiguous under camera rotation.
Reference character removal with geometry-aware scene transfer: to prevent the static reference character in the depth mesh from conflicting with the dynamic pose signal, ActCam inpaints the character out of I_ref, re-estimates depth, and re-inserts the animated skeleton via a boundary-weighted affine depth alignment (Eq. 4), a step absent in Uni3C and all 2D-keypoint baselines.
Two-phase conditioning schedule (Eq. 5): depth conditioning is applied only during the first N_D = 0.2N denoising steps to enforce coarse scene structure and camera geometry, then dropped so that pose-only guidance refines high-frequency details without propagating monocular depth estimation artifacts — a novel staged inference strategy not present in prior ControlNet-style or camera-control works.
Training-free joint camera and motion control: unlike Uni3C (Cao et al. 2025) which requires task-specific fine-tuning of a custom architecture on motion+camera data, ActCam achieves competitive or superior results by operating entirely at inference time on any backbone that accepts dense depth and pose conditioning.
Mesh-based background rendering for camera control: ActCam renders the background depth as a 3D mesh rather than a point cloud under the target camera, which the authors argue reduces sparsity and visual artifacts under large viewpoint changes compared to point-cloud-based reprojection approaches used in related work.

Datasets

RealisDance-Val — 100 reference clips used per camera preset (400 total test cases for moving-camera eval, plus static-camera eval) — public benchmark (Zhou et al. 2025b)
VACE training dataset — size and provenance unspecified in this paper — internal/proprietary to VACE authors (Jiang et al. 2025)

Baselines vs proposed

Uni3C [Cao et al. 2025]: VBench Average = 0.8370 vs ActCam: 0.8497
Uni3C [Cao et al. 2025]: Subject Consistency = 0.9084 vs ActCam: 0.9212
Uni3C [Cao et al. 2025]: Imaging Quality = 0.6640 vs ActCam: 0.7212
Uni3C [Cao et al. 2025]: MPJPE = 0.2121 vs ActCam: 0.2087 (lower is better)
Uni3C [Cao et al. 2025]: Sampson Error = 0.5665 vs ActCam: 0.4546 (lower is better)
Uni3C [Cao et al. 2025]: WorldScore 3D-Consistency = 0.539 vs ActCam: 0.6304
Uni3C [Cao et al. 2025]: Object Control = 0.9878 vs ActCam: 0.9953
VACE [Jiang et al. 2025] (static camera): VBench Average = 85.33 vs ActCam: 86.47
SteadyDancer [Zhang et al. 2025a] (static camera): VBench Average = 85.15 vs ActCam: 86.47
UniAnimate-DiT [Wang et al. 2025c] (static camera): Temporal Consistency = 98.78 vs ActCam: 98.88
Moore-AnimateAnyone [Hu et al. 2024] (static camera): VBench Average = 83.78 vs ActCam: 86.47
MimicMotion [Zhang et al. 2024] (static camera): VBench Average = 82.27 vs ActCam: 86.47

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.06667.

Fig 1

Fig 1: Overview. ActCam enables zero-shot joint control of acting motion and camera motion for single-image video generation from a reference image,

Fig 2

Fig 2: ActCam pipeline. Given a reference image, an acting video, and a target camera trajectory, we (1) estimate background depth from an inpainted

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

Small user study: 17 participants in the 2AFC study is statistically underpowered for the variance expected in subjective video quality judgments; no confidence intervals or p-values are reported for any metric.
Single backbone dependency: all experiments use VACE (Wan 2.1 14B) as the backbone; generalizability to other architectures (e.g., CogVideoX, Kling-based models) that also accept depth/pose conditioning is claimed but not empirically validated.
Monocular depth quality as a silent ceiling: the entire scene geometry pipeline depends on MoGe's monocular depth accuracy; failure cases from MoGe (e.g., reflective surfaces, thin structures, extreme viewpoints) would propagate into conditioning artifacts, but no ablation on depth estimation quality or failure-mode analysis is presented.
4-camera-preset evaluation scope: the moving-camera benchmark uses only 4 cinematic camera presets; generalization to arbitrary or rapidly changing camera trajectories (e.g., drone footage, handheld shake, zoom + pan combinations) is not evaluated.
No distribution shift evaluation: all test clips come from RealisDance-Val; performance on out-of-domain scenes (non-studio backgrounds, non-frontal reference images, multiple occlusions, crowds) is not assessed.
Multi-character support is noted qualitatively (Fig 13) but not quantitatively benchmarked; it is described as contingent on backbone support rather than a validated capability of ActCam's conditioning pipeline.
The optimal N_D hyperparameter (0.2) is determined empirically on the same evaluation set used for reporting results, which risks overfitting the hyperparameter to the benchmark; no held-out validation split is described for this tuning step.

Open questions / follow-ons

How does ActCam degrade as monocular depth estimation quality decreases — e.g., for outdoor scenes with sky, water, or highly reflective surfaces where MoGe is known to struggle — and can a confidence-weighted depth conditioning signal mitigate this?
The N_D cutoff is a hard switch; would a soft annealing schedule (e.g., linearly decaying depth conditioning weight over denoising steps) yield better trade-offs between structure enforcement and detail freedom, particularly for scenes with many dynamic objects?
ActCam handles a single foreground character; extending to multi-character scenes requires independent 3D fitting, occlusion handling, and potentially separate conditioning streams — can the two-phase schedule generalize to this without per-character N_D tuning?
The method relies on VACE's existing dense conditioning interface; as video generation models shift toward latent-space or token-based conditioning (rather than pixel-aligned dense maps), what is the minimal interface assumption needed to preserve ActCam's camera-aligned conditioning strategy?

Why it matters for bot defense

ActCam is a video generation paper with no direct connection to CAPTCHA or bot-defense. However, practitioners in bot-defense should be aware of its implications for synthetic media detection and liveness verification. The method enables high-fidelity, photorealistic videos of humans performing arbitrary motions from arbitrary camera angles without any training, using only a single reference image as identity anchor. This significantly lowers the barrier to generating convincing synthetic human video that could be used to spoof video-based liveness checks, behavioral biometrics, or video CAPTCHAs that rely on detecting 'natural' human motion cues.

Specifically, the 3D-consistent pose conditioning means that generated subjects exhibit geometrically plausible body proportions and joint kinematics under viewpoint changes — a property that naive video generation artifacts (e.g., limb deformation, inconsistent shadows, unnatural parallax) typically violate. Bot-defense engineers building video liveness systems should note that ActCam-class methods produce outputs that score highly on temporal consistency (98.88), subject consistency (0.9212), and 3D geometric coherence (WorldScore 3D-C 0.6304) — metrics that correlate with human perceptual plausibility. Detection systems that rely on 2D keypoint anomalies or depth inconsistency as forgery signals may be particularly vulnerable to this pipeline, since ActCam explicitly constructs geometrically consistent 3D-grounded signals. The zero-shot, inference-time nature of the attack vector means it requires no fine-tuning for new identities or scenes, making it scalable.

Cite

bibtex

@article{arxiv2605_06667,
  title={ ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation },
  author={ Omar El Khalifi and Thomas Rossi and Oscar Fossey and Thibault Fouque and Ulysse Mizrahi and Philip Torr and Ivan Laptev and Fabio Pizzati and Baptiste Bellot-Gurlet },
  journal={arXiv preprint arXiv:2605.06667},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06667}
}

ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​