JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

Source: arXiv:2606.20563 · Published 2026-06-18 · By Siang-Ling Zhang, Huai-Hsun Cheng, Tsung-Ju Yang, Yu-Lun Liu

TL;DR

This paper addresses the challenging problem of 3D visual illusions, where a single 3D mesh reveals entirely different semantics from specific viewing angles. Existing methods either rely on slow optimization with color saturation artifacts or naive stitching that yields visible seams and semantic leakage. JanusMesh presents a fast, zero-shot, training-free two-stage framework that generates highly realistic dual-semantic 3D visual illusions in just 3–5 minutes. The first stage introduces a cross-space dual-branch denoising approach that decodes 3D latents into voxel space, performs CLIP-guided orientation alignment for optimal silhouette matching, and uses Signed Distance Field (SDF) blending to fuse geometries seamlessly. The second stage involves a novel view-conditioned texture synthesis module that projects view-specific 2D diffusion priors onto the fused geometry for coherent texturing. Extensive quantitative experiments, ablations, and perceptual user studies show JanusMesh significantly outperforms prior optimization and concatenation baselines in semantic clarity, geometric coherence, visual realism, and runtime efficiency.

Key findings

JanusMesh generates dual-semantic 3D illusions in 3–5 minutes on a single RTX 4090 GPU, versus ∼40 minutes for SDS-based methods.
Achieves 84% GPT-4.1-mini semantic accuracy, outperforming Direct Concatenation (76%) and Shape from Semantics (70%).
Lowest FID score of 185.555 compared to baselines, indicating superior visual realism.
Object detection counts average 0.86 with 18% multi-object rate, versus 2.1 and 56% respectively for naive stitching, showing strong geometric fusion.
CLIP-guided Orientation Search selects optimal rotation angles, improving semantic silhouette alignment and reducing fusion artifacts.
SDF averaging for voxel fusion outperforms union, Gaussian blur, Minkowski dilation, and polar blending in geometric smoothness and shape quality (see Fig. 12).
Noise guidance strategies (Noise Blending and Space Control) improve geometric fusion and convergence, especially for challenging silhouette pairs.
User study with 50 participants rated JanusMesh results as clearly recognizable 78.5% of the time and preferred the method 71% over baselines.

Threat model

The threat model is an observer adversary attempting to decipher the intended semantics of a 3D visual illusion from multiple viewpoints. The adversary can freely observe the object from arbitrary angles but should only recognize one semantic at the designated viewpoints. The system's goal is to prevent semantic leakage or recognition of unintended semantics at other views. There is no active adversary tampering with the generation process or mesh geometry, nor attempts at adversarial attacks on the neural pipeline.

Methodology — deep read

The methodology proceeds in two main stages to generate dual-semantic 3D visual illusions from two text prompts y1 and y2 at viewpoints θ1 and θ2.

Threat Model & Assumptions: The model assumes an adversary that tries to perceive the object from multiple viewpoints to identify semantics. The system aims to prevent semantic leakage so only the target semantics are recognizable from specified views. No adversarial attacks or manipulation scenarios are considered.
Data: The dataset consists of 60 distinct object prompts encompassing birds, mammals, reptiles, plants, and man-made artifacts, derived from Shape From Semantics [40] and augmented with common objects. Experiments sample 2 prompts randomly per trial to form prompt pairs. No annotated ground-truth 3D data is used; generation is zero-shot and prompt-driven.
Architecture/Algorithm: Leveraging TRELLIS [73], a two-stage Rectified Flow 3D generator with sparse structured latents, Stage 1 uses dual-branch denoising starting from shared noise z_t. Two independent denoising branches conditioned on text prompts produce clean latent voxel predictions (x1_1|t and x2_1|t). These are decoded into voxel occupancy grids (v1, v2). To fuse the geometries seamlessly, the method converts these to Signed Distance Fields (SDFs), averages them element-wise, thresholding via τ=0.8 and truncation clip_s=12, producing a blended voxel that maintains a smooth intermediate shape (SDF blending). The fused voxel is inverse-rotated and re-encoded to latent space for the next denoising step, iterating over 25 steps.

CLIP-guided Orientation Search is used to align object 2 to object 1 optimally: renders of object 1 are scored by CLIP text-image similarity to select an anchor view; object 2 is rotated within a discrete 90° grid across X,Y,Z axes (28 candidates) and the rotation maximizing image-image CLIP similarity to object 1's anchor view is chosen. This ensures semantic shapes' silhouettes align well before blending.

Stage 2 performs view-conditioned texture synthesis. The fused geometry from Stage 1 is textured separately per viewpoint by rendering from θ1 and θ2, and predicting view-specific textures using depth-conditioned ControlNet-guided Stable Diffusion. The textures are projected back onto the mesh surface and aggregated using cosine-weighted blending to smooth seams. View-dependent texture selection chooses textures from the corresponding text prompt branch based on the current angle, switching smoothly at angular boundaries.

Noise Guidance improves geometric fusion. Two strategies are introduced: Noise Blending Guidance initializes the latent from a weighted combination of pre-generated single-semantic voxels fused at target angles and pure Gaussian noise using parameter α=0.3. Space Control Guidance interpolates between the guidance latent and noise at timestep t0=10/25 to restrict generation early and improve convergence on challenging pairs. Noise guidance is optional and pair-dependent.

Training regime: The framework is training-free and zero-shot, running 25 denoising steps per branch for geometry and 30 steps for texture synthesis. Guidance scale ω=7.5 is used for classifier-free guidance between t=0.5 to 0.95 noise levels. Computations are done on a single NVIDIA RTX 4090 GPU.

Evaluation: Metrics include CLIP text-image similarity (1,000 renders ±20° jitter), GPT-4.1-mini semantic classification accuracy, FID/KID vs Objaverse reference renders, object detection with OWLv2 at junction views, view-conditional CLIP contrast to detect semantic leakage, and Impact Factor measuring curvature smoothness at fusion seams. A user study with 50 participants rated perceptual recognizability, semantic alignment, and orientation strategies. Ablations tested voxel fusion strategies, noise guidance methods, and orientation search effects.

Reproducibility: Model code and project page are available as per paper. The dataset is constructed from publicly available prompts, but no explicit 3D ground truth meshes are provided. Precise hyperparameters and runtime details are disclosed, enabling replication of the zero-shot pipeline.

Concrete example: For a "Peacock" vs "Pineapple" prompt pair, Stage 1 starts from noise z_t, generates clean voxels for each prompt, aligns them via CLIP-guided rotation search, blends via SDF averaging providing a smooth intermediary geometry, then re-encodes fused latent for progressive denoising over 25 steps. Stage 2 textures the fused mesh by rendering from both target angles and synthesizing view-specific textures with depth control, projecting these back and blending seamlessly. Final result is a single mesh recognized as a peacock at θ=0° and pineapple at θ=180°, with abstract appearance elsewhere.

Technical innovations

Cross-space dual-branch denoising framework combining 3D latent Rectified Flow generation with voxel-space SDF blending to ensure seamless geometric fusion.
CLIP-guided adaptive orientation search between object pairs for optimal silhouette alignment prior to fusion, improving semantic coherence.
View-conditioned texture synthesis via projecting view-specific 2D diffusion priors onto the fused 3D geometry with cosine-weighted mesh texture aggregation.
Noise guidance strategies (Noise Blending and Space Control) injecting spatial priors into initial latent noise for more stable and geometry-preserving fusion.
Extension of zero-shot multi-view illusions from 2D pixel spaces (Visual Anagrams) to fully textured 3D meshes without training or per-object optimization.

Datasets

Custom 60-object prompt set — 60 distinct objects across birds, mammals, reptiles, plants, man-made artifacts — derived from Shape from Semantics and supplemented with common objects.

Baselines vs proposed

Shape from Semantics [40]: GPT Accuracy = 70%, FID = 194.136, Runtime ∼40 min vs JanusMesh: GPT Accuracy = 84%, FID = 185.555, Runtime 3–5 min
Direct Concatenation: CLIP = 29.030, GPT Accuracy = 76%, multi-object rate 56% vs JanusMesh: CLIP = 28.170, GPT Accuracy = 84%, multi-object rate 18%
TRELLIS [73]: GPT Accuracy = 60%, FID = 174.130 vs JanusMesh: GPT Accuracy = 84%, FID = 185.555
DreamBeast [42]: GPT Accuracy = 65%, FID = 184.956 vs JanusMesh: GPT Accuracy = 84%, Runtime 3–5 min vs DreamBeast’s 3-4 hours

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.20563.

Fig 1

Fig 1: Zero-shot 3D Visual Illusion Generation. Given two different text

Fig 2

Fig 2: Comparison of 3D visual illusion generation methods. (a) SDS-Based

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 3

Fig 3: Pipeline overview. (a) Stage 1 employs dual-branch denoising. At each

Fig 4

Fig 4: SDF blending. (a) Given two rotation-aligned voxels, we compute their SDFs,

Limitations

Small dataset with 60 prompt objects and synthetic prompt pairs; lacks large-scale or real-world validated benchmarks.
No adversarial robustness or deliberate semantic leakage attack evaluations performed.
Viewpoint variations tested within ±20° jitter but not against large distribution shifts or occlusions.
Orientation search samples limited to 90° intervals; finer granularity might improve fusion but untested.
Noise guidance strategies are heuristically parameterized with α and t0; no automatic tuning or ablation on these hyperparameters beyond qualitative assessment.
No direct user perceptual testing under varied viewing conditions or illumination simulated.

Open questions / follow-ons

How can the CLIP-guided orientation search be extended to continuous or finer rotation sampling efficiently?
Can the noise guidance parameters be optimized automatically or learned to generalize better across diverse object pairs?
How does the method scale to more than three-object illusions or complex scenes with occlusion and lighting variations?
What are the limits of semantic complexity or similarity between objects beyond which fusion fails or semantic leakage increases?

Why it matters for bot defense

JanusMesh introduces key techniques applicable to bot defense and CAPTCHA systems that rely on visual or 3D challenges. The ability to generate dual-view semantic illusions with strong geometric coherence and rapid, zero-shot generation enables new types of 3D CAPTCHAs that are difficult for automated systems to interpret from multiple angles. The cross-space SDF blending approach ensures that the puzzle object appears as a coherent mesh rather than separate parts, strengthening illusion integrity against simple heuristic attacks. CLIP-guided orientation search could be leveraged to adaptively generate challenges tuned to maximize semantic ambiguity except at target orientations, increasing bot recognition difficulty. However, practical deployment would require further robustness analysis against adversarial 3D reconstructions or neural network bypass methods. The fast runtime (<5 min) enables scalable challenge generation on-demand without retraining models, favoring dynamic CAPTCHA designs.

Cite

bibtex

@article{arxiv2606_20563,
  title={ JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising },
  author={ Siang-Ling Zhang and Huai-Hsun Cheng and Tsung-Ju Yang and Yu-Lun Liu },
  journal={arXiv preprint arXiv:2606.20563},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.20563}
}

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​