LychSim: A Controllable and Interactive Simulation Framework for Vision Research

Source: arXiv:2605.12449 · Published 2026-05-12 · By Wufei Ma, Chloe Wang, Siyi Chen, Jiawei Peng, Patrick Li, Alan Yuille

TL;DR

LychSim is a UE5-based simulation framework aimed at making high-fidelity, controllable 3D simulation usable for vision researchers who do not want to become graphics-engine programmers. The paper’s core problem is not “can we render pretty scenes?” but rather how to expose simulation as a practical research instrument for closed-loop optimization, OOD evaluation, and synthetic data generation while keeping scene control, annotation quality, and agent interaction manageable. The authors’ main design move is to wrap UE5 complexity behind a Python API, then add a procedural generation pipeline and rich annotations that go beyond standard RGB/depth/segmentation outputs.

What is new here is the combination of three things in one system: a unified Python interface for manipulating heterogeneous UE asset types; a data pipeline that generates diverse scenes with scene-level procedural rules and object-level pose alignments, enabling semantically aligned 3D supervision; and native MCP integration so an LLM can query, navigate, and modify the world in a closed loop. The paper reports the framework through three case studies rather than a single benchmark win: synthetic-data usage for spatial VLMs, RL-based adversarial examination of Segment Anything, and interactive language-driven scene planning. The result is presented as a platform contribution: broader release of code and annotations, plus examples showing that the framework can expose model weaknesses and support agentic scene editing.

Key findings

LychSim exposes a unified Python API that can spawn StaticMesh, SkeletalMesh, and Blueprint objects through one interface, avoiding separate Unreal workflows for each asset class (Fig. 2).
The simulator outputs standard labels plus novel supervision: part IDs, dense point maps, per-object occlusion/truncation ratios, and occlusion relationships for objects that extend beyond the image plane (Section 2.3).
The authors state that LychSim can synthesize OOD scenes with uncommon camera viewpoints, severe occlusion, high-density layouts, and semantically cluttered same-category groups, using scene-level procedural rules to guide sampling (Section 2.2).
In the adversarial examiner case study, an RL-trained Gaussian policy searches camera viewpoints around a target object to minimize SAM IoU; Fig. 4 shows an example trajectory where IoU drops from 0.84 to 0.64 under adversarial exploration.
The interactive planning demo uses MCP to let an agentic LLM inspect and revise scenes over multiple turns; the paper reports it can detect and correct issues like a vase floating in midair and handle user edits such as moving furniture closer to a window (Fig. 3, Fig. 5).
LychSim is positioned as extending UnrealCV with more 2D/3D ground truths, procedural generation, and native Python/MCP integration, rather than as a physics-first simulator like Isaac Sim or MuJoCo (Section 5).
The paper claims public release of the full C++ and Python source code plus procedural rules and object-level annotations, which would materially improve reproducibility if actually released as stated (Conclusion, Acknowledgements).

Threat model

The main adversary is a closed-loop optimizer or agent operating inside the simulator, with the ability to query rendered outputs, inspect structured scene state, move the camera, and manipulate objects through the exposed API or MCP tools. In the adversarial-examiner setting, the attacker is assumed to search for failure-inducing viewpoints and cannot directly alter the segmentation model’s weights or the simulator internals. In the interactive scene-planning setting, the LLM may produce invalid or physically implausible actions, but it is constrained to the simulator’s exposed tool interface and scene specification. The paper does not define a formal white-box or adaptive security attacker against the framework itself.

Methodology — deep read

The threat model is mostly an evaluation and closed-loop interaction setting, not a classic security model. The adversary in the adversarial-examiner case is an optimization process that can query a vision model repeatedly inside simulation, change camera pose, and search for failure cases; it is not assumed to tamper with the simulator internals. For the interactive MCP setting, the “adversary” is more like a reasoning agent or LLM that may propose invalid layouts; the system is designed to let the agent query scene state and then revise its actions. The paper does not define a formal attacker with white-box access to model weights, nor does it discuss defenses against simulator abuse.

On data and assets, the system starts from UE5’s native ecosystem and the Fab Asset Marketplace, which provides artist-made indoor and outdoor environments spanning architectural styles, geographies, and lighting conditions. The paper says LychSim also supports external scene layouts such as Infinigen and HSSD-200, but it does not give a dataset size, number of scenes, number of assets, or an exact train/val/test split for any released corpus in the excerpt provided. The authors add two important annotation layers: object-level metadata (category, canonical scale, pose alignment) and scene-level procedural rules (navigable floors, road areas, pedestrian walkways, dynamic trajectories). Those annotations are then used to modify and repopulate scenes procedurally. Preprocessing is mostly geometric and semantic enrichment: asset categorization, pose alignment, and rule encoding for allowed placements. Because the excerpt does not enumerate counts, labels, or split strategy, reproducibility of the data component is only partially specified here.

Architecturally, LychSim is built on Unreal Engine 5 but exposed through a streamlined Python API that hides engine-specific complexity. The key software abstraction is a unified object interface: users call add_obj with an object ID, asset path, location, and rotation, and the library internally handles whether the asset is a StaticMesh, SkeletalMesh, or Blueprint. Rendering calls similarly expose synchronized outputs such as RGB, segmentation, depth, and point maps. The novel part is not a new renderer; it is the packaging of UE5 functionality into a research-friendly control layer plus a standardized agent-tool interface through MCP. The MCP server exports the Python API as tools for navigation, scene queries, capture of live renderings, and object manipulation, so an external agent can run closed-loop planning rather than only offline inference.

The procedural generation pipeline is the other major algorithmic component. Instead of using pure manual scene assembly, the system combines curated UE5 assets with annotated procedural priors to generate new layouts under controllable visual complexity regimes. The paper highlights targeted sampling for OOD conditions such as uncommon viewpoints, severe occlusions, high-density scenes, and semantically cluttered same-class object groupings. For ground truth, the engine can render depth, instance segmentation, surface normals, point maps, 2D/3D boxes, part segmentation, occlusion ratios, truncation ratios, and occlusion relationships. A concrete end-to-end example is the point-map / occlusion pipeline: given a scene with a bicycle partly hidden behind a pedestrian, the system uses instance-level depth buffers plus geometric projection to recover what part of the bicycle lies outside the visible image plane, then computes occlusion and truncation statistics and can also emit part-level visibility if part IDs are rendered. That is a direct consequence of modeling the 3D structure beyond visible pixels, not just of post-processing 2D masks.

Training and optimization details are sparse in the excerpt because LychSim is primarily a framework paper rather than a model-training paper. For the RL adversarial examiner, the paper says they follow prior work and train a Gaussian policy to search camera viewpoints around a target object while minimizing the IoU of SAM predictions. It does not specify the exact policy network architecture, reward shaping, optimizer, number of steps, batch size, or hardware. Likewise, for the language-driven scene planning demos, the paper describes a tool-using agentic workflow with a scene specification file, a skill file, and MCP tools, but it does not report prompt templates, decoding settings, or failure-rate statistics. The system examples are therefore better read as proof-of-capability demonstrations than as fully benchmarked methods.

Evaluation is case-study driven. There is no single unified benchmark table in the provided text, and no statistical significance tests are reported. Instead, the paper uses three slices: synthetic-data generation for diagnosing spatial VLM weaknesses and supporting post-training; adversarial examination of Segment Anything, where Figure 4 shows a successful attack trajectory; and interactive planning, where Figures 3 and 5 show multi-turn scene editing and layout generation. The only explicit numeric result in the excerpt is the SAM adversarial example in Figure 4, where IoU changes from 0.84 to 0.64 along an adversarial trajectory. The evaluation protocol is therefore qualitative plus illustrative quantitative snapshots, with comparisons to prior systems discussed narratively rather than through extensive ablations. Reproducibility is partially supported by the stated release of code and annotations, but the excerpt does not include frozen weights, downloadable assets, or a complete experiment script set.

One concrete example end to end: a researcher wants to test segmentation robustness under challenging viewpoints. They use the Python API to spawn a target object and other scene elements, request a camera pose from the RL policy, render RGB and masks, compute SAM’s predicted mask, and feed the resulting IoU into the reward function. The policy updates its camera distribution to seek worse IoU, and because LychSim can render from arbitrary viewpoints with consistent scene state, the examiner can systematically probe failure regions rather than relying on chance. This workflow is the clearest example of what LychSim adds over ordinary dataset generation: the environment is not just a renderer but a controllable optimization loop.

Technical innovations

A unified Python API that abstracts away Unreal Engine 5 asset heterogeneity and lets researchers manipulate StaticMesh, SkeletalMesh, and Blueprint objects through the same commands.
A procedural scene-generation pipeline that uses scene-level rules and object-level pose alignments to create OOD visual conditions while preserving semantically plausible layouts.
Native MCP integration that turns the simulator into a tool-using environment for agentic LLMs, enabling closed-loop navigation, querying, and object manipulation.
New ground-truth outputs beyond standard vision labels, including part-level segmentation, dense point maps, occlusion ratios, truncation ratios, and occlusion relationships.
A hybrid generation strategy that combines curated UE5/Fab assets with external layouts such as Infinigen and HSSD-200 rather than relying on a single procedural paradigm.

Datasets

UE5 Fab Asset Marketplace scenes — size not specified — Epic Games Fab marketplace
External scene layouts (Infinigen) — size not specified — public/open source per cited prior work
External scene layouts (HSSD-200) — size not specified — public dataset

Baselines vs proposed

Segment Anything under adversarial examiner: IoU = 0.84 baseline context vs proposed adversarial trajectory: IoU = 0.64 (Fig. 4 example)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.12449.

Fig 1

Fig 1: | We introduce LychSim, a controllable and interactive simulation framework designed

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 3

Fig 3: | Agentic integration and interactive scene planning. Left: Our LychSim provides Python

Fig 4

Fig 4: | Case study of adversarial examiner for instance segmentation. Left: pseudo-code for

Limitations

The paper excerpt provides almost no concrete dataset cardinalities, making it hard to judge scale or diversity relative to prior synthetic corpora.
Training details for the RL adversarial examiner are underspecified: no architecture, optimizer, reward coefficients, rollout length, or hardware are given in the excerpt.
Most results are case studies and qualitative demonstrations rather than controlled benchmark comparisons or ablation studies.
The interactive LLM planning demo appears vulnerable to physically implausible layouts and object collisions, which the authors themselves acknowledge.
No formal analysis is provided of simulator fidelity versus real-world domain gap, despite the system’s focus on OOD evaluation and synthetic data.
Reproducibility depends on a promised public release of code and annotations; the excerpt does not confirm that the release is complete or that all referenced assets are redistributable.

Open questions / follow-ons

How much do the newly exposed annotations, especially occlusion ratios and part-level visibility, improve downstream robustness or 3D representation learning compared with standard instance masks?
How well do policies or LLM agents trained in LychSim transfer to real-world scenes, especially under significant appearance and geometry shift?
Can the MCP loop be made reliable enough for long-horizon scene editing without drifting into invalid layouts or compounding object collisions?
What is the cost tradeoff between generating richer 3D supervision in UE5 and using cheaper procedural pipelines such as Blender-based or physics-first simulators?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, LychSim is most relevant as an evaluation substrate rather than a CAPTCHA system itself. It shows how simulation can be used to systematically generate hard OOD cases, which maps directly onto stress-testing human-verification or bot-detection models against viewpoint shifts, clutter, occlusion, and adversarial search. The RL adversarial-examiner pattern is especially relevant: if a model is deployed as a visual verifier, a simulator like this can search for inputs or camera conditions that reduce confidence, exposing brittle regions before an attacker does.

The MCP-style closed loop is also a useful design pattern for red-teaming. Instead of generating a static test set, you let an agent query the environment, inspect outputs, and iteratively adapt. For CAPTCHA or bot-defense, that implies a move from fixed challenge generation toward adaptive challenge synthesis, where the test distribution can respond to model weaknesses. The caveat is that the paper’s demos are still lightweight and not a full robustness study, so a practitioner should treat LychSim as a promising lab for stress testing and data generation, not as evidence that interactive simulation alone solves generalization or anti-bot hardening.

Cite

bibtex

@article{arxiv2605_12449,
  title={ LychSim: A Controllable and Interactive Simulation Framework for Vision Research },
  author={ Wufei Ma and Chloe Wang and Siyi Chen and Jiawei Peng and Patrick Li and Alan Yuille },
  journal={arXiv preprint arXiv:2605.12449},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.12449}
}

LychSim: A Controllable and Interactive Simulation Framework for Vision Research ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​