HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

Source: arXiv:2606.06390 · Published 2026-06-04 · By Wenbo Li, Xiaoliang Ju, Zipeng Qin, Rongyao Fang, Hongsheng Li

TL;DR

This paper addresses the challenge of generating realistic, coherent, and simulation-ready whole-home indoor scenes from high-level user prompts. Existing methods struggle due to the scarcity of large-scale, high-fidelity 3D residential data and typically focus on isolated tasks such as floorplan generation or single-room furnishing, resulting in layouts that lack global consistency and interaction readiness. To overcome these limitations, the authors propose a unified hierarchical framework that decomposes whole-home scene synthesis into four controllable stages: (1) a large-scale real residential floorplan dataset curation and a fine-tuned large language model (LLM) that generates vectorized floorplans with a novel K-D tree representation; (2) image-driven hierarchical furnishing leveraging foundation 2D image generation models to draft furniture layouts via multi-view roaming constrained by 3D unfurnished shells; (3) a recursive refinement process using a fine-tuned vision-language model (VLM) to iteratively correct physical and semantic violations in layouts; and (4) placement of small manipulable objects on supporting surfaces for embodied AI simulation. The pipeline enriches scenes with physical attributes, textures, and lighting to produce fully sim-ready 3D homes. Experiments including quantitative metrics and user studies demonstrate greater layout diversity and higher design appeal versus prior art. The authors release the 300K floorplan dataset and 5K furnished whole-home scenes to accelerate embodied AI and interior design research.

Key findings

Curated 314K real residential floorplans from over 1 million raw images with detailed semantic, geometric, and caption annotations enabling LLM training.
The K-D tree-based floorplan representation reduces geometric errors and network perplexity, improving layout validity over polygon coordinate regression (Fig 7, Sec 4.3).
Hierarchical furnishing via multi-level roaming (top-down and ego-centric views) generates spatially consistent furniture layouts grounded by explicit 3D shell constraints.
VLM-based iterative refiner corrects placement errors, blocked doors, collisions, and scale inconsistencies over multiple refinement steps until layout passes validation.
Surface-centric small object placement synthesizes plausible manipulable objects on desks, counters, etc., with physical attributes assigned using PhysX-Anything for simulation readiness.
User studies show the pipeline's generated scenes outperform rule-based and baseline methods in layout diversity and design appeal (Sec 4.4).
Dataset release includes 300K floorplans with captions and 5K fully furnished 3D home scenes with >15 manipulable objects on average for embodied AI use.
Comparison with prior works indicates improved global coherence and sim-readiness in generated multi-room homes.

Threat model

n/a — The paper focuses on generative methods for indoor scene synthesis and does not consider adversarial threats or attacker capabilities. The main challenges addressed relate to data scarcity and geometric/semantic consistency rather than security.

Methodology — deep read

Threat Model & Assumptions: The paper does not explicitly frame a threat model as it is focused on indoor scene generation for embodied AI and design, not security. The adversary context is indirect—errors in layout generation, inconsistency, and physical implausibility represent failure modes.
Data: The authors curate a large-scale dataset by collecting 1.084 million raw floorplan images from an online real estate repository. They annotate 2K images manually to train a detection model recognizing architectural elements (walls, doors, windows), then apply OCR for room labels and dimensions. After filtering for quality, 314K vectorized floorplans remain. These are represented as binary K-D trees encoding recursive axis-aligned partitions with semantic room labels. Each floorplan is paired with a detailed textual caption describing spatial constraints for supervised text-to-structure learning.
Architecture / Algorithm: The floorplan generator is a fine-tuned large language model (LLM). Input is a natural language prompt describing layout criteria (room counts, adjacency, openings). Output is a structured JSON representing the floorplan as a K-D tree plus attachment info (doors, windows). The K-D tree representation encodes a hierarchical spatial partition with splitting axis and coordinates, facilitating error-free geometric reconstruction and better training stability versus raw polygon regression.

For furnishing, a 3D empty shell is created from the floorplan to provide explicit spatial constraints. The authors adopt a hierarchical indoor roaming approach with two stages: a global top-down view rendering of the empty shell is inpainted with furniture arrangements by a 2D image generation model, conditioned on room structure and functional cues. This proposes large, room-defining furniture positions. Then, ego-centric camera views are selected via a heatmap coverage heuristic to render partial views for local, fine-grained inpainting of smaller objects and wall-attached furniture. Objects are segmented and reconstructed into 3D and integrated into the scene.

Next, recursive layout refinement uses a vision-language model (VLM) fine-tuned to iteratively detect and fix layout violations such as collisions, blocked doors, and scaling errors. The refiner observes rendered overhead views plus structured 3D layout and outputs corrective transformation actions applied to 3D bounding boxes. This closed loop continues until validation or iteration limit.

Finally, a surface-object synthesis stage identifies furniture with support surfaces (desks, tables) and uses inpainting conditioned on local context to generate small manipulable object layouts (e.g., books, lamps). Each synthesized object is assigned physical properties (mass, density, rigidity) via PhysX-Anything to ensure simulation compatibility. Objects are filtered for collisions or stability before insertion.

Training Regime: The floorplan LLM is fine-tuned on the 314K vectorized floorplan-caption pairs with varying prompt difficulty. The VLM refiner is trained on synthetic corrupted layout states paired with oracle corrective actions generated by a verification pipeline. Model-in-the-loop samples are added to improve robustness. Specific epochs, batch sizes, seeds, and hardware details are not disclosed.
Evaluation Protocol: Quantitative metrics for floorplan validity, geometric correctness, and furniture layout plausibility are reported. Ablations demonstrate benefits of the K-D tree representation and iterative refinement (Fig 7 and 8). User studies compare generated scenes to rule-based and baselines measuring layout diversity and visual appeal. No explicit statistical testing reported. Cross-validation or held-out attacker evaluations are not applicable.
Reproducibility: The authors commit to releasing the curated 300K floorplan dataset and 5K fully furnished home scenes along with pipeline code on their project webpage. Details on pretrained weights or full training scripts are not specified. The furnishing relies on foundation image generation models and fine-tuned VLMs but no public checkpoints are mentioned.

Technical innovations

Introducing a K-D tree based hierarchical representation for floorplan generation enabling error-free, structured, and natural language conditioned LLM generation.
Hierarchical multi-view indoor scene furnishing approach combining top-down global and ego-centric local image inpainting with explicit 3D shell constraints for spatially coherent furniture layout.
VLM-based recursive refinement module that iteratively detects and corrects physical and semantic layout violations via a sequential decision-making process.
Surface-centric small manipulable object synthesis with physics-aware attribute assignment for enhanced simulation readiness in embodied AI scenarios.

Datasets

HomeWorld Floorplan Dataset — 314K vectorized real residential floorplans — Curated from 1.084 million raw images from real estate listings (pending public release)
HomeWorld Furnished Whole-Home Dataset — 5K fully furnished, simulation-ready 3D home scenes with more than 15 manipulable objects per scene on average — Generated and annotated by the pipeline (pending public release)

Baselines vs proposed

K-D tree floorplan generator vs polygon coordinate regression floorplan: Reduced geometric error and improved layout validity (Fig 7)
Pipeline vs rule-based synthetic generation (ProcTHOR-10K): Higher layout diversity and stronger design appeal in user studies
Pipeline vs single-step refinement ablation: Iterative VLM refiner reduces error violations by majority (Fig 8)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.06390.

Fig 1

Fig 1 (page 1).

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

The hierarchical furnishing depends on 2D image inpainting models which can introduce artifacts and may not fully capture 3D spatial constraints beyond the explicit shell.
Physical attribute assignment for small objects uses approximate predictions from PhysX-Anything; complex physics or dynamic interactions are not fully validated.
No adversarial or robustness evaluation against unexpected floorplan structures or adversarial prompt manipulations.
Limited direct quantitative metrics on furniture-level placement accuracy or object-level interaction counts.
No cross-validation or evaluation on out-of-distribution floorplans; generalization beyond the curated dataset remains to be tested.
Details on training hyperparameters, computational cost, and runtime are not provided.

Open questions / follow-ons

How well does the iterative VLM-based refinement generalize to diverse or atypical floorplans beyond those seen in training?
Can the pipeline be extended to incorporate dynamic interactions or deformable objects for richer embodied AI simulation?
How to integrate stronger 3D geometric priors or multi-modal transformers to further reduce 2D inpainting artifacts in furnishing?
What is the impact of varying the scale and diversity of the manipulable object library on task performance in embodied downstream tasks?

Why it matters for bot defense

From a bot-defense or CAPTCHA perspective, the paper’s methodology addresses generating complex, physically plausible 3D scenes that could serve as challenging environments for embodied AI agents, potentially useful benchmarks for interaction-based CAPTCHAs. The hierarchical generation with iterative validation reinforces the importance of modular, multi-stage approaches in producing realistic yet controllable synthetic data. Deploying such generated environments for testing embodied agent robustness requires assurance that layouts avoid unrealistic shortcuts or unreachable configurations, a concern addressed here via explicit 3D constraints and iterative refinement. Furthermore, the curated large-scale floorplan dataset with rich semantic annotations may advance spatial reasoning benchmarks relevant to verifying agent behavior or evaluating human versus automated navigation capabilities in simulated environments. Overall, the approach exemplifies the value in combining large-scale real-world priors with learned refinement agents to yield high-fidelity interactive scenes, an important insight for CAPTCHAs targeting real-world-like embodied AI interactions.

Cite

bibtex

@article{arxiv2606_06390,
  title={ HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes },
  author={ Wenbo Li and Xiaoliang Ju and Zipeng Qin and Rongyao Fang and Hongsheng Li },
  journal={arXiv preprint arXiv:2606.06390},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.06390}
}

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​