Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models

1Wuhan University, 2Hedra, 3JD Explore Academy
* denotes equal contribution, denotes corresponding author

Our method generates more plausible 3D scenes given input human motions and floor plans. It excels in two key aspects: (1) avoiding collisions between humans and objects, as well as between objects, a significant improvement over MIME, and (2) providing better support for human-object interactions compared to DiffuScene.

Abstract

Generating 3D scenes from human motion sequences supports numerous applications, including virtual reality and architectural design. However, previous auto-regression-based human-aware 3D scene generation methods have struggled to accurately capture the joint distribution of multiple objects and input humans, often resulting in overlapping object generation in the same space. To address this limitation, we explore the potential of diffusion models that simultaneously consider all input humans and the floor plan to generate plausible 3D scenes. Our approach not only satisfies all input human interactions but also adheres to spatial constraints with the floor plan. Furthermore, we introduce two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints. These mechanisms help avoid generating scenes that conflict with human motions while respecting layout constraints. To enhance the diversity and accuracy of human-guided scene generation, we have developed an automated pipeline that improves the variety and plausibility of human-object interactions in the existing 3D FRONT HUMAN dataset. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework can generate more natural and plausible 3D scenes with precise human-scene interactions, while significantly reducing human-object collisions compared to previous state-of-the-art methods.

Method

SHADE learns a diffusion model to gradually clean the noisy scene x_T by simultaneously considering the contact bounding boxes, free-space mask, floor plan, and time step. During inference, SHADE applies three spatial collision guidance functions to ensure the generation of plausible scenes that avoid conflicts with human motions, room boundaries, and prevent object overlap.

Illustrative Video

Compared with SOTA Methods

References

[ATISS] Despoina, Paschalidou1 and Amlan, Kar and Maria, Shugrina and Karsten, Kreis and Andreas, Geiger and Sanja, Fidler, ATISS: Autoregressive Transformers for Indoor Scene Synthesis, in NIPS 2021

[MIME] Hongwei, Yi and Chun-Hao P., Huang and Shashank, Tripathi and Lea, Hering and Justus, Thies and Michael J., Black, MIME: Human-aware 3D scene generation, in CVPR 2023

[DiffuScene] Jiapeng, Tang and Yinyu, Nie and Lev, Markhasin and Angela, Dai and Justus, Thies and Matthias, Nießner, DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis, in CVPR 2024

Citation

@misc{hong2024humanaware3dscenegeneration,
      title={Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models}, 
      author={Xiaolin Hong and Hongwei Yi and Fazhi He and Qiong Cao},
      year={2024},
      eprint={2406.18159},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.18159}, 
}