FlashWorld FlashWorld: High-quality 3D Scene Generation within Seconds

1MAC Lab, Xiamen University, 2Tencent, 3Yes Lab, Fudan University
arXiv 2025

*Project Leader, #Corresponding Author

Abstract

We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.

Teaser image

Motivation

The dominant paradigm in 3D scene generation is the MV-oriented pipeline:

Pipeline

These methods generate high-quality but inconsistent multi-view images, leading to low scene quality after 3D reconstruction.

Another paradigm with higher potential but less explored is the 3D-oriented pipeline:

Pipeline

While the 3D-oriented pipeline generates consistent multi-view images, the results are blurry due to limited data and imperfect camera annotations.

We propose to combine the strengths of both paradigms by distillation, with MV-oriented model as teacher to improve visual quality and 3D-oriented model as student to maintain 3D consistency.

Pipeline

This not only enhances visual quality while maintaining 3D consistency, but furthermore reduces the required denoising steps.

Demo

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}