FlashWorld FlashWorld: High-quality 3D Scene Generation within Seconds

1MAC Lab, Xiamen University, 2Tencent, 3Yes Lab, Fudan University
arXiv 2025

*Project Leader, #Corresponding Author

Abstract

We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.

Teaser image

Motivation

The dominant paradigm in 3D scene generation is the MV-oriented pipeline:

Pipeline

These methods generate high-quality but inconsistent multi-view images, leading to low scene quality after 3D reconstruction.

Another paradigm with higher potential but less explored is the 3D-oriented pipeline:

Pipeline

While the 3D-oriented pipeline generates consistent multi-view images, the results are blurry due to limited data and imperfect camera annotations.

We propose to combine the strengths of both paradigms by distillation, with MV-oriented model as teacher to improve visual quality and 3D-oriented model as student to maintain 3D consistency.

Pipeline

This not only enhances visual quality while maintaining 3D consistency, but furthermore reduces the required denoising steps.

Demo

You can try our demo on HuggingFace Spaces for free: https://huggingface.co/spaces/imlixinyang/FlashWorld-Demo-Spark

BibTeX

@@misc{li2025flashworld,
        title={FlashWorld: High-quality 3D Scene Generation within Seconds},
        author={Xinyang Li and Tengfei Wang and Zixiao Gu and Shengchuan Zhang and Chunchao Guo and Liujuan Cao},
        year={2025},
        eprint={2510.13678},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }