Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text

Abstract

Recent advancements in 3D generation have leveraged synthetic datasets with ground truth 3D assets and predefined camera trajectories. However, the potential of adopting real-world datasets, which can produce significantly more realistic 3D scenes, remains largely unexplored. In this work, we delve into the key challenge of the complex and scene-specific camera trajectories found in real-world captures. We introduce Director3D, a robust open-world text-to-3D generation framework, designed to generate both real-world 3D scenes and adaptive camera trajectories. To achieve this, (1) we first utilize a Trajectory Diffusion Transformer, acting as the Cinematographer, to model the distribution of camera trajectories based on textual descriptions. Next, a Gaussian-driven Multi-view Latent Diffusion Model serves as the Decorator, modeling the image sequence distribution given the camera trajectories and texts. This model, fine-tuned from a 2D diffusion model, directly generates pixel-aligned 3D Gaussians as an immediate 3D scene representation for consistent denoising. Lastly, the 3D Gaussians are further refined by a novel SDS++ loss as the Detailer, which incorporates the prior of the 2D diffusion model. Extensive experiments demonstrate that Director3D outperforms existing methods, offering superior performance in real-world 3D generation.

🎥 Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text

Xinyang Li¹

Zhangyu Lai¹

Linning Xu²

Yansong Qu¹

Liujuan Cao*¹

Shengchuan Zhang¹

Bo Dai²

Rongrong Ji¹

¹MAC Lab, Xiamen University

²IDC Group, Shanghai AI Lab

Abstract

3D Video Generation

Citation