Debiasing Scores and Prompts of 2D Diffusion for View-consistent Text-to-3D Generation <i>(NeurIPS 2023)</i>

TL;DR

We propose Debiased Score Distillation Sampling (D-SDS), an efficient technique to address the Janus problem. Our techniques involve Score Debiasing, which clip scores of diffusion models with a linearly increasing threshold, and Prompt Debiasing, which removes conflicting words with view prefixes (e.g., 'smiling' with 'back view'). By introducing our techniques to existing text-to-3D generation framework like DreamFusion, SJC, Magic3D, etc., you can eliminate artifacts such as multiple faces, horns, and ears from the generated 3D objects, resulting in more view-consistent objects.

SDS

Debiased-SDS (Ours)

"a small kitten", "a majestic giraffe with a long neck", "a cute and chubby panda munching on bamboo" (SJC)

"a colorful toucan with a large beak" (IF-DreamFusion, ThreeStudio Implementation)

"a playful and cuddly kitten with big eyes" (Magic3D, ThreeStudio Implementation)

"a flamingo standing on one leg in shallow water" (ProlificDreamer, ThreeStudio Implementation)

Abstract

Existing score-distilling text-to-3D generation techniques, despite their consider able promise, often encounter the view inconsistency problem. One of the most notable issues is the Janus problem, where the most canonical view of an object (e.g., face or head) appears in other views. In this work, we explore existing frameworks for score-distilling text-to-3D generation and identify the main causes of the view inconsistency problem—the embedded bias of 2D diffusion models. Based on these findings, we propose two approaches to debias the score-distillation frameworks for view-consistent text-to-3D generation. Our first approach, called score debiasing, involves cutting off the score estimated by 2D diffusion models and gradually increasing the truncation value throughout the optimization process. Our second approach, called prompt debiasing, identifies conflicting words be tween user prompts and view prompts using a language model, and adjusts the discrepancy between view prompts and the viewing direction of an object. Our experimental results show that our methods improve the realism of the generated 3D objects by significantly reducing artifacts and achieve a good trade-off between faithfulness to the 2D diffusion models and 3D consistency with little overhead.

Score Debiasing

Visualization of the magnitude of the estimated score during the optimization.

The magnitude of the estimated score represents the (scaled) deviation from the original rendered image from the 3D field. When the perturbed and denoised image deviates significantly from the rendered image at certain pixel points, there is a surge in the 2D score magnitude. This spike results in the creation of undesired elements, such as extra legs, beaks, horns, or faces. To address this problem, we incorporate a dynamic clipping method on the 2D scores, where we begin with a low threshold and gradually increase it. This approach ensures the preservation of fine detail in the shapes while simultaneously reducing unwanted artifacts.

Prompt Debiasing

Samples from Stable Diffusion given a text prompt with contradiction.

Based on our findings, contradictions between the view prompts and user prompts make it hard to optimize 3D scenes. For example, despite "Back view of" given in the prompts, the word "smiling" in the prompt makes diffusion models biased towards the front view of objects. Thus, we remove the conflicting words in the prompts to make the prompts consistent with the viewing direction of an object. In specific, we calculate pointwise mutual information (PMI) to identify the conflicting words between the user prompts and the view prompts, utilizing a large language model.

Debiased-SDS Framework

Overall illustration of our framework.

Using the above two debiasing techniques, we propose a simple and efficient debiased score-distilling text-to-3D generation framework. First, we perform prompt debiasing to make the prompts consistent with the viewing direction of an object. Then, we perform score debiasing to remove the artifacts in the generated 3D objects. Note that our framework is easily applicable to any score-distilling text-to-3D generation framework, such as DreamFusion, SDS, Magic3D, etc. We provide an implementation of our techniques applied to the DreamFusion and SJC, which can be found on our official repository for D-SDS. For further details, please refer to our paper.

More Results

SDS

Debiased-SDS (Ours)

"an elegant teacup with delicate floral designs" (ProlificDreamer, ThreeStudio Implementation)

"a colorful toucan with a large beak" (Magic3D, ThreeStudio Implementation)

"a baby bunny, sitting on top of a stack of pancakes" (IF-DreamFusion, ThreeStudio Implementation)

"a playful and cuddly kitten with big eyes" (IF-DreamFusion, ThreeStudio Implementation)

"a kangaroo wearing boxing gloves" (IF-DreamFusion, ThreeStudio Implementation)

BibTeX

@article{hong2023debiasing,
  title={Debiasing scores and prompts of 2d diffusion for robust text-to-3d generation},
  author={Hong, Susung and Ahn, Donghoon and Kim, Seungryong},
  journal={arXiv preprint arXiv:2303.15413},
  year={2023}
}