DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

1University of Washington 2Adobe Research
*Work done during an internship at Adobe

Diffusion Sampling vs. Browsing
(Hover to See Multi-Modal Previews)

Original Diffusion Sampling

"Wide shot of an astronaut pushing through colorful jungle vines on an alien planet"

Full-step diffusion sampling...

Proposed Diffusion Browsing

"Wide shot of an astronaut pushing through colorful jungle vines on an alien planet"

Previewing branches...

Abstract

Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation—keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4× real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.

Timestep-Wise Evolution of Scene Intrinsics

The grid shows material properties (rubber, RGB, base color, depth, normal, metallic, roughness) in columns, with rows for GT and previews at denoising timesteps (2%, 4%, 10%, 20%, 40%, 100%). Basic scene structure and intrinsics emerge early and stabilize quickly.

Rubber-Like Previews

Each column shows a different sample, and each row shows 10% and 20% of the denoising steps. Even at 10% of the denoising steps, it shows clear structure.

Steered Variation Generation

Our trained decoder enables channel-targeted steering at intermediate denoising steps, allowing users to generate meaningful variations in scene intrinsics (base color, depth, and normal maps) in the preview stage. The example shows the steered result at 10% of the denoising steps. The left is unsteered.

Analysis

Linear probing and our multi-branch decoder

Linear probing results

Linear probing shows intrinsic channels (base color, depth, normals) become predictable very early across both timesteps and blocks, while RGB sharpens gradually. Guided by this, we attach lightweight decoders at intermediate features and train multiple branches jointly with branch-wise losses plus an ensemble loss. This multi-branch design reduces the superposition problem that appears at noisy timesteps and yields sharp and consistent previews that match the final video.

Pipeline diagram