MusicInfuser

MusicInfuser: Making Video Diffusion Listen and Dance

Please turn on your audio

Comparison with Prior Work

MusicInfuser infuses listening capability into the text-to-video model (Mochi) and produces dancing videos while preserving prompt adherence.

Abstract

We introduce MusicInfuser, an approach that aligns pre-trained text-to-video diffusion models to generate high-quality dance videos synchronized with specified music tracks. Rather than training a multimodal audio-video or audio-motion model from scratch, our method demonstrates how existing video diffusion models can be efficiently adapted to align with musical inputs. We propose a novel layer-wise adaptability criterion based on a guidance-inspired constructive influence function to select adaptable layers, significantly reducing training costs while preserving rich prior knowledge, even with limited, specialized datasets. Experiments show that MusicInfuser effectively bridges the gap between music and video, generating novel and diverse dance movements that respond dynamically to music. Furthermore, our framework generalizes well to unseen music tracks, longer video sequences, and unconventional subjects, outperforming baseline models in consistency and synchronization. All of this is achieved without requiring motion data, with training completed on a single GPU within a day.

Music- and Text-Controlled Dance Generation

Dance generation with different music tracks and text prompts, showing the ability to control the style, setting, and dancer attributes.

Audio Speed Control

The audio input is modified to different speeds, affecting the dance movement accordingly.

Same setup but including “advanced dance” in the prompt. Complex choreography also adjusts to speed changes.

Group Dance Generation

MusicInfuser generalizes to generate dance videos with two dancers by simply modulating the prompt.

MusicInfuser also generates dance videos with a lot of dancers.

Generalization to Unseen Music

MusicInfuser shows flexibility by adapting to in-the-wild music tracks outside the training distribution.

Generalization to Unseen Subjects

MusicInfuser generalizes beyond the training distribution to unseen subjects, generating dance videos for characters and appearances not present in the alignment data.

Longer Dance Videos (Extrapolation)

We generate longer dance videos that are 2 times longer than the videos used for training, demonstrating our method's ability to generalize to longer sequences.

Citation

@article{hong2025musicinfuser,
  title   = {MusicInfuser: Making Video Diffusion Listen and Dance},
  author  = {Hong, Susung and Kemelmacher-Shlizerman, Ira and Curless, Brian and Seitz, Steven M},
  journal = {arXiv preprint arXiv:2503.14505},
  year    = {2025}
}