MusicInfuser: Making Video Diffusion Listen and Dance
University of Washington
Please turn on your audio! 🔈
Comparison with Prior Work
MusicInfuser infuses listening capability into the text-to-video model (Mochi) and produces dancing videos while preserving prompt adherence.
Abstract
We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality.
Music- and Text-Controlled Dance Generation
Dance generation with different music tracks and text prompts, showing the ability to control the style, setting, and dancer attributes.
Audio Speed Control
Audio Speed Control
The audio input is modified to different speeds, affecting the dance movement accordingly.
Audio Speed Control (Advanced Dance)
Same setup but including "advanced dance" in the prompt. Complex choreography also adjusts to speed changes.
Group Dance Generation
Two Dancers
MusicInfuser generalizes to generate dance videos with two dancers by simply modulating the prompt.
Multiple Dancers
MusicInfuser also generates dance videos with a lot of dancers.
Generalization to Unseen Music
MusicInfuser shows flexibility by adapting to in-the-wild music tracks outside the training distribution.
Longer Dance Videos (Extrapolation)
We generate longer dance videos that are 2 times longer than the videos used for training, demonstrating our method's ability to generalize to longer sequences.
Citation
@article{hong2025musicinfuser, title={MusicInfuser: Making Video Diffusion Listen and Dance}, author={Hong, Susung and Kemelmacher-Shlizerman, Ira and Curless, Brian and Seitz, Steven M}, journal={arXiv preprint arXiv:2503.14505}, year={2025} }