IMPACT: Iterative Mask-based PArallel DeCoding for Text-to-Audio Generation with Diffusion Modeling

Abstract

Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state of the art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However its audio quality still lags behind the diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models.

Special Features
State-of-the-art performance in key metrics FD and FAD on the AudioCaps evaluation set.
Faster generation compared to AudioLDM2, the Tango series, MAGNET and all other baseline models.

Diagram of IMPACT

Training phase: Mask generative modeling
Inference phase: Generate a sequence of latents
- A key point here is to gradually generate the sequence throughout an iterative process. In the beginning, the model starts with a sequence with all mask embeddings. At each iteration, a randomly selected portion of positions is predicted and served as the input for the next iteration. The process stops until all positions are predicted.
- The reason for doing so is that latents generated at later iterations can leverage the content predicted in early iterations as conditions.

Consider citing our paper if you find it useful

@article{huang2025impact,
    title     = {IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling},
    author    = {Huang, Kuan-Po and Yang, Shu-wen and Phan, Huy and Lu, Bo-Ru and Kim, Byeonggeun and Macha, Sashank and Tang, Qingming and Ghosh, Shalini and Lee, Hung-yi and Kao, Chieh-Chi and others},
    journal   = {arXiv preprint arXiv:2506.00736},
    year      = {2025},
  }

Description	IMPACT Base	IMPACT Large	MAGNET-s	AudioLDM2	Tango 2	Ground Truth
Machine grinding wood
Firecrackers popping as a crowd of people cheer and whistle
A dog barks with distant birds chirping then people speak
A baby laughing loudly
A person is snoring
Bird chirping while waves come in with high wind
Helicopter engine running
Several gunshots with a click and glass breaking
Train horns honking as wind blows into a microphone while a group of people talk and an electronic beep repeatedly sounds during a vehicle engine running idle
A crowd murmurs as a siren blares and then stops at a distance
Church bells ringing
Birds chirping and water trickling
Emergency sirens wail as a truck engine accelerates and drives by
Very strong wind is blowing, and leaves are rustling on the trees

IMPACT

Abstract

Diagram of IMPACT

IMPACT's Latency vs FAD↓, KL↓

All models that fall in the green area mean that they are faster than MAGNET and performing better than MAGNET on objective metrics.

(Latency: Required time for generating a batch of 8 audios, measured in seconds with a single V100 GPU.)

Consider citing our paper if you find it useful