IMPACT

Iterative Mask-based PArallel DeCoding for Text-to-Audio Generation with Diffusion Modeling
Kuan-Po Huang, Shu-wen Yang, Huy Phan, Bo-Ru Lu, Byeonggeun Kim, Sashank Macha, Qingming Tang,
Shalini Ghosh, Hung-yi Lee, Chieh-Chi Kao, Chao Wang
Amazon, National Taiwan University

Abstract

Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state of the art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However its audio quality still lags behind the diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models.

Special Features
State-of-the-art performance in key metrics FD and FAD on the AudioCaps evaluation set.
Faster generation compared to AudioLDM2, the Tango series, MAGNET and all other baseline models.

Diagram of IMPACT

IMPACT Diagram
  • Training phase: Mask generative modeling
  • Inference phase: Generate a sequence of latents
    • A key point here is to gradually generate the sequence throughout an iterative process. In the beginning, the model starts with a sequence with all mask embeddings. At each iteration, a randomly selected portion of positions is predicted and served as the input for the next iteration. The process stops until all positions are predicted.
    • The reason for doing so is that latents generated at later iterations can leverage the content predicted in early iterations as conditions.

IMPACT's Latency vs FAD↓, KL↓

IMPACT's Latency vs FAD, KL

All models that fall in the green area mean that they are faster than MAGNET and performing better than MAGNET on objective metrics.

(Latency: Required time for generating a batch of 8 audios, measured in seconds with a single V100 GPU.)

Description IMPACT Base IMPACT Large MAGNET-s AudioLDM2 Tango 2 Ground Truth
Machine grinding wood
Firecrackers popping as a crowd of people cheer and whistle
A dog barks with distant birds chirping then people speak
A baby laughing loudly
A person is snoring
Bird chirping while waves come in with high wind
Helicopter engine running
Several gunshots with a click and glass breaking
Train horns honking as wind blows into a microphone while a group of people talk and an electronic beep repeatedly sounds during a vehicle engine running idle
A crowd murmurs as a siren blares and then stops at a distance
Church bells ringing
Birds chirping and water trickling
Emergency sirens wail as a truck engine accelerates and drives by
Very strong wind is blowing, and leaves are rustling on the trees

Consider citing our paper if you find it useful

@article{huang2025impact,
    title     = {IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling},
    author    = {Huang, Kuan-Po and Yang, Shu-wen and Phan, Huy and Lu, Bo-Ru and Kim, Byeonggeun and Macha, Sashank and Tang, Qingming and Ghosh, Shalini and Lee, Hung-yi and Kao, Chieh-Chi and others},
    journal   = {arXiv preprint arXiv:2506.00736},
    year      = {2025},
  }