Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state of the art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However its audio quality still lags behind the diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models.
Description | IMPACT Base | IMPACT Large | MAGNET-s | AudioLDM2 | Tango 2 | Ground Truth |
---|---|---|---|---|---|---|
Machine grinding wood | ||||||
Firecrackers popping as a crowd of people cheer and whistle | ||||||
A dog barks with distant birds chirping then people speak | ||||||
A baby laughing loudly | ||||||
A person is snoring | ||||||
Bird chirping while waves come in with high wind | ||||||
Helicopter engine running | ||||||
Several gunshots with a click and glass breaking | ||||||
Train horns honking as wind blows into a microphone while a group of people talk and an electronic beep repeatedly sounds during a vehicle engine running idle | ||||||
A crowd murmurs as a siren blares and then stops at a distance | ||||||
Church bells ringing | ||||||
Birds chirping and water trickling | ||||||
Emergency sirens wail as a truck engine accelerates and drives by | ||||||
Very strong wind is blowing, and leaves are rustling on the trees |
@article{huang2025impact,
title = {IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling},
author = {Huang, Kuan-Po and Yang, Shu-wen and Phan, Huy and Lu, Bo-Ru and Kim, Byeonggeun and Macha, Sashank and Tang, Qingming and Ghosh, Shalini and Lee, Hung-yi and Kao, Chieh-Chi and others},
journal = {arXiv preprint arXiv:2506.00736},
year = {2025},
}