Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state of the art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However its audio quality still lags behind the diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models.
| Description | IMPACT Base | IMPACT Large | MAGNET-s | AudioLDM2 | Tango 2 | Ground Truth |
|---|---|---|---|---|---|---|
| Machine grinding wood | ||||||
| Firecrackers popping as a crowd of people cheer and whistle | ||||||
| A dog barks with distant birds chirping then people speak | ||||||
| A baby laughing loudly | ||||||
| A person is snoring | ||||||
| Bird chirping while waves come in with high wind | ||||||
| Helicopter engine running | ||||||
| Several gunshots with a click and glass breaking | ||||||
| Train horns honking as wind blows into a microphone while a group of people talk and an electronic beep repeatedly sounds during a vehicle engine running idle | ||||||
| A crowd murmurs as a siren blares and then stops at a distance | ||||||
| Church bells ringing | ||||||
| Birds chirping and water trickling | ||||||
| Emergency sirens wail as a truck engine accelerates and drives by | ||||||
| Very strong wind is blowing, and leaves are rustling on the trees |
@article{huang2025impact,
title = {IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling},
author = {Huang, Kuan-Po and Yang, Shu-wen and Phan, Huy and Lu, Bo-Ru and Kim, Byeonggeun and Macha, Sashank and Tang, Qingming and Ghosh, Shalini and Lee, Hung-yi and Kao, Chieh-Chi and others},
journal = {arXiv preprint arXiv:2506.00736},
year = {2025},
}