Stability AI has unveiled Stable Audio, a latent diffusion model designed for controllable audio generation. Stable Audio combines text metadata, audio duration, and start time conditioning to offer unprecedented control over the content and length of generated audio, even enabling the creation of complete songs.
Stable Audio addresses a significant limitation of previous audio diffusion models, which were unable to generate audio of specified durations. This was due to the models being trained on random audio chunks and forced into predetermined lengths. Stable Audio overcomes this challenge by using a heavily downsampled latent representation of audio, which enables vastly accelerated inference times and allows the model to generate audio of arbitrary lengths.