Skip to content

Commit b94880e

Browse files
sanchit-gandhipatrickvonplatenwilliamberman
authored
Add AudioLDM (#2232)
* Add AudioLDM * up * add vocoder * start unet * unconditional unet * clap, vocoder and vae * clean-up: conversion scripts * fix: conversion script token_type_ids * clean-up: pipeline docstring * tests: from SD * clean-up: cpu offload vocoder instead of safety checker * feat: adapt tests to audioldm * feat: add docs * clean-up: amend pipeline docstrings * clean-up: make style * clean-up: make fix-copies * fix: add doc path to toctree * clean-up: args for conversion script * clean-up: paths to checkpoints * fix: use conditional unet * clean-up: make style * fix: type hints for UNet * clean-up: docstring for UNet * clean-up: make style * clean-up: remove duplicate in docstring * clean-up: make style * clean-up: make fix-copies * clean-up: move imports to start in code snippet * fix: pass cross_attention_dim as a list/tuple to unet * clean-up: make fix-copies * fix: update checkpoint path * fix: unet cross_attention_dim in tests * film embeddings -> class embeddings * Apply suggestions from code review Co-authored-by: Will Berman <[email protected]> * fix: unet film embed to use existing args * fix: unet tests to use existing args * fix: make style * fix: transformers import and version in init * clean-up: make style * Revert "clean-up: make style" This reverts commit 5d6d1f8. * clean-up: make style * clean-up: use pipeline tester mixin tests where poss * clean-up: skip attn slicing test * fix: add torch dtype to docs * fix: remove conversion script out of src * fix: remove .detach from 1d waveform * fix: reduce default num inf steps * fix: swap height/width -> audio_length_in_s * clean-up: make style * fix: remove nightly tests * fix: imports in conversion script * clean-up: slim-down to two slow tests * clean-up: slim-down fast tests * fix: batch consistent tests * clean-up: make style * clean-up: remove vae slicing fast test * clean-up: propagate changes to doc * fix: increase test tol to 1e-2 * clean-up: finish docs * clean-up: make style * feat: vocoder / VAE compatibility check * feat: possibly expand / cut audio waveform * fix: pipeline call signature test * fix: slow tests output len * clean-up: make style * make style --------- Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: William Berman <[email protected]>
1 parent 1870fb0 commit b94880e

File tree

14 files changed

+2318
-24
lines changed

14 files changed

+2318
-24
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,8 @@
134134
title: AltDiffusion
135135
- local: api/pipelines/audio_diffusion
136136
title: Audio Diffusion
137+
- local: api/pipelines/audioldm
138+
title: AudioLDM
137139
- local: api/pipelines/cycle_diffusion
138140
title: Cycle Diffusion
139141
- local: api/pipelines/dance_diffusion
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# AudioLDM
14+
15+
## Overview
16+
17+
AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al.
18+
19+
Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM
20+
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
21+
latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
22+
sound effects, human speech and music.
23+
24+
This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be found [here](https://github.com/haoheliu/AudioLDM).
25+
26+
## Text-to-Audio
27+
28+
The [`AudioLDMPipeline`] can be used to load pre-trained weights from [cvssp/audioldm](https://huggingface.co/cvssp/audioldm) and generate text-conditional audio outputs:
29+
30+
```python
31+
from diffusers import AudioLDMPipeline
32+
import torch
33+
import scipy
34+
35+
repo_id = "cvssp/audioldm"
36+
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
37+
pipe = pipe.to("cuda")
38+
39+
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
40+
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
41+
42+
# save the audio sample as a .wav file
43+
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
44+
```
45+
46+
### Tips
47+
48+
Prompts:
49+
* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
50+
* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
51+
52+
Inference:
53+
* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
54+
* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
55+
56+
### How to load and use different schedulers
57+
58+
The AudioLDM pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers
59+
that can be used with the AudioLDM pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`],
60+
[`EulerAncestralDiscreteScheduler`] etc. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest
61+
scheduler there is.
62+
63+
To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`]
64+
method, or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the
65+
[`DPMSolverMultistepScheduler`], you can do the following:
66+
67+
```python
68+
>>> from diffusers import AudioLDMPipeline, DPMSolverMultistepScheduler
69+
>>> import torch
70+
71+
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", torch_dtype=torch.float16)
72+
>>> pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
73+
74+
>>> # or
75+
>>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm", subfolder="scheduler")
76+
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", scheduler=dpm_scheduler, torch_dtype=torch.float16)
77+
```
78+
79+
## AudioLDMPipeline
80+
[[autodoc]] AudioLDMPipeline
81+
- all
82+
- __call__

0 commit comments

Comments
 (0)