-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Add AudioLDM #2232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add AudioLDM #2232
Changes from all commits
Commits
Show all changes
79 commits
Select commit
Hold shift + click to select a range
88b77bc
Add AudioLDM
9b353b0
up
1a3ea27
add vocoder
22a315d
Merge branch 'main' into audioldm
patrickvonplaten 1023f68
start unet
81bff99
unconditional unet
990b622
Merge remote-tracking branch 'origin/audioldm' into audioldm
6aa4fda
clap, vocoder and vae
2482b42
clean-up: conversion scripts
9d986c4
fix: conversion script token_type_ids
004fed8
clean-up: pipeline docstring
9feb6ba
tests: from SD
bf3964c
clean-up: cpu offload vocoder instead of safety checker
f200e80
feat: adapt tests to audioldm
dd04c2e
feat: add docs
1c26ca9
clean-up: amend pipeline docstrings
d32bd7f
clean-up: make style
447013e
clean-up: make fix-copies
08d6a1f
fix: add doc path to toctree
9597761
clean-up: args for conversion script
10c584d
clean-up: paths to checkpoints
0f15408
fix: use conditional unet
d99c9e8
clean-up: make style
293f2a4
fix: type hints for UNet
8b52493
clean-up: docstring for UNet
13f6f3e
Merge branch 'main' into audioldm
sanchit-gandhi 0039921
clean-up: make style
1222a63
Merge remote-tracking branch 'origin/audioldm' into audioldm
0be0789
clean-up: remove duplicate in docstring
3f5f863
clean-up: make style
3033ac1
clean-up: make fix-copies
dd1882f
clean-up: move imports to start in code snippet
4471f08
fix: pass cross_attention_dim as a list/tuple to unet
e81696f
clean-up: make fix-copies
b8165a1
fix: update checkpoint path
1a1dc58
fix: unet cross_attention_dim in tests
3947e37
film embeddings -> class embeddings
williamberman 1503f75
Apply suggestions from code review
sanchit-gandhi cccf556
Merge pull request #1 from williamberman/will/audioldm
sanchit-gandhi 074f883
fix: unet film embed to use existing args
94dc761
fix: unet tests to use existing args
e66476e
fix: make style
9f776a2
fix: transformers import and version in init
5d6d1f8
clean-up: make style
8b4ea07
Revert "clean-up: make style"
876f241
clean-up: make style
dfc1c85
clean-up: use pipeline tester mixin tests where poss
ad80911
clean-up: skip attn slicing test
4d2b902
Merge branch 'main' into audioldm
patrickvonplaten 68cc47e
fix: add torch dtype to docs
6bc6a75
fix: remove conversion script out of src
99a3388
fix: remove .detach from 1d waveform
ed9be20
fix: reduce default num inf steps
87755de
fix: swap height/width -> audio_length_in_s
42294e5
clean-up: make style
c62a070
Merge remote-tracking branch 'origin/audioldm' into audioldm
21d6448
fix: remove nightly tests
01f9ade
fix: imports in conversion script
a9faabb
clean-up: slim-down to two slow tests
9f26689
clean-up: slim-down fast tests
7bc812d
fix: batch consistent tests
f0002f1
clean-up: make style
a0a156a
clean-up: remove vae slicing fast test
a01022a
clean-up: propagate changes to doc
460231e
fix: increase test tol to 1e-2
9cb4426
Merge branch 'main' into audioldm
sanchit-gandhi c8a7436
clean-up: finish docs
01fbbcf
Merge remote-tracking branch 'origin/audioldm' into audioldm
ee67277
clean-up: make style
5620390
Merge branch 'main' into audioldm
patrickvonplaten d8ab1a1
feat: vocoder / VAE compatibility check
56e3fb9
feat: possibly expand / cut audio waveform
e66dfc7
fix: pipeline call signature test
4d7849e
Merge remote-tracking branch 'origin/audioldm' into audioldm
7ed071a
fix: slow tests output len
b90d564
clean-up: make style
ef0e8b3
Merge branch 'main' into audioldm
patrickvonplaten b0ade43
Merge branch 'main' into audioldm
patrickvonplaten ef6c8e0
make style
patrickvonplaten File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
<!--Copyright 2023 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
--> | ||
|
||
# AudioLDM | ||
|
||
## Overview | ||
|
||
AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al. | ||
|
||
Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM | ||
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap) | ||
latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional | ||
sound effects, human speech and music. | ||
|
||
This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be found [here](https://github.com/haoheliu/AudioLDM). | ||
|
||
## Text-to-Audio | ||
|
||
The [`AudioLDMPipeline`] can be used to load pre-trained weights from [cvssp/audioldm](https://huggingface.co/cvssp/audioldm) and generate text-conditional audio outputs: | ||
|
||
```python | ||
from diffusers import AudioLDMPipeline | ||
import torch | ||
import scipy | ||
|
||
repo_id = "cvssp/audioldm" | ||
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16) | ||
pipe = pipe.to("cuda") | ||
|
||
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs" | ||
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0] | ||
|
||
# save the audio sample as a .wav file | ||
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio) | ||
``` | ||
|
||
### Tips | ||
|
||
Prompts: | ||
* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream"). | ||
* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with. | ||
|
||
Inference: | ||
* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference. | ||
* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument. | ||
|
||
### How to load and use different schedulers | ||
|
||
The AudioLDM pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers | ||
that can be used with the AudioLDM pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], | ||
[`EulerAncestralDiscreteScheduler`] etc. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest | ||
scheduler there is. | ||
|
||
To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] | ||
method, or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the | ||
[`DPMSolverMultistepScheduler`], you can do the following: | ||
|
||
```python | ||
>>> from diffusers import AudioLDMPipeline, DPMSolverMultistepScheduler | ||
>>> import torch | ||
|
||
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", torch_dtype=torch.float16) | ||
>>> pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) | ||
|
||
>>> # or | ||
>>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm", subfolder="scheduler") | ||
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", scheduler=dpm_scheduler, torch_dtype=torch.float16) | ||
``` | ||
|
||
## AudioLDMPipeline | ||
[[autodoc]] AudioLDMPipeline | ||
- all | ||
- __call__ |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!