Skip to content

Add AudioLDM #2232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Mar 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
88b77bc
Add AudioLDM
Feb 3, 2023
9b353b0
up
Feb 3, 2023
1a3ea27
add vocoder
Feb 9, 2023
22a315d
Merge branch 'main' into audioldm
patrickvonplaten Feb 13, 2023
1023f68
start unet
Feb 14, 2023
81bff99
unconditional unet
Feb 14, 2023
990b622
Merge remote-tracking branch 'origin/audioldm' into audioldm
Feb 16, 2023
6aa4fda
clap, vocoder and vae
Feb 17, 2023
2482b42
clean-up: conversion scripts
Feb 20, 2023
9d986c4
fix: conversion script token_type_ids
Feb 20, 2023
004fed8
clean-up: pipeline docstring
Feb 20, 2023
9feb6ba
tests: from SD
Feb 20, 2023
bf3964c
clean-up: cpu offload vocoder instead of safety checker
Feb 20, 2023
f200e80
feat: adapt tests to audioldm
Feb 21, 2023
dd04c2e
feat: add docs
Feb 21, 2023
1c26ca9
clean-up: amend pipeline docstrings
Feb 21, 2023
d32bd7f
clean-up: make style
Feb 21, 2023
447013e
clean-up: make fix-copies
Feb 21, 2023
08d6a1f
fix: add doc path to toctree
Feb 21, 2023
9597761
clean-up: args for conversion script
Feb 21, 2023
10c584d
clean-up: paths to checkpoints
Feb 21, 2023
0f15408
fix: use conditional unet
Feb 21, 2023
d99c9e8
clean-up: make style
Feb 21, 2023
293f2a4
fix: type hints for UNet
Feb 21, 2023
8b52493
clean-up: docstring for UNet
Feb 21, 2023
13f6f3e
Merge branch 'main' into audioldm
sanchit-gandhi Feb 21, 2023
0039921
clean-up: make style
Feb 21, 2023
1222a63
Merge remote-tracking branch 'origin/audioldm' into audioldm
Feb 21, 2023
0be0789
clean-up: remove duplicate in docstring
Feb 21, 2023
3f5f863
clean-up: make style
Feb 21, 2023
3033ac1
clean-up: make fix-copies
Feb 21, 2023
dd1882f
clean-up: move imports to start in code snippet
Feb 21, 2023
4471f08
fix: pass cross_attention_dim as a list/tuple to unet
Feb 22, 2023
e81696f
clean-up: make fix-copies
Feb 22, 2023
b8165a1
fix: update checkpoint path
Feb 22, 2023
1a1dc58
fix: unet cross_attention_dim in tests
Feb 22, 2023
3947e37
film embeddings -> class embeddings
williamberman Feb 24, 2023
1503f75
Apply suggestions from code review
sanchit-gandhi Feb 27, 2023
cccf556
Merge pull request #1 from williamberman/will/audioldm
sanchit-gandhi Feb 27, 2023
074f883
fix: unet film embed to use existing args
Feb 27, 2023
94dc761
fix: unet tests to use existing args
Feb 27, 2023
e66476e
fix: make style
Feb 27, 2023
9f776a2
fix: transformers import and version in init
Feb 27, 2023
5d6d1f8
clean-up: make style
Feb 27, 2023
8b4ea07
Revert "clean-up: make style"
Feb 27, 2023
876f241
clean-up: make style
Feb 27, 2023
dfc1c85
clean-up: use pipeline tester mixin tests where poss
Feb 27, 2023
ad80911
clean-up: skip attn slicing test
Feb 27, 2023
4d2b902
Merge branch 'main' into audioldm
patrickvonplaten Mar 7, 2023
68cc47e
fix: add torch dtype to docs
Mar 17, 2023
6bc6a75
fix: remove conversion script out of src
Mar 17, 2023
99a3388
fix: remove .detach from 1d waveform
Mar 17, 2023
ed9be20
fix: reduce default num inf steps
Mar 17, 2023
87755de
fix: swap height/width -> audio_length_in_s
Mar 17, 2023
42294e5
clean-up: make style
Mar 17, 2023
c62a070
Merge remote-tracking branch 'origin/audioldm' into audioldm
Mar 17, 2023
21d6448
fix: remove nightly tests
Mar 17, 2023
01f9ade
fix: imports in conversion script
Mar 17, 2023
a9faabb
clean-up: slim-down to two slow tests
Mar 17, 2023
9f26689
clean-up: slim-down fast tests
Mar 17, 2023
7bc812d
fix: batch consistent tests
Mar 17, 2023
f0002f1
clean-up: make style
Mar 17, 2023
a0a156a
clean-up: remove vae slicing fast test
Mar 17, 2023
a01022a
clean-up: propagate changes to doc
Mar 17, 2023
460231e
fix: increase test tol to 1e-2
Mar 17, 2023
9cb4426
Merge branch 'main' into audioldm
sanchit-gandhi Mar 17, 2023
c8a7436
clean-up: finish docs
Mar 17, 2023
01fbbcf
Merge remote-tracking branch 'origin/audioldm' into audioldm
Mar 17, 2023
ee67277
clean-up: make style
Mar 17, 2023
5620390
Merge branch 'main' into audioldm
patrickvonplaten Mar 21, 2023
d8ab1a1
feat: vocoder / VAE compatibility check
Mar 23, 2023
56e3fb9
feat: possibly expand / cut audio waveform
Mar 23, 2023
e66dfc7
fix: pipeline call signature test
Mar 23, 2023
4d7849e
Merge remote-tracking branch 'origin/audioldm' into audioldm
Mar 23, 2023
7ed071a
fix: slow tests output len
Mar 23, 2023
b90d564
clean-up: make style
Mar 23, 2023
ef0e8b3
Merge branch 'main' into audioldm
patrickvonplaten Mar 23, 2023
b0ade43
Merge branch 'main' into audioldm
patrickvonplaten Mar 23, 2023
ef6c8e0
make style
patrickvonplaten Mar 23, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,8 @@
title: AltDiffusion
- local: api/pipelines/audio_diffusion
title: Audio Diffusion
- local: api/pipelines/audioldm
title: AudioLDM
- local: api/pipelines/cycle_diffusion
title: Cycle Diffusion
- local: api/pipelines/dance_diffusion
Expand Down
82 changes: 82 additions & 0 deletions docs/source/en/api/pipelines/audioldm.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# AudioLDM

## Overview

AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://arxiv.org/abs/2301.12503) by Haohe Liu et al.

Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
sound effects, human speech and music.

This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be found [here](https://github.com/haoheliu/AudioLDM).

## Text-to-Audio

The [`AudioLDMPipeline`] can be used to load pre-trained weights from [cvssp/audioldm](https://huggingface.co/cvssp/audioldm) and generate text-conditional audio outputs:

```python
from diffusers import AudioLDMPipeline
import torch
import scipy

repo_id = "cvssp/audioldm"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!


# save the audio sample as a .wav file
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
```

### Tips

Prompts:
* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.

Inference:
* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.

### How to load and use different schedulers

The AudioLDM pipeline uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers
that can be used with the AudioLDM pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`],
[`EulerAncestralDiscreteScheduler`] etc. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest
scheduler there is.

To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`]
method, or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the
[`DPMSolverMultistepScheduler`], you can do the following:

```python
>>> from diffusers import AudioLDMPipeline, DPMSolverMultistepScheduler
>>> import torch

>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", torch_dtype=torch.float16)
>>> pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)

>>> # or
>>> dpm_scheduler = DPMSolverMultistepScheduler.from_pretrained("cvssp/audioldm", subfolder="scheduler")
>>> pipeline = AudioLDMPipeline.from_pretrained("cvssp/audioldm", scheduler=dpm_scheduler, torch_dtype=torch.float16)
```

## AudioLDMPipeline
[[autodoc]] AudioLDMPipeline
- all
- __call__
Loading