Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 48 additions & 4 deletions docs/source/en/tasks/text-to-speech.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,24 +24,68 @@ languages and for multiple speakers. Several text-to-speech models are currently

You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia,
can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.
Here's an example of how you would use the `"text-to-speech"` pipeline with Dia:
Here's an example of how you would use the `"text-to-speech"` pipeline with CSM:
Comment on lines 25 to +27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the order a bit confusing? Would leave one model or the other but not both.

I.e. Some models, like Dia,... ... Here's an example ... with CSM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I would remove the line just before to only mention CSM, and also add a link to CSM: https://huggingface.co/sesame/csm-1b


```py
```python
>>> from transformers import pipeline

>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b")
>>> output = pipe("Hello from Sesame.")
```

Here's a code snippet you can use to listen to the resulting audio in a notebook:

```python
>>> from IPython.display import Audio
>>> Audio(output["audio"], rate=output["sampling_rate"])
```
Comment on lines +38 to +41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the same code below, would double check that we don't repeat ourselves


You can also do conversational TTS, here is an example with Dia:

```python
>>> from transformers import pipeline

>>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626")
>>> text = "[S1] (clears throat) Hello! How are you? [S2] I'm good, thanks! How about you?"
>>> output = pipe(text)
```

Here's a code snippet you can use to listen to the resulting audio in a notebook:
```python
>>> from IPython.display import Audio
>>> Audio(output["audio"], rate=output["sampling_rate"])
```
Comment on lines +43 to +56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move above line here like so:


Some models, like Dia, can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. Below is such an example:

EXAMPLE

Note that Dia also accepts speaker tags such as [S1] and [S2] to generate a conversation between unique voices.


You can also do voice cloning with CSM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this CSM example just after the "Hello from Sesame" example, to do something like @vasqu mentioned (all CSM examples together, and then Dia). And introduce it like so:

"By default, CSM uses a random voice. You can do voice cloning by providing a reference audio as part of a chat template dictionary:"


```python
>>> import soundfile as sf
>>> import torch
>>> from datasets import Audio, load_dataset
>>> from transformers import pipeline

>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b")

>>> ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
>>> ds = ds.cast_column("audio", Audio(sampling_rate=24000))
>>> conversation = [
... {
... "role": "0",
... "content": [
... {"type": "text", "text": "What are you working on?"},
... {"type": "audio", "path": ds[0]["audio"]["array"]},
... ],
... },
... {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]},
... ]
>>> output = pipe(conversation)
```

```python
>>> from IPython.display import Audio
>>> Audio(output["audio"], rate=output["sampling_rate"])
```
Comment on lines 83 to 86
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @vasqu mentioned, we can remove duplicated snippets for running in a notebook


For more examples on what Bark and other pretrained TTS models can do, refer to our
For more examples on what CSM and other pretrained TTS models can do, refer to our
[Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).

If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers
Expand Down