-
Notifications
You must be signed in to change notification settings - Fork 31.3k
more tts pipeline exampel #42484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
more tts pipeline exampel #42484
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,24 +24,68 @@ languages and for multiple speakers. Several text-to-speech models are currently | |
|
|
||
| You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia, | ||
| can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. | ||
| Here's an example of how you would use the `"text-to-speech"` pipeline with Dia: | ||
| Here's an example of how you would use the `"text-to-speech"` pipeline with CSM: | ||
|
|
||
| ```py | ||
| ```python | ||
| >>> from transformers import pipeline | ||
|
|
||
| >>> pipe = pipeline("text-to-audio", model="sesame/csm-1b") | ||
| >>> output = pipe("Hello from Sesame.") | ||
| ``` | ||
|
|
||
| Here's a code snippet you can use to listen to the resulting audio in a notebook: | ||
|
|
||
| ```python | ||
| >>> from IPython.display import Audio | ||
| >>> Audio(output["audio"], rate=output["sampling_rate"]) | ||
| ``` | ||
|
Comment on lines
+38
to
+41
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have the same code below, would double check that we don't repeat ourselves |
||
|
|
||
| You can also do conversational TTS, here is an example with Dia: | ||
|
|
||
| ```python | ||
| >>> from transformers import pipeline | ||
|
|
||
| >>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626") | ||
| >>> text = "[S1] (clears throat) Hello! How are you? [S2] I'm good, thanks! How about you?" | ||
| >>> output = pipe(text) | ||
| ``` | ||
|
|
||
| Here's a code snippet you can use to listen to the resulting audio in a notebook: | ||
| ```python | ||
| >>> from IPython.display import Audio | ||
| >>> Audio(output["audio"], rate=output["sampling_rate"]) | ||
| ``` | ||
|
Comment on lines
+43
to
+56
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move above line here like so: Some models, like Dia, can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. Below is such an example: Note that Dia also accepts speaker tags such as |
||
|
|
||
| You can also do voice cloning with CSM: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would move this CSM example just after the "Hello from Sesame" example, to do something like @vasqu mentioned (all CSM examples together, and then Dia). And introduce it like so: "By default, CSM uses a random voice. You can do voice cloning by providing a reference audio as part of a chat template dictionary:" |
||
|
|
||
| ```python | ||
| >>> import soundfile as sf | ||
| >>> import torch | ||
| >>> from datasets import Audio, load_dataset | ||
| >>> from transformers import pipeline | ||
|
|
||
| >>> pipe = pipeline("text-to-audio", model="sesame/csm-1b") | ||
|
|
||
| >>> ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train") | ||
| >>> ds = ds.cast_column("audio", Audio(sampling_rate=24000)) | ||
| >>> conversation = [ | ||
| ... { | ||
| ... "role": "0", | ||
| ... "content": [ | ||
| ... {"type": "text", "text": "What are you working on?"}, | ||
| ... {"type": "audio", "path": ds[0]["audio"]["array"]}, | ||
| ... ], | ||
| ... }, | ||
| ... {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]}, | ||
| ... ] | ||
| >>> output = pipe(conversation) | ||
| ``` | ||
|
|
||
| ```python | ||
| >>> from IPython.display import Audio | ||
| >>> Audio(output["audio"], rate=output["sampling_rate"]) | ||
| ``` | ||
|
Comment on lines
83
to
86
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As @vasqu mentioned, we can remove duplicated snippets for running in a notebook |
||
|
|
||
| For more examples on what Bark and other pretrained TTS models can do, refer to our | ||
| For more examples on what CSM and other pretrained TTS models can do, refer to our | ||
| [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models). | ||
|
|
||
| If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the order a bit confusing? Would leave one model or the other but not both.
I.e.
Some models, like Dia,......Here's an example ... with CSMThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I would remove the line just before to only mention CSM, and also add a link to CSM: https://huggingface.co/sesame/csm-1b