From dc64cd2141ddd4d92789b8145b159a6ca205e7d6 Mon Sep 17 00:00:00 2001 From: Deep-unlearning Date: Fri, 28 Nov 2025 16:24:31 +0000 Subject: [PATCH 1/4] more tts pipeline exampel --- docs/source/en/tasks/text-to-speech.md | 52 ++++++++++++++++++++++++-- 1 file changed, 48 insertions(+), 4 deletions(-) diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md index b285352acefd..73d6a1d10c71 100644 --- a/docs/source/en/tasks/text-to-speech.md +++ b/docs/source/en/tasks/text-to-speech.md @@ -24,9 +24,25 @@ languages and for multiple speakers. Several text-to-speech models are currently You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia, can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. -Here's an example of how you would use the `"text-to-speech"` pipeline with Dia: +Here's an example of how you would use the `"text-to-speech"` pipeline with CSM: -```py +```python +>>> from transformers import pipeline + +>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b") +>>> output = pipe("Hello from Sesame.") +``` + +Here's a code snippet you can use to listen to the resulting audio in a notebook: + +```python +>>> from IPython.display import Audio +>>> Audio(output["audio"], rate=output["sampling_rate"]) +``` + +You can also do conversational TTS, here is an example with Dia: + +```python >>> from transformers import pipeline >>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626") @@ -34,14 +50,42 @@ Here's an example of how you would use the `"text-to-speech"` pipeline with Dia: >>> output = pipe(text) ``` -Here's a code snippet you can use to listen to the resulting audio in a notebook: +```python +>>> from IPython.display import Audio +>>> Audio(output["audio"], rate=output["sampling_rate"]) +``` + +You can also do voice cloning with CSM: + +```python +>>> import soundfile as sf +>>> import torch +>>> from datasets import Audio, load_dataset +>>> from transformers import pipeline + +>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b") + +>>> ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train") +>>> ds = ds.cast_column("audio", Audio(sampling_rate=24000)) +>>> conversation = [ +... { +... "role": "0", +... "content": [ +... {"type": "text", "text": "What are you working on?"}, +... {"type": "audio", "path": ds[0]["audio"]["array"]}, +... ], +... }, +... {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]}, +... ] +>>> output = pipe(conversation) +``` ```python >>> from IPython.display import Audio >>> Audio(output["audio"], rate=output["sampling_rate"]) ``` -For more examples on what Bark and other pretrained TTS models can do, refer to our +For more examples on what CSM and other pretrained TTS models can do, refer to our [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models). If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers From b0736aceb117e4539772ed57005864922df00430 Mon Sep 17 00:00:00 2001 From: Deep-unlearning Date: Tue, 2 Dec 2025 15:24:30 +0000 Subject: [PATCH 2/4] remove duplicate code --- docs/source/en/tasks/text-to-speech.md | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md index 73d6a1d10c71..60b12ad1b422 100644 --- a/docs/source/en/tasks/text-to-speech.md +++ b/docs/source/en/tasks/text-to-speech.md @@ -50,11 +50,6 @@ You can also do conversational TTS, here is an example with Dia: >>> output = pipe(text) ``` -```python ->>> from IPython.display import Audio ->>> Audio(output["audio"], rate=output["sampling_rate"]) -``` - You can also do voice cloning with CSM: ```python @@ -80,11 +75,6 @@ You can also do voice cloning with CSM: >>> output = pipe(conversation) ``` -```python ->>> from IPython.display import Audio ->>> Audio(output["audio"], rate=output["sampling_rate"]) -``` - For more examples on what CSM and other pretrained TTS models can do, refer to our [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models). From 274c633de7be9adefd53c5faafe0358237557fc6 Mon Sep 17 00:00:00 2001 From: Deep-unlearning Date: Tue, 2 Dec 2025 15:27:45 +0000 Subject: [PATCH 3/4] group example by models --- docs/source/en/tasks/text-to-speech.md | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md index 60b12ad1b422..b2c30e4d6bb7 100644 --- a/docs/source/en/tasks/text-to-speech.md +++ b/docs/source/en/tasks/text-to-speech.md @@ -22,9 +22,8 @@ Text-to-speech (TTS) is the task of creating natural-sounding speech from text, languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as [Dia](../model_doc/dia), [CSM](../model_doc/csm), [Bark](../model_doc/bark), [MMS](../model_doc/mms), [VITS](../model_doc/vits) and [SpeechT5](../model_doc/speecht5). -You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia, -can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. -Here's an example of how you would use the `"text-to-speech"` pipeline with CSM: +You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). +Here's an example of how you would use the `"text-to-speech"` pipeline with [CSM](https://huggingface.co/sesame/csm-1b): ```python >>> from transformers import pipeline @@ -40,16 +39,6 @@ Here's a code snippet you can use to listen to the resulting audio in a notebook >>> Audio(output["audio"], rate=output["sampling_rate"]) ``` -You can also do conversational TTS, here is an example with Dia: - -```python ->>> from transformers import pipeline - ->>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626") ->>> text = "[S1] (clears throat) Hello! How are you? [S2] I'm good, thanks! How about you?" ->>> output = pipe(text) -``` - You can also do voice cloning with CSM: ```python @@ -75,6 +64,16 @@ You can also do voice cloning with CSM: >>> output = pipe(conversation) ``` +You can also do conversational TTS, here is an example with Dia: + +```python +>>> from transformers import pipeline + +>>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626") +>>> text = "[S1] (clears throat) Hello! How are you? [S2] I'm good, thanks! How about you?" +>>> output = pipe(text) +``` + For more examples on what CSM and other pretrained TTS models can do, refer to our [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models). From 8b4343a71f1afc983c1a07cf475c3e745dc0c2aa Mon Sep 17 00:00:00 2001 From: Deep-unlearning Date: Tue, 2 Dec 2025 15:29:59 +0000 Subject: [PATCH 4/4] nit --- docs/source/en/tasks/text-to-speech.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md index b2c30e4d6bb7..297f936cd6b0 100644 --- a/docs/source/en/tasks/text-to-speech.md +++ b/docs/source/en/tasks/text-to-speech.md @@ -39,7 +39,7 @@ Here's a code snippet you can use to listen to the resulting audio in a notebook >>> Audio(output["audio"], rate=output["sampling_rate"]) ``` -You can also do voice cloning with CSM: +By default, CSM uses a random voice. You can do voice cloning by providing a reference audio as part of a chat template dictionary: ```python >>> import soundfile as sf @@ -64,7 +64,7 @@ You can also do voice cloning with CSM: >>> output = pipe(conversation) ``` -You can also do conversational TTS, here is an example with Dia: +Some models, like [Dia](https://huggingface.co/nari-labs/Dia-1.6B-0626), can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. Below is such an example: ```python >>> from transformers import pipeline @@ -74,6 +74,8 @@ You can also do conversational TTS, here is an example with Dia: >>> output = pipe(text) ``` +Note that Dia also accepts speaker tags such as [S1] and [S2] to generate a conversation between unique voices. + For more examples on what CSM and other pretrained TTS models can do, refer to our [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).