From dc64cd2141ddd4d92789b8145b159a6ca205e7d6 Mon Sep 17 00:00:00 2001
From: Deep-unlearning <steven@huggingface.co>
Date: Fri, 28 Nov 2025 16:24:31 +0000
Subject: [PATCH 1/4] more tts pipeline exampel

---
 docs/source/en/tasks/text-to-speech.md | 52 ++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md
index b285352acefd..73d6a1d10c71 100644
--- a/docs/source/en/tasks/text-to-speech.md
+++ b/docs/source/en/tasks/text-to-speech.md
@@ -24,9 +24,25 @@ languages and for multiple speakers. Several text-to-speech models are currently
 
 You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia,
 can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.
-Here's an example of how you would use the `"text-to-speech"` pipeline with Dia:
+Here's an example of how you would use the `"text-to-speech"` pipeline with CSM:
 
-```py
+```python
+>>> from transformers import pipeline
+
+>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b")
+>>> output = pipe("Hello from Sesame.")
+```
+
+Here's a code snippet you can use to listen to the resulting audio in a notebook:
+
+```python
+>>> from IPython.display import Audio
+>>> Audio(output["audio"], rate=output["sampling_rate"])
+```
+
+You can also do conversational TTS, here is an example with Dia:
+
+```python
 >>> from transformers import pipeline
 
 >>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626")
@@ -34,14 +50,42 @@ Here's an example of how you would use the `"text-to-speech"` pipeline with Dia:
 >>> output = pipe(text)
 ```
 
-Here's a code snippet you can use to listen to the resulting audio in a notebook:
+```python
+>>> from IPython.display import Audio
+>>> Audio(output["audio"], rate=output["sampling_rate"])
+```
+
+You can also do voice cloning with CSM:
+
+```python
+>>> import soundfile as sf
+>>> import torch
+>>> from datasets import Audio, load_dataset
+>>> from transformers import pipeline
+
+>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b")
+
+>>> ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
+>>> ds = ds.cast_column("audio", Audio(sampling_rate=24000))
+>>> conversation = [
+...     {
+...         "role": "0",
+...         "content": [
+...             {"type": "text", "text": "What are you working on?"},
+...             {"type": "audio", "path": ds[0]["audio"]["array"]},
+...         ],
+...     },
+...     {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]},
+... ]
+>>> output = pipe(conversation)
+```
 
 ```python
 >>> from IPython.display import Audio
 >>> Audio(output["audio"], rate=output["sampling_rate"])
 ```
 
-For more examples on what Bark and other pretrained TTS models can do, refer to our
+For more examples on what CSM and other pretrained TTS models can do, refer to our
 [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).
 
 If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers

From b0736aceb117e4539772ed57005864922df00430 Mon Sep 17 00:00:00 2001
From: Deep-unlearning <steven@huggingface.co>
Date: Tue, 2 Dec 2025 15:24:30 +0000
Subject: [PATCH 2/4] remove duplicate code

---
 docs/source/en/tasks/text-to-speech.md | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md
index 73d6a1d10c71..60b12ad1b422 100644
--- a/docs/source/en/tasks/text-to-speech.md
+++ b/docs/source/en/tasks/text-to-speech.md
@@ -50,11 +50,6 @@ You can also do conversational TTS, here is an example with Dia:
 >>> output = pipe(text)
 ```
 
-```python
->>> from IPython.display import Audio
->>> Audio(output["audio"], rate=output["sampling_rate"])
-```
-
 You can also do voice cloning with CSM:
 
 ```python
@@ -80,11 +75,6 @@ You can also do voice cloning with CSM:
 >>> output = pipe(conversation)
 ```
 
-```python
->>> from IPython.display import Audio
->>> Audio(output["audio"], rate=output["sampling_rate"])
-```
-
 For more examples on what CSM and other pretrained TTS models can do, refer to our
 [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).
 

From 274c633de7be9adefd53c5faafe0358237557fc6 Mon Sep 17 00:00:00 2001
From: Deep-unlearning <steven@huggingface.co>
Date: Tue, 2 Dec 2025 15:27:45 +0000
Subject: [PATCH 3/4] group example by models

---
 docs/source/en/tasks/text-to-speech.md | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md
index 60b12ad1b422..b2c30e4d6bb7 100644
--- a/docs/source/en/tasks/text-to-speech.md
+++ b/docs/source/en/tasks/text-to-speech.md
@@ -22,9 +22,8 @@ Text-to-speech (TTS) is the task of creating natural-sounding speech from text,
 languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as [Dia](../model_doc/dia), [CSM](../model_doc/csm),
 [Bark](../model_doc/bark), [MMS](../model_doc/mms), [VITS](../model_doc/vits) and [SpeechT5](../model_doc/speecht5).
 
-You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia,
-can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.
-Here's an example of how you would use the `"text-to-speech"` pipeline with CSM:
+You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`).
+Here's an example of how you would use the `"text-to-speech"` pipeline with [CSM](https://huggingface.co/sesame/csm-1b):
 
 ```python
 >>> from transformers import pipeline
@@ -40,16 +39,6 @@ Here's a code snippet you can use to listen to the resulting audio in a notebook
 >>> Audio(output["audio"], rate=output["sampling_rate"])
 ```
 
-You can also do conversational TTS, here is an example with Dia:
-
-```python
->>> from transformers import pipeline
-
->>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626")
->>> text = "[S1] (clears throat) Hello! How are you? [S2] I'm good, thanks! How about you?"
->>> output = pipe(text)
-```
-
 You can also do voice cloning with CSM:
 
 ```python
@@ -75,6 +64,16 @@ You can also do voice cloning with CSM:
 >>> output = pipe(conversation)
 ```
 
+You can also do conversational TTS, here is an example with Dia:
+
+```python
+>>> from transformers import pipeline
+
+>>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626")
+>>> text = "[S1] (clears throat) Hello! How are you? [S2] I'm good, thanks! How about you?"
+>>> output = pipe(text)
+```
+
 For more examples on what CSM and other pretrained TTS models can do, refer to our
 [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).
 

From 8b4343a71f1afc983c1a07cf475c3e745dc0c2aa Mon Sep 17 00:00:00 2001
From: Deep-unlearning <steven@huggingface.co>
Date: Tue, 2 Dec 2025 15:29:59 +0000
Subject: [PATCH 4/4] nit

---
 docs/source/en/tasks/text-to-speech.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md
index b2c30e4d6bb7..297f936cd6b0 100644
--- a/docs/source/en/tasks/text-to-speech.md
+++ b/docs/source/en/tasks/text-to-speech.md
@@ -39,7 +39,7 @@ Here's a code snippet you can use to listen to the resulting audio in a notebook
 >>> Audio(output["audio"], rate=output["sampling_rate"])
 ```
 
-You can also do voice cloning with CSM:
+By default, CSM uses a random voice. You can do voice cloning by providing a reference audio as part of a chat template dictionary:
 
 ```python
 >>> import soundfile as sf
@@ -64,7 +64,7 @@ You can also do voice cloning with CSM:
 >>> output = pipe(conversation)
 ```
 
-You can also do conversational TTS, here is an example with Dia:
+Some models, like [Dia](https://huggingface.co/nari-labs/Dia-1.6B-0626), can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. Below is such an example:
 
 ```python
 >>> from transformers import pipeline
@@ -74,6 +74,8 @@ You can also do conversational TTS, here is an example with Dia:
 >>> output = pipe(text)
 ```
 
+Note that Dia also accepts speaker tags such as [S1] and [S2] to generate a conversation between unique voices.
+
 For more examples on what CSM and other pretrained TTS models can do, refer to our
 [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).