huggingface · Deep-unlearning · Nov 28, 2025 · vasqu · Nov 28, 2025 · ebezzam
diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md
@@ -24,24 +24,68 @@ languages and for multiple speakers. Several text-to-speech models are currently
 
 You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia,
 can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.
-Here's an example of how you would use the `"text-to-speech"` pipeline with Dia:
+Here's an example of how you would use the `"text-to-speech"` pipeline with CSM:
 
-```py
+```python
+>>> from transformers import pipeline
+
+>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b")
+>>> output = pipe("Hello from Sesame.")
+```
+
+Here's a code snippet you can use to listen to the resulting audio in a notebook:
+
+```python
+>>> from IPython.display import Audio
+>>> Audio(output["audio"], rate=output["sampling_rate"])
+```
+
+You can also do conversational TTS, here is an example with Dia:
+
+```python
 >>> from transformers import pipeline
 
 >>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626")
 >>> text = "[S1] (clears throat) Hello! How are you? [S2] I'm good, thanks! How about you?"
 >>> output = pipe(text)
 ```
 
-Here's a code snippet you can use to listen to the resulting audio in a notebook:
+```python
+>>> from IPython.display import Audio
+>>> Audio(output["audio"], rate=output["sampling_rate"])
+```
+
+You can also do voice cloning with CSM:
+
+```python
+>>> import soundfile as sf
+>>> import torch
+>>> from datasets import Audio, load_dataset
+>>> from transformers import pipeline
+
+>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b")
+
+>>> ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
+>>> ds = ds.cast_column("audio", Audio(sampling_rate=24000))
+>>> conversation = [
+...     {
+...         "role": "0",
+...         "content": [
+...             {"type": "text", "text": "What are you working on?"},
+...             {"type": "audio", "path": ds[0]["audio"]["array"]},
+...         ],
+...     },
+...     {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]},
+... ]
+>>> output = pipe(conversation)
+```
 
 ```python
 >>> from IPython.display import Audio
 >>> Audio(output["audio"], rate=output["sampling_rate"])
 ```
 
-For more examples on what Bark and other pretrained TTS models can do, refer to our
+For more examples on what CSM and other pretrained TTS models can do, refer to our
 [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).
 
 If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers