diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md index b285352acefd..73d6a1d10c71 100644 --- a/docs/source/en/tasks/text-to-speech.md +++ b/docs/source/en/tasks/text-to-speech.md @@ -24,9 +24,25 @@ languages and for multiple speakers. Several text-to-speech models are currently You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia, can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. -Here's an example of how you would use the `"text-to-speech"` pipeline with Dia: +Here's an example of how you would use the `"text-to-speech"` pipeline with CSM: -```py +```python +>>> from transformers import pipeline + +>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b") +>>> output = pipe("Hello from Sesame.") +``` + +Here's a code snippet you can use to listen to the resulting audio in a notebook: + +```python +>>> from IPython.display import Audio +>>> Audio(output["audio"], rate=output["sampling_rate"]) +``` + +You can also do conversational TTS, here is an example with Dia: + +```python >>> from transformers import pipeline >>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626") @@ -34,14 +50,42 @@ Here's an example of how you would use the `"text-to-speech"` pipeline with Dia: >>> output = pipe(text) ``` -Here's a code snippet you can use to listen to the resulting audio in a notebook: +```python +>>> from IPython.display import Audio +>>> Audio(output["audio"], rate=output["sampling_rate"]) +``` + +You can also do voice cloning with CSM: + +```python +>>> import soundfile as sf +>>> import torch +>>> from datasets import Audio, load_dataset +>>> from transformers import pipeline + +>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b") + +>>> ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train") +>>> ds = ds.cast_column("audio", Audio(sampling_rate=24000)) +>>> conversation = [ +... { +... "role": "0", +... "content": [ +... {"type": "text", "text": "What are you working on?"}, +... {"type": "audio", "path": ds[0]["audio"]["array"]}, +... ], +... }, +... {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]}, +... ] +>>> output = pipe(conversation) +``` ```python >>> from IPython.display import Audio >>> Audio(output["audio"], rate=output["sampling_rate"]) ``` -For more examples on what Bark and other pretrained TTS models can do, refer to our +For more examples on what CSM and other pretrained TTS models can do, refer to our [Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models). If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers