Skip to content

Conversation

@ebezzam
Copy link
Contributor

@ebezzam ebezzam commented Nov 28, 2025

What does this PR do?

Related to offline discussion with @eustlb and @Deep-unlearning, let's change default pipeline TTS behavior to make it easier to users.

I pinned output_audio=True for CSM but also did manual insertion of speaker IDs (for CSM and Dia) to make usage more intuitive for simple TTS usage.

See below some CSM and Dia examples.

import soundfile as sf
import torch
from datasets import Audio, load_dataset

from transformers import pipeline


device = "cuda" if torch.cuda.is_available() else "cpu"


"""CSM"""
pipe = pipeline("text-to-audio", model="sesame/csm-1b", device=device)

# -- minimal TTS example
torch.manual_seed(0)
outputs = pipe("Hello from Sesame.")     # instead of pipe("[0]Hello from Sesame.")
fn = "csm_pipeline_tts.wav"
sf.write(fn, outputs["audio"], outputs["sampling_rate"])
print(f"Audio saved to {fn}")

# -- minimal TTS example with voice cloning
torch.manual_seed(0)
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = [
    # audio/text pair(s) for voice cloning
    {
        "role": "0",
        "content": [
            {"type": "text", "text": "What are you working on?"},
            {"type": "audio", "path": ds[0]["audio"]["array"]},
        ],
    },
    # desired audio response for voice cloning
    {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]},
]
outputs = pipe(conversation)
fn = "csm_pipeline_voice_cloning.wav"
sf.write(fn, outputs["audio"], outputs["sampling_rate"])
print(f"Audio saved to {fn}")


"""Dia"""
pipe = pipeline("text-to-audio", model="nari-labs/Dia-1.6B-0626", device=device)

# -- minimal TTS example
torch.manual_seed(42)
outputs = pipe(
    "Dia is an open weights text to dialogue model.",      # instead of pipe("[S1] Dia is an open weights text to dialogue model..")
    generate_kwargs={"max_new_tokens": 256},
)
fn = "dia_pipeline_tts.wav"
sf.write(fn, outputs["audio"], outputs["sampling_rate"])
print(f"Audio saved to {fn}")

# -- minimal conversation example
# note: Dia doesn't support chat template for voice cloning
# explicit model loading should be used instead: https://huggingface.co/nari-labs/Dia-1.6B-0626#generation-with-text-and-audio-voice-cloning
torch.manual_seed(0)
outputs = pipe(
    "[S1] Dia is an open weights text to dialogue model. [S2] That's cool, tell me how it works.",
)
fn = "dia_pipeline_conversation.wav"
sf.write(fn, outputs["audio"], outputs["sampling_rate"])
print(f"Audio saved to {fn}")

@Deep-unlearning what do you think about adding such examples to the TTS page (while pruning the verbose comments).

At least the CSM voice cloning example (and pointing to this dataset so they know what the original voice sounds like).

import soundfile as sf
import torch
from datasets import Audio, load_dataset
from transformers import pipeline


device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("text-to-audio", model="sesame/csm-1b", device=device)

# prepare input
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = [
    # -- audio/text pair(s) for voice cloning
    {
        "role": "0",
        "content": [
            {"type": "text", "text": "What are you working on?"}, 
            {"type": "audio", "path": ds[0]["audio"]["array"]},
        ],
    },
    # -- desired audio response for voice cloning
    {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]},
]
outputs = pipe(conversation)
fn = "csm_pipeline_voice_cloning.wav"
sf.write(fn, outputs["audio"], outputs["sampling_rate"])
print(f"Audio saved to {fn}")

Copy link
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @ebezzam! 🤗

Comment on lines +175 to +179
# Add speaker ID if needed and user didn't insert at start of text
if self.model.config.model_type == "csm":
text = [f"[0]{t}" if not t.startswith("[") else t for t in text]
if self.model.config.model_type == "dia":
text = [f"[S1] {t}" if not t.startswith("[") else t for t in text]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum really really not a fan of such hidden processing. This is where the abstraction of the pipeline (this does make sense if you want to interchange model id with simply changing the model) complicates things more than they simplify it ... but okay to keep here since there is already so much custom processing in the audio pipeline codes and that anyway.

Note we might remove in the future though if we find an good API to have specific kwargs for each TTS models and a convinient way to default them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, for example preset as we discussed here

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants