Convenient default behavior for pipeline TTS usage. #42473

ebezzam · 2025-11-28T13:31:44Z

What does this PR do?

Related to offline discussion with @eustlb and @Deep-unlearning, let's change default pipeline TTS behavior to make it easier to users.

I pinned output_audio=True for CSM but also did manual insertion of speaker IDs (for CSM and Dia) to make usage more intuitive for simple TTS usage.

See below some CSM and Dia examples.

import soundfile as sf
import torch
from datasets import Audio, load_dataset

from transformers import pipeline


device = "cuda" if torch.cuda.is_available() else "cpu"


"""CSM"""
pipe = pipeline("text-to-audio", model="sesame/csm-1b", device=device)

# -- minimal TTS example
torch.manual_seed(0)
outputs = pipe("Hello from Sesame.")     # instead of pipe("[0]Hello from Sesame.")
fn = "csm_pipeline_tts.wav"
sf.write(fn, outputs["audio"], outputs["sampling_rate"])
print(f"Audio saved to {fn}")

# -- minimal TTS example with voice cloning
torch.manual_seed(0)
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = [
    # audio/text pair(s) for voice cloning
    {
        "role": "0",
        "content": [
            {"type": "text", "text": "What are you working on?"},
            {"type": "audio", "path": ds[0]["audio"]["array"]},
        ],
    },
    # desired audio response for voice cloning
    {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]},
]
outputs = pipe(conversation)
fn = "csm_pipeline_voice_cloning.wav"
sf.write(fn, outputs["audio"], outputs["sampling_rate"])
print(f"Audio saved to {fn}")


"""Dia"""
pipe = pipeline("text-to-audio", model="nari-labs/Dia-1.6B-0626", device=device)

# -- minimal TTS example
torch.manual_seed(42)
outputs = pipe(
    "Dia is an open weights text to dialogue model.",      # instead of pipe("[S1] Dia is an open weights text to dialogue model..")
    generate_kwargs={"max_new_tokens": 256},
)
fn = "dia_pipeline_tts.wav"
sf.write(fn, outputs["audio"], outputs["sampling_rate"])
print(f"Audio saved to {fn}")

# -- minimal conversation example
# note: Dia doesn't support chat template for voice cloning
# explicit model loading should be used instead: https://huggingface.co/nari-labs/Dia-1.6B-0626#generation-with-text-and-audio-voice-cloning
torch.manual_seed(0)
outputs = pipe(
    "[S1] Dia is an open weights text to dialogue model. [S2] That's cool, tell me how it works.",
)
fn = "dia_pipeline_conversation.wav"
sf.write(fn, outputs["audio"], outputs["sampling_rate"])
print(f"Audio saved to {fn}")

@Deep-unlearning what do you think about adding such examples to the TTS page (while pruning the verbose comments).

At least the CSM voice cloning example (and pointing to this dataset so they know what the original voice sounds like).

import soundfile as sf
import torch
from datasets import Audio, load_dataset
from transformers import pipeline


device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("text-to-audio", model="sesame/csm-1b", device=device)

# prepare input
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = [
    # -- audio/text pair(s) for voice cloning
    {
        "role": "0",
        "content": [
            {"type": "text", "text": "What are you working on?"}, 
            {"type": "audio", "path": ds[0]["audio"]["array"]},
        ],
    },
    # -- desired audio response for voice cloning
    {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]},
]
outputs = pipe(conversation)
fn = "csm_pipeline_voice_cloning.wav"
sf.write(fn, outputs["audio"], outputs["sampling_rate"])
print(f"Audio saved to {fn}")

eustlb

LGTM, thanks @ebezzam! 🤗

eustlb · 2025-11-28T17:21:00Z

src/transformers/pipelines/text_to_audio.py

+            # Add speaker ID if needed and user didn't insert at start of text
+            if self.model.config.model_type == "csm":
+                text = [f"[0]{t}" if not t.startswith("[") else t for t in text]
+            if self.model.config.model_type == "dia":
+                text = [f"[S1] {t}" if not t.startswith("[") else t for t in text]


Hum really really not a fan of such hidden processing. This is where the abstraction of the pipeline (this does make sense if you want to interchange model id with simply changing the model) complicates things more than they simplify it ... but okay to keep here since there is already so much custom processing in the audio pipeline codes and that anyway.

Note we might remove in the future though if we find an good API to have specific kwargs for each TTS models and a convinient way to default them.

Definitely, for example preset as we discussed here

HuggingFaceDocBuilderDev · 2025-11-28T17:37:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Convenient default behavior for pipeline TTS usage.

32daa47

ebezzam requested a review from eustlb November 28, 2025 13:33

Deep-unlearning mentioned this pull request Nov 28, 2025

more tts pipeline exampel #42484

Open

eustlb approved these changes Nov 28, 2025

View reviewed changes

Merge branch 'main' into csm_pipeline

5cd5931

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convenient default behavior for pipeline TTS usage. #42473

Convenient default behavior for pipeline TTS usage. #42473

ebezzam commented Nov 28, 2025 •

edited

Loading

Uh oh!

eustlb left a comment

Uh oh!

eustlb Nov 28, 2025

Uh oh!

ebezzam Nov 28, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Convenient default behavior for pipeline TTS usage. #42473

Are you sure you want to change the base?

Convenient default behavior for pipeline TTS usage. #42473

Conversation

ebezzam commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

eustlb Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ebezzam commented Nov 28, 2025 •

edited

Loading