Skip to content

Latest commit

 

History

History
483 lines (369 loc) · 31.5 KB

2024.07.07_CosyVoice.md

File metadata and controls

483 lines (369 loc) · 31.5 KB

CosyVoice

基本信息
  • 标题: "CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens"
  • 作者:
    • 01 Zhihao Du,
    • 02 Qian Chen,
    • 03 Shiliang Zhang,
    • 04 Kai Hu,
    • 05 Heng Lu,
    • 06 Yexin Yang,
    • 07 Hangrui Hu,
    • 08 Siqi Zheng,
    • 09 Yue Gu,
    • 10 Ziyang Ma,
    • 11 Zhifu Gao,
    • 12 Zhijie Yan
  • 链接:
  • 文件:
    • ArXiv
    • [Publication] #TODO

Abstract: 摘要

展开原文

Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models.


近年来基于大语言模型的文本转语音由于其高自然都和零样本能力而走向主流. 在这种范式中, 语音信号被离散化为 Token 序列, 被文本提示的语言模型建模并通过基于 Token 的声码器重构为波形. 显然, 语音 Token 在 LLM 型 TTS 模型中扮演着至关重要的角色.

现有的语音 Token 以无监督的方式进行学习, 缺乏显式语义信息和对齐到文本的对齐信息. 在本文中, 我们提出使用监督语义 Token 来表示语音, 这些 Token 是通过向量量化插入到编码器中从多语言语音识别模型中获得的.

基于 Token, 我们进一步提出了一个可扩展的零样本 TTS 合成器, CosyVoice, 它由文本到 Token 生成的语言模型和 Token 到语音合成的条件流匹配模型组成.

实验结果表明, 监督语义 Token 在内容一致性和发音相似度方面明显优于现有的非监督 Token. 此外, 我们发现利用大规模数据可以进一步提高合成性能, 表明 CosyVoice 的可扩展能力. 据我们所知, 这是将监督语音 Token 引入 TTS 模型的首次尝试.

1·Introduction: 引言

展开原文

Text-to-Speech (TTS) technology has made remarkable strides in recent years, transitioning from robotic-sounding speech to producing voices that are nearly indistinguishable from human speakers. At the forefront of this advancement are Large Language Models (LLMs), which have been increasingly utilized in TTS systems to generate speech with a higher degree of naturalness and the ability to synthesize voices in a zero-shot fashion (TorToise TTS (2023); VALL-E (2023); BASE TTS (2024)). These LLM-based TTS models function by converting speech signals into sequences of tokens, with the LLM utilizing text as a condition to model these token sequences. A token vocoder is then employed to reconstruct the raw waveforms from the tokenized speech (HiFi-GAN (2020); EnCodec (2022)).

A critical aspect of the TTS process is the representation of speech tokens. Traditionally, tokens are acquired through unsupervised learning, which may not capture explicit semantic information or align well with corresponding text (HuBERT (2021); EnCodec (2022)). Recognizing this gap, our work introduces supervised semantic tokens extracted from a multilingual speech recognition model, Whisper (2022), by integrating vector quantization into the encoder. This innovation allows for more accurate semantic representation and alignment with text. Early studies have shown that quantizers with auxiliary automatic speech recognition (ASR) loss outperform k-means clustering on the universal speech model (USM) for speech-to-text translation and ASR tasks, as demonstrated in AudioPaLM (2023). Additionally, ASQ employed Gumbel-Softmax vector quantization to extract discrete speech representations that prioritize ASR-relevant information for ASR tasks. However, the impact of these approaches on text-to-speech (TTS) remains unclear.

Furthermore, leveraging these supervised tokens, we propose CosyVoice, a scalable and efficient zero-shot TTS synthesizer. CosyVoice is comprised of an LLM for converting text into semantic token sequences and a conditional flow matching model for the subsequent synthesis of speech from these tokens. In contrast to prior systems like TorToise TTS (2023), which employs an LLM in conjunction with a denoising diffusion probabilistic models (DDPM) (2020), CosyVoice utilizes a conditional flow matching approach, as it has been demonstrated to accelerate both training and inference compared to traditional diffusion models (VoiceBox (2023)). While existing methods incorporate flow matching in TTS (VoiceBox (2023); VoiceFlow (2023); Matcha-TTS (2023); ReFlow-TTS (2023)), they often rely on phoneme duration prediction, necessitating the use of supplementary phonemizers and forced aligners. CosyVoice, however, bypasses these dependencies, offering a more direct and efficient pathway from text to speech.

Our research contributes to the field of speech generation in several novel ways:

  • We are the first to integrate supervised speech tokens into TTS models, enhancing content consistency and speaker similarity in zero-shot voice cloning.
  • We propose CosyVoice, a scalable zero-shot TTS synthesis system that combines an LLM for text-to-token generation with a conditional flow matching model for token-to-speech synthesis, forsaking the need for additional phonemizers and forced aligners.
  • To further refine the quality of generated speech, we incorporate the x-vector into the LLM to separate the modeling of speech into semantic, speaker, and prosody components. The LLM models the semantic content and prosody, while the conditional flow matching model captures timbre and environmental information. We optimize the flow matching process with techniques such as classifier-free guidance (2022), a cosine scheduler, and masked conditions.

Our experimental results demonstrate the superiority of supervised semantic tokens over unsupervised counterparts. Additionally, the scalability of CosyVoice is evidenced by improved synthesis performance when utilizing large-scale data. This work, therefore, represents a significant step forward in the development of natural-sounding, versatile TTS systems.


2·Related Works: 相关工作

3·Methodology: 方法

展开原文

As shown in Figure \ref{fig:overall}(b), our CosyVoice consists of four components, namely text encoder, speech tokenizer, large language model and conditional flow matching model. Specifically, the text encoder is used to align the semantic spaces of text and speech tokens, while the speech tokenizer is utilized to extract semantic tokens as illustrated in Figure \ref{fig:overall}(a). We employ a large language model to learn the whole sequence of text encodings and speech tokens, reformulating TTS as an auto-regressive sequence generation problem given text as prompts. Then, as shown in Figure \ref{fig:overall}(c), a conditional flow matching model is utilized to convert speech tokens into a Mel spectrogram via a denoising process on the optimal path. To obtain a perceptible signal, the HiFi-GAN (2020) vocoder is used to synthesize a waveform with the generated Mel spectrogram as input.


3.1·Supervised Semantic Tokens for Speech

展开原文

In CosyVoice, a supervised automatic speech recognition (ASR) model is employed to derive the supervised semantic speech ($\mathcal{S}^3$) tokenizer for speech. The model is a finetuned version of our proprietary SenseVoice ASR model. It is trained on multilingual audio data and possesses rich audio content understanding capabilities. Different from the original ASR model, we split the encoder into two parts and insert a vector quantization layer between them. Given a Mel spectrogram $X$ as input, it undergoes the positional encoding and $\mathrm{Encoder}_1$ to obtain a context-aware representations $H$:

$$ H = \mathrm{Encoder_1}\left(\mathrm{PosEnc}(X)\right) $$

Then, a vector quantizer (VQ) is involved to obtain discrete tokens. For the hidden representation $\mathbf{h}_l$ at the frame $l$, the index of nearest embedding in the codebook $C$ is treated as the speech token $\mu_l$ at this timestep:

$$ \mu_l = \mathrm{VQ}(\mathbf{h}l, C)=\mathrm{arg}\min{\mathbf{c}_n\in C}{|| \mathbf{h}_l - \mathbf{c}_n ||_2} $$

where $||\cdot||_2$ denotes the L2 norm. At the training stage, codebook embeddings are updated via exponentially moving average (EMA):

$$ \mathbf{c}{\mu_l} := \alpha \mathbf{c}{\mu_l} + (1-\alpha) \mathbf{h}_l $$

where $\alpha$ is a pre-defined decay coefficient. The corresponding codebook embeddings of speech tokens are used as the quantized hidden representations $\bar{H}={\mathbf{c}{\mu_1}, \mathbf{c}{\mu_2}, \dots, \mathbf{c}_{\mu_L}}$ and passed through the remaining encoder layers $\mathrm{Encoder}_2$:

$$ \tilde{H} = \mathrm{Encoder_2}\left(\mathrm{PosEnc}(\bar{H})\right) $$

Note that, before the remaining encoder layers, we add an extra positional encoding to enhance the temporal information. After $\mathrm{Encoder_2}$, a transformer-based ASR decoder is followed, predicting the posterior probability of text labels:

$$ P(Y|X)=\mathrm{{ASRDecoder}}\left(\tilde{H},Y^{Z-1}\right) $$

where $Y^{Z-1}$ represents the left-shifted text labels in the teacher-forcing training scheme.


3.2·Large Language Model for TTS

展开原文

In this section, we formulate the TTS task as an auto-regressive speech token generation problem with a large language model (LLM). For LLM, the sequence construction is the most important matter, which is constructed as follows:

$$ \left[\circled{S}, \mathbf{v}, {\bar{\mathbf{y}}u}{u\in[1:U]}, \circled{T}, {\mu_l}_{l\in[1:L]}, \circled{E} \right] $$

\circled{S} and \circled{E} denote the start and end of sequence, respectively. $\mathbf{v}$ is a speaker embedding vector extracted from the speech $X$ with a pre-trained voice-print model. Available at https://github.com/alibaba-damo-academy/3D-Speaker/tree/main/egs/3dspeaker/sv-cam++. The text encodings $\bar{Y}={\bar{\mathbf{y}}u}{u\in[1:U]}$ is obtained by passing the text through a Byte Pair Encoded (BPE) tokenizer and text encoder:

$$ \bar{Y} = \mathrm{TextEncoder}(\mathrm{BPE}(Y)) $$

Since text and speech tokens lie at different semantic levels, the text encoder is used to align their semantic spaces and benefit the LLM modeling. A start identifier \circled{T} is inserted between text encodings and speech tokens ${\mu_l}_{l\in[1:L]}$ that is extracted with the supervised semantic tokenizer as described in \ref{sec:sst}. At the training stage, we employ the teacher-forcing scheme, in which the left-shifted sequence is employed as the mode inputs and the original sequence serves as the expected outputs. Note that only the cross entropy losses of speech tokens and \circled{E} are considered during the training:

$$ \mathcal{L}{LM} = -\frac{1}{L+1}\sum{l=1}^{L+1}{\log{q(\mu_l)}} $$

where $\mu_{L+1}$ is the "end of sequence" token \circled{E}. $q(\mu_l)$ denotes the posterior probability of $\mu_l$, which is predicted by the softmax layer following LLM.


3.3·Optimal-Transport Conditional Flow Matching

展开原文

In CosyVoice, an optimal-transport conditional flow matching model (OT-CFM) is employed to learn the distribution of Mel spectrogram and generate samples from it with generated speech tokens as conditions. OT-CFM can achieve better performance compared to diffusion probabilistic models (DPMs) with simpler gradients, easier training and faster generation (Flow Matching (2022); OT-CFM (2023); Matcha-TTS (2023)). In continuous-time normalizing flows (CNFs), a probability density path is constructed from a prior distribution $p_0(X)$ to the data distribution of Mel spectrogram $q(X)$. The probability density path is defined by a time-dependent vector field $\nu_t(X): [0,1]\times \mathbb{R}^{LD}\rightarrow \mathbb{R}^{LD}$, which generates the flow $\phi_t$ through the following ordinary differential equation (ODE): \begin{equation} \label{eq:prob-path} \begin{aligned} \frac{d}{dt}{\phi_t{(X)}} &= \nu_t(\phi_t(X), t) \ \phi_0(X)&\sim p_0(X)=\mathcal{N}(X;0,I) \ \phi_1(X)&\sim p_1(X) \end{aligned} \end{equation} where $t\in[0, 1]$. By solving the initial value problem Eq. (\ref{eq:prob-path}), we can approximate the speech distribution $q(X)$ with $p_1(X)$ and sample from it.

To learn the vector field $\nu_t(X)$, we define the optimal-transport (OT) flow and force a neural network to match it by minimizing the following loss: \begin{equation} \begin{aligned} &\mathcal{L}{OT-CFM} \ &= \mathbb{E}{t,p_0(X_0),q(X_1)}| \omega_t(\phi^{OT}_t(X_0,X_1)|X_1) \ &- \nu_t(\phi^{OT}_t(X_0,X_1)|\theta) | \end{aligned} \end{equation} where \begin{equation} \begin{aligned} &\phi^{OT}_t(X_0,X_1)=(1-(1-\sigma)t)X_0+tX_1 \ &\omega_t(\phi^{OT}t(X_0,X_1)|X_1)=X_1-(1-\sigma)X_0 \end{aligned} \end{equation} The speaker embedding $\mathbf{v}$, speech tokens ${\mu_l}{1:L}$, and masked Mel spectrogram $\tilde{X_1}$ are also fed into the neural network to match the vector field with learnable parameters $\theta$: \begin{equation} \begin{aligned} &\nu_t(\phi^{OT}t(X_0,X_1)|\theta) \ &= \mathrm{NN}\theta\left(\phi^{OT}t(X_0,X_1),t;\mathbf{v},{\mu_l}{1:L},\tilde{X_1}\right) \end{aligned} \end{equation} $\tilde{X_1}$ is a masked version of $X_1$ by setting continuous frames to zeros from a random start point to the end. Considering the generation process at the beginning is harder than follows, we involve a cosine scheduler for the timestep $t$: \begin{equation} t:=1-\cos\left(\frac{1}{2}t\pi\right) \end{equation} Under the scheduled flow, there are more generation steps at the beginning.

Classifier-free guidance (CFG) has been proven to improve the generation quality of diffusion probabilistic models (CFG (2022); iDDPM (2021); VoiceBox (2023)). Therefore, we propose to adapt the CFG into conditional flow matching models. At the training stage, we randomly drop the conditions $\Psi={\mathbf{v}, {\mu_l}_{1:L}, \tilde{X_1}}$ with a fixed probability of $0.2$. In this manner, we can learn both conditional and unconditional flows. During generation, the vector field is modified as follows: \begin{equation} \begin{aligned} &\tilde{\nu}_t(\phi^{OT}_t(X_0,X_1)|\theta;\Psi)\ &=(1+\beta)\cdot\nu_t(\phi^{OT}_t(X_0,X_1)|\theta;\Psi)\ &-\beta \cdot \nu_t(\phi^{OT}_t(X_0,X_1)|\theta) \end{aligned} \end{equation} where $\beta$ is the guidance strength of $0.7$.


3.3.1·Zero-Shot In-Context Learning

展开原文

CosyVoice models exhibit zero-shot in-context learning capabilities, allowing for the replication of an arbitrary voice with only a brief reference speech sample. This process entails the careful construction of input sequences for the token language model (LM), depicted in Figure \ref{fig:icl-seq}. For prompt speech and input text in the same language, we merge them to form a unified input, treating the prompt speech tokens as pre-generated. With this input sequence, the autoregressive LM iteratively predicts subsequent tokens until it encounters the "end of sequence" token $\circled{E}$. However, when the prompt speech and input text differ linguistically, we omit the text and tokens associated with the prompt to prevent prosodic characteristics of the original language from influencing the target language. It is important to note that the prompt text, which corresponds to the prompt speech's content, can be transcribed either through human annotation or ASR models, such as SenseVoice. Similar to the prompt text, the prompt tokens are extracted from the prompt speech with $\mathcal{S}^3$ tokenizer.

After generating the speech tokens, they are appended after the prompt tokens, forming a composite condition for the flow-matching model. Additionally, the speaker embedding and the Mel spectrogram of the prompt speech are incorporated to further enhance timbre and environmental consistency.


3.4·Rich Generation with Instruction

展开原文

To enable further controllability on CosyVoice, we experiment with integrating additional instruction fine-tuning (TextrolSpeech (2023)). CosyVoice-instruct extends CosyVoice-base with enhanced instruction-following capabilities. Specifically, it supports controllability over various aspects such as speaker identity (i.e., speaker's characteristics), speaking style (including emotion, gender, speaking rate, and pitch), and fine-grained paralinguistic features. These features include the ability to insert laughter, breaths, speaking while laughing, and emphasize certain words.

We fine-tuned CosyVoice-base using this training data without incorporating speaker embedding in the autoregressive language model. Table \ref{tab:example_instruct} shows some examples of speaker identity, speaking style, and fine-grained paralinguistic features.


4·Experiments: 实验

Dataset

Small-scale Single-lingual Dataset

展开原文

We conduct experiments on the LibriTTS (2019) corpus, which contains 585 hours from 2,456 English speakers. We follow the official data partition, where "train-clean-100", "train-clean-360" and "train-other-500" are merged for training and the "dev-clean" is used for model selections. "test-clean" is used to construct the evaluation set as described in UniCATS (2023).


Large-scale Multi-lingual Dataset

展开原文

To train the CosyVoice models, we have amassed a considerable dataset comprising multiple languages. Throughout the collection process, we utilize specialized in-house tools for speech detection, signal-to-noise ratio (SNR) estimation, speaker diarization, and separation. Subsequently, pseudo text labels are generated using SenseVoice-Large and Paraformer. These labels undergo a refinement process with the aid of force-alignment (FA) models, which helps eliminate low-quality data and enhances the accuracy of punctuation. A comprehensive breakdown of the training data's duration across various languages is presented in Table \ref{tab:dataset}. Table \ref{tab:dataset_instruct} presents the duration of the training data for different types of instructions.


4.1·Supervised Semantic Speech Tokenizer

展开原文

For the small-scale single-lingual dataset, we employ the ESPNet Conformer ASR model as the backbone and insert the vector quantizer after the first six encoder layers. There is a single codebook with 4,096 codes. The first six encoder layers and vector quantizer are employed as the speech tokenizer. As for the text tokenizer, a word sentence-piece model is trained on the text of training, which has a vocabulary size of 4,000. We train the quantizer-augmented ASR model on the Librispeech (2015) corpus for 50 epochs from scratch.

For the large-scale multi-lingual dataset, we employ the SenseVoice-Large rich recognition model (FunAudioLLM (2024)) as the backbone. Similar to small-scale dataset, we still insert the vector quantizer after the first six encoder layers with a single codebook of 4,096 codes. More hyper-parameter selections, such as quantizer-inserted layer and the number of codes, are left for future work. Different from single-lingual experiments, we use the pre-trained checkpoint to initialize the SenseVoice-Large model rather than train it from scratch. After inserting the quantizer, we further fine-tune the whole parameters for 210,000 training steps on eight A800 GPUs.


4.2·CosyVoice Model Settings

展开原文

We train the tiny and normal size models in single-lingual and multi-lingual experiments. Details of model architecture settings are shown in Table \ref{tab:model}. The tiny model is trained on LibriTTS training set for 50 epochs with four V100-32M GPUs, while the multi-lingual model is trained on our internal dataset for 800,000 steps with 64 V100-32M GPUs. Tiny and normal models are trained with the learning rate of $10^{-3}$ and $10^{-4}$, respectively. The warmup step is set to 10,000.


5·Results: 结果

5.1·Evaluation on $S^3$ Tokenizer

展开原文

In table \ref{tab:en-token}, we demonstrate how the vector quantization affects the recognition performance on LibriTTS test sets. From the table, we can see that inserting a vector quantizer into the ASR encoder only affects the recognition performance slightly. As a result, the VQ-inserted Conformer ASR model achieves comparable WERs of 3.18% and 7.56% on "test-clean" and "test-other" sets, respectively. This indicates that tokenizers trained in a supervised manner can maintain sufficient semantic information and the alignment to text.

To assess the multi-lingual $\mathcal{S}^3$ tokenizer's ability to preserve semantic information, we compared the recognition performance of the quantizer-augmented SenseVoice-L against its original version and the Whisper-Large V3 model. The models underwent evaluation using the Common Voice zh-CN and en benchmarks, with the findings detailed in Table \ref{tab:tokenizer-performance}. From the table, we can see that our $\mathcal{S}^3$ tokens demonstrate robust recognition performance in both the Chinese and English test sets. Notably, on the common_voice_zh-CN set, $\mathcal{S}^3$ tokens surpass the performance of the Whisper-Large V3 model (FunAudioLLM (2024)), achieving a 4.14% relative reduction in error rate. This suggests a substantial correlation between $\mathcal{S}^3$ tokens and semantic content. It is worth noting that there is only a single codebook in the $\mathcal{S}^3$ tokenizer with a dictionary size of 4,096 entries.


5.2·Comparison with Baselines

展开原文

We compare the proposed CosyVoice models with other TTS systems on content consistency and speaker similarity. For content consistency, an ASR model is employed to recognize the generated utterances. We report the word error rate (WER), and the number of insertion, deletion and substation errors. As for the speaker similarity, we employ the ERes2Net (2023) model to extract speaker embeddings of prompt and generated utterances, and their raw cosine similarity is treated as the speaker similarity. Experimental results are shown in Table \ref{tab:compare}.

Compared with other TTS models, the proposed CosyVoice framework achieves comparable content consistency and higher speaker similarity even using the same text and speech tokenizers. Comparing Exp-1, Exp-2 and Exp-3, we can see that both the text speech tokenizers are critical for content consistency and negligible for speaker similarity. In Exp 4 experiments, we replace the single-lingual text and speech tokenizers with the multi-lingual one. Only using the LibriTTS corpus to train the model degrades both the content consistency and speaker similarity. By involving the internal large-scale dataset, the performance is significantly improved, achieving the human parity quality.


5.3·Evaluation on Generation Quality of CosyVoice

展开原文

We evaluate the quality of CosyVoice's speech synthesis by examining content consistency and speaker similarity. The "test-clean" subset of LibriTTS (2019) and the test set of AISHELL-3 (2020) are employed to construct an evaluation set for English and Chinese, respectively. For each text in these sets, we randomly select a prompt speech. Content consistency was evaluated using Whisper-Large V3 (2022) for English and Paraformer (2022) for Chinese recognition. Speaker similarity was quantified by calculating the cosine similarity between speaker embeddings of the generated and prompt speeches, extracted using ERes2Net (2023).

Similar to other autoregressive language models, we employ a random sampling decoding strategy for our token LM and assessed the synthesis process using five different random seed values: 0, 7, 42, 123, and 1,337. The resultant evaluation metrics were averaged to determine the mean and standard deviation. Additionally, we conducted an ASR re-ranking to demonstrate potential performance improvements in offline mode.

Tables \ref{tab:res-libritts} and \ref{tab:res-aishell} present the results for English and Chinese, respectively. On the English dataset, CosyVoice attained human-level performance with similar content recognition and higher speaker similarity. ASR re-ranking notably enhanced content consistency, yielding a reduced word error rate (WER) of 1.51%. CosyVoice outperformed ChatTTS in WER and the number of insertion and deletion errors, indicating superior content consistency. We did not assess speaker similarity for ChatTTS as it doesn't release voice cloning capabilities.

As for the results in Chinese, the generated utterances of CosyVoice achieve a comparable CER as well as the errors of insertion and deletion compared with the original utterances. It seems that ChatTTS has a better generation ability on Chinese than English in terms of CER. Although ChatTTS and CosyVoice achieve a similar CER, ChatTTS produces more insertion and deletion errors, This is due to the problem of speaker leaking, where modal particles of another speaker is generated unexpectedly. On the contrary, CosyVoice doesn't suffer from this problem with much fewer insertion and deletion errors. With ASR re-ranking, CosyVoice reached a remarkably low CER of 1.84%. As seen with English, CosyVoice also exhibited greater speaker similarity than the original utterances, showcasing its effective voice-cloning proficiency.


5.4·Emotion Controllability of CosyVoice

展开原文

To verify the emotion controllability, we use the public speech emotion recognition model emo2vec (2023) (Github). We generated and evaluated 100 English utterances for each of the six emotions: happy, angry, sad, surprised, fearful, and disgusted. The content of the synthesized text is designed to match the target emotion. We then measure the accuracy of the predicted emotions from the synthesized speech for each emotion.

Table \ref{tab:emo_acc} shows the comparison of emotion control accuracy between CosyVoice-base and CosyVoice-instruct. For CosyVoice-instruct, the input consists of content text accompanied by a speaking style instruction (e.g., "Happy.<endofprompt>Content Text"). In contrast, CosyVoice-base only receives the content text as input. The results indicate that CosyVoice-instruct with emotional instructions demonstrates a significant improvement over both CosyVoice-base and CosyVoice-instruct without emotional instructions.


5.5·CosyVoice as a Data Generator

展开原文

A straightforward application of CosyVoice is as a data generator to augment the training data of other tasks, such as ASR, speech-to-speech translation (S2ST). Taking the ASR task an example, we conduct an experiment on the Librispeech corpus to evaluate CosyVoice's capability in generating high-quality data. The experimental results are shown in Table \ref{tab:data-syn}, where "Librispeech" denotes the original 960-hour data. "Syn on LS text" and "Syn on LS text" denote the generated data with the text from Librispeech and MLS training sets, respectively. From the table, we can see that only training on the synthesized data, the ASR model can achieve a comparable result than the original Librispeech training set. Upon integration of them, a notable enhancement in recognition accuracy is observed. An interesting finding is that involving the synthesized data on the MLS text significantly improves the recognition performance. This may indicates that the text diversity is more critical for ASR task than the duration of speech itself. This improvement can be attributed to the varied linguistic content introduced by CosyVoice synthesized samples. The findings from our evaluation underscore the high quality of the samples generated by CosyVoice.


6·Conclusions: 结论

展开原文

In this paper, we introduce CosyVoice, a scalable multi-lingual speech generation model, which supports zero-shot in-context learning, cross-lingual voice cloning, instructed generation and fine-grained controlling of emotion, paralinguistic features. Experimental results show that the system architecture of CosyVoice is important for speaker similarity, while the text and speech tokenizers affect the content consistency much. Besides, we find that scaling up the model size and data volume can improve the performance significantly. As a result, CosyVoice achieves the human parity generation quality.