Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
YuE would work similar to the OuteTTS implementation where an LLM (In this case two separate llama models) is involved in the generation of the audio. Yue does not appear to be using wavtokenizer. Instead of just speech YuE is capable of music and sung vocals.
A demo page with links to all the relevant code and models can be found here : https://map-yue.github.io/
Motivation
For end users: Music generation is a use case currently missing from the llamacpp ecosystem, users can leverage quantized versions of the LLM to generate songs on their own or rented hardware, this model is capable of signing making it more flexible than established non-LLM audio models.
For developers: I think this is an interesting next step in llamacpp's TTS experiments since this is also LLM based. We first saw how language models running in llamacpp could be paired with wavtokenizer to produce audible speech. This would rely on the same existing llama infrastructure but paired with new implementations on the music audio side. It seems similar to Llasa-3B due to both of them using xcodec so the implementation may be sharable between both models.
Possible Implementation
The llama models should be able to leverage the existing llama implementation, for the audio side this open source project can be used to reference the audio parts : https://github.com/multimodal-art-projection/YuE (paper is pending).
This seems to require xcodec which if compatible would also be progress towards support for Llasa-3B.