-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Add LTX 2.0 Video Pipelines #12915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add LTX 2.0 Video Pipelines #12915
Conversation
LTX 2.0 Vocoder Implementation
LTX 2.0 Video VAE Implementation
|
Cc: @matanby if you want to test this PR on your end. We will shortly be adding the upsampling pipeline as well. |
|
no audio encode? |
|
@bghira, so I understand correctly, is the request for an analogue of |
|
the audio autoencoder is missing encode() function which exists in the LTX-2 repo from Lightricks, and ComfyUI is having audio encoding as well |
|
@bghira thanks for the clarification! We will support the audio VAE encoder. |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments.
tests/models/autoencoders/test_models_autoencoder_kl_ltx2_audio.py
Outdated
Show resolved
Hide resolved
| num_rope_elems = num_pos_dims * 2 | ||
|
|
||
| # 3. Create a 1D grid of frequencies for RoPE | ||
| freqs_dtype = torch.float64 if self.double_precision else torch.float32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit): we could keep the self.freqs_dtype inside the init to skip doing it multiple times.
| video_cross_attn_rotary_emb = self.cross_attn_rope(video_coords[:, 0:1, :], device=hidden_states.device) | ||
| audio_cross_attn_rotary_emb = self.cross_attn_audio_rope( | ||
| audio_coords[:, 0:1, :], device=audio_hidden_states.device | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit): would be nice to have a comment about the small indexing going on there.
* Initial implementation of LTX 2.0 latent upsampling pipeline * Add new LTX 2.0 spatial latent upsampler logic * Add test script for LTX 2.0 latent upsampling * Add option to enable VAE tiling in upsampling test script * Get latent upsampler working with video latents * Fix typo in BlurDownsample * Add latent upsample pipeline docstring and example * Remove deprecated pipeline VAE slicing/tiling methods * make style and make quality * When returning latents, return unpacked and denormalized latents for T2V and I2V * Add model_cpu_offload_seq for latent upsampling pipeline --------- Co-authored-by: Daniel Gu <[email protected]>
What does this PR do?
This PR adds pipelines for the LTX 2.0 video generation model (code, weights). LTX 2.0 is an audio-video foundation model that generates videos with synced audio; it supports generation tasks such as text-to-video (T2V), text-image-to-video (TI2V), and more.
You can try out T2V generation as follows:
python scripts/ltx2_test_full_pipeline.py \ --model_id Lightricks/LTX-2 \ --revision refs/pr/3 \ --cpu_offloadNote that LTX 2.0 video generation uses a lot of memory; it is necessary to use CPU offloading even for an A100 which has 80 GB VRAM (assuming no other memory optimizations other than
bf16inference are used).Similarly, you can try out I2V generation with
python scripts/ltx2_test_full_pipeline_i2v.py \ --model_id Lightricks/LTX-2 \ --revision refs/pr/3 \ --image_path https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg \ --cpu_offloadHere is an I2V sample from the above:
ltx2_i2v_sample.mp4
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@yiyixuxu
@sayakpaul
@ofirbb